<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/L07-Chatbot%20Based%20on%20PyTorch/Chatbot_1_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
#!pip install convokit  # Install the convokit library
import random
import re
import unicodedata
from convokit import Corpus, download

# Step 1: Download and load the Cornell Movie Dialogs Corpus
corpus = Corpus(filename=download("movie-corpus"))


ModuleNotFoundError: No module named 'convokit'

In [5]:

# Step 2: Function to extract conversation pairs (question-answer pairs)
def extract_sentence_pairs(corpus):
    qa_pairs = []
    for conversation in corpus.iter_conversations():
        utterances = conversation.get_utterance_ids()
        for i in range(len(utterances) - 1):  # Iterate through the conversation
            input_sentence = corpus.get_utterance(utterances[i]).text
            output_sentence = corpus.get_utterance(utterances[i + 1]).text
            qa_pairs.append([input_sentence, output_sentence])
    return qa_pairs


In [4]:

# Step 3: Extract sentence pairs from the dataset
qa_pairs = extract_sentence_pairs(corpus)
print(f"Extracted {len(qa_pairs)} question-answer pairs.")
print(f"Example pair: {qa_pairs[0]}")


Extracted 221616 question-answer pairs.
Example pair: ['They do not!', 'They do to!']


In [5]:

# Step 4: Preprocessing - normalize the dataset
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Function to normalize the input text
def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Preprocess the QA pairs
for i in range(len(qa_pairs)):
    qa_pairs[i][0] = normalize_string(qa_pairs[i][0])  # Normalize question
    qa_pairs[i][1] = normalize_string(qa_pairs[i][1])  # Normalize answer


In [6]:

# Step 5: Create a simple rule-based chatbot
def chatbot_response(user_input, qa_pairs):
    user_input = normalize_string(user_input)
    response = "I'm sorry, I don't understand. Can you rephrase?"  # Default response

    for pair in qa_pairs:
        question, answer = pair
        if user_input in question:
            return answer  # Return the most suitable answer

    return response



In [7]:
# Step 6: Function to run the chatbot interactively
def run_chatbot(qa_pairs):
    print("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'exit':
            print("Chatbot: Goodbye!")
            break
        response = chatbot_response(user_input, qa_pairs)
        print(f"Chatbot: {response}")

# Run the chatbot
run_chatbot(qa_pairs)


Chatbot: Hello! Ask me anything (type 'exit' to quit).
You: hello
Chatbot: be polite . say hello . this is candy .
You: hello this is baby
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: how are you?
Chatbot: is there a problem ?
You: no, just asking
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: do you like movies
Chatbot: people watched the movies in their cars ?
You: which movie would you prefer me to watch
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: exit
Chatbot: Goodbye!


Based on the chat log you provided and your analysis request, it looks like the chatbot struggles to respond correctly to a variety of inputs. The root cause is likely the simplistic matching strategy used in the chatbot, which results in a high rate of "I don't understand" responses. Here's a breakdown of the issues and improvements you can implement:

##### Issues with Current Implementation:

1. **Exact Matching**:
   - The chatbot attempts to match user input exactly or checks if the user input is a substring of a pre-existing question. This method fails to accommodate variations in phrasing or sentence structure.
   
2. **Handling User Input**:
   - Questions like "Do you like movies?" and "How are you?" result in the chatbot defaulting to "I don't understand." because the exact match or simple substring search does not find similar questions in the dataset.

3. **Fallback Mechanism**:
   - The fallback response "Can you rephrase?" appears too often, indicating that the similarity matching and logic behind selecting responses are not sophisticated enough.

4. **Response Relevance**:
   - Even when responses are provided, they are sometimes nonsensical (e.g., "be polite. say hello . this is candy.") because there is no contextual understanding involved.


##### Improvements You Can Make:

1. **Implementing a More Sophisticated Matching Algorithm**:
   - Instead of simple substring matching, you can implement similarity algorithms like **Levenshtein distance**, **Cosine similarity** with **TF-IDF** (Term Frequency-Inverse Document Frequency), or **Word2Vec**/**BERT** embeddings to improve response relevance.

2. **Threshold Tuning for Similarity**:
   - Set a dynamic threshold for matching questions and adjust it based on user input. If the chatbot struggles to find a close enough match, it can request clarification from the user.

3. **Response Diversification**:
   - Instead of repeating the same fallback responses, add more varied, contextually relevant responses when the bot cannot understand the user's question.

4. **Data Augmentation**:
   - Enrich your dataset by augmenting it with more possible variations of common questions and answers. This will help the bot become more adaptable to different ways of phrasing the same question.


## Version 2


In [8]:
!pip install convokit scikit-learn  # Install the necessary libraries




In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import unicodedata
import random


# Step 2: Function to extract conversation pairs (question-answer pairs)
def extract_sentence_pairs(corpus):
    qa_pairs = []
    for conversation in corpus.iter_conversations():
        utterances = conversation.get_utterance_ids()
        for i in range(len(utterances) - 1):  # Iterate through the conversation
            input_sentence = corpus.get_utterance(utterances[i]).text
            output_sentence = corpus.get_utterance(utterances[i + 1]).text
            qa_pairs.append([input_sentence, output_sentence])
    return qa_pairs


In [11]:

# Step 3: Extract sentence pairs from the dataset
qa_pairs = extract_sentence_pairs(corpus)
print(f"Extracted {len(qa_pairs)} question-answer pairs.")

# Step 4: Preprocessing - normalize the dataset
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Preprocess the QA pairs
questions = [normalize_string(pair[0]) for pair in qa_pairs]
answers = [normalize_string(pair[1]) for pair in qa_pairs]


Extracted 221616 question-answer pairs.


In [12]:

# Step 5: Initialize TF-IDF vectorizer and fit it on the questions
vectorizer = TfidfVectorizer().fit(questions)

# Step 6: Function to find the best response using cosine similarity
def chatbot_response(user_input, vectorizer, questions, answers):
    user_input = normalize_string(user_input)  # Normalize the user input
    user_vec = vectorizer.transform([user_input])  # Convert input to TF-IDF vector
    question_vecs = vectorizer.transform(questions)  # Convert all questions to TF-IDF vectors

    # Compute cosine similarity between user input and all questions in the dataset
    similarities = cosine_similarity(user_vec, question_vecs).flatten()

    # Find the index of the question with the highest similarity
    best_match_index = similarities.argmax()

    # If the best match similarity score is too low, provide a fallback response
    if similarities[best_match_index] < 0.5:  # Threshold can be tuned
        return random.choice([
            "I'm sorry, I don't understand. Can you rephrase?",
            "Could you clarify your question?",
            "I don't have an answer for that, sorry.",
            "Can you try asking that differently?"
        ])

    return answers[best_match_index]  # Return the best matched answer


In [13]:

# Step 7: Function to run the chatbot interactively
def run_chatbot(vectorizer, questions, answers):
    print("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'exit':
            print("Chatbot: Goodbye!")
            break
        response = chatbot_response(user_input, vectorizer, questions, answers)
        print(f"Chatbot: {response}")

# Run the chatbot
run_chatbot(vectorizer, questions, answers)


Chatbot: Hello! Ask me anything (type 'exit' to quit).
You: hello
Chatbot: we have a visitor .
You: how are you?
Chatbot: sun city . i ve been meaning to call you for months .
You: can you recomment a movie for me
Chatbot: because i need half a million to buy a script .
You: exit
Chatbot: Goodbye!


The chatbot's responses in the latest interaction don't seem coherent in the context of the questions being asked. The model selects responses that are movie dialogues, but they don't align well with the questions. This indicates a gap between matching user input and relevant responses.

##### Issues in the Current Chatbot Implementation:
1. **Irrelevant Responses**:
   - The responses don't seem to fit the questions well (e.g., "we have a visitor" and "i need half a million to buy a script"). These responses, while part of movie dialogues, are not suitable answers to the questions being asked.
   
2. **No Contextual Understanding**:
   - The current approach does not take context into account and selects responses based purely on surface-level similarity, which is insufficient for meaningful conversations.

3. **Randomness of Responses**:
   - Since the chatbot uses pre-defined dialogue exchanges, the selected answers may not always make sense. This is a fundamental limitation of using raw movie dialogues without additional processing.

---

##### Improvements to Make:
1. **Contextual Filtering**:
   - Apply **semantic similarity** methods using pre-trained models like **BERT** to better understand the context and meaning of both the user's input and the movie dialogues.

2. **Dynamic Response Selection**:
   - To avoid irrelevant responses, you could filter potential answers based on content type or add rules to ensure answers fit specific question categories (e.g., responses about movies, actions, or greetings).
   
3. **Better Preprocessing**:
   - Improve normalization of both questions and answers by handling a wider range of characters, punctuation, and special symbols.

4. **Handling Conversational Context**:
   - Implement basic contextual memory to track ongoing conversation topics. For example, after a question about movies, subsequent responses should remain within the movie-related context.
   
5. **Threshold Tuning**:
   - The threshold for similarity (currently set at 0.5) could be tuned more dynamically. Higher thresholds could help in avoiding irrelevant answers, though it might increase the fallback responses.

---

##### Revised Code with Semantic Similarity (Using Sentence Transformers/BERT):
To significantly improve the chatbot's ability to provide relevant answers, let's incorporate **sentence embeddings** via **BERT-based models** from the `sentence-transformers` library.


In [15]:

!pip install sentence-transformers convokit  # Install required libraries

from sentence_transformers import SentenceTransformer, util
from convokit import Corpus, download
import re
import unicodedata
import random

# Step 1: Download and load the Cornell Movie Dialogs Corpus
#corpus = Corpus(filename=download("movie-corpus"))




In [16]:

# Step 2: Function to extract conversation pairs (question-answer pairs)
def extract_sentence_pairs(corpus):
    qa_pairs = []
    for conversation in corpus.iter_conversations():
        utterances = conversation.get_utterance_ids()
        for i in range(len(utterances) - 1):  # Iterate through the conversation
            input_sentence = corpus.get_utterance(utterances[i]).text
            output_sentence = corpus.get_utterance(utterances[i + 1]).text
            qa_pairs.append([input_sentence, output_sentence])
    return qa_pairs

# Step 3: Extract sentence pairs from the dataset
qa_pairs = extract_sentence_pairs(corpus)
print(f"Extracted {len(qa_pairs)} question-answer pairs.")

# Step 4: Preprocessing - normalize the dataset
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Preprocess the QA pairs
questions = [normalize_string(pair[0]) for pair in qa_pairs]
answers = [normalize_string(pair[1]) for pair in qa_pairs]

# Step 5: Load a pre-trained sentence-transformer model for embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # A lightweight BERT model

# Step 6: Encode all the questions in the dataset using the BERT model
encoded_questions = model.encode(questions, convert_to_tensor=True)

# Step 7: Function to find the best response using semantic similarity
def chatbot_response(user_input, model, encoded_questions, questions, answers):
    user_input = normalize_string(user_input)  # Normalize the user input
    user_embedding = model.encode(user_input, convert_to_tensor=True)  # Encode user input using BERT

    # Compute cosine similarity between user input and all questions in the dataset
    similarities = util.pytorch_cos_sim(user_embedding, encoded_questions).flatten()

    # Find the index of the question with the highest similarity
    best_match_index = similarities.argmax()

    # If the best match similarity score is too low, provide a fallback response
    if similarities[best_match_index] < 0.7:  # Adjust the threshold as needed
        return random.choice([
            "I'm sorry, I don't understand. Can you rephrase?",
            "Could you clarify your question?",
            "I don't have an answer for that, sorry.",
            "Can you try asking that differently?"
        ])

    return answers[best_match_index]  # Return the best matched answer


Extracted 221616 question-answer pairs.


In [17]:

# Step 8: Function to run the chatbot interactively
def run_chatbot(model, encoded_questions, questions, answers):
    print("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'exit':
            print("Chatbot: Goodbye!")
            break
        response = chatbot_response(user_input, model, encoded_questions, questions, answers)
        print(f"Chatbot: {response}")

# Run the chatbot
run_chatbot(model, encoded_questions, questions, answers)


Chatbot: Hello! Ask me anything (type 'exit' to quit).
You: hello
Chatbot: hi patrick . i thought that was you .
You: this is babu
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: how are you?
Chatbot: sun city . i ve been meaning to call you for months .
You: can you have any recommendation for the movies
Chatbot: Can you try asking that differently?
You: what is chat
Chatbot: i have a few people here i can t really chat right now .
You: do you have new people
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: exit
Chatbot: Goodbye!



##### Improvements in This Version:

1. **BERT for Semantic Matching**:
   - We now use a **BERT-based model** (`sentence-transformers`) to generate sentence embeddings for both the user input and the dataset questions. This allows the chatbot to understand the meaning and context of the questions and respond with more relevant answers.
   
2. **Increased Threshold for Response Matching**:
   - The similarity threshold has been increased to 0.7. This ensures the chatbot provides a fallback response when there isn't a close match, reducing irrelevant answers.

3. **Fallback Responses**:
   - More diverse fallback responses are used when the chatbot cannot find a good match, making the conversation feel more dynamic.

##### Expected Benefits:
- **More Relevant Responses**: Using **semantic similarity** allows the chatbot to find responses that are contextually appropriate rather than simply relying on surface-level similarity.
- **Handling a Variety of Input**: The chatbot will be able to handle various ways of phrasing a question, providing more meaningful interactions.
- **Fewer Irrelevant Replies**: The increased similarity threshold ensures that the chatbot doesn’t provide answers that don’t fit the user's input.
