<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/L07-Chatbot%20Based%20on%20PyTorch/Chatbot_1_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [33]:
# Install the convokit library if not already installed
# !pip install convokit  

import random
import re
import unicodedata
from convokit import Corpus, download

# Step 1: Download and load the Cornell Movie Dialogs Corpus
# The Corpus class from convokit is used to handle conversational datasets
corpus = Corpus(filename=download("movie-corpus"))

Downloading movie-corpus to C:\Users\Girija\.convokit\downloads\movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done


In [40]:

# Step 2: Function to extract conversation pairs (question-answer pairs)
# This function retrieves dialogues from conversations in the corpus
def extract_sentence_pairs(corpus):
    qa_pairs = []  # Initialize an empty list to store question-answer pairs
    for conversation in corpus.iter_conversations():
        # Get a list of utterance IDs in the conversation
        utterances = conversation.get_utterance_ids()
        # Iterate through pairs of utterances (current and next) in the conversation
        for i in range(len(utterances) - 1):  
            # The current utterance is treated as the question
            input_sentence = corpus.get_utterance(utterances[i]).text
            # The next utterance is treated as the answer
            output_sentence = corpus.get_utterance(utterances[i + 1]).text
            # Append the question-answer pair to the list
            qa_pairs.append([input_sentence, output_sentence])
    return qa_pairs  # Return the list of question-answer pairs

In [35]:

# Step 3: Extract sentence pairs from the dataset
qa_pairs = extract_sentence_pairs(corpus)
print(f"Extracted {len(qa_pairs)} question-answer pairs.")
print(f"Example pair: {qa_pairs[0]}")

Extracted 221616 question-answer pairs.
Example pair: ['They do not!', 'They do to!']


In [36]:
# Step 4: Preprocessing - normalize the dataset
# Function to convert unicode characters to ASCII
def unicode_to_ascii(s):
    # Normalize string to NFD form, which separates accents from characters
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'  # Remove accents by excluding characters with 'Mn' (nonspacing mark) category
    )

# Function to normalize text by converting to lowercase, removing unwanted characters, and adding spaces around punctuation
def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())  # Convert text to ASCII and lowercase
    s = re.sub(r"([.!?])", r" \1", s)  # Separate punctuation from words with spaces
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)  # Remove characters that aren't letters, punctuation, or spaces
    s = re.sub(r"\s+", r" ", s).strip()  # Replace multiple spaces with a single space
    return s  # Return the normalized string

# Preprocess all question-answer pairs by normalizing each question and answer
for i in range(len(qa_pairs)):
    qa_pairs[i][0] = normalize_string(qa_pairs[i][0])  # Normalize question
    qa_pairs[i][1] = normalize_string(qa_pairs[i][1])  # Normalize answer


In [37]:
# Step 5: Create a simple rule-based chatbot
# This function finds the best answer for the user's input by checking each question-answer pair
def chatbot_response(user_input, qa_pairs):
    # Normalize the user input to match the format of the dataset
    user_input = normalize_string(user_input)
    # Default response if no match is found
    response = "I'm sorry, I don't understand. Can you rephrase?"

    # Loop through question-answer pairs to find a match for user_input
    for pair in qa_pairs:
        question, answer = pair  # Unpack the question and answer
        if user_input in question:  # Check if user input matches any question
            return answer  # Return the answer if a match is found

    return response  # Return default response if no match is found

**Chat Simulation**

In [44]:

# Step 6: Function to run the chatbot with simulated inputs
def run_chatbot(qa_pairs, user_inputs):
    # Initialize conversation list to store messages
    conversation = []
    conversation.append("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    
    # Process each user input in the list
    for user_input in user_inputs:
        conversation.append(f"You: {user_input}")
        
        # Check if the user wants to exit
        if user_input.lower() == 'exit':
            conversation.append("Chatbot: Goodbye!")
            break
        
        # Get chatbot's response based on the user input
        response = chatbot_response(user_input, qa_pairs)
        conversation.append(f"Chatbot: {response}")
    
    # Print the entire conversation to simulate the chat
    output_text = "\n".join(conversation)
    print(output_text)

# List of simulated user inputs for testing based on typical conversational patterns
user_inputs = [
    "What are you doing here?",
    "Do you believe in love?",
    "Tell me a secret.",
    "What's your favorite movie?",
    "Who is your best friend?",
    "Do you ever feel lonely?",
    "Why do people lie?",
    "What makes you happy?",
    "Can you tell me a joke?",
    "Why do people fall in love?",
    "exit"
    ]

# Run the chatbot with simulated user inputs
run_chatbot(qa_pairs, user_inputs)

Chatbot: Hello! Ask me anything (type 'exit' to quit).
You: What are you doing here?
Chatbot: excuse me have you seen the feminine mystique ? i lost my copy .
You: Do you believe in love?
Chatbot: no . it s not that .
You: Tell me a secret.
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: What's your favorite movie?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: Who is your best friend?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: Do you ever feel lonely?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: Why do people lie?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: What makes you happy?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: Can you tell me a joke?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: Why do people fall in love?
Chatbot: I'm sorry, I don't understand. Can you rephrase?
You: exit
Chatbot: Goodbye!


**User Input Way**

In [42]:
# Step 6: Function to run the chatbot interactively with live user input
def run_chatbot(qa_pairs):
    # Print initial greeting message from chatbot
    print("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    
    # Begin a loop to continuously take user input and respond
    while True:
        # Capture live user input
        user_input = input("You: ")
        
        # Check if the user wants to exit the chat
        if user_input.lower() == 'exit':
            print("Chatbot: Goodbye!")
            break
        
        # Generate chatbot response based on user input
        response = chatbot_response(user_input, qa_pairs)
        
        # Display the response
        print(f"Chatbot: {response}")

# Run the chatbot interactively with qa_pairs
run_chatbot(qa_pairs)


Chatbot: Hello! Ask me anything (type 'exit' to quit).
Chatbot: be polite . say hello . this is candy .
Chatbot: I'm sorry, I don't understand. Can you rephrase?
Chatbot: end my career ?
Chatbot: Goodbye!


Observation:

Based on the chat log you provided and your analysis request, it looks like the chatbot struggles to respond correctly to a variety of inputs. The root cause is likely the simplistic matching strategy used in the chatbot, which results in a high rate of "I don't understand" responses. Here's a breakdown of the issues and improvements you can implement:

##### Issues with Current Implementation:

1. **Exact Matching**:
   - The chatbot attempts to match user input exactly or checks if the user input is a substring of a pre-existing question. This method fails to accommodate variations in phrasing or sentence structure.
   
2. **Handling User Input**:
   - Questions like "Do you like movies?" and "How are you?" result in the chatbot defaulting to "I don't understand." because the exact match or simple substring search does not find similar questions in the dataset.

3. **Fallback Mechanism**:
   - The fallback response "Can you rephrase?" appears too often, indicating that the similarity matching and logic behind selecting responses are not sophisticated enough.

4. **Response Relevance**:
   - Even when responses are provided, they are sometimes nonsensical (e.g., "be polite. say hello . this is candy.") because there is no contextual understanding involved.


##### Improvements You Can Make:

1. **Implementing a More Sophisticated Matching Algorithm**:
   - Instead of simple substring matching, you can implement similarity algorithms like **Levenshtein distance**, **Cosine similarity** with **TF-IDF** (Term Frequency-Inverse Document Frequency), or **Word2Vec**/**BERT** embeddings to improve response relevance.

2. **Threshold Tuning for Similarity**:
   - Set a dynamic threshold for matching questions and adjust it based on user input. If the chatbot struggles to find a close enough match, it can request clarification from the user.

3. **Response Diversification**:
   - Instead of repeating the same fallback responses, add more varied, contextually relevant responses when the bot cannot understand the user's question.

4. **Data Augmentation**:
   - Enrich your dataset by augmenting it with more possible variations of common questions and answers. This will help the bot become more adaptable to different ways of phrasing the same question.


## Version 2


In [15]:
#!pip install convokit scikit-learn  # Install the necessary libraries -- for google colab


In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import unicodedata
import random


# Step 2: Function to extract conversation pairs (question-answer pairs)
def extract_sentence_pairs(corpus):
    qa_pairs = []
    for conversation in corpus.iter_conversations():
        utterances = conversation.get_utterance_ids()
        for i in range(len(utterances) - 1):  # Iterate through the conversation
            input_sentence = corpus.get_utterance(utterances[i]).text
            output_sentence = corpus.get_utterance(utterances[i + 1]).text
            qa_pairs.append([input_sentence, output_sentence])
    return qa_pairs


In [46]:

# Step 3: Extract sentence pairs from the dataset
qa_pairs = extract_sentence_pairs(corpus)
print(f"Extracted {len(qa_pairs)} question-answer pairs.")

# Step 4: Preprocessing - normalize the dataset
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Preprocess the QA pairs
questions = [normalize_string(pair[0]) for pair in qa_pairs]
answers = [normalize_string(pair[1]) for pair in qa_pairs]


Extracted 221616 question-answer pairs.


In [47]:

# Step 5: Initialize TF-IDF vectorizer and fit it on the questions
vectorizer = TfidfVectorizer().fit(questions)

# Step 6: Function to find the best response using cosine similarity
def chatbot_response(user_input, vectorizer, questions, answers):
    user_input = normalize_string(user_input)  # Normalize the user input
    user_vec = vectorizer.transform([user_input])  # Convert input to TF-IDF vector
    question_vecs = vectorizer.transform(questions)  # Convert all questions to TF-IDF vectors

    # Compute cosine similarity between user input and all questions in the dataset
    similarities = cosine_similarity(user_vec, question_vecs).flatten()

    # Find the index of the question with the highest similarity
    best_match_index = similarities.argmax()

    # If the best match similarity score is too low, provide a fallback response
    if similarities[best_match_index] < 0.5:  # Threshold can be tuned
        return random.choice([
            "I'm sorry, I don't understand. Can you rephrase?",
            "Could you clarify your question?",
            "I don't have an answer for that, sorry.",
            "Can you try asking that differently?"
        ])

    return answers[best_match_index]  # Return the best matched answer


In [49]:
# Step 7: Function to run the chatbot interactively with user inputs shown
def run_chatbot(vectorizer, questions, answers, user_inputs):
    # Initialize conversation log to store all user and chatbot messages
    conversation = []
    conversation.append("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    
    # Process each simulated user input
    for user_input in user_inputs:
        conversation.append(f"You: {user_input}")
        
        # Check if the user wants to exit
        if user_input.lower() == 'exit':
            conversation.append("Chatbot: Goodbye!")
            break
        
        # Generate chatbot response
        response = chatbot_response(user_input, vectorizer, questions, answers)
        conversation.append(f"Chatbot: {response}")
    
    # Display the entire conversation at once
    output_text = "\n".join(conversation)
    print(output_text)


# List of simulated user inputs for testing based on typical conversational patterns
user_inputs = [
    "What are you doing here?",
    "Do you believe in love?",
    "Tell me a secret.",
    "What's your favorite movie?",
    "Who is your best friend?",
    "Do you ever feel lonely?",
    "Why do people lie?",
    "What makes you happy?",
    "Can you tell me a joke?",
    "Why do people fall in love?",
    "exit"
    ]

# Run the chatbot with the vectorizer, questions, answers, and user inputs
run_chatbot(vectorizer, questions, answers, user_inputs)


Chatbot: Hello! Ask me anything (type 'exit' to quit).
You: What are you doing here?
Chatbot: excuse me have you seen the feminine mystique ? i lost my copy .
You: Do you believe in love?
Chatbot: no . it s not that .
You: Tell me a secret.
Chatbot: mm hm .
You: What's your favorite movie?
Chatbot: i don t remember . but off the top of my head i d say black .
You: Who is your best friend?
Chatbot: i was her best friend .
You: Do you ever feel lonely?
Chatbot: what ?
You: Why do people lie?
Chatbot: i lied . to her . she thought she d seen you .
You: What makes you happy?
Chatbot: i hope you like large weddings .
You: Can you tell me a joke?
Chatbot: druid hill park .
You: Why do people fall in love?
Chatbot: so what do i do ?
You: exit
Chatbot: Goodbye!


The chatbot's responses in the latest interaction don't seem coherent in the context of the questions being asked. The model selects responses that are movie dialogues, but they don't align well with the questions. This indicates a gap between matching user input and relevant responses.

##### Issues in the Current Chatbot Implementation:
1. **Irrelevant Responses**:
   - The responses don't seem to fit the questions well (e.g., "we have a visitor" and "i need half a million to buy a script"). These responses, while part of movie dialogues, are not suitable answers to the questions being asked.
   
2. **No Contextual Understanding**:
   - The current approach does not take context into account and selects responses based purely on surface-level similarity, which is insufficient for meaningful conversations.

3. **Randomness of Responses**:
   - Since the chatbot uses pre-defined dialogue exchanges, the selected answers may not always make sense. This is a fundamental limitation of using raw movie dialogues without additional processing.

---

##### Improvements to Make:
1. **Contextual Filtering**:
   - Apply **semantic similarity** methods using pre-trained models like **BERT** to better understand the context and meaning of both the user's input and the movie dialogues.

2. **Dynamic Response Selection**:
   - To avoid irrelevant responses, you could filter potential answers based on content type or add rules to ensure answers fit specific question categories (e.g., responses about movies, actions, or greetings).
   
3. **Better Preprocessing**:
   - Improve normalization of both questions and answers by handling a wider range of characters, punctuation, and special symbols.

4. **Handling Conversational Context**:
   - Implement basic contextual memory to track ongoing conversation topics. For example, after a question about movies, subsequent responses should remain within the movie-related context.
   
5. **Threshold Tuning**:
   - The threshold for similarity (currently set at 0.5) could be tuned more dynamically. Higher thresholds could help in avoiding irrelevant answers, though it might increase the fallback responses.

---

##### Revised Code with Semantic Similarity (Using Sentence Transformers/BERT):
To significantly improve the chatbot's ability to provide relevant answers, let's incorporate **sentence embeddings** via **BERT-based models** from the `sentence-transformers` library.


In [50]:

#!pip install sentence-transformers convokit  # Install required libraries

from sentence_transformers import SentenceTransformer, util
from convokit import Corpus, download
import re
import unicodedata
import random

# Step 1: Download and load the Cornell Movie Dialogs Corpus
#corpus = Corpus(filename=download("movie-corpus"))


  from tqdm.autonotebook import tqdm, trange


In [51]:

# Step 2: Function to extract conversation pairs (question-answer pairs)
def extract_sentence_pairs(corpus):
    qa_pairs = []
    for conversation in corpus.iter_conversations():
        utterances = conversation.get_utterance_ids()
        for i in range(len(utterances) - 1):  # Iterate through the conversation
            input_sentence = corpus.get_utterance(utterances[i]).text
            output_sentence = corpus.get_utterance(utterances[i + 1]).text
            qa_pairs.append([input_sentence, output_sentence])
    return qa_pairs

# Step 3: Extract sentence pairs from the dataset
qa_pairs = extract_sentence_pairs(corpus)
print(f"Extracted {len(qa_pairs)} question-answer pairs.")

# Step 4: Preprocessing - normalize the dataset
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Preprocess the QA pairs
questions = [normalize_string(pair[0]) for pair in qa_pairs]
answers = [normalize_string(pair[1]) for pair in qa_pairs]

# Step 5: Load a pre-trained sentence-transformer model for embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # A lightweight BERT model

# Step 6: Encode all the questions in the dataset using the BERT model
encoded_questions = model.encode(questions, convert_to_tensor=True)

# Step 7: Function to find the best response using semantic similarity
def chatbot_response(user_input, model, encoded_questions, questions, answers):
    user_input = normalize_string(user_input)  # Normalize the user input
    user_embedding = model.encode(user_input, convert_to_tensor=True)  # Encode user input using BERT

    # Compute cosine similarity between user input and all questions in the dataset
    similarities = util.pytorch_cos_sim(user_embedding, encoded_questions).flatten()

    # Find the index of the question with the highest similarity
    best_match_index = similarities.argmax()

    # If the best match similarity score is too low, provide a fallback response
    if similarities[best_match_index] < 0.7:  # Adjust the threshold as needed
        return random.choice([
            "I'm sorry, I don't understand. Can you rephrase?",
            "Could you clarify your question?",
            "I don't have an answer for that, sorry.",
            "Can you try asking that differently?"
        ])

    return answers[best_match_index]  # Return the best matched answer


Extracted 221616 question-answer pairs.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [53]:
# Step 8: Function to run the chatbot interactively with simulated input
def run_chatbot(model, encoded_questions, questions, answers, user_inputs):
    # Initialize conversation list to store each interaction
    conversation = []
    conversation.append("Chatbot: Hello! Ask me anything (type 'exit' to quit).")
    
    # Process each user input in the provided list of simulated inputs
    for user_input in user_inputs:
        conversation.append(f"You: {user_input}")
        
        # Check if the user wants to exit
        if user_input.lower() == 'exit':
            conversation.append("Chatbot: Goodbye!")
            break
        
        # Generate the chatbot's response based on the simulated user input
        response = chatbot_response(user_input, model, encoded_questions, questions, answers)
        conversation.append(f"Chatbot: {response}")
    
    # Print the entire conversation to simulate the chat
    output_text = "\n".join(conversation)
    print(output_text)

# List of simulated user inputs for testing based on typical conversational patterns
user_inputs = [
    "What are you doing here?",
    "Do you believe in love?",
    "Tell me a secret.",
    "What's your favorite movie?",
    "Who is your best friend?",
    "Do you ever feel lonely?",
    "Why do people lie?",
    "What makes you happy?",
    "Can you tell me a joke?",
    "Why do people fall in love?",
    "exit"
    ]

# Run the chatbot with simulated user inputs
run_chatbot(model, encoded_questions, questions, answers, user_inputs)


Chatbot: Hello! Ask me anything (type 'exit' to quit).
You: What are you doing here?
Chatbot: excuse me have you seen the feminine mystique ? i lost my copy .
You: Do you believe in love?
Chatbot: no . it s not that .
You: Tell me a secret.
Chatbot: mm hm .
You: What's your favorite movie?
Chatbot: what do you care ? let em have their fun . so what s up ?
You: Who is your best friend?
Chatbot: was jack goodman your good friend ?
You: Do you ever feel lonely?
Chatbot: now i make sure that no one has the opportunity to test me .
You: Why do people lie?
Chatbot: how did we get here ?
You: What makes you happy?
Chatbot: . . . honest . at least you re honest with me .
You: Can you tell me a joke?
Chatbot: i m so tired i m about to drive off the road . keep me awake willya ?
You: Why do people fall in love?
Chatbot: give what up ?
You: exit
Chatbot: Goodbye!



##### Improvements in This Version:

1. **BERT for Semantic Matching**:
   - We now use a **BERT-based model** (`sentence-transformers`) to generate sentence embeddings for both the user input and the dataset questions. This allows the chatbot to understand the meaning and context of the questions and respond with more relevant answers.
   
2. **Increased Threshold for Response Matching**:
   - The similarity threshold has been increased to 0.7. This ensures the chatbot provides a fallback response when there isn't a close match, reducing irrelevant answers.

3. **Fallback Responses**:
   - More diverse fallback responses are used when the chatbot cannot find a good match, making the conversation feel more dynamic.

##### Expected Benefits:
- **More Relevant Responses**: Using **semantic similarity** allows the chatbot to find responses that are contextually appropriate rather than simply relying on surface-level similarity.
- **Handling a Variety of Input**: The chatbot will be able to handle various ways of phrasing a question, providing more meaningful interactions.
- **Fewer Irrelevant Replies**: The increased similarity threshold ensures that the chatbot doesn’t provide answers that don’t fit the user's input.
