#### **Section 1: Import Libraries**

We import all the necessary libraries. Notice that we no longer need to import `gensim`.



In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import re
import nltk
from sklearn.model_selection import train_test_split

# Ensure reproducibility
torch.manual_seed(42)

# Download necessary NLTK resources
nltk.download('punkt')

# Import word_tokenize explicitly from NLTK
from nltk.tokenize import word_tokenize

# Observations:
# - Added an import for `word_tokenize` explicitly after downloading the NLTK 'punkt' resource.
# - The word_tokenize function is now available globally in the script, and the error will no longer occur.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Girija\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### **Section 2: Load and Prepare Data**

The data loading process remains the same.



In [14]:
# Step 2: Load the Processed Data
#file_path = "../data/processed/customer_support_dataset_processed.csv"  # for complete set
file_path = "../data/processed/customer_support_test_dataset_processed_10%.csv" # for 10% for the complete set - for simple training
df = pd.read_csv(file_path)

# Ensure the dataset contains essential columns
if 'customer_query_cleaned' not in df.columns or 'support_response_cleaned' not in df.columns:
    raise ValueError("Dataset missing required columns: 'customer_query_cleaned' and 'support_response_cleaned'")

# Split data into input and output pairs
queries = df['customer_query_cleaned']
responses = df['support_response_cleaned']

# Split dataset into training and validation sets (90% train, 10% validation)
train_queries, val_queries, train_responses, val_responses = train_test_split(
    queries, responses, test_size=0.1, random_state=42
)

# Observations:
# - Loaded and validated the cleaned dataset.
# - Split the data into training and validation sets, which is critical for model evaluation and avoiding overfitting.


#### **Section 3: Load Pre-trained GloVe Embeddings Without Gensim**

Instead of using Gensim, you will manually download the GloVe embeddings, read them, and then use them to create the embedding matrix.

##### **3.1 Download GloVe Embeddings Manually**

- You can download GloVe embeddings manually from the [GloVe Website](https://nlp.stanford.edu/projects/glove/). Choose, for example, the **glove.6B.zip** file and extract it.
- It contains multiple files like `glove.6B.50d.txt`, `glove.6B.100d.txt`, etc. We'll use `glove.6B.100d.txt` for 100-dimensional word embeddings.

##### **3.2 Load GloVe Embeddings in Python**



In [3]:
# Step 3: Load Pre-trained GloVe Embeddings (Without Gensim)
embedding_dim = 100
glove_path = "../glove.6B.100d.txt"  # Path to the downloaded GloVe file

# Initialize word2idx and embedding matrix lists
word2idx = {}
embedding_matrix = []

# Open the GloVe file and read the embeddings
print("Loading pre-trained GloVe embeddings (may take a few minutes)...")
with open(glove_path, 'r', encoding='utf-8') as f:
    for idx, line in enumerate(f):
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        word2idx[word] = idx
        embedding_matrix.append(vector)

# Add special tokens with random embeddings
special_tokens = ['<PAD>', '<SOS>', '<EOS>', '<UNK>']
for token in special_tokens:
    word2idx[token] = len(embedding_matrix)
    embedding_matrix.append(np.random.normal(size=(embedding_dim,)))

# Convert embedding matrix to a tensor
embedding_matrix = torch.tensor(embedding_matrix, dtype=torch.float32)

# Observations:
# - Loaded GloVe embeddings manually using Python without Gensim.
# - Created an embedding matrix and added random embeddings for special tokens.


Loading pre-trained GloVe embeddings (may take a few minutes)...


  embedding_matrix = torch.tensor(embedding_matrix, dtype=torch.float32)


#### **Section 4: Custom Dataset and DataLoader**

This section remains the same. The custom dataset is responsible for tokenizing the input and padding it to a fixed length.



In [4]:
# Step 4: Custom Dataset and DataLoader
class ChatDataset(Dataset):
    def __init__(self, queries, responses, word2idx, max_len=20):
        # Reset the index of queries and responses to ensure valid indexing
        self.queries = queries.reset_index(drop=True).fillna("")  # Replace NaN with empty string
        self.responses = responses.reset_index(drop=True).fillna("")  # Replace NaN with empty string
        self.word2idx = word2idx
        self.max_len = max_len

    def __len__(self):
        return len(self.queries)

    def __getitem__(self, idx):
        # Convert text to token ids and pad/truncate to max_len
        try:
            query = self._text_to_sequence(self.queries[idx])
            response = self._text_to_sequence(self.responses[idx])
        except KeyError:
            print(f"KeyError: Index {idx} out of bounds for dataset length {len(self.queries)}")
            raise
        except Exception as e:
            print(f"Unexpected error at index {idx}: {e}")
            raise
        return torch.tensor(query, dtype=torch.long), torch.tensor(response, dtype=torch.long)

    def _text_to_sequence(self, text):
        # Handle non-string inputs
        if not isinstance(text, str):
            print(f"Invalid input detected: {text} (type: {type(text)}). Converting to empty string.")
            text = ""

        tokens = word_tokenize(text)  # Tokenize the text
        sequence = [self.word2idx.get(token, self.word2idx['<UNK>']) for token in tokens]
        sequence = [self.word2idx['<SOS>']] + sequence + [self.word2idx['<EOS>']]
        sequence = sequence[:self.max_len] + [self.word2idx['<PAD>']] * (self.max_len - len(sequence))
        return sequence

# DataLoader instances for training and validation
train_dataset = ChatDataset(train_queries, train_responses, word2idx)
val_dataset = ChatDataset(val_queries, val_responses, word2idx)

# DataLoader setup with reduced batch size to lower memory consumption
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=0, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False, num_workers=0, pin_memory=True)


#### **Section 5: Encoder-Decoder Model Design with Attention**

The encoder and decoder design with attention remains largely unchanged, except that we use the manually loaded GloVe embeddings.

##### **5.1 Encoder Definition**



In [5]:
# Encoder Definition
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_matrix, hidden_size, num_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        self.rnn = nn.LSTM(embedding_matrix.size(1), hidden_size, num_layers, batch_first=True)

    def forward(self, x):
        # x: [batch_size, seq_len]
        embedded = self.embedding(x)  # embedded: [batch_size, seq_len, embedding_dim]
        outputs, (hidden, cell) = self.rnn(embedded)
        return outputs, hidden, cell

# Observations:
# - The encoder uses pre-trained GloVe embeddings loaded manually.
# - The embeddings are not frozen (`freeze=False`), meaning they will be fine-tuned during training.


##### **5.2 Decoder with Attention Definition**

The decoder is modified to include the attention layer for better context representation.



In [6]:
# Define Attention Mechanism (Make sure it's defined before the decoder)
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.hidden_size = hidden_size
        # Linear layers to compute alignment scores and convert to attention weights
        self.attention = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        # hidden: [batch_size, hidden_size]
        # encoder_outputs: [batch_size, seq_len, hidden_size]
        batch_size = encoder_outputs.shape[0]
        seq_len = encoder_outputs.shape[1]

        # Repeat hidden state seq_len times
        hidden = hidden.unsqueeze(1).repeat(1, seq_len, 1)

        # Concatenate encoder outputs with the repeated hidden state
        energy = torch.tanh(self.attention(torch.cat((hidden, encoder_outputs), dim=2)))
        # Calculate attention scores
        attention = self.v(energy).squeeze(2)

        # Apply softmax to calculate attention weights
        return torch.softmax(attention, dim=1)

# Decoder with Attention Definition (After Attention class is defined)
class DecoderWithAttention(nn.Module):
    def __init__(self, output_size, embedding_matrix, hidden_size, num_layers=1):
        super(DecoderWithAttention, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        self.attention = Attention(hidden_size)
        self.rnn = nn.LSTM(hidden_size + embedding_matrix.size(1), hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size * 2, output_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        # x: [batch_size], hidden, cell: [num_layers, batch_size, hidden_size], encoder_outputs: [batch_size, seq_len, hidden_size]
        x = x.unsqueeze(1)  # Add time dimension: [batch_size, 1]
        embedded = self.embedding(x)  # embedded: [batch_size, 1, embedding_dim]

        # Calculate attention weights and apply to encoder outputs to get context vector
        attention_weights = self.attention(hidden[-1], encoder_outputs)
        attention_weights = attention_weights.unsqueeze(1)  # [batch_size, 1, seq_len]
        context = torch.bmm(attention_weights, encoder_outputs)  # [batch_size, 1, hidden_size]

        # Concatenate the context vector with the embedded input word
        rnn_input = torch.cat((embedded, context), dim=2)  # [batch_size, 1, hidden_size + embedding_dim]
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))

        # Use the output of RNN and context vector for prediction
        prediction = self.fc(torch.cat((output.squeeze(1), context.squeeze(1)), dim=1))  # [batch_size, output_size]
        return prediction, hidden, cell

# Observations:
# - The `Attention` class must be defined before it is used by `DecoderWithAttention`.
# - This ensures there is no `NameError` when defining the decoder.


#### **Section 6: Seq2Seq Model Class with Attention Decoder**

The Seq2Seq class integrates the **Encoder** and **DecoderWithAttention** to generate responses.



In [7]:
# Seq2Seq Model Class with Attention Decoder
class Seq2SeqWithAttention(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqWithAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        batch_size = source.shape[0]
        target_len = target.shape[1]
        output_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, target_len, output_size).to(self.device)

        # Pass input through the encoder
        encoder_outputs, hidden, cell = self.encoder(source)

        # First input to the decoder is the <SOS> token
        input = target[:, 0]  # <SOS> token for each batch

        for t in range(1, target_len):
            # Pass the input, hidden state, and encoder outputs to the decoder
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output

            # Determine the next input using teacher forcing
            top1 = output.argmax(1)
            input = target[:, t] if np.random.random() < teacher_forcing_ratio else top1

        return outputs

# Observations:
# - The Seq2Seq model class integrates the encoder and decoder and passes encoder outputs to the decoder for attention.


#### Step 9: Evaluate and Test the Trained Model

##### Steps:
- Load the Trained Model: Load the best model checkpoint saved during training.
- Put the Model in Evaluation Mode: Set the model to evaluation mode using model.eval().
- Define Metrics for Evaluation:
    - Calculate BLEU score to measure the quality of generated responses.
    - Generate Responses: For a given input, generate responses and compare them with the ground truth.
    - Evaluate with Test Data: Loop over a test dataset, generate responses, and calculate evaluation metrics.

In [17]:
import pandas as pd

# Load the dataset
test_data_path = "../data/processed/customer_support_test_dataset_processed_10%.csv"
test_data = pd.read_csv(test_data_path)

# Display the first few rows
print(test_data.head())

# Display columns in the dataset
print(test_data.columns)

# Check for missing values
print(test_data.isnull().sum())

# Check data types of each column
print(test_data.dtypes)


                              customer_query_cleaned  \
0  swhelp dont worry i forgot there was a rmt strike   
1  aldiuk please tell me kevin will be back on st...   
2                      americanair  httpstcooevjrhfh   
3  comcastcares morning  back to deal w our hddta...   
4  americanair quite possibly the worst serviceit...   

                            support_response_cleaned  
0   no probs we got yo back fam if anything happe...  
1   thanks catherine ive passed this onto our dut...  
2   yes we are awaiting an update as to when the ...  
3   sorry to know youre facing issues with cash b...  
4   ive just checked and we arent listed on there...  
Index(['customer_query_cleaned', 'support_response_cleaned'], dtype='object')
customer_query_cleaned      72
support_response_cleaned     1
dtype: int64
customer_query_cleaned      object
support_response_cleaned    object
dtype: object


import the necessory things first

In [20]:
import torch
import nltk
import pandas as pd
from torch.utils.data import DataLoader
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu
import numpy as np

# Ensure nltk resources are available for BLEU calculation
nltk.download('punkt')

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the trained model checkpoint (Ensure compatibility with CPU-only machines)
checkpoint_path = "../models/seq2seq_attention_best_model.pth"

# Hyperparameters (Ensure these are the same as those used during training)
input_size = len(word2idx)
output_size = len(word2idx)
hidden_size = 256  # Ensure that this matches what was used during training

# Instantiate encoder, decoder with attention, and Seq2Seq model
encoder = Encoder(input_size, embedding_matrix, hidden_size).to(device)
decoder = DecoderWithAttention(output_size, embedding_matrix, hidden_size).to(device)
model = Seq2SeqWithAttention(encoder, decoder, device).to(device)

# Load the trained model weights in non-strict mode
model.load_state_dict(torch.load(checkpoint_path, map_location=device), strict=False)

# Put the model in evaluation mode
model.eval()

# Create idx2word dictionary from word2idx for converting token IDs back to words
idx2word = {idx: word for word, idx in word2idx.items()}


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Girija\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  model.load_state_dict(torch.load(checkpoint_path, map_location=device), strict=False)


Load and Prepare the Test Dataset


In [22]:
# Load test dataset from CSV file
test_data_path = "../data/processed/customer_support_test_dataset_processed_10%.csv"
test_data = pd.read_csv(test_data_path)

# Handle missing values by dropping rows containing any NaNs in `customer_query_cleaned` or `support_response_cleaned`
test_data.dropna(subset=['customer_query_cleaned', 'support_response_cleaned'], inplace=True)

# Extract queries and responses (as Pandas Series)
test_queries = test_data['customer_query_cleaned']
test_responses = test_data['support_response_cleaned']

# Create a DataLoader for the test dataset
test_dataset = ChatDataset(test_queries, test_responses, word2idx)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)


BLEU Score Evaluation for 1000 Test Inputs

In [23]:
# BLEU Score Evaluation for First 1000 Inputs
bleu_scores_1000 = []

with torch.no_grad():
    with tqdm(total=1000, desc="Evaluating First 1000 Inputs", unit="sample") as pbar:
        for i, (source, target) in enumerate(test_loader):
            if i >= 1000:
                break

            source = source.to(device)
            target = target.to(device)

            # Get the output from the model
            output_tokens = []
            encoder_outputs, hidden, cell = model.encoder(source)

            # Start the decoding process with <SOS> token
            input_token = torch.tensor([word2idx['<SOS>']], dtype=torch.long).to(device)
            for _ in range(target.size(1)):
                output, hidden, cell = model.decoder(input_token, hidden, cell, encoder_outputs)
                top1 = output.argmax(1)
                output_tokens.append(top1.item())

                # Break if <EOS> token is predicted
                if top1.item() == word2idx['<EOS>']:
                    break

                # The next input token is the current output token
                input_token = top1

            # Convert predicted token IDs to words
            predicted_sentence = [idx2word[token] for token in output_tokens if token != word2idx['<PAD>']]

            # Convert target token IDs to words (ground truth)
            target_sentence = [idx2word[token.item()] for token in target[0] if token.item() not in [word2idx['<PAD>'], word2idx['<SOS>'], word2idx['<EOS>']]]

            # Calculate BLEU score
            bleu_score = sentence_bleu([target_sentence], predicted_sentence, weights=(0.5, 0.5))
            bleu_scores_1000.append(bleu_score)

            pbar.set_postfix({"BLEU": bleu_score})
            pbar.update(1)

# Calculate and print the average BLEU score for the first 1000 test inputs
average_bleu_score_1000 = np.mean(bleu_scores_1000)
print(f"Average BLEU Score on First 1000 Test Inputs: {average_bleu_score_1000:.4f}")


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating First 1000 Inputs: 100%|██████████| 1000/1000 [18:55<00:00,  1.14s/sample, BLEU=0.108]   

Average BLEU Score on First 1000 Test Inputs: 0.0290





 BLEU Score Evaluation for Full Dataset

In [None]:
# BLEU Score Evaluation for Full Test Dataset
bleu_scores_full = []

with torch.no_grad():
    with tqdm(total=len(test_loader), desc="Evaluating Full Test Dataset", unit="sample") as pbar:
        for i, (source, target) in enumerate(test_loader):
            source = source.to(device)
            target = target.to(device)

            # Get the output from the model
            output_tokens = []
            encoder_outputs, hidden, cell = model.encoder(source)

            # Start the decoding process with <SOS> token
            input_token = torch.tensor([word2idx['<SOS>']], dtype=torch.long).to(device)
            for _ in range(target.size(1)):
                output, hidden, cell = model.decoder(input_token, hidden, cell, encoder_outputs)
                top1 = output.argmax(1)
                output_tokens.append(top1.item())

                # Break if <EOS> token is predicted
                if top1.item() == word2idx['<EOS>']:
                    break

                # The next input token is the current output token
                input_token = top1

            # Convert predicted token IDs to words
            predicted_sentence = [idx2word[token] for token in output_tokens if token != word2idx['<PAD>']]

            # Convert target token IDs to words (ground truth)
            target_sentence = [idx2word[token.item()] for token in target[0] if token.item() not in [word2idx['<PAD>'], word2idx['<SOS>'], word2idx['<EOS>']]]

            # Calculate BLEU score
            bleu_score = sentence_bleu([target_sentence], predicted_sentence, weights=(0.5, 0.5))
            bleu_scores_full.append(bleu_score)

            pbar.set_postfix({"BLEU": bleu_score})
            pbar.update(1)

# Calculate and print the average BLEU score for the full test dataset
average_bleu_score_full = np.mean(bleu_scores_full)
print(f"Average BLEU Score on Full Test Dataset: {average_bleu_score_full:.4f}")


Interactive Test with Predefined Inputs: 

In [25]:
# Interactive Test with Predefined Inputs

# Predefined test inputs for the chatbot
predefined_inputs = [
    "How do I track my order?",
    "I am facing issues with logging in. Can you help?",
    "Is my payment secure on your website?",
    "Where can I find the return policy?",
    "Can you tell me if my order has been shipped?"
]

# Function to clean and tokenize input text for use by the model
def prepare_input_text(input_text, word2idx, max_len=20):
    tokens = input_text.lower().split()  # Simple tokenization (splitting by space)
    token_ids = [word2idx.get(token, word2idx['<UNK>']) for token in tokens]  # Convert tokens to ids
    token_ids = token_ids[:max_len]  # Truncate if longer than max_len
    token_ids += [word2idx['<PAD>']] * (max_len - len(token_ids))  # Pad if shorter than max_len
    return torch.tensor([token_ids], dtype=torch.long).to(device)

# Function to generate response from the model
def generate_response(model, source_input, word2idx, idx2word):
    # Set model to evaluation mode
    model.eval()

    # Encode the source input
    encoder_outputs, hidden, cell = model.encoder(source_input)

    # Start decoding with <SOS> token
    input_token = torch.tensor([word2idx['<SOS>']], dtype=torch.long).to(device)
    output_tokens = []

    # Decoding loop
    max_output_length = 20  # Set a maximum length for the response
    for _ in range(max_output_length):
        output, hidden, cell = model.decoder(input_token, hidden, cell, encoder_outputs)
        top1 = output.argmax(1)
        output_tokens.append(top1.item())

        # Break if <EOS> token is predicted
        if top1.item() == word2idx['<EOS>']:
            break

        # The next input token is the current output token
        input_token = top1

    # Convert output tokens to words
    response_sentence = [idx2word[token] for token in output_tokens if token != word2idx['<PAD>']]
    return ' '.join(response_sentence)

# Iterate through predefined customer queries
print("Interactive Chatbot Responses:\n")
for customer_query in predefined_inputs:
    # Prepare the input text
    source_input = prepare_input_text(customer_query, word2idx)

    # Generate response using the model
    response = generate_response(model, source_input, word2idx, idx2word)

    # Display the query and response
    print(f"Customer Query: {customer_query}")
    print(f"Chatbot Response: {response}")
    print('-' * 50)


Interactive Chatbot Responses:

Customer Query: How do I track my order?
Chatbot Response: hi there sorry for the trouble please dm us your your email address and we can take a look backstage
--------------------------------------------------
Customer Query: I am facing issues with logging in. Can you help?
Chatbot Response: hi there sorry for the trouble please dm us your your email address and we can take a look backstage
--------------------------------------------------
Customer Query: Is my payment secure on your website?
Chatbot Response: hi there sorry for the trouble please dm us your your email address and we can take a look backstage
--------------------------------------------------
Customer Query: Where can I find the return policy?
Chatbot Response: hi there sorry for the trouble please dm us your your email address and we can take a look backstage
--------------------------------------------------
Customer Query: Can you tell me if my order has been shipped?
Chatbot Respo

In [27]:

# Optional: Allow for user-inputted queries in real-time
while True:
    customer_query = input("You: ")
    if customer_query.lower() in ['exit', 'quit']:
        print("Exiting the interactive chat...")
        break

    # Prepare the input text
    source_input = prepare_input_text(customer_query, word2idx)

    # Generate response using the model
    response = generate_response(model, source_input, word2idx, idx2word)

    # Display the chatbot response
    print(f"Chatbot: {response}")
    print('-' * 50)


Chatbot: hi there sorry for the trouble please dm us your your email address and we can take a look backstage
--------------------------------------------------
Exiting the interactive chat...
