Alright, let's proceed with the **Seq2Seq Model Implementation** for building the foundational chatbot using an Encoder-Decoder architecture. Below, I've included a detailed impl
### **1.2 Seq2Seq Model Implementation** in `model_training.ipynb`

---

#### **Section 1: Import Libraries**

We start by importing the necessary libraries for data handling, deep learning, and other utilities.



In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import re
import nltk
from sklearn.model_selection import train_test_split

# Ensure reproducibility
torch.manual_seed(42)

# Download the Punkt tokenizer (used for splitting sentences into words)
nltk.download('punkt')

# Observations:
# - PyTorch is used for creating and training the neural network.
# - NLTK is used for preprocessing and tokenizing text.


#### **Section 2: Load and Prepare Data**

Load the cleaned dataset that was saved in the previous preprocessing step (`customer_support_dataset_processed.csv`) and prepare it for training.



In [None]:
# Step 2: Load the Processed Data
file_path = "../data/processed/customer_support_dataset_processed.csv"
df = pd.read_csv(file_path)

# Split data into input and output pairs
queries = df['customer_query_cleaned']
responses = df['support_response_cleaned']

# Split dataset into training and validation sets (90% train, 10% validation)
train_queries, val_queries, train_responses, val_responses = train_test_split(
    queries, responses, test_size=0.1, random_state=42
)

# Observations:
# - The cleaned dataset is loaded, and customer queries are paired with their responses.
# - The data is split into training and validation sets for training and evaluating the model's performance.


#### **Section 3: Text Tokenization and Vocabulary Building**

Tokenize the sentences and build vocabulary dictionaries for mapping words to integer tokens.



In [None]:
# Tokenize the text and create vocabulary
from collections import Counter
from nltk.tokenize import word_tokenize

# Step 3.1: Tokenization
train_queries_tokens = [word_tokenize(query) for query in train_queries]
train_responses_tokens = [word_tokenize(response) for response in train_responses]

# Step 3.2: Building Vocabulary
def build_vocab(token_lists):
    vocab = Counter()
    for tokens in token_lists:
        vocab.update(tokens)
    return vocab

vocab_queries = build_vocab(train_queries_tokens)
vocab_responses = build_vocab(train_responses_tokens)

# Add special tokens to vocabulary
special_tokens = ['<PAD>', '<SOS>', '<EOS>', '<UNK>']
for token in special_tokens:
    vocab_queries[token] = float('inf')
    vocab_responses[token] = float('inf')

# Create word to index and index to word dictionaries
word2idx = {word: idx for idx, (word, _) in enumerate(vocab_queries.items())}
idx2word = {idx: word for word, idx in word2idx.items()}

# Observations:
# - Text is tokenized using the NLTK tokenizer.
# - Vocabulary is built using word frequency counting.
# - Special tokens like '<PAD>' (padding), '<SOS>' (start of sentence), '<EOS>' (end of sentence), and '<UNK>' (unknown word) are added.


#### **Section 4: Create Custom Dataset and DataLoader**

Prepare a PyTorch dataset and DataLoader to handle batching during training.



In [None]:
# Step 4: Custom Dataset and DataLoader
class ChatDataset(Dataset):
    def __init__(self, queries, responses, word2idx, max_len=20):
        self.queries = queries
        self.responses = responses
        self.word2idx = word2idx
        self.max_len = max_len

    def __len__(self):
        return len(self.queries)

    def __getitem__(self, idx):
        # Convert text to token ids and pad/truncate to max_len
        query = self._text_to_sequence(self.queries[idx])
        response = self._text_to_sequence(self.responses[idx])
        return torch.tensor(query, dtype=torch.long), torch.tensor(response, dtype=torch.long)

    def _text_to_sequence(self, text):
        tokens = word_tokenize(text)
        sequence = [self.word2idx.get(token, self.word2idx['<UNK>']) for token in tokens]
        sequence = [self.word2idx['<SOS>']] + sequence + [self.word2idx['<EOS>']]
        # Pad/truncate to max_len
        sequence = sequence[:self.max_len] + [self.word2idx['<PAD>']] * (self.max_len - len(sequence))
        return sequence

# Create DataLoader instances for training and validation datasets
train_dataset = ChatDataset(train_queries, train_responses, word2idx)
val_dataset = ChatDataset(val_queries, val_responses, word2idx)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Observations:
# - A custom PyTorch dataset (`ChatDataset`) is created to handle tokenization and padding.
# - Padding ensures that all sequences have the same length, facilitating easy batch processing.
# - DataLoader is used to create batches for training and validation, which helps to efficiently manage large datasets.


#### **Section 5: Encoder-Decoder Model Design**

Define the Encoder and Decoder classes for the Seq2Seq architecture.

##### **5.1 Encoder Class**



In [None]:
# Step 5.1: Encoder Definition
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_dim, hidden_size, num_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_size, num_layers, batch_first=True)
    
    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(embedded)
        return hidden, cell

# Observations:
# - The Encoder takes the input sentence and produces hidden and cell states.
# - The embedding layer converts word indices to embedding vectors.
# - The LSTM (or GRU) captures sequential dependencies.


##### **5.2 Decoder Class**



In [None]:
# Step 5.2: Decoder Definition
class Decoder(nn.Module):
    def __init__(self, output_size, embedding_dim, hidden_size, num_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden, cell):
        x = x.unsqueeze(1)  # Add time dimension
        embedded = self.embedding(x)
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))
        return prediction, hidden, cell

# Observations:
# - The Decoder takes the previous word, hidden state, and cell state to produce the next word.
# - The fully connected layer is used to predict the next word in the sequence.



##### **5.3 Seq2Seq Model Class**



In [None]:
# Step 5.3: Seq2Seq Model Class
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        batch_size = source.shape[0]
        target_len = target.shape[1]
        output_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, target_len, output_size).to(self.device)

        hidden, cell = self.encoder(source)

        # Take first word input as <SOS>
        input = target[:, 0]

        for t in range(1, target_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[:, t, :] = output
            top1 = output.argmax(1)
            input = target[:, t] if np.random.random() < teacher_forcing_ratio else top1

        return outputs

# Observations:
# - The Seq2Seq class ties the Encoder and Decoder together.
# - Teacher forcing is used during training to decide whether to use the model's prediction or the true target.



#### **Section 6: Training the Model**

Define the training process, including the loss function and optimizer.



In [None]:
# Step 6: Training the Model

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_size = len(word2idx)
output_size = len(word2idx)
embedding_dim = 256
hidden_size = 512

# Instantiate encoder, decoder, and Seq2Seq model
encoder = Encoder(input_size, embedding_dim, hidden_size).to(device)
decoder = Decoder(output_size, embedding_dim, hidden_size).to(device)
model = Seq2Seq(encoder, decoder, device).to(device)

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=word2idx['<PAD>'])

# Observations:
# - CrossEntropyLoss is used to compare predicted and actual words, ignoring padding tokens.
# - An optimizer like Adam is chosen to improve the training efficiency.
