# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html


In [1]:
# Check current torch version
import torch
print(torch.__version__)

1.11.0+cu102


In [2]:
# Intall tqdm for progress bars
!pip install tqdm

Defaulting to user installation because normal site-packages is not writeable


## Step 1: Build your Vocabulary & create the Word Embeddings

In [1]:
# Import required libraries
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown
from tqdm import tqdm
import torch.nn as nn
import os
from nltk.tokenize import word_tokenize
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

In [2]:
nltk.download('brown')
nltk.download('punkt')

# Loading provided embedding
w2v = gensim.models.Word2Vec.load('brown.embedding')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
# Note: Due to ongoing problems with torchtext, the SQuAD2.0 dataset was obtained directly from
# https://rajpurkar.github.io/SQuAD-explorer/

# Load SQuAD2.0 record from root directory
with open('train-v2.0.json', 'r') as f:
    squad_train = pd.read_json(f)
    
with open('dev-v2.0.json', 'r') as f:
    squad_dev = pd.read_json(f)

In [4]:
# Extract Questions and Answers
def extract_qa(data):
    questions = []
    answers = []
    for topic in data['data']:
        for paragraph in topic['paragraphs']:
            for qas in paragraph['qas']:
                questions.append(qas['question'])
                if qas['answers']:
                    answers.append(qas['answers'][0]['text'])
                else: 
                    answers.append('') # No-answer
    return questions, answers

train_questions, train_answers = extract_qa(squad_train)
dev_questions, dev_answers = extract_qa(squad_dev)

In [5]:
# Check sample questions and answers
print(train_questions[0])
print(train_answers[0])
print(dev_questions[0])
print(dev_answers[0])

When did Beyonce start becoming popular?
in the late 1990s
In what country is Normandy located?
France


In [6]:
# Tokenize using NLTK
def tokenize_data(questions, answers):
    questions_tokens = []
    answers_tokens = []
    
    for question in tqdm(questions, desc="Tokenizing questions"):
        questions_tokens.append(word_tokenize(question))
    
    for answer in tqdm(answers, desc="Tokenizing answers"):
        answers_tokens.append(word_tokenize(answer))
        
    return questions_tokens, answers_tokens

train_questions_tokens, train_answers_tokens = tokenize_data(train_questions, train_answers)
dev_questions_tokens, dev_answers_tokens = tokenize_data(dev_questions, dev_answers)

Tokenizing questions: 100%|██████████| 130319/130319 [00:27<00:00, 4799.08it/s]
Tokenizing answers: 100%|██████████| 130319/130319 [00:13<00:00, 9356.59it/s] 
Tokenizing questions: 100%|██████████| 11873/11873 [00:02<00:00, 4722.63it/s]
Tokenizing answers: 100%|██████████| 11873/11873 [00:01<00:00, 11663.02it/s]


In [7]:
# Create the Vocabulary
all_tokens = [token for sublist in train_questions_tokens+train_answers_tokens for token in sublist]
vocab = set(all_tokens)
vocab_size = len(vocab)

In [8]:
len(vocab)

73078

In [9]:
# Create Word-Index Mapping
word2index = {}
index2word = {}

special_tokens = ['<pad>', '<sos>', '<eos>', '<unk>']

for token in special_tokens:
    word2index[token] = len(word2index)
    index2word[len(index2word)] = token

for tokens in tqdm(train_questions_tokens + train_answers_tokens, desc="Building Vocabulary"):
    for word in tokens:
        if word not in word2index:
            word2index[word] = len(word2index)
            index2word[len(index2word)] = word

         

Building Vocabulary: 100%|██████████| 260638/260638 [00:00<00:00, 418390.20it/s]


In [10]:
len(word2index)

73082

In [11]:
vocab_size = len(word2index)

In [12]:
# Initialize Embeddings
embedding_dim = w2v.vector_size
embedding_matrix = np.random.uniform(-1, 1, (len(word2index), embedding_dim))

for word, i in word2index.items():
    try:
        if word in special_tokens:
            continue
        embedding_vector = w2v.wv[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        # If word not in pretrained embeddings (alternatively: pass)
        embedding_matrix[i] = np.random.randn(embedding_dim)

In [13]:
# Check for NaN or infinities
if np.isnan(embedding_matrix).any():
    print("NaN found in embedding matrix")
if np.isinf(embedding_matrix).any():
    print("Infinity found in embedding matrix")

## Step 2: Create the Encoder

In [14]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True)
        
    def forward(self, input_seq):
        embedded = self.embedding(input_seq)
        outputs, (hidden, cell) = self.lstm(embedded)
        return hidden, cell

## Step 3: Create the Decoder

In [15]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True)
        self.out = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, input_step, hidden, cell):
        embedded = self.embedding(input_step).unsqueeze(0)  # Now it's [1, batch_size, embedding_dim]
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        output = self.out(output.squeeze(0))
        return output, (hidden, cell)

## Step 4: Combine them into a Seq2Seq Architecture

In [16]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, source, target):
        # Placeholder for decoder outputs
        outputs = torch.zeros(target.size(0), target.size(1), self.decoder.out.out_features).to(target.device)
        
        # Pass input through the encoder
        hidden, cell = self.encoder(source)
        
        # First input to the decoder is the <sos> token
        input = target[:, 0]
        
        for t in range(1, target.size(1)):
            # Note: decoder should accept input with shape [batch_size, 1] 
            # but seems your decoder is set up to take [batch_size], 
            # hence not using unsqueeze
            output, (hidden, cell) = self.decoder(input, hidden, cell)
            outputs[:, t] = output
            # No teacher forcing: use model's prediction as next input
            input = output.argmax(1)
        
        return outputs

In [17]:
# Initialize model
hidden_dim = 128
encoder = Encoder(vocab_size, embedding_dim, hidden_dim)
decoder = Decoder(vocab_size, embedding_dim, hidden_dim)
seq2seq = Seq2Seq(encoder, decoder)

# Step 5: Train & evaluate your model

In [18]:
# Convert Tokens to Indices
def tokens_to_indices(tokens_list, word2index):
    return [[word2index[token] for token in tokens] for tokens in tokens_list]

train_questions_indices = tokens_to_indices(train_questions_tokens, word2index)
train_answers_indices = tokens_to_indices(train_answers_tokens, word2index)

In [19]:
# Move model to available device, set optimizer with scheduler
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
seq2seq = seq2seq.to(device)
optimizer = optim.Adam(seq2seq.parameters(), lr=0.001) # Lowered to avoid exploding values
scheduler = StepLR(optimizer, step_size=5, gamma=0.1)  # Decrease the learning rate by a factor of 0.1 every 5 epochs
criterion = nn.CrossEntropyLoss(ignore_index=word2index['<pad>'])

In [20]:
# Training Loop (For simplicity Batch size of 1)
epochs = 3  # Due to ongoing issues with the Udacity GPU workspace and low speeds, 
# unfortunately more epochs could not be tried out
max_len = 100  # To adjust the average sequence length
limit_data = 1000  # To train with only a portion of the data when resources are limited
# the max would be: len(train_questions_indices)

# Set model to train state
seq2seq.train()  

# Initialize a variable to keep track of the best loss so far
best_loss = float('inf')
model_save_path = 'best_seq2seq_model.pt'

for epoch in range(epochs):
    epoch_loss = 0
    
    for i in tqdm(range(min(len(train_questions_indices), limit_data)), desc=f"Epoch {epoch+1}/{epochs}"):
        question = train_questions_indices[i]
        answer = train_answers_indices[i]
        # Note: for debugging purposes a few print statements were added, 
        # which are now commented out again
        # print("Question (indices):", question)
        # print("Answer (indices):", answer)
        # print("Length of question before slicing:", len(question))
        # print("Length of answer before slicing:", len(answer))
        
        # Pad or truncate to max_len
        question = question[:max_len] + [word2index['<pad>']] * (max_len - len(question))
        answer = answer[:max_len] + [word2index['<pad>']] * (max_len - len(answer))
        
        # Convert list of indices to tensors
        question_tensor = torch.LongTensor(question).unsqueeze(0).to(device)
        answer_tensor = torch.LongTensor(answer).unsqueeze(0).to(device)
        # print("Before reshaping:")
        # print("Answer tensor shape:", answer_tensor.shape)
        # print("Answer tensor:", answer_tensor)
        
        optimizer.zero_grad()
        
        outputs = seq2seq(question_tensor, answer_tensor)
        #print("Outputs shape:", outputs.shape)
        #print("Outputs:", outputs)
        outputs = outputs[:, 1:].reshape(-1, outputs.shape[2])
        answer_tensor = answer_tensor[:, 1:].reshape(-1)
        
        #print("After reshaping:")
        #print("Outputs shape:", outputs.shape)
        #print("Answer tensor shape:", answer_tensor.shape)
        
        loss = criterion(outputs, answer_tensor)
        
        '''
        for name, param in seq2seq.named_parameters():
            if torch.isnan(param).any():
                print(f"NaN found in {name} before backward")
            if torch.isinf(param).any():
                print(f"Infinity found in {name} before backward")
        '''
        loss.backward()
        '''
        for name, param in seq2seq.named_parameters():
            if torch.isnan(param.grad).any():
                print(f"NaN found in gradient of {name} after backward")
            if torch.isinf(param.grad).any():
                print(f"Infinity found in gradient of {name} after backward")
        '''
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(seq2seq.parameters(), max_norm=1)
        
        optimizer.step()
        epoch_loss += loss.item()
        # print(loss.item())
        # print(epoch_loss)
        
        scheduler.step()

    # Calculate average epoch loss
    avg_epoch_loss = epoch_loss/min(len(train_questions_indices), limit_data)
    print(f"Epoch: {epoch+1}, Loss: {avg_epoch_loss}")

    # Check if the current model is better than previous best before saving
    if epoch == 0:
        best_loss = avg_epoch_loss
        torch.save(seq2seq.state_dict(), model_save_path)
    else:
        if avg_epoch_loss < best_loss:
            best_loss = avg_epoch_loss
            torch.save(seq2seq.state_dict(), model_save_path)

Epoch 1/3: 100%|██████████| 1000/1000 [08:46<00:00,  1.90it/s]


Epoch: 1, Loss: nan


Epoch 2/3: 100%|██████████| 1000/1000 [08:49<00:00,  1.89it/s]
Epoch 3/3:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch: 2, Loss: nan


Epoch 3/3: 100%|██████████| 1000/1000 [08:52<00:00,  1.88it/s]

Epoch: 3, Loss: nan





In [66]:
# Some additional code to check for potential error sources despite limited epoch number
def check_for_nan_or_inf(data):
    if torch.any(torch.isnan(data)):
        return "Contains NaN"
    if torch.any(torch.isinf(data)):
        return "Contains Inf"
    return "No NaN or Inf"

# Just test on a small subset to check for potential issues
for i in range(10):  
    question = train_questions_indices[i]
    answer = train_answers_indices[i]
    question = question[:max_len] + [word2index['<pad>']] * (max_len - len(question))
    answer = answer[:max_len] + [word2index['<pad>']] * (max_len - len(answer))
    question_tensor = torch.LongTensor(question).unsqueeze(0).to(device)
    answer_tensor = torch.LongTensor(answer).unsqueeze(0).to(device)
    
    print(f"Question {i} status: {check_for_nan_or_inf(question_tensor)}")
    print(f"Answer {i} status: {check_for_nan_or_inf(answer_tensor)}")

Question 0 status: No NaN or Inf
Answer 0 status: No NaN or Inf
Question 1 status: No NaN or Inf
Answer 1 status: No NaN or Inf
Question 2 status: No NaN or Inf
Answer 2 status: No NaN or Inf
Question 3 status: No NaN or Inf
Answer 3 status: No NaN or Inf
Question 4 status: No NaN or Inf
Answer 4 status: No NaN or Inf
Question 5 status: No NaN or Inf
Answer 5 status: No NaN or Inf
Question 6 status: No NaN or Inf
Answer 6 status: No NaN or Inf
Question 7 status: No NaN or Inf
Answer 7 status: No NaN or Inf
Question 8 status: No NaN or Inf
Answer 8 status: No NaN or Inf
Question 9 status: No NaN or Inf
Answer 9 status: No NaN or Inf


In [67]:
seq2seq.train()

# Use only one data point for simplicity
question = train_questions_indices[0]
answer = train_answers_indices[0]
question = question[:max_len] + [word2index['<pad>']] * (max_len - len(question))
answer = answer[:max_len] + [word2index['<pad>']] * (max_len - len(answer))
question_tensor = torch.LongTensor(question).unsqueeze(0).to(device)
answer_tensor = torch.LongTensor(answer).unsqueeze(0).to(device)

optimizer.zero_grad()

outputs = seq2seq(question_tensor, answer_tensor)
print(f"Model output status: {check_for_nan_or_inf(outputs)}")

outputs = outputs[1:].reshape(-1, len(word2index))
answer_tensor = answer_tensor[1:].reshape(-1)

loss = criterion(outputs, answer_tensor)
print(f"Loss status: {check_for_nan_or_inf(loss)}")

loss.backward()

for name, param in seq2seq.named_parameters():
    if param.requires_grad:
        print(f"Gradient of {name} status: {check_for_nan_or_inf(param.grad)}")


Model output status: No NaN or Inf
Loss status: Contains NaN
Gradient of encoder.embedding.weight status: No NaN or Inf
Gradient of encoder.lstm.weight_ih_l0 status: No NaN or Inf
Gradient of encoder.lstm.weight_hh_l0 status: No NaN or Inf
Gradient of encoder.lstm.bias_ih_l0 status: No NaN or Inf
Gradient of encoder.lstm.bias_hh_l0 status: No NaN or Inf
Gradient of decoder.embedding.weight status: No NaN or Inf
Gradient of decoder.lstm.weight_ih_l0 status: No NaN or Inf
Gradient of decoder.lstm.weight_hh_l0 status: No NaN or Inf
Gradient of decoder.lstm.bias_ih_l0 status: No NaN or Inf
Gradient of decoder.lstm.bias_hh_l0 status: No NaN or Inf
Gradient of decoder.out.weight status: No NaN or Inf
Gradient of decoder.out.bias status: No NaN or Inf


## Step 6: Interact with the Chatbot

In [21]:
# Load best model
seq2seq.load_state_dict(torch.load('best_seq2seq_model.pt'))
seq2seq.eval()  # Set the model to evaluation mode

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(73082, 100)
    (lstm): LSTM(100, 128, batch_first=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(73082, 100)
    (lstm): LSTM(100, 128, batch_first=True)
    (out): Linear(in_features=128, out_features=73082, bias=True)
  )
)

In [22]:
# Define a Function to Generate Answers
def generate_answer(model, question):
    # Convert question to tensor
    indexed = [word2index.get(word, word2index['<unk>']) for word in question.split()]
    question_tensor = torch.LongTensor(indexed).unsqueeze(0).to(device)

    # Use the model to get the answer
    with torch.no_grad():
        outputs = model(question_tensor, torch.zeros_like(question_tensor))
    
    # Convert tensor outputs to words
    answer_indices = outputs.argmax(dim=2).squeeze().cpu().numpy()
    answer = ' '.join([index2word[idx] for idx in answer_indices])

    # Remove the first occurrence of '<pad>' if present as quick fix
    answer = answer.replace('<pad> ', '', 1)

    return answer

In [26]:
# Test chatbot by interacting with Sample Questions
sample_questions = [
    "What is the capital of France?",
    "Who wrote the Iliad?",
    "When was the Declaration of Independence signed?"
]

for q in sample_questions:
    answer = generate_answer(seq2seq, q)
    print(f"Question: {q}")
    print(f"Answer: {answer}\n")

Question: What is the capital of France?
Answer: godfather Menu entirety 10,826 lineup

Question: Who wrote the Iliad?
Answer: Educación Gerhard OEMs

Question: When was the Declaration of Independence signed?
Answer: godfather Menu entirety 10,826 lineup génos

