# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
# import torch
from dataset_helper import load_df, prepare_text, train_test_split

train, dev = load_df()
v, token_df = prepare_text(count_limit=2, min_length=2, max_length=13, stage='dev') # dev or train
print(f'create train, test and validation data sets ...')
train_set, test_set, valid_set = train_test_split(token_df)


Adding word 0 to our vocabulary.
Adding word 15000 to our vocabulary.
Adding word 30000 to our vocabulary.
Adding word 45000 to our vocabulary.
Adding word 60000 to our vocabulary.
Adding word 75000 to our vocabulary.
Word count in vocab is 5579. Removed 5059 words during cleanup.
Data frame contains 5928 rows.
Data frame after row cleanup contains 1046 rows.
create train, test and validation data sets ...
Train set of length: 732
Test set of length: 236
Valid set of length: 78


In [2]:
# print first 10 QnAs
for i in range(10):
    ex_q = train_set.iloc[i, 0]
    ex_a = train_set.iloc[i, 1]
    ex_question = [w for w in v.index2word(ex_q) if w!= '<UNK>']
    ex_answer = [w for w in v.index2word(ex_a) if w!= '<UNK>']
    # Finally, write out an answer for user
    print("Q:", " ".join(ex_question))
    print("A:", " ".join(ex_answer), "\n")

Q: <SOS> petrologists identify rock samples in the field and where else <EOS> <PAD>
A: <SOS> the laboratory <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> how high are victoria s alpine regions <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> 2 000 m <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> what do some believe the treaty of versailles assisted in <EOS> <PAD>
A: <SOS> adolf hitler s rise to power <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> friedrich ratzel thought imperialism was what for the country <EOS> <PAD> <PAD>
A: <SOS> geographical societies in europe <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> what did this agreement do <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
A: <SOS> granted the huguenots substantial religious political and military autonomy <EOS> <PAD> <PAD> 

Q: <SOS> what is negatively correlated to the duration of economic growth <EOS> <PAD>
A: <SOS> inequality in wealth and income <EOS> <PAD> <PAD> <PAD> <PAD> <PA

In [30]:
# define parameters
input_size = len(v.words)
output_size = len(v.words)
embedding_size = 128 # 256
hidden_size = 32 # 256
lstm_layer = 2
dropout = 0.5
learning_rate = 0.001
epochs = 100
clip = 1
BATCH_SIZE = 32


In [31]:
from dataset_helper import get_dataloader
from encoder import Encoder
from decoder import Decoder
from seq2seq import Seq2Seq
import torch

train_dataloader, test_dataloader, valid_dataloader = get_dataloader(train_set, test_set, valid_set, BATCH_SIZE)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# instantiate encoder and decoder classes
enc = Encoder(input_size, hidden_size, embedding_size, lstm_layer, dropout, BATCH_SIZE).to(device)
dec = Decoder(input_size, hidden_size, output_size, embedding_size, lstm_layer, dropout).to(device)
# instantiate seq2seq model
model = Seq2Seq(enc, dec, device).to(device)
print(model)

Created `train_dataloader` with 22 batches!
Created `test_dataloader` with 7 batches!
Created `test_dataloader` with 2 batches!
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(5579, 128)
    (lstm): LSTM(128, 32, num_layers=2, batch_first=True, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5579, 128)
    (lstm): LSTM(128, 32, num_layers=2, batch_first=True, dropout=0.5)
    (lin_out): Linear(in_features=32, out_features=5579, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)


In [32]:
from helpers import evaluate, train_loop
import torch.optim as optim
import torch.nn as nn

model_out_path = 'qna-model.pt'
# definer optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# ignore padding index when calculating the loss (<PAD>=3 in vocab)
criterion = nn.CrossEntropyLoss(ignore_index=3)
# train loop
train_loop(model, train_dataloader, test_dataloader, optimizer, criterion, clip, device, epochs, model_out_path)

	Epoch: 0 | 	Train Loss: 8.627 | 	 Val. Loss: 8.505
	Epoch: 5 | 	Train Loss: 6.242 | 	 Val. Loss: 6.987
	Epoch: 10 | 	Train Loss: 6.099 | 	 Val. Loss: 7.093
	Epoch: 15 | 	Train Loss: 6.055 | 	 Val. Loss: 7.210
	Epoch: 20 | 	Train Loss: 6.013 | 	 Val. Loss: 7.278
	Epoch: 25 | 	Train Loss: 5.973 | 	 Val. Loss: 7.333
	Epoch: 30 | 	Train Loss: 5.947 | 	 Val. Loss: 7.405
	Epoch: 35 | 	Train Loss: 5.917 | 	 Val. Loss: 7.391
	Epoch: 40 | 	Train Loss: 5.872 | 	 Val. Loss: 7.422
	Epoch: 45 | 	Train Loss: 5.841 | 	 Val. Loss: 7.509
	Epoch: 50 | 	Train Loss: 5.799 | 	 Val. Loss: 7.595
	Epoch: 55 | 	Train Loss: 5.754 | 	 Val. Loss: 7.535
	Epoch: 60 | 	Train Loss: 5.721 | 	 Val. Loss: 7.616
	Epoch: 65 | 	Train Loss: 5.677 | 	 Val. Loss: 7.583
	Epoch: 70 | 	Train Loss: 5.645 | 	 Val. Loss: 7.642
	Epoch: 75 | 	Train Loss: 5.594 | 	 Val. Loss: 7.670
	Epoch: 80 | 	Train Loss: 5.556 | 	 Val. Loss: 7.666
	Epoch: 85 | 	Train Loss: 5.505 | 	 Val. Loss: 7.702
	Epoch: 90 | 	Train Loss: 5.476 | 	 Val. Loss: 7

In [33]:
# Evaluate the model on data it has never seen
model_out_path = 'qna-model.pt'
model.load_state_dict(torch.load(model_out_path))
valid_loss = evaluate(model, valid_dataloader, criterion, device)
print(f'Validation Loss: {valid_loss:.3f}')

Validation Loss: 7.167


In [34]:
# inference, load model
model = Seq2Seq(enc, dec, device)
model.load_state_dict(torch.load(model_out_path))
model.eval

# define the input question
# question = "What does the urban education institute help run?"
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    question = input("> ")
    if question.strip() == "exit":
        break
    # clean and tokenize the input question
    src = v.word2index(v.clean_text(question))
    # convert the tokenized question to a tensor and add a batch dimension
    src = torch.tensor(src, dtype=torch.long).unsqueeze(0).to(device)
    # generate the answer using the model
    output = model(src=src, trg=None, teaching=0, max_len=13)
    # convert the output tensor to a list of token IDs
    preds = output.argmax(dim=2).tolist()[0]
    # convert the token IDs to tokens
    answer = v.index2word(preds)
    # print the predicted answer
    print(answer)

Type 'exit' to finish the chat.
 ------------------------------ 

['<SOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
