# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [16]:
# import torch
from dataset_helper import prepare_text, train_test_split
# load and prepare data
v, token_df = prepare_text(max_rows_train_set=20000, count_limit=2, min_length=3, max_length=12, stage='train') # dev or train
print(f'create train, test and validation data sets ...')
train_set, test_set, valid_set = train_test_split(token_df)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/q439310/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Train data frame contains 20000 rows.
Dev data frame contains 5928 rows.
Adding word 0 to our vocabulary.
Adding word 50000 to our vocabulary.
Adding word 100000 to our vocabulary.
Adding word 150000 to our vocabulary.
Word count in vocab is now 9091, removed 5890 words during cleanup.
Data frame contains 20000 rows.
Data frame after row cleanup contains 3391 rows.
create train, test and validation data sets ...
Train set of length: 2374
Test set of length: 763
Valid set of length: 254


In [17]:
# print first 15 QnAs
for i in range(15):
    ex_q = train_set.iloc[i, 0]
    ex_a = train_set.iloc[i, 1]
    ex_question = [w for w in v.index2word(ex_q) if w!= '<UNK>']
    ex_answer = [w for w in v.index2word(ex_a) if w!= '<UNK>']
    # Finally, write out an answer for user
    print("Q:", " ".join(ex_question))
    print("A:", " ".join(ex_answer), "\n")

Q: <SOS> provision govern democrat feder yugoslavia assembl <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> 7 march 1945 <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> much money use strengthen construct school <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> 400 000 yuan <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> mani student new york citi public school <EOS> <PAD> <PAD> <PAD>
A: <SOS> 1 1 million <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> mani peopl watch first episod american idol second season <EOS> <PAD>
A: <SOS> 26 5 million <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> televis station region headquart plymouth <EOS> <PAD> <PAD> <PAD> <PAD> <PAD>
A: <SOS> bbc south west <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> latitud guinea bissau most lie <EOS> <PAD> <PAD> <PAD> <PAD> <PAD>
A: <SOS> 11 13 n <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> accord articl tibet remain jurisdict <EOS> <PAD> <PAD> <PAD> <PAD> <PAD>
A:

In [21]:
# define parameters
input_size = len(v.words)
output_size = len(v.words)
embedding_size = 300
hidden_size = 512
lstm_layer = 2
dropout = 0.3
learning_rate = 0.02
epochs = 200
clip = 1  # something between 1 and 5 as a starting point
BATCH_SIZE = 64
teaching_ratio = 0.5

In [22]:
from dataset_helper import get_dataloader
from encoder import Encoder
from decoder import Decoder
from seq2seq import Seq2Seq
import torch

train_dataloader, test_dataloader, valid_dataloader = get_dataloader(train_set, test_set, valid_set, BATCH_SIZE)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# instantiate encoder and decoder classes
enc = Encoder(input_size, hidden_size, embedding_size, lstm_layer, dropout, BATCH_SIZE).to(device)
dec = Decoder(input_size, hidden_size, output_size, embedding_size, lstm_layer, dropout).to(device)
# instantiate seq2seq model
model = Seq2Seq(enc, dec, device).to(device)
print(model)

Created `train_dataloader` with 37 batches!
Created `test_dataloader` with 11 batches!
Created `test_dataloader` with 3 batches!
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(9091, 300)
    (lstm): LSTM(300, 512, num_layers=2, batch_first=True, dropout=0.3)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(9091, 300)
    (lstm): LSTM(300, 512, num_layers=2, dropout=0.3)
    (lin_out): Linear(in_features=512, out_features=9091, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
    (softmax): LogSoftmax(dim=1)
  )
)


In [23]:
from helpers import evaluate, train_loop
import torch.optim as optim
import torch.nn as nn
import torch.optim.lr_scheduler as lr_scheduler

model_out_path = 'qna-model.pt'
# definer optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# ignore padding index when calculating the loss (<PAD>=3 in vocab)
# criterion = nn.CrossEntropyLoss(ignore_index=3)
# test model with logsoftmax and see if it does help the training to converge better
criterion = nn.NLLLoss(ignore_index=1)
# lr scheduler to see if this improves the performance of the validation loss
lr_scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=10)
# train loop
train_loop(model, train_dataloader, test_dataloader, optimizer, criterion, lr_scheduler, clip, teaching_ratio, device, epochs, model_out_path)

	Epoch:   0 | 	 Train Loss: 3.634 | 	  lr: 0.02
	Epoch:   1 | 	 Train Loss: 2.833 | 	  lr: 0.02
	Epoch:   2 | 	 Train Loss: 2.697 | 	  lr: 0.02
	Epoch:   3 | 	 Train Loss: 2.639 | 	  lr: 0.02
	Epoch:   4 | 	 Train Loss: 2.559 | 	  lr: 0.02
	Epoch:   5 | 	 Train Loss: 2.504 | 	  lr: 0.02
	Epoch:   6 | 	 Train Loss: 2.463 | 	  lr: 0.02
	Epoch:   7 | 	 Train Loss: 2.454 | 	  lr: 0.02
	Epoch:   8 | 	 Train Loss: 2.441 | 	  lr: 0.02
	Epoch:   9 | 	 Train Loss: 2.453 | 	  lr: 0.02
	Epoch:  10 | 	 Train Loss: 2.444 | 	  lr: 0.02
	Epoch:  11 | 	 Train Loss: 2.428 | 	  lr: 0.02
	Epoch:  12 | 	 Train Loss: 2.359 | 	  lr: 0.02
	Epoch:  13 | 	 Train Loss: 2.334 | 	  lr: 0.02
	Epoch:  14 | 	 Train Loss: 2.301 | 	  lr: 0.02
	Epoch:  15 | 	 Train Loss: 2.292 | 	  lr: 0.02
	Epoch:  16 | 	 Train Loss: 2.265 | 	  lr: 0.02
	Epoch:  17 | 	 Train Loss: 2.237 | 	  lr: 0.02
	Epoch:  18 | 	 Train Loss: 2.221 | 	  lr: 0.02
	Epoch:  19 | 	 Train Loss: 2.203 | 	  lr: 0.02
	Epoch:  20 | 	 Train Loss: 2.260 | 	  l

In [24]:
# Evaluate the model on data it has never seen
model_out_path = 'qna-model.pt'
model.load_state_dict(torch.load(model_out_path))
valid_loss = evaluate(model, test_dataloader, criterion, device)
print(f'Validation Loss: {valid_loss:.3f}')

Validation Loss: 4.011


In [32]:
# inference, load model
model = Seq2Seq(enc, dec, device)
model.load_state_dict(torch.load(model_out_path))
model.eval

# define the input question
# question = "What does the urban education institute help run?"
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    question = input("> ")
    if question.strip() == "exit":
        break
    # clean and tokenize the input question
    print(f'Question:\t {question}')
    src = v.word2index(v.clean_text(question))
    # print(f'Clean Question: {src}')
    # convert the tokenized question to a tensor and add a batch dimension
    src = torch.tensor(src, dtype=torch.long).unsqueeze(0).to(device)
    # generate the answer using the model
    output = model(src=src, trg=None, teaching=0, max_len=12)
    # convert the output tensor to a list of token IDs
    preds = output.argmax(dim=1).tolist()[0]
    # convert the token IDs to tokens
    answer = v.index2word(preds)
    # pretty answer
    pretty_answer = ' '.join([w for w in answer]).replace('<SOS>', '').replace('<EOS>', '').replace('<PAD>','')
    # print the predicted answer
    print(f'Answer:\t {pretty_answer}')

Type 'exit' to finish the chat.
 ------------------------------ 

Question:	 How did Beyoncé name her daughter?
Answer:	  wolf singl conservatori        
Question:	 Who led Issacs troops to Cyprus?
Answer:	  us 3 million price       
Question:	 Who began a program of church reform in the 1100s?
Answer:	  119 21 1991 coup       
Question:	 Who made fun of the Latin language?
Answer:	  10 30 1991 000       
Question:	 Which first chopin work did he gain international renown for?
Answer:	  new york client repres       
Question:	 Which instrument for disco songs do incorporate house music?
Answer:	  almost arkadievich malyarchuk hors       
Question:	 How many people watched the first episode of american idol?
Answer:	  us 000 million 000       


## Sample questions for chatbot
Who led Issacs troops to Cyprus?
How did Beyoncé name her daughter?
Who began a program of church reform in the 1100s?
Who made fun of the Latin language?
Which first chopin work did he gain international renown for?
Which instrument for disco songs do incorporate house music?
How many people watched the first episode of american idol?