# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
# import torch
from dataset_helper import prepare_text, train_test_split
# load and prepare data
v, token_df = prepare_text(max_rows_train_set=30000, count_limit=2, min_length=3, max_length=12, stage='train') # dev or train
print(f'create train, test and validation data sets ...')
train_set, test_set, valid_set = train_test_split(token_df)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/q439310/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Train data frame contains 30000 rows.
Dev data frame contains 5928 rows.
Adding word 0 to our vocabulary.
Adding word 50000 to our vocabulary.
Adding word 100000 to our vocabulary.
Adding word 150000 to our vocabulary.
Adding word 200000 to our vocabulary.
Word count in vocab is now 11908, removed 8143 words during cleanup.
Data frame contains 30000 rows.
Data frame after row cleanup contains 5380 rows.
create train, test and validation data sets ...
Train set of length: 3766
Test set of length: 1210
Valid set of length: 404


In [2]:
# print first 10 QnAs
for i in range(15):
    ex_q = train_set.iloc[i, 0]
    ex_a = train_set.iloc[i, 1]
    ex_question = [w for w in v.index2word(ex_q) if w!= '<UNK>']
    ex_answer = [w for w in v.index2word(ex_a) if w!= '<UNK>']
    # Finally, write out an answer for user
    print("Q:", " ".join(ex_question))
    print("A:", " ".join(ex_answer), "\n")

Q: <SOS> facil center southampton public sport outdoor activ <EOS> <PAD> <PAD> <PAD>
A: <SOS> southampton sport centr <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> instrument disco song incorpor hous music <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> synthes drum machin <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> date afl take control one team <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> januari 6 2016 <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> first chopin work gain intern renown <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> rondo op 1 <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> relief team travel wenchuan counti <EOS> <PAD> <PAD> <PAD> <PAD> <PAD>
A: <SOS> two militari transport plane <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> centr provid elect outsid scienc student <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> centr co curricular studi <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> posit malenkov demot <EOS> <PAD> <PAD> <PAD> <PAD> <

In [3]:
# define parameters
input_size = len(v.words)
output_size = len(v.words)
embedding_size = 300 # 256  -- 300
hidden_size = 500 # 256  -- 300
lstm_layer = 4
dropout = 0.5
learning_rate = 0.0001  # -- 0.00001
epochs = 100
clip = 1
BATCH_SIZE = 256 # -- 100
teaching_ratio = 0.3

In [4]:
from dataset_helper import get_dataloader
from encoder import Encoder
from decoder import Decoder
from seq2seq import Seq2Seq
import torch

train_dataloader, test_dataloader, valid_dataloader = get_dataloader(train_set, test_set, valid_set, BATCH_SIZE)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# instantiate encoder and decoder classes
enc = Encoder(input_size, hidden_size, embedding_size, lstm_layer, dropout, BATCH_SIZE).to(device)
dec = Decoder(input_size, hidden_size, output_size, embedding_size, lstm_layer, dropout).to(device)
# instantiate seq2seq model
model = Seq2Seq(enc, dec, device).to(device)
print(model)

Created `train_dataloader` with 14 batches!
Created `test_dataloader` with 4 batches!
Created `test_dataloader` with 1 batches!
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(11908, 300)
    (lstm): LSTM(300, 500, num_layers=4, batch_first=True, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(11908, 300)
    (lstm): LSTM(300, 500, num_layers=4, batch_first=True, dropout=0.5)
    (lin_out): Linear(in_features=500, out_features=11908, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (softmax): LogSoftmax(dim=1)
  )
)


In [5]:
from helpers import evaluate, train_loop
import torch.optim as optim
import torch.nn as nn
import torch.optim.lr_scheduler as lr_scheduler

model_out_path = 'qna-model.pt'
# definer optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# ignore padding index when calculating the loss (<PAD>=3 in vocab)
# criterion = nn.CrossEntropyLoss(ignore_index=3)
# test model with logsoftmax and see if it does help the training to converge better
criterion = nn.NLLLoss(ignore_index=3)
# lr scheduler to see if this improves the performance of the validation loss
lr_scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=5)
# train loop
train_loop(model, train_dataloader, test_dataloader, optimizer, criterion, lr_scheduler, clip, teaching_ratio, device, epochs, model_out_path)

	Epoch:   0 | 	 Train Loss: 7.706 | 	  Val. Loss: 7.554 | 	  lr: 0.0001
	Epoch:   1 | 	 Train Loss: 7.001 | 	  Val. Loss: 6.142 | 	  lr: 0.0001
	Epoch:   2 | 	 Train Loss: 5.916 | 	  Val. Loss: 5.841 | 	  lr: 0.0001
	Epoch:   3 | 	 Train Loss: 5.596 | 	  Val. Loss: 5.658 | 	  lr: 0.0001
	Epoch:   4 | 	 Train Loss: 5.421 | 	  Val. Loss: 5.587 | 	  lr: 0.0001
	Epoch:   5 | 	 Train Loss: 5.332 | 	  Val. Loss: 5.631 | 	  lr: 0.0001
	Epoch:   6 | 	 Train Loss: 5.286 | 	  Val. Loss: 5.598 | 	  lr: 0.0001
	Epoch:   7 | 	 Train Loss: 5.261 | 	  Val. Loss: 5.608 | 	  lr: 0.0001
	Epoch:   8 | 	 Train Loss: 5.232 | 	  Val. Loss: 5.616 | 	  lr: 0.0001
	Epoch:   9 | 	 Train Loss: 5.221 | 	  Val. Loss: 5.634 | 	  lr: 0.0001
	Epoch:  10 | 	 Train Loss: 5.207 | 	  Val. Loss: 5.655 | 	  lr: 1e-05
	Epoch:  11 | 	 Train Loss: 5.195 | 	  Val. Loss: 5.613 | 	  lr: 1e-05
	Epoch:  12 | 	 Train Loss: 5.189 | 	  Val. Loss: 5.666 | 	  lr: 1e-05
	Epoch:  13 | 	 Train Loss: 5.187 | 	  Val. Loss: 5.633 | 	  lr: 1e

KeyboardInterrupt: 

In [6]:
# Evaluate the model on data it has never seen
model_out_path = 'qna-model.pt'
model.load_state_dict(torch.load(model_out_path))
valid_loss = evaluate(model, valid_dataloader, criterion, device)
print(f'Validation Loss: {valid_loss:.3f}')

Validation Loss: 5.666


In [8]:
# inference, load model
model = Seq2Seq(enc, dec, device)
model.load_state_dict(torch.load(model_out_path))
model.eval

# define the input question
# question = "What does the urban education institute help run?"
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    question = input("> ")
    if question.strip() == "exit":
        break
    # clean and tokenize the input question
    print(f'Question:\t {question}')
    src = v.word2index(v.clean_text(question))
    # convert the tokenized question to a tensor and add a batch dimension
    src = torch.tensor(src, dtype=torch.long).unsqueeze(0).to(device)
    # generate the answer using the model
    output = model(src=src, trg=None, teaching=0, max_len=13)
    # convert the output tensor to a list of token IDs
    preds = output.argmax(dim=2).tolist()[0]
    # convert the token IDs to tokens
    answer = v.index2word(preds)
    # pretty answer
    pretty_answer = ' '.join([w for w in answer]) # .replace('<SOS>', '').replace('<EOS>', '')
    # print the predicted answer
    print(f'Answer:\t {pretty_answer}')

Type 'exit' to finish the chat.
 ------------------------------ 

Question:	 Which first chopin work gained international renown for?
Answer:	 <SOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS>


## Sample questions for chatbot
Who led Issacs troops to Cyprus?
Who began a program of church reform in the 1100s?
Who made fun of the Latin language?
Which first chopin work gained international renown for?