# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
# import torch
from dataset_helper import load_df, prepare_text, train_test_split

train, dev = load_df()
v, token_df = prepare_text(count_limit=2, min_length=2, max_length=13, stage='train') # dev or train
print(f'create train, test and validation data sets ...')
train_set, test_set, valid_set = train_test_split(token_df)


Adding word 0 to our vocabulary.
Adding word 15000 to our vocabulary.
Adding word 30000 to our vocabulary.
Adding word 45000 to our vocabulary.
Adding word 60000 to our vocabulary.
Adding word 75000 to our vocabulary.
Adding word 90000 to our vocabulary.
Adding word 105000 to our vocabulary.
Adding word 120000 to our vocabulary.
Adding word 135000 to our vocabulary.
Adding word 150000 to our vocabulary.
Adding word 165000 to our vocabulary.
Adding word 180000 to our vocabulary.
Adding word 195000 to our vocabulary.
Adding word 210000 to our vocabulary.
Adding word 225000 to our vocabulary.
Adding word 240000 to our vocabulary.
Adding word 255000 to our vocabulary.
Adding word 270000 to our vocabulary.
Adding word 285000 to our vocabulary.
Adding word 300000 to our vocabulary.
Adding word 315000 to our vocabulary.
Adding word 330000 to our vocabulary.
Adding word 345000 to our vocabulary.
Adding word 360000 to our vocabulary.
Adding word 375000 to our vocabulary.
Adding word 390000 to o

In [2]:
# print first 10 QnAs
for i in range(10):
    ex_q = train_set.iloc[i, 0]
    ex_a = train_set.iloc[i, 1]
    ex_question = [w for w in v.index2word(ex_q) if w!= '<UNK>']
    ex_answer = [w for w in v.index2word(ex_a) if w!= '<UNK>']
    # Finally, write out an answer for user
    print("Q:", " ".join(ex_question))
    print("A:", " ".join(ex_answer), "\n")

Q: <SOS> on what date was the ussr dissolved <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> december 26 1991 <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> why did queen victoria want to take over other countries <EOS> <PAD>
A: <SOS> protecting native peoples from more aggressive powers or cruel rulers <EOS> <PAD> 

Q: <SOS> what do young birds form attachments to <EOS> <PAD> <PAD> <PAD> <PAD>
A: <SOS> potential breeding sites <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> who is toni morrison <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
A: <SOS> nobel prize winning novelist <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> who wrote the book robopocalypse is based on <EOS> <PAD> <PAD> <PAD>
A: <SOS> daniel h wilson <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> what team did the cincinnati red stockings become <EOS> <PAD> <PAD> <PAD>
A: <SOS> the atlanta braves <EOS> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Q: <SOS> wh

In [3]:
# define parameters
input_size = len(v.words)
output_size = len(v.words)
embedding_size = 256
hidden_size = 512
lstm_layer = 2
dropout = 0.3
learning_rate = 0.01
epochs = 100
clip = 1
BATCH_SIZE = 128


In [4]:
from dataset_helper import get_dataloader
from encoder import Encoder
from decoder import Decoder
from seq2seq import Seq2Seq
import torch

train_dataloader, test_dataloader, valid_dataloader = get_dataloader(train_set, test_set, valid_set, BATCH_SIZE)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# instantiate encoder and decoder classes
enc = Encoder(input_size, hidden_size, embedding_size, lstm_layer, dropout, BATCH_SIZE).to(device)
dec = Decoder(input_size, hidden_size, output_size, embedding_size, lstm_layer, dropout).to(device)
# instantiate seq2seq model
model = Seq2Seq(enc, dec, device).to(device)
print(model)

Created `train_dataloader` with 145 batches!
Created `test_dataloader` with 46 batches!
Created `test_dataloader` with 15 batches!
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(30363, 256)
    (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.3)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(30363, 256)
    (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.3)
    (lin_out): Linear(in_features=512, out_features=30363, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
)


In [5]:
from helpers import evaluate, train_loop
import torch.optim as optim
import torch.nn as nn

model_out_path = 'qna-model.pt'
# definer optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# ignore padding index when calculating the loss (<PAD>=3 in vocab)
criterion = nn.CrossEntropyLoss(ignore_index=3)
# train loop
train_loop(model, train_dataloader, test_dataloader, optimizer, criterion, clip, device, epochs, model_out_path)

	Epoch: 0 | 	Train Loss: 8.197 | 	 Val. Loss: 7.696


KeyboardInterrupt: 

In [6]:
# Evaluate the model on data it has never seen
model_out_path = 'qna-model.pt'
model.load_state_dict(torch.load(model_out_path))
valid_loss = evaluate(model, valid_dataloader, criterion, device)
print(f'Validation Loss: {valid_loss:.3f}')

Validation Loss: 7.703


In [7]:
# inference, load model
model = Seq2Seq(enc, dec, device)
model.load_state_dict(torch.load(model_out_path))
model.eval

# define the input question
# question = "What does the urban education institute help run?"
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    question = input("> ")
    if question.strip() == "exit":
        break
    # clean and tokenize the input question
    src = v.word2index(v.clean_text(question))
    # convert the tokenized question to a tensor and add a batch dimension
    src = torch.tensor(src, dtype=torch.long).unsqueeze(0).to(device)
    # generate the answer using the model
    output = model(src, None, 0)
    # convert the output tensor to a list of token IDs
    preds = output.argmax(dim=2).tolist()[0]
    # convert the token IDs to tokens
    answer = v.index2word(preds)
    # print the predicted answer
    print(answer)

Type 'exit' to finish the chat.
 ------------------------------ 

['<SOS>', 'the', '<EOS>', '<EOS>', '<EOS>']
