# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
from src.data import Seq2SeqDataset, Vocabulary
import torch
from torch.nn.utils.rnn import pad_sequence


In [2]:
# initiate the custom dataset
train_dataset = Seq2SeqDataset()
train_dataset.load_train_data()
train_dataset.get_pairs()
train_dataset.tokenize()
# create a vocabulary based on training data
vocab = Vocabulary().from_iterator(train_dataset)
# apply the vocabulary to numerize the training data
train_dataset.numerize(vocab)
# creat test dataset and apply the vocabulary to numerize the test data
val_dataset = Seq2SeqDataset()
val_dataset.load_val_data().get_pairs().tokenize()
val_dataset.numerize(vocab)

<src.data.Seq2SeqDataset at 0x7f50f498c4f0>

In [3]:
# get a sense of the data
print(vocab.word_count)
print(len(train_dataset))
print(len(val_dataset))

3451
5000
1500


In [4]:

def padding(batch:list):
    # print(batch)
    src_seqs = [pairs[0] for pairs in batch]
    src_padded = pad_sequence(src_seqs, padding_value=3) # 3 represents 'pad'
    trg_seqs = [pairs[1] for pairs in batch]
    trg_padded = pad_sequence(trg_seqs, padding_value=3)
    return src_padded, trg_padded

In [5]:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=128, 
                        collate_fn=padding
                        )

val_dataloader = DataLoader(val_dataset, batch_size=128, 
                        collate_fn=padding
                        )

In [6]:
from src.model import Seq2Seq, Encoder, Decoder
import torch.nn as nn
from train import epoch_train, epoch_evaluate
learning_rate = 0.00001
hidden_size = 300 # encoder and decoder hidden size
embedding_size= 300
dropout = 0.5
batch_size = 100
epochs = 50
words_count = vocab.word_count

encoder = Encoder(input_size=words_count, hidden_size=hidden_size, embedding_size=embedding_size, dropout=dropout)
decoder = Decoder(output_size=words_count, hidden_size=hidden_size, embedding_size=embedding_size, dropout=dropout)
seq2seq = Seq2Seq(encoder = encoder, decoder=decoder)
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(seq2seq.parameters(), lr=learning_rate)

for epoch in range(epochs):
    train_loss = epoch_train(seq2seq, train_dataloader, optimizer=optimizer, criterion=criterion, batch_size=batch_size)
    val_loss = epoch_evaluate(seq2seq, val_dataloader, criterion=criterion)
    print(f"epoch {epoch} train loss {train_loss} || val loss {val_loss}")

epoch 0 train loss 8.101334762573241 || val loss 8.120554367701212
epoch 1 train loss 7.975427377223968 || val loss 7.698226531346639
epoch 2 train loss 7.7873450994491575 || val loss 7.2554095188776655
epoch 3 train loss 7.45891923904419 || val loss 6.4916675090789795
epoch 4 train loss 6.859838712215423 || val loss 5.348465879758199
epoch 5 train loss 5.870908486843109 || val loss 4.196055233478546
epoch 6 train loss 4.7138101875782015 || val loss 3.3295124570528665
epoch 7 train loss 3.79291672706604 || val loss 2.7480830748875937
epoch 8 train loss 3.1611127495765685 || val loss 2.3885755836963654
epoch 9 train loss 2.7499813050031663 || val loss 2.1622140606244407
epoch 10 train loss 2.482448935508728 || val loss 2.0034494598706565
epoch 11 train loss 2.291770887374878 || val loss 1.884276459614436
epoch 12 train loss 2.1550375282764436 || val loss 1.7897451718648274
epoch 13 train loss 2.047431567311287 || val loss 1.7171014348665874
epoch 14 train loss 1.96210874915123 || val lo