# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
!pip install torchdata

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata
  Downloading torchdata-0.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.6 MB/s eta 0:00:01
Collecting torch==1.12.1
  Downloading torch-1.12.1-cp37-cp37m-manylinux1_x86_64.whl (776.3 MB)


[K     |█████████████▏                  | 318.2 MB 6.7 MB/s eta 0:01:088   |▏                               | 3.0 MB 28.2 MB/s eta 0:00:28     |▎                               | 5.6 MB 28.2 MB/s eta 0:00:28     |▍                               | 10.5 MB 28.2 MB/s eta 0:00:28     |▌                               | 13.3 MB 28.2 MB/s eta 0:00:28     |▋                               | 14.3 MB 28.2 MB/s eta 0:00:27     |▊                               | 18.0 MB 10.6 MB/s eta 0:01:12     |▉                               | 19.8 MB 10.6 MB/s eta 0:01:12     |█                               | 21.8 MB 10.6 MB/s eta 0:01:12     |█                               | 24.5 MB 10.6 MB/s eta 0:01:11     |█▎                              | 30.2 MB 10.6 MB/s eta 0:01:11     |█▉                              | 44.5 MB 14.1 MB/s eta 0:00:52     |██                              | 49.4 MB 14.1 MB/s eta 0:00:52     |██▏                             | 52.1 MB 8.9 MB/s eta 0:01:22     |██▎                          

[K     |████████████████████████████▌   | 690.5 MB 4.1 MB/s eta 0:00:227     |█████████████▋                  | 330.6 MB 14.7 MB/s eta 0:00:31     |█████████████▉                  | 336.0 MB 14.7 MB/s eta 0:00:30     |██████████████                  | 338.2 MB 14.7 MB/s eta 0:00:30     |██████████████                  | 341.2 MB 9.0 MB/s eta 0:00:49     |██████████████▎                 | 345.5 MB 9.0 MB/s eta 0:00:48     |██████████████▎                 | 346.1 MB 9.0 MB/s eta 0:00:48     |██████████████▍                 | 350.1 MB 9.0 MB/s eta 0:00:48     |██████████████▋                 | 355.1 MB 13.2 MB/s eta 0:00:32     |██████████████▊                 | 357.8 MB 13.2 MB/s eta 0:00:32     |██████████████▉                 | 359.3 MB 13.2 MB/s eta 0:00:32     |██████████████▉                 | 360.1 MB 13.2 MB/s eta 0:00:32     |███████████████                 | 363.1 MB 13.2 MB/s eta 0:00:32     |███████████████▏                | 369.0 MB 14.0 MB/s eta 0:00:30     |███████████████

[K     |████████████████████████████████| 776.3 MB 6.4 kB/s  eta 0:00:01    |████████████████████████████▌   | 692.6 MB 4.1 MB/s eta 0:00:21     |████████████████████████████▋   | 693.2 MB 4.1 MB/s eta 0:00:21     |████████████████████████████▊   | 695.6 MB 4.1 MB/s eta 0:00:20     |████████████████████████████▊   | 697.7 MB 4.1 MB/s eta 0:00:20     |████████████████████████████▉   | 698.4 MB 4.1 MB/s eta 0:00:20     |████████████████████████████▉   | 699.2 MB 19.6 MB/s eta 0:00:04     |████████████████████████████▉   | 700.6 MB 19.6 MB/s eta 0:00:04     |█████████████████████████████   | 701.3 MB 19.6 MB/s eta 0:00:04     |█████████████████████████████▏  | 708.8 MB 19.6 MB/s eta 0:00:04     |█████████████████████████████▌  | 714.9 MB 10.4 MB/s eta 0:00:06     |█████████████████████████████▉  | 722.7 MB 10.4 MB/s eta 0:00:06     |██████████████████████████████  | 726.2 MB 4.5 MB/s eta 0:00:12     |██████████████████████████████  | 727.0 MB 4.5 MB/s eta 0:00:12     |███████████████████

In [2]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown
from torchtext.datasets import SQuAD2
# Already downloaded
nltk.download('brown')
nltk.download('punkt')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')
# w2v = gensim.models.Word2Vec.load('brown.embedding')

In [4]:
# Now let's import our files
import src.helpers as helpers
import src.model as model
import src.givens as givens

In [5]:
# splitting the training and the test set
train_dataset, test_dataset = SQuAD2()

In [6]:
# Training dataset to dataframe
train_df = givens.loadDF(train_dataset)

In [7]:
# Let's view out dataframe
train_df.head()

Unnamed: 0,question,answer
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s


In [8]:
print("Start preparing training data ...")
voc, pairs = helpers.readVocs(train_df, "train")
print("Read {!s} sentence pairs".format(len(pairs)))

Start preparing training data ...
Reading lines...
Read 86821 sentence pairs


In [9]:
pairs = helpers.filterPairs(pairs)
print("Trimmed to {!s} sentence pairs".format(len(pairs)))

Trimmed to 29334 sentence pairs


In [10]:
print("Counting words...")
for pair in pairs:
    voc.addSentence(pair[0])
    voc.addSentence(pair[1])
print("Counted words:", voc.num_words)

Counting words...
Counted words: 28298


In [11]:
for pair in pairs[:5]:
    print(pair)

['when did beyonce start becoming popular ?', 'in the late s']
['in which decade did beyonce become famous ?', 'late s']
['what album made her a worldwide known artist ?', 'dangerously in love']
['who managed the destiny s child group ?', 'mathew knowles']
['when did beyonce rise to fame ?', 'late s']


In [12]:
MIN_COUNT = 3    # Minimum word count threshold for trimming



def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 9536 / 28295 = 0.3370
Trimmed from 29334 pairs to 13802, 0.4705 of total


In [13]:
print(dir(voc))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'addSentence', 'addWord', 'index2word', 'name', 'num_words', 'trim', 'trimmed', 'word2count', 'word2index']


In [14]:
voc.num_words

9539

In [24]:
import torch.nn as nn
import torch.optim as optim

In [16]:
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, embedding_size):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.embedding_dim = embedding_size

        self.hidden = torch.zeros(1, 1, hidden_size)

        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(num_embeddings=self.input_size,
                                      embedding_dim=self.embedding_dim)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(input_size=self.embedding_dim,
                            hidden_size=self.hidden_size,
                            num_layers=1)
    # Forward pass
    def forward(self, i):
        """
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        """
        embedded = self.embedding(i)
        
        # Since lstm returns o, h and c, we can directly return it
        return self.lstm(embedded)

In [27]:
# Decoder class
class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size, embedding_size):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
        
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(num_embeddings=self.output_size,
                                      embedding_dim=self.embedding_size)
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(self.embedding_size, hidden_size, num_layers=3)
        
        # self.output, predicts on the hidden state via a linear output layer
        self.out = nn.Linear(self.hidden_size, self.output_size)
        
    # Forward pass
    def forward(self, i, h, c):
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        unsqueezed = i.unsqueeze(0)
        embedded = self.embedding(unsqueezed)
        
        o, h, c = self.lstm(embedded, (h, c))
        
        p = self.fc(o.squeeze(0))
        
        return o, h, p, c

In [18]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_size
        
        o = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        h, c = self.encoder(src)
        
        x = trg[0, :]
        for t in range(1, trg_len):
            o, h, c = self.decoder(x, h, c)
            o[t] = o
            teacher_force = random.random() < teacher_forcing_ratio
            top_trg = o.argmax(1)
            x = trg[t] if teacher_force else top_trg
            
        return o

In [19]:
# First initialize our model.
input_size = voc.num_words
output_size = voc.num_words
embedding_size = 256
hidden_size = 512

In [20]:
encoder = Encoder(input_size=input_size,   hidden_size=hidden_size, embedding_size=embedding_size)
decoder = Decoder(hidden_size=hidden_size, output_size=output_size, embedding_size=embedding_size)

model = Seq2Seq(encoder, decoder)

In [22]:
# Defining hyperparameters
lr = 0.001

In [25]:
optimizer = optim.Adam(model.parameters(), lr=lr)

criterion = nn.CrossEntropyLoss()

In [26]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 17,133,891 trainable parameters
