In this coding example - we will train a model to generate new text using LSTM.

In [1]:
import torch
import os
import torch.nn as nn
import numpy as np
from torch.nn.utils import clip_grad_norm_
import torch.nn.functional as F

### Dictionary Class:
As we saw in the tutorial that before passing the words to the Embedding layer, we need a create a word to index mappin. 
This word2Index mapping we will call a dictionary. Dictionary class does below mentioned tasks -

- At first it checks if the passed word is already present in the dictionary or not.
- If it's a new word the class adds the word to the dictionary , assigns an index to the word.

In [2]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = {}
        self.idx = 0

    def add_word(self, word):
        if word not in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1
            
    def __len__(self):
        return len(self.word2idx)

### Corpus Class:

Corpus class with the help of Dictionary class doing the below operations - 

- Creates an object of the Dictionary class which is to be used in future operations. 
- In the get_data method below operations are going on - 
    - Opens the file in read mode.
    - Reads the file line by line , splits each of the words in lines.
    - Adds a end of sentence tokoen at the end of each line.
    - Maintains a variable tokens to keep track of the total number of words.
    - Adds the words in the dictionary. 
    - Once all the words are added in the dictonary creates a long tensor named 'ids'
    - In the 'ids' all the index from the dictionary is stored using word2idx.
    - Makes sure that all batches are of same size

In [3]:
class Corpus(object):
    
    def __init__(self):
        self.dictionary = Dictionary()

    def get_data(self, path, batch_size=20):
        with open(path, 'r') as f:
            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words: 
                    self.dictionary.add_word(word)  
        #Create a 1-D tensor which contains index of all the words in the file with the help of word2idx
        ids = torch.LongTensor(tokens)
        token = 0
        with open(path, 'r') as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1
        # no of required batches            
        num_batches = ids.shape[0] // batch_size     
        #Remove the remainder from the last batch , so that always batch size is constant
        ids = ids[:num_batches*batch_size]
        # return (batch_size,num_batches)
        ids = ids.view(batch_size, -1)
        return ids

### Setting the parameter values

In [4]:
embed_size = 128    # Embedding layer size , input to the LSTM
hidden_size = 1024  # Hidden size of LSTM units
num_layers = 1      # no LSTMs stacked
num_epochs = 10     # total no of epochs
batch_size = 20     # batch size
seq_length = 100     # sequence length
learning_rate = 0.002 # learning rate

In [5]:
corpus = Corpus()

Calling the get data method with path of the file and batch size
Data Source - https://raw.githubusercontent.com/yunjey/pytorch-tutorial/master/tutorials/02-intermediate/language_model/data/train.txt

In [6]:
ids = corpus.get_data('data.txt', batch_size)

In [7]:
# ids tensors contain all the index of each words
print(ids.shape)

torch.Size([20, 18970])


In [8]:
ids

tensor([[   0,    1,    2,  ...,  737,  181,  247],
        [  42,   32,   79,  ..., 1132,   27,   27],
        [ 467,   27,   27,  ...,   24,  130,  154],
        ...,
        [  42,   26,   48,  ...,   32,  392,   34],
        [  35, 3039,   24,  ..., 3154, 1339, 1570],
        [1088,   24,  315,  ...,  108, 1691,   27]])

In [9]:
# What is the vocabulary size ?
vocab_size = len(corpus.dictionary)
print(vocab_size)

9468


As our sequence length is greater than one , so multiple words would be there in a single batch. So we will need 10 batches to pass through the whole data.

In [10]:
num_batches = ids.shape[1] // seq_length
print(num_batches)

189


### Creating the model archietecture with LSTM Class

In [11]:
class LSTM(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTM, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size) # maps words to feature vectors
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True) # LSTM layer
        self.linear = nn.Linear(hidden_size, vocab_size) # Fully connected layer

    def forward(self, x, h):
        # Perform Word Embedding 
        x = self.embed(x)

        out, (h, c) = self.lstm(x, h) # (input , hidden state)
        
        # Reshape output to (batch_size*sequence_length, hidden_size)
        out = out.reshape(out.size(0)*out.size(1), out.size(2))
        
        # Decode hidden states of all time steps
        out = self.linear(out)
        return out, (h, c)

In [12]:
model = LSTM(vocab_size, embed_size, hidden_size, num_layers)

In [13]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Pytorch LSTM document - https://pytorch.org/docs/stable/nn.html#lstm

### Train the network

In [14]:
# to Detach the Hidden and Cell states from previous history
def detach(states):
    return [state.detach() for state in states]

In [15]:
for epoch in range(num_epochs):
    # initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size),
              torch.zeros(num_layers, batch_size, hidden_size))
    
    for i in range(0, ids.size(1) - seq_length, seq_length):
        
        #move with seq length from the the starting index and move till - (ids.size(1) - seq_length)
        
        # prepare mini-batch inputs and targets
        inputs = ids[:, i:i+seq_length] # fetch words for one seq length  
        targets = ids[:, (i+1):(i+1)+seq_length] # shifted by one word from inputs
        
        states = detach(states)

        outputs,states = model(inputs, states)
        loss = criterion(outputs, targets.reshape(-1))

        model.zero_grad()
        loss.backward()
         
        #The gradients are clipped in the range [-clip_value, clip_value]. This is to prevent the exploding gradient problem
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
              
        step = (i+1) // seq_length
        if step % 100 == 0:
            print ('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))

Epoch [1/10], Loss: 9.1580
Epoch [1/10], Loss: 6.0836
Epoch [2/10], Loss: 5.4496
Epoch [2/10], Loss: 5.3647
Epoch [3/10], Loss: 4.7479
Epoch [3/10], Loss: 4.8044
Epoch [4/10], Loss: 4.1976
Epoch [4/10], Loss: 4.2439
Epoch [5/10], Loss: 3.6642
Epoch [5/10], Loss: 3.7502
Epoch [6/10], Loss: 3.1097
Epoch [6/10], Loss: 3.2812
Epoch [7/10], Loss: 2.6329
Epoch [7/10], Loss: 2.8812
Epoch [8/10], Loss: 2.2548
Epoch [8/10], Loss: 2.4805
Epoch [9/10], Loss: 1.9408
Epoch [9/10], Loss: 2.1896
Epoch [10/10], Loss: 1.7228
Epoch [10/10], Loss: 1.9036


#### Geneate new Text using the training model

In [18]:
# Test the model
with torch.no_grad():
    with open('results.txt', 'w') as f:
        #intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size),
                 torch.zeros(num_layers, 1, hidden_size))
        
        # Select one word id randomly and convert it to shape (1,1)
        input = torch.randint(0,vocab_size, (1,)).long().unsqueeze(1) 
                            # (min , max , shape) , convert to long tensor and make it a shape of 1,1 

        for i in range(500):
            output, _ = model(input, state)

            
            # Sample a word id from the exponential of the output 
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()
            #print(word_id)

            
            # Replace the input with sampled word id for the next time step
            input.fill_(word_id)

            # Write the results to file
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            
            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, 500, 'results.txt'))

Sampled [100/500] words and save to results.txt
Sampled [200/500] words and save to results.txt
Sampled [300/500] words and save to results.txt
Sampled [400/500] words and save to results.txt
Sampled [500/500] words and save to results.txt


##### Why use multinomial distribution?

Sticking to the most probable words would restrict the model to always use the most commonly used words, just to add some randomness we use this approach.

I tried top k approach, howver top k approach was kind of generating same words for so used the multinomial distribution approach 

- Multinomial pytorch documentation - https://pytorch.org/docs/stable/torch.html?highlight=multinomial#torch.multinomial
- What is Multinomial distribution ? https://www.youtube.com/watch?v=syVW7DgvUaY

Code Inspiration : https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/02-intermediate/language_model