In [1]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/shakespeare-text/text.txt


#### Generation of Shakespeare like words through a simple Neural Network model.

This project is highly influenced by Andrej Karpathy's awesome [video](https://youtu.be/TCH_1BHY58I). This work would be very simple and will be mainly based on the re-implementation of this [paper](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf).

So, without further adieu, lets get started with the necessary imports.

In [2]:
## Importing the necessary packages

import torch
import torch.nn as nn
import numpy as np
from tqdm import tqdm

Since this would be very simplistic and mainly basic, we are just going to use the torch library to do everything.

##### Step 1: Dataset creation

As a first step, we are going to build our dataset. Our dataset is going to be again very simple: 
- Given the text file, our dataset will read it, and make a vocabulary of unique words. Here I didn't remove the symbols so, they will also be part of the vocabulary.
- Then from our vocabulary an index would be given to each word. We will do this via a dictionary mapping.
- A reverse mapping will also be generated for the prediction stage.
- Now, for the dataset, we will give 3 input words (X) and expect our model to generate a new word(Y).
- We will also add a (< TOK >) word as a start sentence and end sentence character.  
- Finally a train, validation and test split will also be generated to align with the proper deep learning project formulation.

In [3]:
def make_datasets(text_file, num_context = 3):
    lines = []
    # Reading the lines
    with open(text_file, "r") as f:
        for line in f.readlines():
            line = " ".join(line.lower().strip('\n').split())
            if (len(line) < 1) or (len(line) == 1 and line.isalnum() == False):
                continue
            lines.append(line)
            
    # Making the words vocabulary
    words_vocabulary = set()
    for each_line in lines:
        words = each_line.split(' ')
        for each_word in words:
            if len(each_word) < 1:
                continue
            words_vocabulary.add(each_word)

    words_vocabulary = sorted(list(words_vocabulary))

    # Inserting the special token
    words_vocabulary.insert(0 , "<TOK>")
    
    print(f'The vocabulary was created with {len(words_vocabulary)} words!!')
    
    # Making the word to index mapping
    words_2_idx = {k:v for v,k in enumerate(words_vocabulary)}
    
    # Making index to word mapping
    idx_2_words = {k:v for k,v in enumerate(words_vocabulary)}
    
    #Making entire dataset
    X , y = [] , []
    
    for line in lines:
        line_words = ["<TOK>"] * num_context + line.split(" ") + ["<TOK>"]
        for i in range(len(line_words) - num_context):
            X.append([words_2_idx[w] for w in line_words[:num_context]])
            if len(line_words[num_context]) < 1:
                print(line_words)
            y.append(words_2_idx[line_words[num_context]])
            line_words = line_words[1:]
    
    X = torch.tensor(X)
    y = torch.tensor(y)
    
    print(X.shape)
    
    # Making datasets
    indices = np.arange(0,len(X))
    np.random.shuffle(indices)
    
    n1 = int(0.8 * len(X))
    n2 = int(0.9 * len(X))
    
    train_indices = indices[:n1]
    val_indices = indices[n1:n2]
    test_indices = indices[n2:]
    
    train_dataset = X[train_indices], y[train_indices]
    val_dataset = X[val_indices], y[val_indices]
    test_dataset = X[test_indices], y[test_indices]
    
    print(f'Train dataset has {len(train_dataset[0])} datapoints.')
    print(f'Val dataset has {len(val_dataset[0])} datapoints.')
    print(f'Test dataset has {len(test_dataset[0])} datapoints.')
    
    return train_dataset, val_dataset, test_dataset , words_vocabulary, words_2_idx , idx_2_words

In [4]:
train_dataset, val_dataset, test_dataset, vocabulary, words_2_idx ,  idx_2_words= make_datasets(
    '/kaggle/input/shakespeare-text/text.txt'
)

The vocabulary was created with 23642 words!!
torch.Size([235428, 3])
Train dataset has 188342 datapoints.
Val dataset has 23543 datapoints.
Test dataset has 23543 datapoints.


In [5]:
# Creation of dataloader instance

train_dl = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_dataset, batch_size=128, shuffle=True)
test_dl = torch.utils.data.DataLoader(test_dataset, batch_size=128, shuffle=True)

Trust me it took me several attempts to actually make it. So dont be harsh on yourself if you cannot do it in the first go. Just be patient and do each part one at a time. I also did like that, and removed the testing stuff after my work was done.

##### Step 2: Simple model creation

In this step we are going to create our model. Its going to be very simple as usual. 

So what are the steps?
- First up we will make an embedding layer. This embedding layer will be basically a lookup table. What this embedding layer will hold is very simple. It will actually be a layer of shape (num_words_in_vocab, feature_dim), where we compress our words to be represented in a smaller feature space. You might wonder what's the difference and this is going to help? The fact is when we do processing of words via a neural network, we put in numbers instead of words. As of now we changed the words to a number like 1,2,... But the thing is we cant pass 1,2,3... values as these represent ordinal values, i.e., higher values have more precedence. But, in case of words this should not be the case and each word must be represented as an independed nominal vector. So, the easiest way to do this is to convert the 1,2,3... into one hot vectors, with each other having a size of (num_words_in_vocab) with all zeros except at one position (its index). So, for the entire dataset, that would be (num_words_in_vocab,num_words_in_vocab) which would be bizarrely high. So to compress this we compress every word in a lower dimensional space with a smaller feature vector. This is done by the embedding layer, which is learnable.

- Next up we need to map this to a hidden layer with n neurons and pass it through a tanh function.

- Finally we would output to another layer, the output layer which must have the size of the vocabulary to represent each of the next word.

- Using the final output layer we would plug in our loss function and do our optimization.

In [6]:
## The model

class NP_Model(nn.Module):
    """Pytorch model"""
    
    def __init__(self , vocab_size, embedding_dim, hidden_dim, num_context=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Embedding(vocab_size , embedding_dim), #give (B, num_context, embedding_dim)
            nn.Flatten(), #makes (B, num_context, embedding_dim)
            nn.Linear(num_context * embedding_dim , hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim , vocab_size)
        )
        
    def forward(self , x):
        return self.model(x)

Let's test our model too.

In [7]:
model = NP_Model(vocab_size = len(vocabulary), embedding_dim = 30 , hidden_dim = 100)

model(train_dataset[0][3].unsqueeze(0)).shape

torch.Size([1, 23642])

Perfect....
Our model works as expected.

Now lets device our loss function and our optimizer.

In [8]:
## Re-initializing our mode

model = NP_Model(vocab_size = len(vocabulary), embedding_dim = 30 , hidden_dim = 100)

## Loss function

loss_func = nn.CrossEntropyLoss()

optim = torch.optim.Adam(model.parameters() , lr=0.1)

Now we are ready. Let's get to training.

In [9]:
## Training on minibatches

losses = []
X_tr,y_tr = train_dataset

loop = tqdm(range(10000))
for i in loop:
    idx = np.random.randint(0, len(X_tr), 32)
    X , y = X_tr[idx] , y_tr[idx]
    pred = model(X)
    loss = loss_func(pred,y)
    losses.append(loss.item())
    loop.set_description(f'Epoch : {i+1}/100 :')
    loop.set_postfix(Loss=loss.item())
    losses.append(loss.item())
    optim.zero_grad()
    loss.backward()
    optim.step()

Epoch : 10000/100 :: 100%|██████████| 10000/10000 [03:37<00:00, 46.00it/s, Loss=58.3]


Model evaluation.

In [10]:
## Model evaluation

X_val, y_val = val_dataset
X_val = X_val[:1000]
y_val = y_val[:1000]
model.eval()
pred = model(X_val)
loss = loss_func(pred,y_val)
print(f'Evaluation Loss : {loss.item()}')

Evaluation Loss : 76.62408447265625


I implemented a very basic level model.

Mainly because of the memory consumption.

This was done as just as an introductory work to learn about transformers.