In [None]:
# This module allows to remove accents: a fast, but not very clean solution
# to standardize the text (in reviews people often forget accents)
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
[?25l[K     |█▍                              | 10 kB 20.5 MB/s eta 0:00:01[K     |██▉                             | 20 kB 24.9 MB/s eta 0:00:01[K     |████▏                           | 30 kB 11.7 MB/s eta 0:00:01[K     |█████▋                          | 40 kB 9.0 MB/s eta 0:00:01[K     |███████                         | 51 kB 5.6 MB/s eta 0:00:01[K     |████████▍                       | 61 kB 5.8 MB/s eta 0:00:01[K     |█████████▊                      | 71 kB 5.6 MB/s eta 0:00:01[K     |███████████▏                    | 81 kB 6.2 MB/s eta 0:00:01[K     |████████████▌                   | 92 kB 6.4 MB/s eta 0:00:01[K     |██████████████                  | 102 kB 5.4 MB/s eta 0:00:01[K     |███████████████▎                | 112 kB 5.4 MB/s eta 0:00:01[K     |████████████████▊               | 122 kB 5.4 MB/s eta 0:00:01[K     |██████████████████              | 133 kB 5.4 MB/s eta 0:00:01

# TP6: Natural Language Generation

In this practical session, we will test a very simple model for text generation based on a neural language model, using a RNN.

Here we use the data from the French dataset reviews, but it's not ideal, it's too small.

If you have some time, you can also try:
- to implement the functions to read the data in sent_recipes.txt that contains sentence split recipes. -These data could be used with a seq2seq model where the goal is to generate the next sentence, each line containing the input and target sentence)
- I followed an existing tutorial using movie plots, you can test its code in the notebook here (modify the number of iterations to get better results): https://colab.research.google.com/drive/1ARI_F0RKV-L4GvmyTPPStZWhmqf737D-?usp=sharing using the data here: https://drive.google.com/file/d/1PakdWMKYNyC5-2G_CSlLtkBsHezFpMHJ/view?usp=sharing 

In [None]:
import re
import pickle
import random

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

## 1.1 Load the data

The code below allows to read the data, as usual, except that we ignore the sentiment label. We also don't use the dev set, so I added the dev and test data to our training set.

Here we also lower case the text and do some pre-processing to ignore punctuations, in order to focus on words. 

In [None]:
import unidecode


# read movie data 
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# Load train, dev and test set
train_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)
dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
test_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)

# splits the string sentence by space, we don t need the sentiment label here
tokenizer = get_tokenizer( None ) 
train_iter = []
for i in train_df.index:
    #train_iter.append( tuple( [train_df["sentiment"][i], train_df["review"][i]] ) )
    train_iter.append( train_df["review"][i] )
for i in dev_df.index:
    train_iter.append( dev_df["review"][i] )
for i in test_df.index:
    train_iter.append( test_df["review"][i] )

print( "Original first review:\t", train_iter[0])

# Optional: lower casing 
train_iter = [review.lower() for review in train_iter]
print( "Lower case:\t", train_iter[0])

# Optional: remove accents
train_iter = [unidecode.unidecode(accented_string) for accented_string in train_iter]
print( "Accents removed:\t", train_iter[0])

# Optional: clean text, rmove punctuation to focus on alphabet
train_iter = [re.sub("[^a-z' ]", "", i) for i in train_iter]
print( "Remove punctuation, numbers..:\t", train_iter[0])



def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>", "<PAD"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) #simple mapping to self
print( "Test pipeline:", text_pipeline( train_iter[0] ) )

vocab_size = len(vocab)
print( "\nVocab size:", vocab_size)

Original first review:	 Une grosse daube. La premiere saison etait pas mal, bon y'avait pas mal d'incohérences, mais bon les téléspectateurs sont pas trop regardant donc ça passe ... mais alors la 2e saison !! L'intérêt de l'intrigue de départ s'est envolée forcément, mais on se dit qu'ils vont peut etre trouver une idée qui donne un quelconque intéret à l'histoire. Meme si on emploie des procédés extremes, comme tuer des personnages centraux de la saison 1 pour ne pas s'en encombrer par la suite ... Comme rien ne choque personne on est plus à ça pres ! Tout le long de la saison 2 on se demande où ça va aller, on se dit "mais qu'est ce que c'est que ce truc ??", des épisodes font carrément marrer, tellement les scénaristes ont cherché des situations tarabiscotées pour aller là où ils avaient envie d'aller ... Le pompon pour le dernier épisode parce que c'est ENORME !!! A la fin de l'épisode, une phrase m'est sortie spontanément, c'est " Quel gros caca vraiment". Donc voilà la saison 3 

### Prepare sequences

Here, we will give our model fixed-length sequences (n-grams), with the length of the sequences as an hyper-parameter that can be changed. 

Another option would be to work directly on sentences (with padding to deal with sequences of different lengths).

In [None]:
# create sequences of length 5 tokens
def create_seq(text, seq_len = 5):
    sequences = []
    text = text.strip()
    # if the number of tokens in 'text' is greater than 5
    if len(text.split()) > seq_len:
      for i in range(seq_len, len(text.split())):
        # select sequence of tokens
        seq = text.split()[i-seq_len:i+1]
        # add to the list
        sequences.append(" ".join(seq))
      return sequences
    # if the number of tokens in 'text' is less than or equal to 5
    else:
      return [text]

In [None]:
seqs = [create_seq(i, seq_len = 5) for i in train_iter]
# merge list-of-lists into a single list
seqs = sum(seqs, [])
# count of sequences
len(seqs)

105806

### Input and target data

Now we create input and target sequences from our training data: the target is simply the sequence following the input one. This way, our model starts with an input word, the first in the input sequence, and tries to predict the next token, until the last of the target sequence. For example: 
* input: Une grosse daube. La premiere 
* target: grosse daube. La premiere saison

A cleaner solution is to segment into sentences and add special characters signaling a start and end of a sequence. 

In [None]:
# create inputs and targets (x and y)
x = []
y = []

for s in seqs:
  if len(s.split()[:-1]) != 0:
    x.append(" ".join(s.split()[:-1]).strip())
    y.append(" ".join(s.split()[1:]).strip())
print( x[0])
print(y[0])

une grosse daube la premiere
grosse daube la premiere saison


### Map to integer

Now we map our token sequences to integer lists. Here we also add padding when necessary.

In [None]:
def get_integer_seq(seq, max_len=5):
  int_seq = text_pipeline(seq)
  while len(int_seq)!=max_len:
      int_seq.append(vocab.lookup_indices(["<PAD>"])[0])
  return int_seq

# convert text sequences to integer sequences
x_int = [get_integer_seq(i) for i in x]
y_int = [get_integer_seq(i) for i in y]

# convert lists to numpy arrays
x_int = np.array(x_int)
y_int = np.array(y_int)

print(x_int)

[[  11  434  760    3   89]
 [ 434  760    3   89   28]
 [ 760    3   89   28  149]
 ...
 [  67   84   59 2392    5]
 [  84   59 2392    5    3]
 [  59 2392    5    3 1298]]


### Batches

Batches are simply list of sequences.

In [None]:
def get_batches(arr_x, arr_y, batch_size):
    # iterate through the arrays
    prv = 0
    for n in range(batch_size, arr_x.shape[0], batch_size):
      x = arr_x[prv:n,:]
      y = arr_y[prv:n,:]
      prv = n
      yield x, y

## 1.2 Model definition

The model is very similar to chat we've seen until now.

▶▶ **Write the '__init__(..) part, using and embedding layer and an LSTM**

In [None]:
class WordLSTM(nn.Module):
    
    def __init__(self, n_hidden=256, n_layers=4, drop_prob=0.3, lr=0.001):
        super().__init__()

        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        self.emb_layer = nn.Embedding(vocab_size, 200)

        ## define the LSTM
        self.lstm = nn.LSTM(200, n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        ## define a dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        ## define the fully-connected layer
        self.fc = nn.Linear(n_hidden, vocab_size)      
    
    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''
        ## pass input through embedding layer
        embedded = self.emb_layer(x)     
        ## Get the outputs and the new hidden state from the lstm
        lstm_output, hidden = self.lstm(embedded, hidden)
        ## pass through a dropout layer
        out = self.dropout(lstm_output)
        #out = lstm_output  
        out = out.reshape(-1, self.n_hidden) 
        ## put "out" through the fully-connected layer
        out = self.fc(out)
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        # if GPU is available
        if (torch.cuda.is_available()):
          hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                    weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        # if GPU is not available
        else:
          hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                    weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        return hidden

The training function is also very similar to what we saw before. 

In [None]:
def train(net, epochs=10, batch_size=32, lr=0.001, clip=1, print_every=32):
    # optimizer
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    # loss
    criterion = nn.CrossEntropyLoss()
    # push model to GPU
    net.cuda()
    counter = 0
    net.train()
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        for x, y in get_batches(x_int, y_int, batch_size):
            counter+= 1
            # convert numpy arrays to PyTorch arrays
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            # push tensors to GPU
            inputs, targets = inputs.cuda(), targets.cuda()
            # detach hidden states
            # https://discuss.pytorch.org/t/solved-why-we-need-to-detach-variable-which-contains-hidden-representation/1426/4
            h = tuple([each.data for each in h])
            # zero accumulated gradients
            net.zero_grad()
            # get the output from the model
            output, h = net(inputs, h)
            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(-1))
            # back-propagate error
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            # update weigths
            opt.step()            
            if counter % print_every == 0:
              print("Epoch: {}/{}...".format(e+1, epochs),
                    "Step: {}...".format(counter))

### Predict

The predict function takes as input some tokens: the model needs a first input to predict the next tokens. The functions below can be used to generate some text based on an input sequence. 

In [None]:
# predict next token
def predict(net, tkn, h=None):
  # tensor inputs 
  x = np.array( [vocab.lookup_indices( [tkn] ) ] )
  inputs = torch.from_numpy(x)
  # push to GPU
  inputs = inputs.cuda()
  # detach hidden state from history 
  h = tuple([each.data for each in h])
  # get the output of the model
  out, h = net(inputs, h)
  # get the token probabilities
  p = F.softmax(out, dim=1).data
  p = p.cpu()
  p = p.numpy()
  p = p.reshape(p.shape[1],)
  # get indices of top 3 values
  top_n_idx = p.argsort()[-3:][::-1]
  # randomly select one of the three indices
  sampled_token_index = top_n_idx[random.sample([0,1,2],1)[0]]
  # return the encoded value of the predicted char and the hidden state
  return vocab.lookup_tokens( [sampled_token_index] ), h


# function to generate text
def sample(net, size, prime='un petit'):
    out_tokens = prime.split()
    # push to GPU
    net.cuda()
    net.eval()
    # batch size is 1
    h = net.init_hidden(1)
    toks = prime.split()
    # predict next token
    token, h = predict(net, toks[-1], h)
    out_tokens.append(token[0])
    for i in range(1, size-1):
      token, h = predict(net, out_tokens[-1], h)
      out_tokens.append(token[0])
    print(' '.join(out_tokens))


## 1.3 Run experiments

You can now start training a model. Once trained, you can use it to generate texts using the predict and sample functions below.

▶▶ **Try to vary the hyper-parameters and see the influence on the results:**
* start with 2 iterations 
* increase the number of iterations
* increase the size of the hidden layer and the number of hidden layers
* Try with GRU

In [None]:
# instantiate the model
net1 = WordLSTM( n_hidden=32, n_layers=1, drop_prob=0.3, lr=0.001 )
# push the model to GPU (avoid it if you are not using the GPU)
net1.cuda()
print(net1)

# train the model
train(net1, batch_size = 16, epochs=2, print_every=2000)

# Evaluation
sample(net1, 15)
sample(net1, 15, prime = "une des")
sample(net1, 15, prime = "une serie")
sample(net1, 15, prime = "ils")

  "num_layers={}".format(dropout, num_layers))


WordLSTM(
  (emb_layer): Embedding(8788, 200)
  (lstm): LSTM(200, 32, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=32, out_features=8788, bias=True)
)
Epoch: 1/2... Step: 2000...
Epoch: 1/2... Step: 4000...
Epoch: 1/2... Step: 6000...
Epoch: 2/2... Step: 8000...
Epoch: 2/2... Step: 10000...
Epoch: 2/2... Step: 12000...
un petit monde pas a la saison est une bonne ile deserte a chaque personnage est
une des acteurs a la saison qui est tres tres bien la premiere et je n'ai
une serie qui est une etoile et le suspense a un peu de la serie qui
ils ne pas de la serie qui est tres bien on ne se dit pas


### Experiment



In [None]:
# instantiate the model
net2 = WordLSTM( n_hidden=256, n_layers=4, drop_prob=0.3, lr=0.001 )
# push the model to GPU (avoid it if you are not using the GPU)
net2.cuda()
print(net2)

# train the model
train(net2, batch_size = 16, epochs=2, print_every=2000)

# Evaluation
sample(net2, 15)
sample(net2, 15, prime = "un des")
sample(net2, 15, prime = "une serie")
sample(net2, 15, prime = "ils")

WordLSTM(
  (emb_layer): Embedding(8788, 200)
  (lstm): LSTM(200, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=8788, bias=True)
)
Epoch: 1/2... Step: 2000...
Epoch: 1/2... Step: 4000...
Epoch: 1/2... Step: 6000...
Epoch: 2/2... Step: 8000...
Epoch: 2/2... Step: 10000...
Epoch: 2/2... Step: 12000...
un petit fait de cette premiere serie et de montrer les acteurs sont et de plus
un des episodes et les personnages est de la saison est a la premiere serie et
une serie a la premiere bonne saison de regarder les episodes de cette serie est de
ils fait a la premiere saison de la serie est de cette ile de plus


In [None]:
# instantiate the model
net3 = WordLSTM( n_hidden=256, n_layers=4, drop_prob=0.3, lr=0.001 )
# push the model to GPU (avoid it if you are not using the GPU)
net3.cuda()
print(net3)

# train the model
train(net3, batch_size = 16, epochs=20, print_every=2000)

# Evaluation
sample(net3, 15)
sample(net3, 15, prime = "une des")
sample(net3, 15, prime = "une serie")
sample(net3, 15, prime = "ils")

WordLSTM(
  (emb_layer): Embedding(8788, 200)
  (lstm): LSTM(200, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=8788, bias=True)
)
Epoch: 1/20... Step: 2000...
Epoch: 1/20... Step: 4000...
Epoch: 1/20... Step: 6000...
Epoch: 2/20... Step: 8000...
Epoch: 2/20... Step: 10000...
Epoch: 2/20... Step: 12000...
Epoch: 3/20... Step: 14000...
Epoch: 3/20... Step: 16000...
Epoch: 3/20... Step: 18000...
Epoch: 4/20... Step: 20000...
Epoch: 4/20... Step: 22000...
Epoch: 4/20... Step: 24000...
Epoch: 4/20... Step: 26000...
Epoch: 5/20... Step: 28000...
Epoch: 5/20... Step: 30000...
Epoch: 5/20... Step: 32000...
Epoch: 6/20... Step: 34000...
Epoch: 6/20... Step: 36000...
Epoch: 6/20... Step: 38000...
Epoch: 7/20... Step: 40000...
Epoch: 7/20... Step: 42000...
Epoch: 7/20... Step: 44000...
Epoch: 7/20... Step: 46000...
Epoch: 8/20... Step: 48000...
Epoch: 8/20... Step: 50000...
Epoch: 8/20... Step: 52000...
E

### Additional exercises

Some possible improvements:

Data:
* Try with sentences (+SOS and EOS symbols)
* Try with a restricted vocabulary (trimming unfrequent words)
* Try with a cleaner pre-processing

Model:
* Try to use the sentiment as a condition


If you want to explore generation with a seq2seq: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html with the notebook https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/cf54d584af1322e88020549223e907dc/chatbot_tutorial.ipynb 

### Sources
https://www.analyticsvidhya.com/blog/2020/08/build-a-natural-language-generation-nlg-system-using-pytorch/

https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html

