# NLP for text generation (char based)

In this notebook NLP is experimented with long short term memory units, LSTMs. 

Tasks performed with LSTMs:
- Generating text á la Shakespeare (with character based model)

In [1]:
# Load libraries
import numpy as np
import pandas as pd
pd.options.display.width=120
#pd.set_option('display.width',75)
#pd.options.display.max_columns=8
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import nlpia
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
from collections import OrderedDict
import copy
from sklearn.feature_extraction.text import TfidfVectorizer

The alternatives are as below, let's use tf.keras here.
- multibackend Keras 
- tf.keras. 

In [3]:
import tensorflow as tf
from tensorflow import keras
tf.__version__

'2.0.0'

In [4]:
keras.__version__

'2.2.4-tf'

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,LSTM
from tensorflow.keras.optimizers import RMSprop

## Generating Shakespearean text

The idea in the task 
- to learn to predict 41st character, based on 40 characters that came before that.
- for this purpose the source text is chopped up into data samples, each with the fixed size of 40 characters.
- the samples are taken as follows: take 40 chars from beginning, then move to 3rd character, take 40 from there etc.

### a) Load Project Gutenberg dataset

In [7]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [8]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

### b) Load three Shakespeare texts and preprocess them

In [9]:
text=''
for txt in gutenberg.fileids():
    if 'shakespeare' in txt:
        text+=gutenberg.raw(txt).lower()  # load the raw text based on the selected fileid and concatenate it to text string
chars=sorted(list(set(text)))  # create a set of all characters, turn it into a list, and then sort it.
char_indices=dict((c,i) for i,c in enumerate(chars))  
indices_char=dict((i,c) for i,c in enumerate(chars))
'corpus length: {} total chars: {}'.format(len(text),len(chars))

'corpus length: 375542 total chars: 50'

In [13]:
# Note, takes into account the form only with print() function
print(text[0:500])

[the tragedie of julius caesar by william shakespeare 1599]


actus primus. scoena prima.

enter flauius, murellus, and certaine commoners ouer the stage.

  flauius. hence: home you idle creatures, get you home:
is this a holiday? what, know you not
(being mechanicall) you ought not walke
vpon a labouring day, without the signe
of your profession? speake, what trade art thou?
  car. why sir, a carpenter

   mur. where is thy leather apron, and thy rule?
what dost thou with thy best apparrell on


### c) Create a training set

In [16]:
# Here you create overlapping samples of text of length 40 characters (starting from character positions 0,3,6,9,...)
maxlen=40
step=3
sentences=[]
next_chars=[]
for i in range(0,len(text)-maxlen,step):
    sentences.append(text[i:i+maxlen])  # 40 character length sequences, taken from every 3rd character
    next_chars.append(text[i+maxlen])  # 41st character for each sequence
print('number of sequences:',len(sentences))

number of sequences: 125168


### d) One-hot encode the training samples

In [17]:
X=np.zeros((len(sentences),maxlen,len(chars)),dtype=np.bool)
y=np.zeros((len(sentences),len(chars)),dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t,char in enumerate(sentence):
        X[i,t,char_indices[char]]=1
    y[i,char_indices[next_chars[i]]]=1

### e) Create character-based LSTM model for generating text

In [20]:
num_neurons=128
model=Sequential([
    LSTM(num_neurons,  # return_sequences=False is the default -> we want output only at last timestep
        input_shape=(maxlen,len(chars))),  # length of sequences * length of one-hot encoding
    # Flatten(), # Here we don't need flatten layer since the output comes only from last timestep, thus its shape is num_neurons
    Dense(len(chars),activation="softmax")  
])

In [21]:
optimizer=RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy',optimizer=optimizer,metrics=['accuracy']) # If binary classes: loss='binary_crossentropy'

In [22]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               91648     
_________________________________________________________________
dense_1 (Dense)              (None, 50)                6450      
Total params: 98,098
Trainable params: 98,098
Non-trainable params: 0
_________________________________________________________________


### f) Train the model

In [23]:
epochs=6
batch_size=128
model_structure=model.to_json()

In [24]:
# Train for a while, then save the model, then continue training. (it continue where it ended last time)
# Alternative method is to use callback functions from Keras.
with open("shakes_lstm_model.json","w") as json_file:
    json_file.write(model_structure)  # this only saves the structure, not the weights
np.random.seed(1337)
for i in range(5):
    model.fit(X,y,batch_size=batch_size, epochs=epochs)
    model.save_weights("shakes_lstm_weights_{}.h5".format(i+1))

Train on 125168 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Train on 125168 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Train on 125168 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Train on 125168 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Train on 125168 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


### g) Create a sample to generate character sequences

The LSTM network output predictions (probability for all characters)

Let's create a function that generates (draws from the prob.distribution) the next character based on those probabilities.
- here dividing by the temperature is flattening (if >1) or sharpening (if <1) the prob.distribution
- temperature (or diversity) <1 creates text that is very close to original
- temperature (or diversity) >1 creates more diverse type of outcome.

Numpy random function np.random.multinomial(num_samples,probab_list,size)
- makes num_samples from the distribution given in the probab_list.
- it outputs a list of length size which is equal to the number of experiments.
- here there is only one experiment, i.e. only one output is needed.

In [26]:
import random
def sample(preds,temperature=1.0):
    preds=np.asarray(preds).astype('float64')
    preds=np.log(preds)/temperature
    exp_preds=np.exp(preds)
    preds=exp_preds/np.sum(exp_preds) # This is softmax
    probas=np.random.multinomial(1,preds,1) # Multinomial draw from probabilities
    return np.argmax(probas) # return the class that was drawn.

### h) Generate three texts with three diversity levels

In [28]:
import sys
start_index=random.randint(0,len(text)-maxlen-1)
for diversity in [0.2,0.5,1.0]:
    print()
    print('-----------diversity:',diversity)
    generated=''
    sentence=text[start_index:start_index+maxlen]
    generated+=sentence
    print('-----------Generating with seed:"' + sentence + '"')
    sys.stdout.write(generated)
    for i in range(400):
        x=np.zeros((1,maxlen,len(chars)))
        for t,char in enumerate(sentence):
            x[0,t,char_indices[char]]=1
        preds=model.predict(x,verbose=0)[0] # create prediction (proba_list)
        next_index=sample(preds,diversity) # draw the index of next character
        next_char=indices_char[next_index] # check what is the character
        generated+=next_char
        sentence=sentence[1:]+next_char # create 40-character seed for next round of character prediction
        sys.stdout.write(next_char)
        sys.stdout.flush() # flushes the internal buffer to the console so next char appears immediately
    print()       


-----------diversity: 0.2
-----------Generating with seed:"france of the best ranck and station,
ar"
france of the best ranck and station,
are ankellish then the world to the common the selfe,
and the selfe and fierie to me to the sould,
the face to the common the soule be well,
that i will not be a will him to the street,
that i will see the death, the selfe to the compost,
that i will see the world to the selfe, and there
i shall be so be so father: the selfe the street,
that i will be sonne of men, and with the man of the fathers co

-----------diversity: 0.5
-----------Generating with seed:"france of the best ranck and station,
ar"
france of the best ranck and station,
are like a good of the boben, that i haue
their fire as chome dismitted enterping ang
ye manney to thy thing, i to make a creat,
as thou my lord, i haue not be a bone,
that i haue me and thing, and then my lord?
  os. they in him, you may well and honor, will plucke them bend,
that should be a countence of ites to t

Evaluation of the generated text:
- Diversity 0.2 and 0.5 both give something that looks like Shakespear.
- Diversity 1.0 (with this dataset) starts to go off the rails fairly quickly.

How to improve the model:
- use larger corpus
- use larger number of neurons
- segment sentences

### i) Other type of cell: GRU (gated recurrent unit)

In [29]:
# The syntax is exactly the same as with LSTM
# GRU(num_neurons,return_sequences=True,input_shape=X[0].shape)

### j) Several layers of LSTM

You can have several layers of LSTM, ie creating a deeper LSTM network
- then return_sequences=True is essential

In [31]:
# Example of deep LSTM with two layers
num_neurons2=128
model_deep=Sequential([
    LSTM(num_neurons, return_sequences=True, input_shape=X[0].shape),  
    LSTM(num_neurons2, return_sequences=True), 
    Dense(len(chars),activation="softmax")  
])

Note, however that creating complex model
- that is capable of representing more complex relationships than are present in the data can lead to strange results.