# Lab 8: Training Deep Recurrent Neural Network - Part 2

Name1, Student's ID1<br>

## Lab Instruction - Language Modelling and Text Classification

In this lab, you will learn to train a deep recurrent neural network using LSTM with the Keras library using the Tensorflow backend. Your task is to implement the natural language modelling and text generation.

```
alice_in_wonderland.txt
```

In class will use alice_in_wonderland as a text file. Then, you will train your language model using RNN-LSTM. 



- Language model (in Thai): http://bit.ly/language_model_1
- Tutorial on how to create a language model (in English): https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275

To evaluate the model, the perplexity measurement is used: https://stats.stackexchange.com/questions/10302/what-is-perplexity

Last, fine-tune your model. You have to try different hyperparameter or adding more data. Discuss your result.




#### 1. Load your data 

In [1]:
# Import require library
from keras import *
from keras.preprocessing import text
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split

import _utils as fn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow import keras

In [2]:
# Load data
import csv

# Load data
file = open("/content/alice_in_wonderland.txt","r",encoding="utf8", errors='ignore')
raw_text = file.read()

In [3]:
raw_text[:200]

'CHAPTER I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into\nthe book her sister was rea'

In [4]:
chars = sorted(list(set(raw_text)))

In [5]:
print("Total characters: ", len(chars))
print("Total word: ", len(raw_text.split()))

Total characters:  82
Total word:  29371


#### 2. Data Preprocessing 

*Note that only story will be used as a dataset, footnote and creddit are not include.*

The symbol '\n' is indicated the end of the line ``<EOS>``, which is for our model to end the sentence here.

To create a corpus for your model. The following code is can be used:</br>
*Note that other techniques can be used*

```python
# cut the text in semi-redundant sequences of maxlen characters.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
```

The code loop through the data from first word to the last word. The maxlen define a next n word for a model to predict.


In [6]:
from keras.preprocessing.sequence import pad_sequences

In [7]:
# Adding end of string symbol use .replace   to replace data_text with  [  \n\n', " <EOS> " ]
raw_text = raw_text.replace('\n\n', " <EOS> ")
raw_text[:200]


'CHAPTER I.\nDown the Rabbit-Hole <EOS> \nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into\nthe book her sister wa'

In [8]:
# Preprocessing 
# Create corpus & Vectorization

#Preprocessing 
# Create corpus & Vectorization

tokenizer = text.Tokenizer()

# basic cleanup
corpus = raw_text.lower().split("\n")

# tokenization
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# create input sequences using list of tokens
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])

# Pre padding 
input_sequences = np.array(sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

# One-hot label
label = keras.utils.to_categorical(label, num_classes=total_words)

In [11]:
print('Max sequence len: %s' % max_sequence_len)
print('Total word len: %s' % total_words)

Max sequence len: 112
Total word len: 3162


In [13]:
n_gram_sequence[0]

3160

In [9]:
print(predictors[0])

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0 330]


In [10]:
print(label[0])

[0. 0. 0. ... 0. 0. 0.]


#### 3. Language Model

Define RNN model using LSTM and word embedding representation</br>
We will used perplexity as a metrics

```python
def perplexity(y_true, y_pred):
    cross_entropy = keras.backend.categorical_crossentropy(y_true, y_pred)
    perplexity = keras.backend.pow(2.0, cross_entropy)
    return perplexity
```

To used custom metrics function > https://keras.io/metrics/

For a loss function `categorical_crossentropy` is used, any optimzation method can be applied.

In [16]:
from keras.layers import Embedding 
from keras.layers import LSTM
from keras.layers import Dropout 
from keras.layers import Dense
import keras.backend 

In [17]:
def perplexity(y_true, y_pred):
    cross_entropy = keras.backend.categorical_crossentropy(y_true, y_pred)
    perplexity = keras.backend.pow(2.0, cross_entropy)
    return perplexity

In [18]:

# Define your model
# Used Word Embedding 

model = models.Sequential()
model.add(layers.Embedding(total_words, 512,input_length=max_sequence_len-1,name='Embedding'))
model.add(layers.LSTM(512, kernel_initializer = 'he_normal',
                      dropout=0.3,
                      return_sequences=True,
                     name='LSTM1'))
model.add(layers.LSTM(256, kernel_initializer = 'he_normal',
                     dropout=0.3,
                     name='LSTM2'))
model.add(layers.Dense(total_words, activation='softmax',name='Output'))

model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=[perplexity])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Embedding (Embedding)        (None, 111, 512)          1618944   
_________________________________________________________________
LSTM1 (LSTM)                 (None, 111, 512)          2099200   
_________________________________________________________________
LSTM2 (LSTM)                 (None, 256)               787456    
_________________________________________________________________
Output (Dense)               (None, 3162)              812634    
Total params: 5,318,234
Trainable params: 5,318,234
Non-trainable params: 0
_________________________________________________________________


In [19]:
# Define your model

model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=[ perplexity])

In [20]:
history = model.fit(predictors, label,batch_size=32, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### 4. Evaluate your model 

In [21]:

# Create a function to evaluate your model using perplexity measurment (You can try adding other measurements as well)
def evaluate_result(features, label, model ):
    model.evaluate(features, label)

In [22]:
evaluate_result(predictors, label, model)



#### 5. Text generating

In [40]:

    
def generate_text(seedtext, next_words, max_sequence_len, model):
  for j in range(next_words):
    token_list = tokenizer.texts_to_sequences([seedtext])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    #predicted = model.predict_classes(token_list, verbose=0)
    predict_x=model.predict(token_list) 
    predicted =np.argmax(predict_x,axis=1)

    output_word = ""
    for word, index in tokenizer.word_index.items():
      if index == predicted:
        output_word = word
        break
    seedtext +=" " + output_word
  return seedtext

In [45]:
# generate your sample text

seed_text = input('Enter your start sentence:')
#generate_text , Input , Num_next_word,Max_sequence,Model
gen_text = generate_text(seed_text,10,max_sequence_len,model)

Enter your start sentence:I am


In [46]:
gen_text

'I am i must be a little girl said the king eos'

### More on Natural language Processing and Language model
1. https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e 
2. https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b
3. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

**Music generates by RNN**
https://soundcloud.com/optometrist-prime/recurrence-music-written-by-a-recurrent-neural-network
