## Char Prediction using LSTM

1. Download data of Alice in Wonderland or Dracula from https://www.gutenberg.org/browse/scores/top in plain text format
2. Create an char_to_int map which maps each character used in the novel to an integer. example {a: 3}
3. Read data from the text file and do the following:
    3.1 Create a sliding window in which it takes in first 100 characters as the input sequence and 101th character as the output sequence. (It slides over every character).
    For example: 
        "Avul Pakir Jainulabdeen Abdul Kalam better known as A.P.J. Abdul Kalam"
        You should slide from "A" to the 100th char and 101th char will be your output.
        Then you should start sliding from "v" to the 100th char and 101th char will be your output.
    The input and the output sequence should be converted to their integer representation using the char_to_int map.
    With this you basically have two arrays seqIn and seqOut with each element containing integer representation of 100 characters and 1 character respectively.
    seqIn = [[10........15], [5.....25]...] seqOut = [5, 2, 5]
4. Now reshape your seqIn as (NumberOfSamples, 100, 1) - So you basically get this [[[10]........[15]], [[5]..... [25]]...]
5. One hot encode your seqOut using np_utils.to_categorical

6. Now create a simple model with LSTM followed by a Dense layer.

7. Then, given a seed sentence predict the next character using the model created.


### Importing Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

from sklearn.model_selection import train_test_split

import keras
# Sequence to attain Padding
from keras.preprocessing import sequence
# Importing RNN's LSTM
from keras.layers import LSTM, Dense, Dropout
from keras.layers import Embedding
# Applying Sequential algorithm to model
from keras.models import Sequential

Using TensorFlow backend.


### Storing the Document

In [2]:
Original_file = open('AliceInWonderland').read()

In [3]:
# Removing all '\n' from the document
file = Original_file.replace('\n', ' ').replace('\r', '')

#### Calculating number of Unique Letters in the document

In [4]:
# Stores the unique letters from the document
letters = list(set(file))

# Stores the number of unique letters which is the num_classes in outputs
unique_output_Values = len(letters)
unique_output_Values

85

#### Conversions

In [5]:
# Neural Networks accepts only number inputs, so converting text(letters) into numbers

## Maps letters to numbers
char_to_int = dict(zip(letters, [i for i in range(len(letters))]))

## Maps numbers back to text
int_to_char = dict(zip([i for i in range(len(letters))],letters))

### METHODS

In [6]:
''' SLIDING FUNCTION: Slides over the input text file character by character'''

def generate_char_Dataset(data, slide):
    
    x = []
    y = []
    
    ## Generating iput texts(characters)
    for index in range(len(data) - slide):
        x.append([ch for ch in data[index:index+slide]])
        y.append(data[index+slide:index+slide+1])
        
    return x,y

In [7]:
''' CHAR TO INT CONVERSION FUNCTION: Converts character dataset to int dataset '''

def char_Dataset_to_int_Dataset(x,y, char_to_int):
    
    input_to_int = []
    output_to_int = []

    for i in range(len(x)):
        input_to_int.append([char_to_int[char] for char in x[i]])
        output_to_int.append([char_to_int[char] for char in y[i]])
    
    return input_to_int, output_to_int

In [8]:
''' (BACK) INT TO CHAR CONVERSION FUNCTION: Accepts output(y) i.e. List of lists '''

def int_Dataset_to_char_Dataset(y, int_to_char):
    
    back_to_char = []

    for i in range(len(y)):
        back_to_char.append([int_to_char[y[i][0]]])
        
    return back_to_char

In [9]:
''' INTIALIZATION FUNCTION: Accepts tokenized words, slide, list of unique words from the doc '''

def initialize(data, slide, char_to_int):
    
    char_Dataset = generate_char_Dataset(data, slide)
    int_Dataset = char_Dataset_to_int_Dataset(char_Dataset[0], char_Dataset[1], char_to_int)
    
    # INPUT: e.g. [[12,21,34], [12,33,41], ...] - List of Lists
    seqInput = int_Dataset[0]
    
    # OUTPUT: e.g. flatten([[12],[24],[2],[5] ...] - List of Lists = [12,24,2,5....]
    seqOutput = list(np.array(int_Dataset[1]).flatten())
    
    seqInput_RESHAPED = np.array(seqInput).reshape(len(seqInput), slide, 1)
    
    return seqInput_RESHAPED, seqOutput

### Initializing

In [10]:
DATA_SET = initialize(file, 100, char_to_int)

X = DATA_SET[0]
Y = DATA_SET[1]

In [11]:
''' X=(163716, 100, 1) 

    Number of samples = 163716
    Number of inputs  = 100 (Letter1, Letter2...., Letter100)
               Output = 1 (Letter101th)
'''

X.shape

(163716, 100, 1)

### Defining Paramters

In [12]:
num_words = 20000

## Dividing the whole No. of samples into batches of 32
batch_size = 32

## Number of iterations
epochs = 2

## Number of Output classes
num_classes = unique_output_Values

In [13]:
print(len(X),len(Y))

163716 163716


### Training and Testing units

In [14]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.01, random_state=1)

### One-Hot-Encoding Output Values

In [15]:
# Total no. of classes = Unique Values in the document, [0,0,0,.....1]
y_train_oneHotEncoded = keras.utils.to_categorical(y_train, num_classes=unique_output_Values)
y_test_oneHotEncoded = keras.utils.to_categorical(y_test, num_classes=unique_output_Values)

In [16]:
print(len(y_train_oneHotEncoded), len(y_test_oneHotEncoded))

162078 1638


In [17]:
x_train.shape

(162078, 100, 1)

In [None]:
y_test_oneHotEncoded.shape

(1638, 85)

### Model

In [None]:
# Model Architecture:  Consists of 2 LSTM Layers and a Output Dense Layer

#       t0 - t1 - t2 - t3 ------ t99    = Sample 1                
#       t0 - t1 - t2 - t3 ------ t99    = Sample 2
#                                         ...
#       t0 - t1 - t2 - t3 ------ t99    = Sample 162078 


## Sequential one by one
model = Sequential()

## LSTM Layer 1: Consists of 256 Neurons. To connect with the second layer of LSTM, return_sequences = True
model.add(LSTM(64, input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=True))
## LSTM Layer 2: Consists of 256 Neurons in one RNN Layer 
model.add(LSTM(128))
model.add(Dense(unique_output_Values, activation="sigmoid"))

## Compiling Model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

## Fitting Model without weights(Wr or Wht-1)
model.fit(x_train, y_train_oneHotEncoded, batch_size=batch_size, epochs=1, validation_data=(x_test, y_test_oneHotEncoded))

Train on 162078 samples, validate on 1638 samples
Epoch 1/1

#### Loading Weights

In [None]:
#model.load_weights('weights-improvement-49-1.2575.hdf5', by_name=False)

##### Loading Dropouts

In [None]:
#model.add(Dropout(32, input_shape=(x_train.shape[1], x_train.shape[2]))

### Predictions

In [None]:
predict = model.predict(x_test)

### Accuracy

In [None]:
evaluate = model.evaluate(x_test, y_test_oneHotEncoded)

In [None]:
accuracy = evaluate[1]
accuracy*100

### TEST INPUT

In [None]:
test_input = "Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: Alice’s Adventures in Wonderland Author: Lew"

In [None]:
test_input_seqIn_withChar,test_input_seqOut_withChar = generate_char_Dataset(test_input, 100)

In [None]:
'''Converting char input sequence to Integer'''
test_input_seqIn = []

for i in range(len(test_input_seqIn_withChar)):
    test_input_seqIn.append([char_to_int[letter] for letter in test_input_seqIn_withChar[i]])

In [None]:
'''Reshaping seqIn sample'''

test_input_seqIn_reshape = np.array(test_input_seqIn).reshape(np.array(test_input_seqIn).shape[0], np.array(test_input_seqIn).shape[1], 1)

In [None]:
predictions = model.predict_classes(test_input_seqIn_reshape)

#### Storing inputs and outputs in a proper string

In [None]:
input = []

for i in range(len(test_input_seqIn_withChar)):
    input.append(''.join(test_input_seqIn_withChar[i]))

In [None]:
output = []

for i in predictions:
    output.append(int_to_char[i])

### OUTPUT

In [None]:
''.join(output)