# Training Language Neural Model

We can now train a statistical language model from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

(a) It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
(b) It learns the representation at the same time as learning the model.
(c) It learns to predict the probability for the next word using the context of the last 10 words.

First, let's input the encoded data from 'encoded_sequences' folder.

In [None]:
import pandas as pd
import numpy as np

In [None]:
train_data = pd.read_csv('encoded_sequences/encoded_training_20pc_300.csv')
validation_data = pd.read_csv('encoded_sequences/encoded_validation_20pc_300.csv')
testing_data = pd.read_csv('encoded_sequences/encoded_testing_20pc_300.csv')

In [None]:
N = 300 # Length of input sequence

a = list(range(0, N))
a = [str(i) for i in a]
XTrain = np.array(train_data[a])
YTrain = np.array(train_data[str(N)])
XVal = np.array(validation_data[a])
YVal = np.array(validation_data[str(N)])
XTest = np.array(testing_data[a])
YTest = np.array(testing_data[str(N)])

We know that the length of the vocabulary is 16689. So, we will one-hot encode the Y vectors for the train, validation, test sets.

In [None]:
# One - Hot Encoding for Training Set
oh_YTrain = []
for j in range(len(YTrain)):
    temp = np.zeros(16689)
    temp[YTrain[j]] = 1
    oh_YTrain.append(temp)

In [None]:
# One - Hot Encoding for Validation set
oh_YVal = []
for j in range(len(YVal)):
    temp = np.zeros(16689)
    temp[YVal[j]] = 1
    oh_YVal.append(temp)


In [None]:
# One - Hot Encoding for Test set
oh_YTest = []
for i in range(len(YTest)):
    temp = np.zeros(16689)
    temp[YTest[i]] = 1
    oh_YTest.append(list(temp))

In [None]:
inp_shape = np.array(XTrain).shape[1]

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM

In [None]:
# Initialising a sequential LSTM model with 2 hidden layers with 100 nodes each

model = Sequential()  
model.add(Embedding(16689, N, input_length=inp_shape))
model.add(LSTM(100, return_sequences=True))  
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(16689, activation='softmax'))
print(model.summary())

In [None]:
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit model
model.fit(XTrain, np.asarray(oh_YTrain), batch_size=128, epochs=80)

In [None]:
# Predict probabilities for each word
y_hat = model.predict_classes(XTest, verbose=0)

In [None]:
print('Accuracy: '+ str(np.mean(y_hat == YTest))