In this practical you will explore how LSTM can be applied with text data in order to preform tasks such as predicting word in a sequence and predicting sentiment in a text.

# Task 1: Predicting a word in a sequence

In this task, you will train a model that for a given sequence of words passed as an input, predicts the next word in the sequence.

**T1.1** Obtaining data

For this task we will use the [20newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) using [sklearn](https://scikit-learn.org/stable/datasets/index.html). You can load the date using the code below. Familiarize yourself with the dataset before moving to the next task.

In [1]:
from sklearn.datasets import fetch_20newsgroups
corpus = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),categories=['sci.med'])

In [2]:
data = corpus.data
data_train = data[:400]
data_test = data[400:]
print(len(data))
print(len(data_train))
print(len(data_test))

594
400
194


***
**T1.2** Data pre-processing

Since we will be using the Embedding layer, the data should be pre-processed as in the previous practicals. In order to clean the data you can use the filter attribute to specify what characters should be remove from the text. 

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,split=' ')
tokenizer.fit_on_texts(data_train)
token_list_train = tokenizer.texts_to_sequences(data_train)
token_list_test = tokenizer.texts_to_sequences(data_test)
num_words = len(tokenizer.word_index)+1

***
**T1.3** Creating features and labels vectors 

The training instances will be composed of a sequence of words (we will set it to 20 for this exercise) and the labels represented by a single words.

The feature and labels vectors can be generated as follows. We use the first 20 words as features with the 21st as the label, then use words 2–21 as features and predict the 22nd and so on. This gives us significantly more training data. 

Generate the features and labels vectors for the train and the test datasets.

In [4]:
import numpy as np

x_train = []
y_train = []
train_length = 20

for row in token_list_train:
    for i in range(train_length, len(row)):
        sequence = row[i-train_length:i+1] 
        x_train.append(sequence[:-1])
        y_train.append(sequence[-1])
x_train = np.array(x_train)
y_train = np.array(y_train)
print(x_train.shape)
print(len(y_train))

(82696, 20)
82696


In [5]:
x_test = []
y_test = []
train_length = 20

for row in token_list_test:
    for i in range(train_length, len(row)):
        sequence = row[i-train_length:i+1]       
        x_test.append(sequence[:-1])
        y_test.append(sequence[-1])
x_test = np.array(x_test)
y_test = np.array(y_test)
print(x_test.shape)
print(len(y_test))

(24209, 20)
24209


***
**T1.3** Constructing the embedding weights matrix.

In this task we will use the pre-trained word embeddings using the word2vec model. Create the embedding weights matrix for the Embedding layer.

In [6]:
from gensim.models import KeyedVectors
import re 
from gensim.scripts.glove2word2vec import glove2word2vec

file = 'GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(file, binary=True)
word2vec_vectors = word2vec

In [7]:
import numpy as np

num_words = len(tokenizer.word_index)+1
embedding_matrix = np.zeros((num_words, 300))
for word, i in tokenizer.word_index.items():
    if word in word2vec_vectors:
        embedding_vector = word2vec[word]
        embedding_matrix[i] = embedding_vector

In [8]:
embedding_matrix.shape

(11693, 300)

***
**T1.4** One-hot encoding the labels.

Since we are dealing with a multi-class classification problem, we need to convert each label into a vector of dimension equals to the number of words. Convert the train and test labels into one-hot encoded vectors.

In [9]:
y_train_array = np.zeros((len(y_train), num_words),dtype=int)
for idx,word_idx in enumerate(y_train):
    y_train_array[idx,word_idx] = 1
    
y_test_array = np.zeros((len(y_test), num_words),dtype=int)
for idx,word_idx in enumerate(y_test):
    y_test_array[idx,word_idx] = 1

***
**T1.5** Building and training the model.

Now you can construct your neural network. You should add the Embedding layer as the first layer. To mask any words that do not have a pre-trained embedding (which will be represented as all zeros) you can configure mask_zero = True in the Embedding layer.


The model will be very similar to the model from the last practical. Instead of the Convolutional layer you will be using LSTM layer. Please read about different configuration of the LSTM layer in [here](https://keras.io/api/layers/recurrent_layers/lstm/).

In [10]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Embedding

model = Sequential()
model.add(Embedding(input_dim=num_words,
              input_length = train_length,
              output_dim=300,
              weights=[embedding_matrix],
              trainable=False,
              mask_zero=True))

# Recurrent layer
model.add(LSTM(64,dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_words, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train,  y_train_array, batch_size=64, epochs=15, validation_data=(x_test, y_test_array))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


# Task 2: Sentiment Analysis with LSTM

Implement an LSTM Neural Network to solve the sentiment analysis problem from the last practical. You can explore different variants of the models (with pre-trained embeddings, with embeddings trained via the Embedding layer, transfer learning embeddings using word2vec). To avoid the RNN model to be trained on the padded values, you can configure mask_zero = True in the Embedding layer.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('yelp_reviews.csv',encoding = "ISO-8859-1")

#select input and output variables
data = df.values[:,0]
labels = df.values[:,1]

x_train, x_test, y_train, y_test = train_test_split(data, labels,test_size=0.5, random_state=0)

Data pre-processing. Encoding each entry from the train/test sets as sequence of integers for the Embedding layer.

In [12]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=50000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)

length = []
for x in x_train:
    length.append(len(x.split()))
max(length)

num_words = len(tokenizer.word_index)+1

In [17]:
from keras.preprocessing.sequence import pad_sequences

x_train_seq = pad_sequences(sequences, maxlen=45)
sequences_val = tokenizer.texts_to_sequences(x_test)
x_test_seq = pad_sequences(sequences_val, maxlen=45)
x_test_seq=np.asarray(x_test_seq).astype(np.float32)
x_train_seq=np.asarray(x_train_seq).astype(np.float32)
y_test=np.asarray(y_test).astype(np.float32)
y_train=np.asarray(y_train).astype(np.float32)

Generating the weight matrix with pre-treined word2vec embeddings.

In [18]:
num_words = len(tokenizer.word_index)+1
embedding_matrix = np.zeros((num_words, 300))
for word, i in tokenizer.word_index.items():
    if word in word2vec_vectors:
        embedding_vector = word2vec[word]
        embedding_matrix[i] = embedding_vector

In [19]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Embedding

In [20]:
model = Sequential()
e = Embedding(num_words, 300, weights=[embedding_matrix], input_length=45, trainable=False, mask_zero = True)
model.add(e)
model.add(LSTM(64, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train_seq, y_train, validation_data=(x_test_seq, y_test), epochs=5, batch_size=32, verbose=2)

Epoch 1/5
16/16 - 4s - loss: 0.6790 - accuracy: 0.5442 - val_loss: 0.6498 - val_accuracy: 0.7108 - 4s/epoch - 231ms/step
Epoch 2/5
16/16 - 1s - loss: 0.5614 - accuracy: 0.7610 - val_loss: 0.5304 - val_accuracy: 0.7430 - 707ms/epoch - 44ms/step
Epoch 3/5
16/16 - 1s - loss: 0.4445 - accuracy: 0.8092 - val_loss: 0.4610 - val_accuracy: 0.8032 - 666ms/epoch - 42ms/step
Epoch 4/5
16/16 - 1s - loss: 0.3893 - accuracy: 0.8353 - val_loss: 0.4502 - val_accuracy: 0.7932 - 697ms/epoch - 44ms/step
Epoch 5/5
16/16 - 1s - loss: 0.3169 - accuracy: 0.8735 - val_loss: 0.4424 - val_accuracy: 0.8032 - 685ms/epoch - 43ms/step


<keras.callbacks.History at 0x1e8ac510a30>