<div style="height:100px">

<div style="display:inline-block; width:77%; vertical-align:middle;">
    <div>
        <b>Author</b>: <a href="http://pages.di.unipi.it/castellana/">Daniele Castellana</a>
    </div>
    <div>
        PhD student at the Univeristy of Pisa and member of the Computational Intelligence & Machine Learning Group (<a href="http://www.di.unipi.it/groups/ciml/">CIML</a>)
    </div>
    <div>
        <b>Mail</b>: <a href="mailto:daniele.castellana@di.unipi.it">daniele.castellana@di.unipi.it</a>
    </div>
</div>

<div style="display:inline-block; width: 10%; vertical-align:middle;">
    <img align="right" width="100%" src="https://upload.wikimedia.org/wikipedia/it/7/72/Stemma_unipi.png">
</div>

<div style="display:inline-block; width: 10%; vertical-align:middle;">
    <img align="right" width="100%" src="http://www.di.unipi.it/groups/ciml/Home_files/loghi/logo_ciml-restyling2018.svg">
</div>
</div>

# LSTM for Sentiment Classification

A typical NLP task is the sentiment classification. The task requires to assingn a sentiment to natural language sentences.

The problem that we will use to demonstrate sequence classification is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words and the sentiment of each movie review must be classified.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.

Keras provides access to the IMDB dataset built-in. The imdb.load_data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models. 

Howeverm, when we work on natural language data, we should pay a lot of effort on the data pre-processing since most of the data are not clean.

Hence, we will start from the raw data ([download](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch09/movie_data.csv.gz)).

Whe everything is clear, **try to improve the model performance using what you have larnt during this course!**


## Read the data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/IMDB_movie_data.csv', encoding='utf-8')
df.head(3)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [2]:
df.shape

(50000, 2)

In [3]:
x_train = df.loc[:25000-1, 'review'].values
y_train = df.loc[:25000-1, 'sentiment'].values

x_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

print('The input shape is {}\n'
      'The output shape is {}'.format(x_train.shape, y_train.shape))

The input shape is (25000,)
The output shape is (25000,)


In [4]:
x_train[0]

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

## Clean the data

As we can see from the example above, theere are a lot of undesiribles token in the text: for example, the html tags.

We use regular expression to remove the br tag along with punctuations and brackets. The input text should become a sequence of words.

In [5]:
#we need to clean the data
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def clean_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

In [6]:
x_train = clean_reviews(x_train)
x_test = clean_reviews(x_test)

In [7]:
x_train[0]

'in 1974 the teenager martha moxley maggie grace moves to the high class area of belle haven greenwich connecticut on the mischief night eve of halloween she was murdered in the backyard of her house and her murder remained unsolved twenty two years later the writer mark fuhrman christopher meloni who is a former la detective that has fallen in disgrace for perjury in oj simpson trial and moved to idaho decides to investigate the case with his partner stephen weeks andrew mitchell with the purpose of writing a book the locals squirm and do not welcome them but with the support of the retired detective steve carroll robert forster that was in charge of the investigation in the 70s they discover the criminal and a net of power and money to cover the murder murder in greenwich is a good tv movie with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy the powerful and rich family used their influence to cover the murde

## Create the word vocabulary

The LSTM are not able to understand the words. Hence, we need to map each word in an integer number which represnet the index of that words in a vocabulary.

The vocabular construction and the translation from seq-of-words to seq-of-int can be done easily using the Keras Tokenizer ([doc page](https://keras.io/preprocessing/text/)).

In order to speed-up the training, we get only the 10000 most freqeuent words. This option is enabled setting the parameter **num_words** to the tokenizer constructor. Moreover, we specify the special token OOV to represent all the words that are not in the vocabualary. 

In [8]:
from keras.preprocessing.text import Tokenizer

# we use only the 10k most frequent words
num_words = 10000
tokenizer_obj = Tokenizer(num_words=num_words, oov_token='OOV')
# the tokenizer will not use the idx 0 because it will be used to pad sequences
all_reviews = x_train + x_test
tokenizer_obj.fit_on_texts(all_reviews)

Using TensorFlow backend.


In [9]:
x_train_seq = tokenizer_obj.texts_to_sequences(x_train)
x_test_seq = tokenizer_obj.texts_to_sequences(x_test)

#the text becomes an integer sequence!
np.array(x_train_seq[0])

array([   8, 5774,    2, 2191, 4079,    1, 4942, 1608, 1114,    6,    2,
        297,  707, 1597,    5, 7024,    1,    1, 8719,   20,    2,    1,
        311, 3611,    5, 2118,   59,   13, 1909,    8,    2, 8720,    5,
         41,  320,    3,   41,  590, 4640,    1, 1699,  104,  149,  305,
          2,  569,  966,    1, 1467,    1,   37,    7,    4, 1105,  982,
       1354,   12,   45, 3019,    8, 5880,   15,    1,    8,    1, 5748,
       3161,    3, 1634,    6,    1, 1089,    6, 3882,    2,  420,   16,
         24, 1959, 1791, 2237, 3803, 3505,   16,    2, 1264,    5,  488,
          4,  274,    2, 5380,    1,    3,   77,   21, 2550,   92,   18,
         16,    2, 1438,    5,    2, 4993, 1354, 1337, 9770,  614,    1,
         12,   13,    8, 2760,    5,    2, 3465,    8,    2,  977,   34,
       1837,    2, 1777,    3,    4, 5515,    5,  657,    3,  290,    6,
        992,    2,  590,  590,    8,    1,    7,    4,   49,  229,   17,
         16,    2,  286,   64,    5,    4,  590,   

In [10]:
# we can revert the process
tokenizer_obj.sequences_to_texts(x_train_seq[:1])

['in 1974 the teenager martha OOV maggie grace moves to the high class area of belle OOV OOV connecticut on the OOV night eve of halloween she was murdered in the backyard of her house and her murder remained OOV twenty two years later the writer mark OOV christopher OOV who is a former la detective that has fallen in disgrace for OOV in OOV simpson trial and moved to OOV decides to investigate the case with his partner stephen weeks andrew mitchell with the purpose of writing a book the locals OOV and do not welcome them but with the support of the retired detective steve carroll robert OOV that was in charge of the investigation in the 70s they discover the criminal and a net of power and money to cover the murder murder in OOV is a good tv movie with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy the powerful and rich family used their influence to cover the murder for more than twenty years however a OOV de

## Sequence alignment

The last step of the data preparation is the sequence alignment since the Keras LSTM works on sequence with the same length.

Again, we can use the **pad_sequences()** Keras function to pad the short sequence and to trim the long one. The value used for the padding is the 0, which is already excluded by the tokenizer during the vocabulary construction.

In [11]:
# each sequence have a different lenght
from keras.preprocessing.sequence import pad_sequences

# finde the max len
# maxlen = max([len(l) for l in x_train_seq+x_test_seq]) = 2473 is too large
# we fix the maxlen to 500
maxlen = 500
# the value used for padding is 0
x_train_seq = pad_sequences(x_train_seq, maxlen=maxlen)
x_test_seq = pad_sequences(x_test_seq, maxlen=maxlen)


print('The training shape is {}'.format(x_train_seq.shape))

The training shape is (25000, 500)


## Use the embedding layer

We decide to use an embedding layer to map the word id in a vector with 100 feature.

The embedding layer can be easily defined using the Keras **EmbeddingLayer**; during its initialisation we need to specify the number of words and the size of the embedding ([doc page](https://keras.io/layers/embeddings/)).

In [12]:

from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

embedding_dim = 100

def build_model():
    
    model = Sequential()
    
    # the 0 is the padding
    model.add(Embedding(num_words, embedding_dim, mask_zero=True))
    model.add(LSTM(128))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        
    return model


## Fit the data

In [13]:
LSTM_class = build_model()
print(LSTM_class.summary())

# we use less data to speed-up the computation
LSTM_class.fit(x_train_seq[:5000,:], y_train[:5000], validation_data=(x_test_seq[:1000,:], y_test[:1000]), epochs=10, batch_size=256, verbose=1);

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         1000000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,117,377
Trainable params: 1,117,377
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Use tf.cast instead.
Train on 5000 samples, validate on 1000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Check the learned embeddings

In [14]:
# this is the embedding matrix
emb_matrix = LSTM_class.layers[0].get_weights()[0]
emb_matrix.shape

(10000, 100)

In [15]:
# find most similar embeddings to the words 'horrible' and 'fantastic'
from sklearn.metrics.pairwise import cosine_similarity

horrible_emb = emb_matrix[tokenizer_obj.word_index['horrible'],:]
fantastic_emb = emb_matrix[tokenizer_obj.word_index['fantastic'],:]

emb_sim = cosine_similarity(emb_matrix, np.stack((horrible_emb, fantastic_emb), axis=0))

emb_sim.shape

(10000, 2)

In [16]:
# the 5 most similar word to 'horrible'
[tokenizer_obj.index_word[i] for i in np.flip(np.argsort(emb_sim[:,0])[-5:])]

['horrible', '4', 'unclear', 'boring', 'avoid']

In [17]:
# the 5 most similar word to 'fantastic'
[tokenizer_obj.index_word[i] for i in np.flip(np.argsort(emb_sim[:,1])[-5:])]

['fantastic', 'heights', 'images', 'wonderful', 'ages']

## How can we use a pretrained embedding in Keras?

There are a lot of word-embeddings that are already pretrained on huge text corpora. In some cases, using the pretrained word-embeddings can boost our model performance.

## Glove embeddings

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

More information and download on the [project page](https://nlp.stanford.edu/projects/glove/).

In [18]:
# first of all, we read the .txt files whihc contains the embeddings and we store its content in vocaburaly word_to_embedding.
import os
word2embedding = {}
# we keep embedding_dim = 100
with open(os.path.join('data/glove_embs/glove.6B.100d.txt'),'r',encoding='utf8') as f:
    for line in f.readlines():
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word2embedding[word] = coefs

print('Found {} word embeddings.'.format(len(word2embedding)))

Found 400000 word embeddings.


In [19]:
# retrieve the idx2word vocabulary from the tokenizer
idx2word = tokenizer_obj.index_word

glove_emb_matrix = np.random.randn(num_words, embedding_dim)

not_found = 0
#we start from 1 to skip the 0 padding
for i in range(1,num_words):
    # retrieve the word
    w = idx2word[i]
    
    #check if exisists a pretrained embedding
    if w in word2embedding:
        glove_emb_matrix[i,:] = word2embedding[w]
    else:
        not_found = not_found+1

#remove t
print('The embedding matrix has shape {}.\n'
      '{} embeddings not found.'.format(glove_emb_matrix.shape, not_found))

The embedding matrix has shape (10000, 100).
45 embeddings not found.


In [20]:
from keras.initializers import Constant

def build_model_with_glove_embs():
    model = Sequential()
    
    model.add(Embedding(num_words, embedding_dim, mask_zero=True,
                        embeddings_initializer=Constant(glove_emb_matrix), trainable=True))
    model.add(LSTM(128))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        
    return model

In [21]:
glove_LSTM_class = build_model_with_glove_embs()
print(glove_LSTM_class.summary())

# we use less data to speed-up the computation
glove_LSTM_class.fit(x_train_seq[:5000,:], y_train[:5000], validation_data=(x_test_seq[:1000,:], y_test[:1000]), epochs=10, batch_size=256, verbose=1);

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 100)         1000000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 1,117,377
Trainable params: 1,117,377
Non-trainable params: 0
_________________________________________________________________
None
Train on 5000 samples, validate on 1000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
