# Sentiment Classification with ELMO

We want to predict sentiment from a IMDB movie review dataset. For every review we take only the first 100 words - we want in fact to check wheter most of the actual sentiment is at the beginning of the reviews, rather than at the end. 

In this dataset, sentiment is either positive or negative, 0 or 1. (There exists also a version with the 1 to 10 scale of votes)


- [Tensorflow hub](https://www.tensorflow.org/hub): a hub of models
- [ELMO](https://tfhub.dev/google/elmo/2) webpage



### In this Notebook, you're going to finish some following #TODOs
- Always try to read all of the code and try to understand it.
- Excercise with early stopping and hyperparameter search

In [1]:
import os, sys

def add_path(path):
    module_path = os.path.abspath(os.path.join(path))
    if module_path not in sys.path:
        sys.path.append(module_path)
    
    
    
add_path('../../pythonlibs/embeddings')
add_path('../../pythonlibs')


Run one of these lines to install tensorflow_hub. Select to use either pip or conda


In [2]:
#!{sys.executable} -m pip install tensorflow_hub
#!conda install --yes --prefix {sys.prefix} numpy

In [3]:
import os
import numpy as np
import re
import tensorflow_hub as hub
import tensorflow as tf
from imdb.helper import get_imdb_reviews_dataset
from elmo.helper import get_elmo_embeddings_layer, transform_imdb_dataset, loss_pass


W0509 00:01:02.011122 140428235376448 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


## 1. Load the dataset

 - see helper file for the function's code

In [4]:

dataset_folder = '../../../data/aclImdb'


(train_x, train_y), (test_x, test_y) = get_imdb_reviews_dataset(path=dataset_folder, max_dataset_size=6000, trunc=100)


x, y = np.vstack([train_x, test_x]), np.vstack([train_y, test_y])


In [5]:
print(x)

[['i am very sorry that this charming and whimsical film which i first saw soon after it was first released in the early fifties has had such a poor reception more recently in my opinion it has been greatly underrated  but perhaps it appeals more to the european sense of humour than to for example the american maybe we in europe can understand and appreciate its subtleties and situations more since we are closer to some of them in real life particular mention should be made of the limited but good music  especially the catchy and memorable song']
 ['xizao is a rare little movie it is simple and undemanding and at the same time so rewarding in emotion and joy the story is simple and the theme of old and new clashing is wonderfully introduced in the first scenes this theme is the essence of the movie but it would have fallen flat if it was not for the magnificent characters and the actors portraying themthe aging patriarch master liu is a relic of chinas preexpansion days he runs a bath 

## 2. The dataset from (string, class) is transformed to (embedding, string, class) using ELMo


ELMo is concatenated on the input placeholder.

We will have: **Input(string) -> ELMo -> Output(float x 1024)**

We use the 'default' output of ELMo, that for a sentence with tokens k1, k2, ... averages all of the word embeddings ELMok.

In [6]:
batch_size = 1

sess = tf.Session()

'''

'''
input_text = tf.placeholder(shape=(batch_size, ), dtype='string')
elmo = get_elmo_embeddings_layer(input_text)


'''
We pass the dataset to ELMo for it to generate the latent representations of the reviews.
'''

sess.run([
    tf.tables_initializer(), 
    tf.global_variables_initializer()
])
    
    
# Set load = False to generate the embeddings with ELMO. It takes a lot of time.  
# Set load = True to load a pre-precessed dataset (given) from the hard-disk.
embs = transform_imdb_dataset(sess, input_text, elmo, x, batch_size, load=True, npy_path='./imdb_embs.npy')


assert len(embs) == len(x) == len(y)

# embs, x and y are row-aligned


perc_train = .8


row_split = int(perc_train * len(x))


# We shuffle the dataset
randind = np.random.permutation(x.shape[0])

x = x[randind]
y = y[randind, :]
embs = embs[randind, :]


train_embed, train_labels, train_values = embs[:row_split, :], y[:row_split], x[:row_split, :]
test_embed, test_labels, test_values    = embs[row_split:, :], y[row_split:], x[row_split:, :]

W0509 00:01:07.748224 140428235376448 deprecation.py:323] From /home/michele/Development/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


## 3. Specify a classifier that will use ELMo embeddings

It is a standard feed-forward network as we don't directly concatenate it to ELMo. 

We decided to separate the classifier from the generation of the representations to speed-up the training time.

In [7]:
# Specifications of the classifier

input_embed  = tf.placeholder(dtype='float', shape=(None, 1024), name='input_emb')
input_labels = tf.placeholder(dtype='float',  shape=(None, 2), name='input_lbl')

dense1 = tf.layers.dense(input_embed, 500, activation='sigmoid')

prob = tf.placeholder_with_default(.0, shape=())
dout1  = tf.layers.dropout(dense1, rate=prob, training=True)

output = tf.layers.dense(dout1, 2, activation='linear')


# Loss function and optimization criteria

loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=output, labels=input_labels)
opt = tf.train.RMSPropOptimizer(1e-3)
opt_op = opt.minimize(loss)

W0509 00:01:11.026981 140428235376448 deprecation.py:323] From <ipython-input-7-bbe967b1c85d>:6: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0509 00:01:11.057341 140428235376448 deprecation.py:323] From <ipython-input-7-bbe967b1c85d>:9: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W0509 00:01:11.062977 140428235376448 deprecation.py:506] From /home/michele/Development/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


## 4. Train the model

Using above defined network and the following parameters

In [8]:
# Now we train the model!

epochs = 20
batch_size = 16


# Feed dictionaries
train_feeds = {input_embed: train_embed, input_labels: train_labels}
test_feeds  = {input_embed: test_embed,  input_labels: test_labels}


sess.run([
    tf.tables_initializer(), 
    tf.global_variables_initializer()
])

for e in range(epochs):

    train_loss = loss_pass(train_feeds, batch_size, sess, loss, train_op=opt_op, log=False)
    test_loss  = loss_pass(test_feeds,  batch_size, sess, loss)

    print('Epoch: {:>6}/{:<6} Train loss: {} Test loss: {}'.format(e+1, epochs, train_loss, test_loss))




Epoch:      1/20     Train loss: 0.7266153552134832 Test loss: 0.7168813848495483
Epoch:      2/20     Train loss: 0.6692285135388374 Test loss: 0.6414126340548197
Epoch:      3/20     Train loss: 0.6419026438395182 Test loss: 0.6295867749055226
Epoch:      4/20     Train loss: 0.629452684322993 Test loss: 0.6258252811431885
Epoch:      5/20     Train loss: 0.6201454809308052 Test loss: 0.6236344202359517
Epoch:      6/20     Train loss: 0.6113301544388136 Test loss: 0.6226768004894256
Epoch:      7/20     Train loss: 0.602236191034317 Test loss: 0.6225941967964173
Epoch:      8/20     Train loss: 0.5929680331548055 Test loss: 0.6227985227108002
Epoch:      9/20     Train loss: 0.5836559851964315 Test loss: 0.6231621913115184
Epoch:     10/20     Train loss: 0.5741625733176867 Test loss: 0.6238262462615967
Epoch:     11/20     Train loss: 0.5643218252062797 Test loss: 0.6248316756884257
Epoch:     12/20     Train loss: 0.5539860497911772 Test loss: 0.6261344528198243
Epoch:     13/20  

## 5. Predict

Now some predictions of the test set.
- Can you guess which class is what?

In [9]:
probabilities = tf.nn.softmax(output)


for i, r in enumerate(test_x):
    
    result= sess.run(probabilities, 
                     feed_dict= {
                         input_embed: test_embed[i,:].reshape(1,-1),
                         prob: .0
                    })
    
    print(f'{str(test_values[i])} \n Class 0: {result[0][0]} Class 1: {result[0][1]}')
    print('\n')
    




['lets put political correctness aside and just look at this in terms of the numerous sex comedies that came out in the 1980s because i for one do not think this is any better or any worse than the others unless your some religious kook or an uptight female you can probably view a silly film such as this without getting all worked up about the content and i personally had a totally innocuous feeling towards this before and after watching it story is set in albuquerque new mexico where a rich 15 year old boy named phillip philly fillmore'] 
 Class 0: 0.003334630513563752 Class 1: 0.9966654181480408


['i was young film student in 1979 when the union of the soviet filmmakers came to sofia bulgaria and premiered konchalovskys siberiade tarkosvkys stalker and daneliaa autumn marathon i was stunned by the cosmopolitan dimension of the art form then and only then i saw siberiade 4 and 12 hours epic and was speechless way better then bertoluccis 1900 by farhope andron will somehow get to the 

IndexError: index 1200 is out of bounds for axis 0 with size 1200

# 6. Exercises


### Early stopping

Design an early stopping criteria and implement it.


#### From wiki:

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set (*the test set*). Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation. 


### Hyperparameter search
Design an algorithm that bruteforces all of the combinations of the selected hyperparameters. Keeping only the combination with best performance
- Select some parameters, such as layers size, learning rate, dropout
- Select for each parameter min value, max value, step
- Run into nested loops the training procedure, sturing the final loss


#### From wiki:

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned. 


### Can you draw any conclusion on the efficacy of the model?


# 7. Conclusions

This Notebook containes code to load a dataset, process it with ELMo, and run a classifier with the obtained representations. This is transfer learning in practice!