# Transfer Learning

This notebook describes the steps for using a previously trained sentiment analysis model, on a new data set. The model being used here, has been trained earlier on the [Sentiment140](http://help.sentiment140.com/home) data. For details on building and training a sentiment analysis model, please see the Python code files in this project, namely, <i>01_read_data.py</i>, <i>02_pre-processing_and_model_training.py</i> and <i>3_model_training_on_sentiment140_notebook</i>. The model will then make predictions on the [IMDB Movie Reviews Dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews) from Tensorflow. This process of using a previously trained network, and customizing it for a given new task, is called <b>Transfer Learning</b>. We'll be demonstrating the following steps in this notebook:

1. Download and prepare the data. We have used the same Tokenizer that was used on the Sentiment140 data.


2. Build the model:

    a) Load the pre-trained base model
    
    b) Stack the new classification layers on top
    
    
3. Train the new model, evaluate and make predictions.


The following work will look fairly standard to anyone having trained machine learning models using python Jupyter notebooks. The CML platform provides a fully capable Jupyter notebook environment that data scientists know and love.

In [1]:
import csv
import tensorflow as tf
print("Tensorflow Version: ", tf.__version__)
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers
from tensorflow.keras import Model
import h5py
import pickle

Tensorflow Version:  2.2.0


### Download the IMDB dataset from TF

The IMDB Movie Reviews dataset is a collection of 50,000 highly polar movie reviews, split into training and test sets with 25,000 reviews in each set. This is included as a Tensorflow package, and is easy to doanload and import.

In [2]:
#!pip3 install tensorflow_datasets

In [3]:
import tensorflow_datasets as tfds

In [4]:
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

In [5]:
print(type(imdb))
print(type(info))

<class 'dict'>
<class 'tensorflow_datasets.core.dataset_info.DatasetInfo'>


### Read and prepare data

In this section, we're reading the data from the IMDB set, and examining what each of the training and test sets look like, in terms of data size, labels etc.. Both the sets have only two unique labels, 0 and 1, which clearly states that the data is built for a binary classification problem. Also, ensure that the data adheres to the UTF-8 encoding, in order to avoid further troubles with Python readers. 

In [6]:
train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

for s,l in train_data:
    training_sentences.append(s.numpy().decode('utf8'))
    training_labels.append(l.numpy())

for s,l in test_data:
    testing_sentences.append(s.numpy().decode('utf8'))
    testing_labels.append(l.numpy())

training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

In [7]:
testing_labels_final.shape

(25000,)

In [8]:
training_labels_final.shape

(25000,)

In [9]:
unique_labels = set(training_labels)
unique_labels

{0, 1}

In [10]:
unique_labels = set(testing_labels)
unique_labels

{0, 1}

In [11]:
training_sentences[0:2]

["This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
 'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development wa

In [12]:
testing_sentences[0:2]

["There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come.",
 "A blackly comic tale of a down-trodden priest, Nazarin

### Pre-processing

In this section, we're preparing the text data for training the model. <b>Tokenization</b> and <b>padding</b> are done for the review text, but we will be loading the previously saved tokenizer. This is an important step, because if new tokenizers are used, we will end up preocessing the entire corpus everytime, even for classifying a small sentence. By loading the previous tokenizer, we are using the same word index we had created for the Sentiment140 data (please see <i>02_pre-processing_and_model_training.py</i>).

Another significant change made to this data is the maximum length. For the sentiment140 data, we had specified the maximum length to be 16, and as the data was composed of tweets, 16 words was a reasonable assumption. In this case, we have movie reviews, which are often descriptive. We have kept this limit to be 100.

In [13]:
embedding_dim = 100
max_length = 100
padding_type='post'
trunc_type='post'
oov_tok = "<OOV>"


with open('../../models/sentiment140_tokenizer.pickle', 'rb') as handle:
    loaded_tokenizer = pickle.load(handle)

print('Words in the previously saved Tokenizer: ' + str(len(loaded_tokenizer.word_index)))

sequences = loaded_tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

word_index = loaded_tokenizer.word_index

testing_sequences = loaded_tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

Words in the previously saved Tokenizer: 690960


In [14]:
print(padded.shape)
print(testing_padded.shape)

(25000, 100)
(25000, 100)


In [15]:
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(20, word_index.items())
print(n_items)

[('i', 1), ('to', 2), ('the', 3), ('a', 4), ('my', 5), ('and', 6), ('you', 7), ('is', 8), ('it', 9), ('in', 10), ('for', 11), ('of', 12), ('on', 13), ('me', 14), ('so', 15), ('have', 16), ('that', 17), ('but', 18), ("i'm", 19), ('just', 20)]


In [16]:
print("Length of an example training sequence: " + str(len(sequences[1])))
print("Length of the example training sequence after padding: " + str(len(padded[1])))
print("Length of an example test sequence: " + str(len(testing_sequences[1])))
print("Length of the example test sequence after padding: " + str(len(testing_padded[1])))

Length of an example training sequence: 111
Length of the example training sequence after padding: 100
Length of an example test sequence: 275
Length of the example test sequence after padding: 100


### Load pre-trained Model

Once the training and test data are prepared, load the previously trained model. In this example, we are loading the base model, with the pretrained weights, and adding the classification layers on top of the network.

In [17]:
pre_trained_model = tf.keras.models.load_model('../../models/model_conv1D_LSTM_with_batch_100_epochs.h5')

# Show the base model architecture
pre_trained_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 16, 100)           69096100  
_________________________________________________________________
conv1d (Conv1D)              (None, 9, 32)             25632     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 4, 32)             0         
_________________________________________________________________
dropout (Dropout)            (None, 4, 32)             0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                5

In [18]:
last_layer = pre_trained_model.get_layer('lstm')
print('last layer output shape: ', last_layer.output_shape)

last layer output shape:  (None, 32)


In [19]:
last_output = last_layer.output
print(last_output)

Tensor("lstm/Identity:0", shape=(None, 32), dtype=float32)


### Build and Train new model

In this step, we are building a new model, using the base model loaded previously. This process will use the representations learned by the base model to extract meaningful features from new data samples. Additionally, we are adding a new classifier, which will be trained from scratch, on top of the pretrained model. This is done only for repurposing the feature maps learned previously. We then train the model for 50 epochs.

In [20]:
from tensorflow.keras.optimizers import RMSprop

x = layers.Flatten()(last_output)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(1, activation='sigmoid')(x) 

model = Model(pre_trained_model.input, x)

model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['acc'])

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_input (InputLayer) [(None, 16)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 16, 100)           69096100  
_________________________________________________________________
conv1d (Conv1D)              (None, 9, 32)             25632     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 4, 32)             0         
_________________________________________________________________
dropout (Dropout)            (None, 4, 32)             0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
flatten (Flatten)            (None, 32)                0     

In [21]:
num_epochs = 50
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final), verbose=2)


Epoch 1/50












782/782 - 14s - loss: 0.5968 - acc: 0.6714 - val_loss: 0.4999 - val_acc: 0.7548
Epoch 2/50
782/782 - 12s - loss: 0.5036 - acc: 0.7592 - val_loss: 0.5068 - val_acc: 0.7390
Epoch 3/50
782/782 - 12s - loss: 0.4631 - acc: 0.7823 - val_loss: 0.6284 - val_acc: 0.7106
Epoch 4/50
782/782 - 12s - loss: 0.4374 - acc: 0.7990 - val_loss: 0.5101 - val_acc: 0.7586
Epoch 5/50
782/782 - 12s - loss: 0.4166 - acc: 0.8100 - val_loss: 0.4255 - val_acc: 0.8010
Epoch 6/50
782/782 - 12s - loss: 0.3973 - acc: 0.8210 - val_loss: 0.4331 - val_acc: 0.7955
Epoch 7/50
782/782 - 12s - loss: 0.3840 - acc: 0.8266 - val_loss: 0.4194 - val_acc: 0.8087
Epoch 8/50
782/782 - 12s - loss: 0.3705 - acc: 0.8360 - val_loss: 0.6431 - val_acc: 0.7449
Epoch 9/50
782/782 - 12s - loss: 0.3601 - acc: 0.8386 - val_loss: 0.4207 - val_acc: 0.8061
Epoch 10/50
782/782 - 12s - loss: 0.3420 - acc: 0.8500 - val_loss: 0.4852 - val_acc: 0.7901
Epoch 11/50
782/782 - 12s - loss: 0.3310 - acc: 0.8554 - val_loss: 0.4378 - val_acc: 0.8092
Epoch 12

<tensorflow.python.keras.callbacks.History at 0x7f9a118ad390>

### Model Evaluation

#### Overall Training and Test Accuracies

In [22]:
loss, acc = model.evaluate(np.array(padded), np.array(training_labels_final), verbose=0)
print('Training Accuracy: %.2f%% ' % (acc*100))

Training Accuracy: 96.90% 


In [23]:
loss, acc = model.evaluate(np.array(testing_padded), np.array(testing_labels_final), verbose=0)
print('Test Accuracy: %.2f%% ' % (acc*100))

Test Accuracy: 77.54% 


#### Sentiment Prediction Examples

Before testing the model for example predictions, the input sentences need to be pre-processed and made ready for the model in the same format as the training data, i.e. tokenized and padded. To evaluate the predictive capability of the trained model, we do the following:

Preprocess the records using the same Tokenizer as the training data

Use the preprocessed data as the input vectors of the model, and compute the output vectors i.e. the prediction classes and the confidence.

In this example, the confidence value is used to set a threshold for performing the classification. If the predicted probability is found to be less than 0.5, we classify the statement as "negative", and as "positive" if the value is found to be 0.5 or above. We later display that value in percentage, as the <i>confidence</i> value for our model prediction.

In [27]:
def text_prep(sent):
    print("Input sentence : " + sent)
    sent = np.array([sent])
    token_list = loaded_tokenizer.fit_on_texts(sent)
    
    sequences = loaded_tokenizer.texts_to_sequences(sent)
    padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
    return padded

In [28]:
def predict_sentiment(sent):
    test_example = text_prep(sent)
    pred_conf = model.predict(test_example)
    pred_class = (model.predict(test_example) > 0.5).astype("int32")
    if pred_class[0][0]==0:
        sentiment = 'Negative'
        conf = 100 - (pred_conf[0][np.argmax(pred_conf)] * 100)
    else:
        sentiment = 'Positive'
        conf = pred_conf[0][np.argmax(pred_conf)] * 100
    print("%s sentiment; %.2f%% confidence" % (sentiment, conf))

In [29]:
predict_sentiment("The movie was inspired!")

Input sentence : The movie was inspired!




Positive sentiment; 62.69% confidence


In [30]:
predict_sentiment("The movie is a masterpiece and the acting was brilliant")

Input sentence : The movie is a masterpiece and the acting was brilliant
Positive sentiment; 99.74% confidence


In [31]:
predict_sentiment("The story was convoluted and way too twisted")

Input sentence : The story was convoluted and way too twisted
Negative sentiment; 93.67% confidence


In [32]:
predict_sentiment("What a waste of time!")

Input sentence : What a waste of time!
Negative sentiment; 98.54% confidence


In [33]:
predict_sentiment("Nolan is the greatest director Ive known")

Input sentence : Nolan is the greatest director Ive known
Positive sentiment; 78.33% confidence


In [34]:
predict_sentiment("Shawshank might be a great movie, but you can't beat Interstellar!")

Input sentence : Shawshank might be a great movie, but you can't beat Interstellar!
Positive sentiment; 99.69% confidence


In [35]:
predict_sentiment("It was too slow!")

Input sentence : It was too slow!
Negative sentiment; 99.23% confidence


In [36]:
predict_sentiment("The Devil, indeed, was in the detail!")

Input sentence : The Devil, indeed, was in the detail!
Negative sentiment; 96.10% confidence


In [37]:
predict_sentiment("Everybody at the theatre could not wait for it to be over")

Input sentence : Everybody at the theatre could not wait for it to be over
Negative sentiment; 72.72% confidence


In [38]:
predict_sentiment("I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else.")

Input sentence : I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else.
Positive sentiment; 93.74% confidence


In [39]:
predict_sentiment("The film's last scenes, in which he casts doubt on his behaviour and, in a split second, has to choose between the life he has been leading or the conventional life that is expected of a priest, are so emotional because they concern his moral integrity and we are never quite sure whether it remains intact or not.")

Input sentence : The film's last scenes, in which he casts doubt on his behaviour and, in a split second, has to choose between the life he has been leading or the conventional life that is expected of a priest, are so emotional because they concern his moral integrity and we are never quite sure whether it remains intact or not.
Negative sentiment; 68.11% confidence


In [40]:
predict_sentiment("If you haven't seen this movie see it right now!")

Input sentence : If you haven't seen this movie see it right now!
Positive sentiment; 97.22% confidence


In [41]:
predict_sentiment("I went into this not expecting to much but I came out blown away.")

Input sentence : I went into this not expecting to much but I came out blown away.
Positive sentiment; 85.31% confidence


In [43]:
predict_sentiment("It is a movie worth seeing due to the fact that the workers put so much time and effort into this one film and they did not want to waste money putting up garbage.")

Input sentence : It is a movie worth seeing due to the fact that the workers put so much time and effort into this one film and they did not want to waste money putting up garbage.
Positive sentiment; 91.78% confidence
