![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo_IMDB.ipynb)


# Training a Sentiment Analysis Classifier with NLU 
With the [SentimentDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#sentimentdl-multi-class-sentiment-analysis-annotator) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu pyspark==2.4.7 > /dev/null  


import nlu

# 2. Download IMDB dataset
https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

In [None]:
! wget http://ckl-it.de/wp-content/uploads/2021/01/IMDB-Dataset.csv


--2021-01-16 09:07:54--  http://ckl-it.de/wp-content/uploads/2021/01/IMDB-Dataset.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3288450 (3.1M) [text/csv]
Saving to: ‘IMDB-Dataset.csv’


2021-01-16 09:07:56 (2.29 MB/s) - ‘IMDB-Dataset.csv’ saved [3288450/3288450]



In [None]:
import pandas as pd
train_path = '/content/IMDB-Dataset.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
columns=['text','y']
train_df = train_df[columns]
train_df

Unnamed: 0,text,y
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
2495,Another great movie by Costa-Gavras. It's a gr...,negative
2496,Though structured totally different from the b...,positive
2497,Handsome and dashing British airline pilot Geo...,positive
2498,This film breeches the fine line between satir...,negative


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.70      0.70      0.70        27
     neutral       0.00      0.00      0.00         0
    positive       0.79      0.65      0.71        23

    accuracy                           0.68        50
   macro avg       0.50      0.45      0.47        50
weighted avg       0.74      0.68      0.71        50



Unnamed: 0_level_0,text,default_name_embeddings,sentiment,sentiment_confidence,y,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,One of the other reviewers has mentioned that ...,"[-0.04935329407453537, -0.01034686528146267, -...",positive,0.968638,positive,One of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,"[0.040489643812179565, -0.054199717938899994, ...",negative,0.990273,positive,A wonderful little production. <br /><br />The...
2,I thought this was a wonderful way to spend ti...,"[0.026364900171756744, 0.07112795859575272, 0....",negative,0.957352,positive,I thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,"[-0.05151151493191719, 0.008207003585994244, -...",negative,0.958503,negative,Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...","[0.06880538165569305, 0.019250543788075447, -0...",positive,0.999108,positive,"Petter Mattei's ""Love in the Time of Money"" is..."
5,"Probably my all-time favorite movie, a story o...","[0.004764211364090443, 0.027671916410326958, -...",positive,0.993937,positive,"Probably my all-time favorite movie, a story o..."
6,I sure would like to see a resurrection of a u...,"[-0.03813941031694412, -0.03322296217083931, 0...",positive,0.974884,positive,I sure would like to see a resurrection of a u...
7,"This show was an amazing, fresh & innovative i...","[0.010670202784240246, -0.04322813078761101, -...",negative,0.721451,negative,"This show was an amazing, fresh & innovative i..."
8,Encouraged by the positive comments about this...,"[0.010801736265420914, -0.07724311947822571, -...",positive,0.884824,negative,Encouraged by the positive comments about this...
9,If you like original gut wrenching laughter yo...,"[-0.0245585348457098, 0.0005475765210576355, -...",negative,0.850509,positive,If you like original gut wrenching laughter yo...


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict('It was one of the best films i have ever watched in my entire life !!')

Unnamed: 0_level_0,default_name_embeddings,sentiment,sentiment_confidence,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[0.06468033790588379, -0.040837567299604416, -...",positive,0.982375,Bitcoin is going to the moon!


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentiment_dl'] has settable params:
pipe['sentiment_dl'].setMaxEpochs(2)                 | Info: Maximum number of epochs to train | Currently set to : 2
pipe['sentiment_dl'].setLr(0.005)                    | Info: Learning Rate | Currently set to : 0.005
pipe['sentiment_dl'].setBatchSize(64)                | Info: Batch size | Currently set to : 64
pipe['sentiment_dl'].setDropout(0.5)                 | Info: Dropout coefficient | Currently set to : 0.5
pipe['sentiment_dl'].setEnableOutputLogs(True)       | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
pipe['sentiment_dl'].setThreshold(0.6)               | Info: The minimum threshold for the final result otheriwse it will be neutral | Currently set to : 0.6
pipe['sentiment_dl'].setThresholdLabel('neutral')    | Info: In case the score is less than threshold, what should be the label. Default i

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe['sentiment_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.81      0.96      0.88        27
     neutral       0.00      0.00      0.00         0
    positive       0.94      0.70      0.80        23

    accuracy                           0.84        50
   macro avg       0.58      0.55      0.56        50
weighted avg       0.87      0.84      0.84        50



Unnamed: 0_level_0,text,default_name_embeddings,sentiment,sentiment_confidence,y,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,One of the other reviewers has mentioned that ...,"[-0.04935329407453537, -0.01034686528146267, -...",positive,0.966858,positive,One of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,"[0.040489643812179565, -0.054199717938899994, ...",negative,0.985679,positive,A wonderful little production. <br /><br />The...
2,I thought this was a wonderful way to spend ti...,"[0.026364900171756744, 0.07112795859575272, 0....",negative,0.988745,positive,I thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,"[-0.05151151493191719, 0.008207003585994244, -...",negative,0.999291,negative,Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...","[0.06880538165569305, 0.019250543788075447, -0...",positive,0.999684,positive,"Petter Mattei's ""Love in the Time of Money"" is..."
5,"Probably my all-time favorite movie, a story o...","[0.004764211364090443, 0.027671916410326958, -...",positive,0.996598,positive,"Probably my all-time favorite movie, a story o..."
6,I sure would like to see a resurrection of a u...,"[-0.03813941031694412, -0.03322296217083931, 0...",positive,0.960203,positive,I sure would like to see a resurrection of a u...
7,"This show was an amazing, fresh & innovative i...","[0.010670202784240246, -0.04322813078761101, -...",negative,0.753273,negative,"This show was an amazing, fresh & innovative i..."
8,Encouraged by the positive comments about this...,"[0.010801736265420914, -0.07724311947822571, -...",negative,0.958928,negative,Encouraged by the positive comments about this...
9,If you like original gut wrenching laughter yo...,"[-0.0245585348457098, 0.0005475765210576355, -...",neutral,0.536441,positive,If you like original gut wrenching laughter yo...


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

In [None]:
trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['sentiment_dl'].setMaxEpochs(120)  
trainable_pipe['sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.85      0.81      0.83      1234
     neutral       0.00      0.00      0.00         0
    positive       0.87      0.79      0.83      1266

    accuracy                           0.80      2500
   macro avg       0.57      0.54      0.55      2500
weighted avg       0.86      0.80      0.83      2500



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('It was one of the best films i have ever watched in my entire life !!')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,sentiment,en_embed_sentence_small_bert_L12_768_embeddings,sentiment_confidence,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,positive,"[0.09222018718719482, 0.11720675230026245, 0.1...",0.999543,It was one of the best films i have ever watch...


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')            | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])                  | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)                 | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)           | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLength(99999)                  | Info: Set the maximum allowed length for ea