![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo_apple_twitter.ipynb)



# Training a Sentiment Analysis Classifier with NLU 
With the [SentimentDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#sentimentdl-multi-class-sentiment-analysis-annotator) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



In [None]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu pyspark==2.4.7 > /dev/null  


import nlu

# 2. Download appple twitter  Sentiment dataset 
https://www.kaggle.com/seriousran/appletwittersentimenttexts

this dataset contains tweets made towards apple and today we are going to train our model to predict whether the tweet contains sentiment!


In [None]:
! wget https://raw.githubusercontent.com/ahmedlone127/nlu-master/main/apple-twitter-sentiment-texts.csv


--2021-01-01 02:27:38--  https://raw.githubusercontent.com/ahmedlone127/nlu-master/main/apple-twitter-sentiment-texts.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31678 (31K) [text/plain]
Saving to: ‘apple-twitter-sentiment-texts.csv’


2021-01-01 02:27:39 (12.9 MB/s) - ‘apple-twitter-sentiment-texts.csv’ saved [31678/31678]



In [None]:
import pandas as pd
train_path = '/content/apple-twitter-sentiment-texts.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
columns=['text','y']
train_df = train_df[columns]
train_df = train_df[~train_df["y"].isin(["neuteral"])]
train_df

Unnamed: 0,text,y
0,@Apple you need to sort your phones out.,negative
1,Wow. Yall needa step it up @Apple RT @heynyla:...,negative
2,I'm surprised there isn't more talk about what...,negative
3,Realised the reason @apple make huge phones is...,negative
4,Apple Inc. CEO Donates $291K To Pennsylvania S...,positive
...,...,...
281,@apple so thanks for being greedy assholes who...,negative
282,@apple iCal AGAIN!!! it reset all my recurring...,negative
283,Just did my first transaction with @Apple Pay ...,positive
284,RT @JPDesloges: Kantar Worldpanel: iPhone sale...,positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.91      0.80      0.85       143
     neutral       0.00      0.00      0.00         0
    positive       0.82      0.91      0.86       143

    accuracy                           0.86       286
   macro avg       0.58      0.57      0.57       286
weighted avg       0.86      0.86      0.86       286



Unnamed: 0_level_0,sentiment_confidence,y,default_name_embeddings,text,document,sentiment
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.998447,negative,"[-0.01731022447347641, 0.010604134760797024, -...",@Apple you need to sort your phones out.,@Apple you need to sort your phones out.,negative
1,0.990570,negative,"[0.019931159913539886, -0.04991159215569496, -...",Wow. Yall needa step it up @Apple RT @heynyla:...,Wow. Yall needa step it up @Apple RT @heynyla:...,positive
2,0.969844,negative,"[0.01646081730723381, -0.02681073546409607, -0...",I'm surprised there isn't more talk about what...,I'm surprised there isn't more talk about what...,negative
3,0.996128,negative,"[0.04638500511646271, -0.037105873227119446, -...",Realised the reason @apple make huge phones is...,Realised the reason @apple make huge phones is...,negative
4,0.959235,positive,"[-0.028623634949326515, 0.03947276994585991, -...",Apple Inc. CEO Donates $291K To Pennsylvania S...,Apple Inc. CEO Donates $291K To Pennsylvania S...,positive
...,...,...,...,...,...,...
281,0.978435,negative,"[0.03778046742081642, 0.03407461196184158, 0.0...",@apple so thanks for being greedy assholes who...,@apple so thanks for being greedy assholes who...,negative
282,0.623791,negative,"[-0.013547728769481182, -0.001025827950797975,...",@apple iCal AGAIN!!! it reset all my recurring...,@apple iCal AGAIN!!! it reset all my recurring...,positive
283,0.999104,positive,"[-0.0015363194979727268, -0.01644994132220745,...",Just did my first transaction with @Apple Pay ...,Just did my first transaction with @Apple Pay ...,positive
284,0.999854,positive,"[0.0656985342502594, 0.028557728976011276, -0....",RT @JPDesloges: Kantar Worldpanel: iPhone sale...,RT @JPDesloges: Kantar Worldpanel: iPhone sale...,positive


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict('I hate the newest update')

Unnamed: 0_level_0,sentiment_confidence,default_name_embeddings,document,sentiment
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.996097,"[0.06468033790588379, -0.040837567299604416, -...",Bitcoin is going to the moon!,positive


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentiment_dl'] has settable params:
pipe['sentiment_dl'].setMaxEpochs(2)                 | Info: Maximum number of epochs to train | Currently set to : 2
pipe['sentiment_dl'].setLr(0.005)                    | Info: Learning Rate | Currently set to : 0.005
pipe['sentiment_dl'].setBatchSize(64)                | Info: Batch size | Currently set to : 64
pipe['sentiment_dl'].setDropout(0.5)                 | Info: Dropout coefficient | Currently set to : 0.5
pipe['sentiment_dl'].setEnableOutputLogs(True)       | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
pipe['sentiment_dl'].setThreshold(0.6)               | Info: The minimum threshold for the final result otheriwse it will be neutral | Currently set to : 0.6
pipe['sentiment_dl'].setThresholdLabel('neutral')    | Info: In case the score is less than threshold, what should be the label. Default i

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe['sentiment_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df.iloc[:100])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:100],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.96      0.85      0.90       143
     neutral       0.00      0.00      0.00         0
    positive       0.87      0.95      0.91       143

    accuracy                           0.90       286
   macro avg       0.61      0.60      0.60       286
weighted avg       0.92      0.90      0.91       286



Unnamed: 0_level_0,sentiment_confidence,y,default_name_embeddings,text,document,sentiment
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.999738,negative,"[-0.01731022447347641, 0.010604134760797024, -...",@Apple you need to sort your phones out.,@Apple you need to sort your phones out.,negative
1,0.937319,negative,"[0.019931159913539886, -0.04991159215569496, -...",Wow. Yall needa step it up @Apple RT @heynyla:...,Wow. Yall needa step it up @Apple RT @heynyla:...,positive
2,0.974594,negative,"[0.01646081730723381, -0.02681073546409607, -0...",I'm surprised there isn't more talk about what...,I'm surprised there isn't more talk about what...,negative
3,0.997196,negative,"[0.04638500511646271, -0.037105873227119446, -...",Realised the reason @apple make huge phones is...,Realised the reason @apple make huge phones is...,negative
4,0.709098,positive,"[-0.028623634949326515, 0.03947276994585991, -...",Apple Inc. CEO Donates $291K To Pennsylvania S...,Apple Inc. CEO Donates $291K To Pennsylvania S...,positive
...,...,...,...,...,...,...
281,0.984257,negative,"[0.03778046742081642, 0.03407461196184158, 0.0...",@apple so thanks for being greedy assholes who...,@apple so thanks for being greedy assholes who...,negative
282,0.904880,negative,"[-0.013547728769481182, -0.001025827950797975,...",@apple iCal AGAIN!!! it reset all my recurring...,@apple iCal AGAIN!!! it reset all my recurring...,negative
283,0.995687,positive,"[-0.0015363194979727268, -0.01644994132220745,...",Just did my first transaction with @Apple Pay ...,Just did my first transaction with @Apple Pay ...,positive
284,0.998746,positive,"[0.0656985342502594, 0.028557728976011276, -0....",RT @JPDesloges: Kantar Worldpanel: iPhone sale...,RT @JPDesloges: Kantar Worldpanel: iPhone sale...,positive


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['sentiment_dl'].setMaxEpochs(110)  
trainable_pipe['sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.96      0.85      0.90       143
     neutral       0.00      0.00      0.00         0
    positive       0.92      0.92      0.92       143

    accuracy                           0.88       286
   macro avg       0.63      0.59      0.61       286
weighted avg       0.94      0.88      0.91       286



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('I hate the newest update')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,sentiment_confidence,en_embed_sentence_small_bert_L12_768_embeddings,document,sentiment
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.974083,"[-0.058236218988895416, -0.3061041235923767, 0...",I hate it,negative


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')            | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['regex_tokenizer'] has settable params:
pipe['regex_tokenizer'].setCaseSensitiveExceptions(True)       | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True
pipe['regex_tokenizer'].setTargetPattern('\S+')                | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['regex_tokenizer'].setMaxLength(99999)                    | Info: Set the maximum allowed length for each token | Currently set to : 99999
pipe['regex_tokenizer'].setMinLength(0)                        | Info: Set the minimum allowed length for each token | Currently set to : 0
>>> pipe['