![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/multi_class_text_classification/NLU_training_multi_class_text_classifier_demo.ipynb)



# Training a Deep Learning Classifier with NLU 
## ClassifierDL (Multi-class Text Classification)
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null pyspark==2.4.7

import nlu

# 2. Download news classification dataset

In [None]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

--2020-12-14 02:23:36--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.154.38
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.154.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24032125 (23M) [text/csv]
Saving to: ‘news_category_train.csv’


2020-12-14 02:23:37 (21.7 MB/s) - ‘news_category_train.csv’ saved [24032125/24032125]

--2020-12-14 02:23:37--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.74.118
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.74.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1504408 (1.4M) [text/csv]
Saving to: ‘news_category_test.csv’


2020-12-14 02:23:38 (2.77 MB/s) - ‘news_category_test.csv’ saved [1504408/1504408]



In [None]:
import pandas as pd
test_path = '/content/news_category_test.csv'
train_df = pd.read_csv(test_path)
train_df.columns=['y','text']
train_df

Unnamed: 0,y,text
0,Business,Unions representing workers at Turner Newall...
1,Sci/Tech,"TORONTO, Canada A second team of rocketeer..."
2,Sci/Tech,A company founded by a chemistry researcher a...
3,Sci/Tech,It's barely dawn when Mike Fitzpatrick starts...
4,Sci/Tech,Southern California's smog fighting agency we...
...,...,...
7595,World,Ukrainian presidential candidate Viktor Yushch...
7596,Sports,With the supply of attractive pitching options...
7597,Sports,Like Roger Clemens did almost exactly eight ye...
7598,Business,SINGAPORE : Doctors in the United States have ...


# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

By default, the Universal Sentence Encoder Embeddings (USE) are beeing downloaded to provide embeddings for the classifier. You can use any of the 50+ other sentence Emeddings in NLU tough!

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
fitted_pipe = nlu.load('train.classifier').fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,y,default_name_embeddings,text,sentence,category_confidence,category
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Business,"[0.012997539713978767, 0.019844762980937958, -...",Unions representing workers at Turner Newall...,Unions representing workers at Turner Newall s...,0.999985,Business
1,Sci/Tech,"[0.023022323846817017, -0.01595703884959221, -...","TORONTO, Canada A second team of rocketeer...","TORONTO, Canada A second team of rocketeers co...",1.000000,Sports
1,Sci/Tech,"[-0.010587693192064762, 0.011531050316989422, ...","TORONTO, Canada A second team of rocketeer...","10 million Ansari X Prize, a contest for priva...",1.000000,Sports
2,Sci/Tech,"[0.038641855120658875, 0.02322080172598362, -0...",A company founded by a chemistry researcher a...,A company founded by a chemistry researcher at...,0.744563,Business
3,Sci/Tech,"[-0.006857294123619795, 0.01967567577958107, -...",It's barely dawn when Mike Fitzpatrick starts...,It's barely dawn when Mike Fitzpatrick starts ...,0.999360,Sci/Tech
...,...,...,...,...,...,...
7596,Sports,"[0.005107458680868149, -0.011805553920567036, ...",With the supply of attractive pitching options...,.,1.000000,Sports
7596,Sports,"[0.005107458680868149, -0.011805553920567036, ...",With the supply of attractive pitching options...,.,2.000000,Sports
7597,Sports,"[0.044696468859910965, 0.0015660696662962437, ...",Like Roger Clemens did almost exactly eight ye...,Like Roger Clemens did almost exactly eight ye...,1.000000,Sports
7598,Business,"[0.05564942583441734, -0.021285761147737503, -...",SINGAPORE : Doctors in the United States have ...,SINGAPORE : Doctors in the United States have ...,0.999433,Business


# 4. Evaluate the model

In [None]:
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


              precision    recall  f1-score   support

    Business       0.76      0.81      0.78      3671
    Sci/Tech       0.80      0.79      0.79      3983
      Sports       0.86      0.92      0.89      3687
       World       0.89      0.77      0.83      3058

    accuracy                           0.82     14399
   macro avg       0.83      0.82      0.82     14399
weighted avg       0.82      0.82      0.82     14399



# 5. Lets try different Sentence Emebddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
# Load pipe with bert embeds
# using large embeddings can take a few hours..
# fitted_pipe = nlu.load('en.embed_sentence.bert_large_uncased train.classifier').fit(train_df)
fitted_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.classifier').fit(train_df)


# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

    Business       0.00      0.00      0.00      1900
    Sci/Tech       0.25      1.00      0.40      1900
      Sports       0.00      0.00      0.00      1900
       World       0.00      0.00      0.00      1900

    accuracy                           0.25      7600
   macro avg       0.06      0.25      0.10      7600
weighted avg       0.06      0.25      0.10      7600



In [None]:
# Load pipe with bert embeds
fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)

# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    Business       0.81      0.74      0.77      1900
    Sci/Tech       0.74      0.87      0.80      1900
      Sports       0.92      0.94      0.93      1900
       World       0.91      0.81      0.86      1900

    accuracy                           0.84      7600
   macro avg       0.85      0.84      0.84      7600
weighted avg       0.85      0.84      0.84      7600



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,classifier_confidence,document,classifier,embed_sentence_bert_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.997592,Tesla plans to invest 10M into the ML sector,Business,"[-0.07111635059118271, 0.9532930850982666, -1...."


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                            | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['regex_tokenizer'] has settable params:
pipe['regex_tokenizer'].setCaseSensitiveExceptions(True)                       | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True
pipe['regex_tokenizer'].setTargetPattern('\S+')                                | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['regex_tokenizer'].setMaxLength(99999)                                    | Info: Set the maximum allowed length for each token | Currently set to : 99999
pipe['regex_tokenizer'].setMinLength(0)                                        | Info: