# Text Classification Model

In this notebook we will train & explore different Text Classifiers to solve our problem.

  1. Train a single classifier variations.
  2. Train multiple classifiers on unbalanced dataset.
  3. Train multiple classifiers on balanced dataset.

In [1]:
# First let's check what has Google given us ! Thank you Google for the GPU
!nvidia-smi

Fri Jan 10 07:12:08 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
# Let's mount our G-Drive. Hey !! Because for GPU you now give your data to Google 

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
# Install necessary packages and restart the environment

! pip install tiny-tokenizer
! pip install  flair
# ! pip install -U tensorflow-gpu

Collecting tiny-tokenizer
  Downloading https://files.pythonhosted.org/packages/8d/0f/aa52c227c5af69914be05723b3deaf221805a4ccbce87643194ef2cdde43/tiny_tokenizer-3.1.0.tar.gz
Building wheels for collected packages: tiny-tokenizer
  Building wheel for tiny-tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for tiny-tokenizer: filename=tiny_tokenizer-3.1.0-cp36-none-any.whl size=10550 sha256=d5c57ad3510339b668f5795d2101ffad41fdaae645f2d2c74da2385db55e4b70
  Stored in directory: /root/.cache/pip/wheels/d1/c8/36/334497a689fab90128232e86b5829b800dd271a3d5d5959c53
Successfully built tiny-tokenizer
Installing collected packages: tiny-tokenizer
Successfully installed tiny-tokenizer-3.1.0
Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/16/22/8fc8e5978ec05b710216735ca47415700e83f304dec7e4281d61cefb6831/flair-0.4.4-py3-none-any.whl (193kB)
[K     |████████████████████████████████| 194kB 6.5MB/s 
Collecting deprecated>=1.2.4
  Downloading https://files.pythonho

In [1]:
# Let's import our packages !

import pandas as pd
from tqdm import tqdm
import html
import re
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
import flair
import pickle
from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings,FastTextEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from flair.samplers import ImbalancedClassificationDatasetSampler


In [0]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

# 1. Train a Single Classifier

The first approach, would be to load the entire dataset(~1M) and train a single classifier powerful & mighty enough to perform great.

Challenges :
  * Super slow to experiment with different architecture
  * Super slow to train a model
  * Super slow for Hyper-parameter tuning
  * Highly Skewed Dataset


Experiments :
  * Try stacking multiple embeddings at #3
  * Try custom embeddings at #3
  * Try different RNN cell at #4
  * Try different hidden size at #4
  * All hyperparameters are tunable for extensive experimenting at #7

In [0]:
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
corpus_path = path+'training_data/classification_corpus.pkl'
label_dict_path = path+'training_data/classification_corpus_label_dict.pkl'


# 1. Reading Corpus File : which we prepared before-hand
with open(corpus_path, mode='rb') as f:
  corpus = pickle.load(f)

# 2. Reading Corpus Dictionary : which we computed & saved
with open(label_dict_path, mode='rb') as f:
  label_dict = pickle.load(f)

# 3. make a list of word embeddings 
word_embeddings = [
  WordEmbeddings('glove'),

  # comment in flair embeddings for state-of-the-art results
  # FlairEmbeddings('news-forward'),
  # FlairEmbeddings('news-backward'),
]

# 4. initialize document embedding by passing list of word embeddings
## Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
document_embeddings = DocumentRNNEmbeddings(
  word_embeddings,
  hidden_size=128,
  reproject_words=True,
  reproject_words_dimension=256,
)

# 5. create the text classifier
classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label=True)

# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier,
                        corpus,
                        optimizer=Adam,
                        use_tensorboard=True)

# 7. start the training
model_path = path + '/model/full_model'
trainer.train(model_path,
              learning_rate=0.06,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=2,
              checkpoint=True,
              embeddings_storage_mode='gpu',
              num_workers=12)

# 2. Train Multiple Classifiers on unbalanced Dataset

A single classifier fails to capture and fit the data, performing poorly on all the metrices. This led us to divide the problem into sub-groups and FOCUS more on individual group than Training a Titan Model for everything.

In this section, we would train multiple classifiers on the different groups that we create.

### Experiments 

  1. We chose a couple of representative groups
  2. We try different architectures & embeddings
  3. We train only 2 epochs for quick results
  4. We manually try couple of Hyper-parameter tuning based on our Hypothesis


### Things to try & Build Hypothesis on:
  1. GRU/LSTM cells
  2. Number of RNN Layers 
  3. Hidden Units / Time steps / Sequence Length
  4. Embeddings
  5. Batch Size







In [4]:
for grp_id in [10] :
  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model'), # Custom Word Embedding 
                    ## comment in different embeddings for state-of-the-art results
                     
                    #  WordEmbeddings('glove'),                 
                    # FlairEmbeddings('news-forward'),
                    # FlairEmbeddings('news-backward'),
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=256, # Build a hypothesis for different values
    rnn_layers = 2,  # Build a hypothesis for different values
    reproject_words=True, 
    reproject_words_dimension=256
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )

  # model_path = path + '/model_70_30/full_model_1'

  trainer.train(base_path,
                learning_rate=0.1,
                mini_batch_size=128,
                anneal_factor=0.5,
                patience=5,
                max_epochs=2,
                checkpoint=True,
                embeddings_storage_mode='gpu'
                )

Group ID : 10


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2020-01-10 07:16:18,761 ----------------------------------------------------------------------------------------------------
2020-01-10 07:16:18,767 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model')
    )
    (word_reprojection_map): Linear(in_features=300, out_features=256, bias=True)
    (rnn): GRU(256, 256, num_layers=2, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=256, out_features=4, bias=True)
  (loss_function): BCEWithLogitsLoss()
)"
2020-01-10 07:16:18,771 ----------------------------------------------------------------------------------------------------
2020-01-10 07:16:18,775 Corpus: "Corpus: 9085 train + 1947 dev + 1947 test sentences"
2020-01-10 07:16:18,779 ------------------------------------------------------------------------

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


2020-01-10 07:21:27,452 ----------------------------------------------------------------------------------------------------
2020-01-10 07:21:35,644 epoch 2 - iter 0/71 - loss 0.68568194 - samples/sec: 287.36
2020-01-10 07:22:05,410 epoch 2 - iter 7/71 - loss 0.67302275 - samples/sec: 38.41
2020-01-10 07:22:29,020 epoch 2 - iter 14/71 - loss 0.70303029 - samples/sec: 43.61
2020-01-10 07:22:52,829 epoch 2 - iter 21/71 - loss 0.77519718 - samples/sec: 42.53
2020-01-10 07:23:12,962 epoch 2 - iter 28/71 - loss 0.81017300 - samples/sec: 52.21
2020-01-10 07:23:33,755 epoch 2 - iter 35/71 - loss 0.81906394 - samples/sec: 47.67
2020-01-10 07:23:54,393 epoch 2 - iter 42/71 - loss 0.80896194 - samples/sec: 48.46
2020-01-10 07:24:13,609 epoch 2 - iter 49/71 - loss 0.81108198 - samples/sec: 52.12
2020-01-10 07:24:34,933 epoch 2 - iter 56/71 - loss 0.81042614 - samples/sec: 50.62
2020-01-10 07:24:56,373 epoch 2 - iter 63/71 - loss 0.80587265 - samples/sec: 48.62
2020-01-10 07:25:16,368 epoch 2 - it

Once with you experiments, you have finalised top 2-3 architecture and configurations, then use it to train the classifiers.

**Note** : You should individually run these experiments for all the groups and train custom model for each of them. Here we will be using a vanila configuration for all the groups.

In [0]:
for grp_id in range(1,15) :

  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/training_data/standard/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/training_data/standard/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model'), # Custom Word Embedding 
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=128, 
    rnn_layers = 1,  
    reproject_words=True, 
    reproject_words_dimension=256
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )


  trainer.train(base_path,
                learning_rate=0.1,
                mini_batch_size=128,
                anneal_factor=0.5,
                patience=5,
                max_epochs=10,
                checkpoint=True,
                embeddings_storage_mode='gpu'
                )

# 3. Train Multiple Classifiers on Balanced Dataset


Though now we see that there has been some improvement in the metrices with multiple classifiers in their respective groups validation/test datasets. it should be still be noted that those groups are fairly skewed and gives a hard time to the model to learn meaningful relations.

Hence now we train the model with the normalised dataset that we prepared. 

**Note** : It should be noted that for every dataset/problem different ways of balancing/normalising the dataset works. Here clipping off worked well for us but penalising loss for less representative class did not.

We run the same set of experiments & build a hypothesis.

### Experiments 

  1. We chose a couple of representative groups
  2. We try different architectures & embeddings
  3. We train only 2 epochs for quick results
  4. We manually try couple of Hyper-parameter tuning based on our Hypothesis


### Things to try & Build Hypothesis on:
  1. GRU/LSTM cells
  2. Number of RNN Layers 
  3. Hidden Units / Time steps / Sequence Length
  4. Embeddings
  5. Batch Size



In [0]:
for grp_id in [10] :
  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model'), # Custom Word Embedding 
                    ## comment in different embeddings for state-of-the-art results
                     
                    #  WordEmbeddings('glove'),                 
                    # FlairEmbeddings('news-forward'),
                    # FlairEmbeddings('news-backward'),
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=64, # Build a hypothesis for different values
    rnn_layers = 1,  # Build a hypothesis for different values
    bidirectional = True, # Trying changing the behaviour of the model
    reproject_words=True, 
    reproject_words_dimension=256,
    dropout = 0 ,
    rnn_type = 'LSTM'
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label_threshold = 0.3 , # Check with different Thresholds
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )


  trainer.train(base_path,
                learning_rate=0.06,
                mini_batch_size=64,
                anneal_factor=0.5,
                patience=5,
                max_epochs=2,
                checkpoint=True,
                embeddings_storage_mode='gpu',
#                 sampler=ImbalancedClassificationDatasetSampler # Check if puishing the mis-classification of less frequent labels heavily helps?
                )

Group ID : 10


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2020-01-10 07:34:12,540 ----------------------------------------------------------------------------------------------------
2020-01-10 07:34:12,546 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model')
    )
    (word_reprojection_map): Linear(in_features=300, out_features=256, bias=True)
    (rnn): LSTM(256, 64, batch_first=True, bidirectional=True)
  )
  (decoder): Linear(in_features=256, out_features=4, bias=True)
  (loss_function): BCEWithLogitsLoss()
)"
2020-01-10 07:34:12,550 ----------------------------------------------------------------------------------------------------
2020-01-10 07:34:12,553 Corpus: "Corpus: 8811 train + 1101 dev + 1102 test sentences"
2020-01-10 07:34:12,556 ----------------------------------------------------------------------------------------------------
2020-01-10

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


2020-01-10 07:39:12,944 ----------------------------------------------------------------------------------------------------
2020-01-10 07:39:22,359 epoch 2 - iter 0/138 - loss 0.68621749 - samples/sec: 374.58
2020-01-10 07:39:47,127 epoch 2 - iter 13/138 - loss 0.61067164 - samples/sec: 41.58
2020-01-10 07:40:05,630 epoch 2 - iter 26/138 - loss 0.61288287 - samples/sec: 48.34
2020-01-10 07:40:23,371 epoch 2 - iter 39/138 - loss 0.61177309 - samples/sec: 51.07
2020-01-10 07:40:40,518 epoch 2 - iter 52/138 - loss 0.61445505 - samples/sec: 52.62
2020-01-10 07:40:58,839 epoch 2 - iter 65/138 - loss 0.62030486 - samples/sec: 49.07
2020-01-10 07:41:18,826 epoch 2 - iter 78/138 - loss 0.62219412 - samples/sec: 45.22
2020-01-10 07:41:37,242 epoch 2 - iter 91/138 - loss 0.61707596 - samples/sec: 53.72
2020-01-10 07:41:55,451 epoch 2 - iter 104/138 - loss 0.61539865 - samples/sec: 56.66
2020-01-10 07:42:14,409 epoch 2 - iter 117/138 - loss 0.61404586 - samples/sec: 51.61
2020-01-10 07:42:30,974

Once with you experiments, you have finalised top 2-3 architecture and configurations, then use it to train the classifiers.

**Note** : You should individually run these experiments for all the groups and train custom model for each of them. Here we will be using a vanila configuration for all the groups.

**Fun Fact** : We ran ~80 experiments for architecture for this small dataset itself to build the hypothesis for this Demonstrations

In [0]:
for grp_id in range(1,15) :
  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/word_embedding/gensim_model'), # Custom Word Embedding 
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=64, # Build a hypothesis for different values
    rnn_layers = 1,  # Build a hypothesis for different values
    bidirectional = True, # Trying changing the behaviour of the model
    reproject_words=True, 
    reproject_words_dimension=256,
    dropout = 0 ,
    rnn_type = 'LSTM'
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label_threshold = 0.1 , # Check with different Thresholds
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )


  trainer.train(base_path,
                learning_rate=0.03,
                mini_batch_size=16,
                anneal_factor=0.5,
                patience=5,
                max_epochs=10,
                checkpoint=True,
                embeddings_storage_mode='gpu'
                )