# Text Classification Model

In this notebook we will train & explore different Text Classifiers to solve our problem.

  1. Train a single classifier variations.
  2. Train multiple classifiers on unbalanced dataset.
  3. Train multiple classifiers on balanced dataset.

In [0]:
# First let's check what has Google given us ! Thank you Google for the GPU
!nvidia-smi

In [0]:
# Let's mount our G-Drive. Hey !! Because for GPU you now give your data to Google 

from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Install necessary packages and restart the environment

! pip install tiny-tokenizer
! pip install  flair
! pip install -U tensorflow-gpu

In [0]:
# Let's import our packages !

import pandas as pd
from tqdm import tqdm
import html
import re
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
import flair
import pickle
from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings,FastTextEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from flair.samplers import ImbalancedClassificationDatasetSampler


In [0]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

# 1. Train a Single Classifier

The first approach, would be to load the entire dataset(~1M) and train a single classifier powerful & mighty enough to perform great.

Challenges :
  * Super slow to experiment with different architecture
  * Super slow to train a model
  * Super slow for Hyper-parameter tuning
  * Highly Skewed Dataset


Experiments :
  * Try stacking multiple embeddings at #3
  * Try custom embeddings at #3
  * Try different RNN cell at #4
  * Try different hidden size at #4
  * All hyperparameters are tunable for extensive experimenting at #7

In [0]:
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
corpus_path = path+'training_data/classification_corpus.pkl'
label_dict_path = path+'training_data/classification_corpus_label_dict.pkl'


# 1. Reading Corpus File : which we prepared before-hand
with open(corpus_path, mode='rb') as f:
  corpus = pickle.load(f)

# 2. Reading Corpus Dictionary : which we computed & saved
with open(label_dict_path, mode='rb') as f:
  label_dict = pickle.load(f)

# 3. make a list of word embeddings 
word_embeddings = [
  WordEmbeddings('glove'),

  # comment in flair embeddings for state-of-the-art results
  # FlairEmbeddings('news-forward'),
  # FlairEmbeddings('news-backward'),
]

# 4. initialize document embedding by passing list of word embeddings
## Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
document_embeddings = DocumentRNNEmbeddings(
  word_embeddings,
  hidden_size=128,
  reproject_words=True,
  reproject_words_dimension=256,
)

# 5. create the text classifier
classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label=True)

# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier,
                        corpus,
                        optimizer=Adam,
                        use_tensorboard=True)

# 7. start the training
model_path = path + '/model/full_model'
trainer.train(model_path,
              learning_rate=0.06,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=2,
              checkpoint=True,
              embeddings_storage_mode='gpu',
              num_workers=12)

# 2. Train Multiple Classifiers on unbalanced Dataset

A single classifier fails to capture and fit the data, performing poorly on all the metrices. This led us to divide the problem into sub-groups and FOCUS more on individual group than Training a Titan Model for everything.

In this section, we would train multiple classifiers on the different groups that we create.

### Experiments 

  1. We chose a couple of representative groups
  2. We try different architectures & embeddings
  3. We train only 2 epochs for quick results
  4. We manually try couple of Hyper-parameter tuning based on our Hypothesis


### Things to try & Build Hypothesis on:
  1. GRU/LSTM cells
  2. Number of RNN Layers 
  3. Hidden Units / Time steps / Sequence Length
  4. Embeddings
  5. Batch Size







In [0]:
for grp_id in [3,11] :
  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/model_300/gensim_model'), # Custom Word Embedding 
                    ## comment in different embeddings for state-of-the-art results
                     
                    #  WordEmbeddings('glove'),                 
                    # FlairEmbeddings('news-forward'),
                    # FlairEmbeddings('news-backward'),
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=256, # Build a hypothesis for different values
    rnn_layers = 2,  # Build a hypothesis for different values
    reproject_words=True, 
    reproject_words_dimension=256
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )

  # model_path = path + '/model_70_30/full_model_1'

  trainer.train(base_path,
                learning_rate=0.1,
                mini_batch_size=128,
                anneal_factor=0.5,
                patience=5,
                max_epochs=2,
                checkpoint=True,
                )

Once with you experiments, you have finalised top 2-3 architecture and configurations, then use it to train the classifiers.

**Note** : You should individually run these experiments for all the groups and train custom model for each of them. Here we will be using a vanila configuration for all the groups.

In [0]:
for grp_id in range(1,15) :

  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/training_data/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/training_data/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/model_300/gensim_model'), # Custom Word Embedding 
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=128, 
    rnn_layers = 1,  
    reproject_words=True, 
    reproject_words_dimension=256
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )


  trainer.train(base_path,
                learning_rate=0.1,
                mini_batch_size=128,
                anneal_factor=0.5,
                patience=5,
                max_epochs=10,
                checkpoint=True,
                )

# 3. Train Multiple Classifiers on Balanced Dataset


Though now we see that there has been some improvement in the metrices with multiple classifiers in their respective groups validation/test datasets. it should be still be noted that those groups are fairly skewed and gives a hard time to the model to learn meaningful relations.

Hence now we train the model with the normalised dataset that we prepared. 

**Note** : It should be noted that for every dataset/problem different ways of balancing/normalising the dataset works. Here clipping off worked well for us but penalising loss for less representative class did not.

We run the same set of experiments & build a hypothesis.

### Experiments 

  1. We chose a couple of representative groups
  2. We try different architectures & embeddings
  3. We train only 2 epochs for quick results
  4. We manually try couple of Hyper-parameter tuning based on our Hypothesis


### Things to try & Build Hypothesis on:
  1. GRU/LSTM cells
  2. Number of RNN Layers 
  3. Hidden Units / Time steps / Sequence Length
  4. Embeddings
  5. Batch Size



In [0]:
for grp_id in [3,11] :
  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised_training_data/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised_training_data/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/model_300/gensim_model'), # Custom Word Embedding 
                    ## comment in different embeddings for state-of-the-art results
                     
                    #  WordEmbeddings('glove'),                 
                    # FlairEmbeddings('news-forward'),
                    # FlairEmbeddings('news-backward'),
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=64, # Build a hypothesis for different values
    rnn_layers = 1,  # Build a hypothesis for different values
    bidirectional = True, # Trying changing the behaviour of the model
    reproject_words=True, 
    reproject_words_dimension=256,
    dropout = 0 ,
    rnn_type = 'LSTM'
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label_threshold = 0.3 , # Check with different Thresholds
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )


  trainer.train(base_path,
                learning_rate=0.06,
                mini_batch_size=64,
                anneal_factor=0.5,
                patience=5,
                max_epochs=2,
                checkpoint=True,
                sampler=ImbalancedClassificationDatasetSampler # Check if puishing the mis-classification of less frequent labels heavily helps?
                )

Once with you experiments, you have finalised top 2-3 architecture and configurations, then use it to train the classifiers.

**Note** : You should individually run these experiments for all the groups and train custom model for each of them. Here we will be using a vanila configuration for all the groups.

**Fun Fact** : We ran ~80 experiments for architecture for this small dataset itself to build the hypothesis for this Demonstrations

In [0]:
for grp_id in range(1,15) :
  print("================================================================================================")
  print("Group ID : {}".format(grp_id))
  print("================================================================================================")

  path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised_training_data/group/'
  base_path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised_training_data/group/' + str(grp_id) + '/'

  corpus_path =  path + str(grp_id) + '/classification_corpus.pkl'
  label_dict_path = path + str(grp_id) + '/classification_corpus_label_dict.pkl'

  # 1. Reading Corpus File : which we prepared before-hand
  with open(corpus_path, mode='rb') as f:
    corpus = pickle.load(f)

  # 2. Reading Corpus Dictionary : which we computed & saved
  with open(label_dict_path, mode='rb') as f:
    label_dict = pickle.load(f)

  # 3. make a list of word embeddings 
  word_embeddings = [ 
                     WordEmbeddings('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/model_300/gensim_model'), # Custom Word Embedding 
                     
    ]

  # Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
  document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=64, # Build a hypothesis for different values
    rnn_layers = 1,  # Build a hypothesis for different values
    bidirectional = True, # Trying changing the behaviour of the model
    reproject_words=True, 
    reproject_words_dimension=256,
    dropout = 0 ,
    rnn_type = 'LSTM'
  )

  classifier = TextClassifier(document_embeddings,
                            label_dictionary=label_dict,
                            multi_label_threshold = 0.1 , # Check with different Thresholds
                            multi_label=True)

  # 6. initialize the text classifier trainer
  trainer = ModelTrainer(classifier,
                          corpus,
                          optimizer=Adam
                          )


  trainer.train(base_path,
                learning_rate=0.03,
                mini_batch_size=16,
                anneal_factor=0.5,
                patience=5,
                max_epochs=10,
                checkpoint=True
                )