# Perdiction Pipeline | Ensemble Pipeline

In this notebook, we will ensemble our models to make a prediction pipeline.
The keys steps would be as following :

  1. Text Preprocessing for inference.
  2. Load classifiers iteratively in a list.
  3. Load test-data & pre-process.
  4. Set classifier threshold and run it through Ensemble



In [1]:
# First let's check what has Google given us ! Thank you Google for the GPU

!nvidia-smi

Mon Jan 13 17:29:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
# Let's mount our G-Drive. Hey !! Because for GPU you now give your data to Google 

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
# Install necessary packages and restart the environment

!pip install tiny-tokenizer
!pip install  flair

Collecting tiny-tokenizer
  Downloading https://files.pythonhosted.org/packages/8d/0f/aa52c227c5af69914be05723b3deaf221805a4ccbce87643194ef2cdde43/tiny_tokenizer-3.1.0.tar.gz
Building wheels for collected packages: tiny-tokenizer
  Building wheel for tiny-tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for tiny-tokenizer: filename=tiny_tokenizer-3.1.0-cp36-none-any.whl size=10550 sha256=c0ca89bccd7511da93e1bb1efdf7ec0527fb87a267de83cce971ec0242e1d196
  Stored in directory: /root/.cache/pip/wheels/d1/c8/36/334497a689fab90128232e86b5829b800dd271a3d5d5959c53
Successfully built tiny-tokenizer
Installing collected packages: tiny-tokenizer
Successfully installed tiny-tokenizer-3.1.0
Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/16/22/8fc8e5978ec05b710216735ca47415700e83f304dec7e4281d61cefb6831/flair-0.4.4-py3-none-any.whl (193kB)
[K     |████████████████████████████████| 194kB 2.7MB/s 
[?25hCollecting mpld3==0.3
[?25l  Downloading https://files.pyt

In [1]:
# Let's import our packages !

import pandas as pd
from tqdm import tqdm
import html
import re
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
# import flair
import pickle
from torch.optim.adam import Adam

# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence
from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from flair.samplers import ImbalancedClassificationDatasetSampler

# 1. Text Pre-processing

### Note -
We need to be careful to apply the same set of text transformations during inference as we did during training. Any changes will directly affect the model to produce junk / more junky results

In [0]:
clean = re.compile('<.*?>')

def preprocess_text(text) :
  try :
    # soup = BeautifulSoup(text, "html.parser")
    # text = soup.get_text()
    text=  re.sub(clean, '', text)
    text = html.unescape(text)
  except :
    print("Error in HTML Processing ...")
    print(text)
    text = text
  try :
    # remove extra newlines (often might be present in really noisy text)
    text = text.translate(text.maketrans("\n\t\r", "   "))
  except :
    print("Error in removing extra lines ...")
    print(text)
    text = text

  try :
    # remove extra whitespace
    text = re.sub(' +', ' ', text)
    text = text.strip()
  except :
    print("Error in extra whitespace removal ...")
    print(text)
    text = text

  return text

# 2. Load Classifiers

We iterate through the saved classifiers and load them. We append them in a list can be easily iterated during evaluation/prediction.

In [3]:
classifiers  = []

for grp_id in tqdm(range(1,15)) :
  path  = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/' + str(grp_id) + '/best-model.pt'
  classifier = TextClassifier.load(path)
  classifiers.append(classifier)

  0%|          | 0/14 [00:00<?, ?it/s]

2020-01-13 17:32:19,739 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/1/best-model.pt


  7%|▋         | 1/14 [00:19<04:10, 19.28s/it]

2020-01-13 17:32:39,018 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/2/best-model.pt


 14%|█▍        | 2/14 [00:46<04:21, 21.77s/it]

2020-01-13 17:33:06,616 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/3/best-model.pt


 21%|██▏       | 3/14 [00:56<03:20, 18.25s/it]

2020-01-13 17:33:16,632 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/4/best-model.pt


 29%|██▊       | 4/14 [01:21<03:22, 20.29s/it]

2020-01-13 17:33:41,682 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/5/best-model.pt


 36%|███▌      | 5/14 [01:53<03:31, 23.53s/it]

2020-01-13 17:34:12,777 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/6/best-model.pt


 43%|████▎     | 6/14 [02:23<03:23, 25.46s/it]

2020-01-13 17:34:42,741 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/7/best-model.pt


 50%|█████     | 7/14 [02:50<03:02, 26.11s/it]

2020-01-13 17:35:10,354 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/8/best-model.pt


 57%|█████▋    | 8/14 [03:19<02:41, 26.86s/it]

2020-01-13 17:35:38,974 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/9/best-model.pt


 64%|██████▍   | 9/14 [03:49<02:19, 27.87s/it]

2020-01-13 17:36:09,186 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/10/best-model.pt


 71%|███████▏  | 10/14 [04:20<01:55, 28.79s/it]

2020-01-13 17:36:40,135 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/11/best-model.pt


 79%|███████▊  | 11/14 [04:46<01:24, 28.09s/it]

2020-01-13 17:37:06,590 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/12/best-model.pt


 86%|████████▌ | 12/14 [05:17<00:57, 28.78s/it]

2020-01-13 17:37:36,993 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/13/best-model.pt


 93%|█████████▎| 13/14 [05:50<00:30, 30.01s/it]

2020-01-13 17:38:09,865 loading file /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/normalised/group/14/best-model.pt


100%|██████████| 14/14 [06:18<00:00, 29.41s/it]


# 3. Load Test Data & Preprocess

We load the test-data from the previously created split and re-transform it into a Dataframe with corresponding labels.

Finally we save the DataFrame with a placeholder for the predicted labels.

In [0]:
test_data = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/70_30_split/test.txt'

with open(test_data,'r',encoding='utf-8') as f :
  data = f.readlines()

# prefix from the Training Data Format
prefix = '__label__'

In [0]:
test_text = []
test_label = []
for doc in tqdm(data) :
  splits =  doc.split()
  labels = []
  idx = 0 
  for word in splits :
    if prefix in word :
      labels.append(word[9:].strip())
      idx += len(word)
    else :
      text = doc[idx:].strip()
      break
  test_text.append(text)
  test_label.append(labels)
  # break

test_df = pd.DataFrame(list(zip(test_text,test_label)),columns = ['text','original_labels'])
test_df['predicted_labels'] = None


100%|██████████| 157799/157799 [00:02<00:00, 71856.09it/s]


In [0]:
test_df.to_pickle('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/70_30_split/test.pkl')
test_df.head()

# 4. Run the Ensemble Pipeline

Once we load the test-data, we can iteratively run it via all the classifiers and store the predictions for evaluating the performance.

### Coding Exercise 
  1. Multi-thread/process the prediction pipeline.
  2. Find a redundant line of code in the cell below
  3. Optimize it

In [0]:
def predict_ensemble(sentence,threshold=0.1) :
  labels  = [] 
  ## Iterate through each classifier or prediction
  for classifier in classifiers:
    classifier.multi_label_threshold = threshold
    classifier.predict(sentence)
    for label in sentence.labels :
      ## Append labels from all classifiers
      labels.append(label.value)
  return labels

In [0]:
# Read the cleaned Test Data for running with different classifiers & threshold

with open('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/70_30_split/test.pkl','rb') as f :
  test_df = pickle.load(f)

In [0]:
## Setting a global threshold for all classifiers
threshold = 0.9

## Iterating through each Test Example
for idx in tqdm(test_df.index) :
  text = preprocess_text(test_df.loc[idx,'text'])
  
  # create example sentence
  sentence = Sentence(text)
  labels = []

  ## Iterate through each classifier or prediction
  for classifier in classifiers:
    classifier.multi_label_threshold = threshold
    classifier.predict(sentence)

    for label in sentence.labels :
      ## Append labels from all classifiers
      labels.append(label.value)

  test_df.at[idx,'predicted_labels'] = labels


 85%|████████▌ | 134329/157799 [4:09:11<53:03,  7.37it/s]

Buffered data was truncated after reaching the output size limit.

In [0]:
# Save the predictions
test_df.to_pickle('/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/70_30_split/normalised_test_predcted_th_0'+ str(int(threshold*10)) +'.pkl')
test_df.head()

# Let's Infer !

Let's manually create some text/questions and ask the model about it.

In [36]:
text1 = preprocess_text("How to handle memory locking ?")
# create example sentence & tokenize
sentence1 = Sentence(text1)
# predict
labels = predict_ensemble(sentence1,threshold=0.8)
print(labels)


['haskell', 'testing']


In [35]:
text2 = preprocess_text("How to handle memory locking in java ?")
# create example sentence & tokenize
sentence2 = Sentence(text2)
# predict
labels = predict_ensemble(sentence2,threshold=0.8)
print(labels)

['haskell', 'testing']


In [37]:
text3 = preprocess_text("How to handle memory locking in java python ?")
# create example sentence & tokenize
sentence3 = Sentence(text3)
# predict
labels = predict_ensemble(sentence3,threshold=0.8)
print(labels)

['testing', 'excel']


In [38]:
text4 = preprocess_text("This post is not about java")
# create example sentence & tokenize
sentence4 = Sentence(text4)
# predict
labels = predict_ensemble(sentence4,threshold=0.8)
print(labels)

['typescript', 'haskell', 'testing']
