# Classical Machine Learning Approach

In this notebook we will be learning to
  1. Create a Naive TF - IDF based Bag of Words representation of text.
  2. Use classical ML models to solve text classification.
  3. Use a One Vs Rest strategy to solve multi-label text classification.


  **HOT TIP** : *Save them as pickle for easy rendering for experiments*

  This Notebook uses code from https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Multi%20label%20text%20classification.ipynb


In [1]:
# Installing packages.
!pip install contractions
!pip install textsearch
!pip install tqdm

# Importing packages.
import nltk
nltk.download('punkt')
nltk.download('stopwords')
%matplotlib inline
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
import pickle
import ast
from sklearn.externals import joblib
from datetime import datetime
from sklearn.preprocessing import MultiLabelBinarizer

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/85/41/c3dfd5feb91a8d587ed1a59f553f07c05f95ad4e5d00ab78702fbf8fe48a/contractions-0.0.24-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 4.0MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 58.8MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  



In [15]:
# Let's mount our G-Drive.

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [0]:
# Data read and preparation.
# Mentioning where is our data located on G-Drive. Make sure to rectify your path
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
data ='filtered_data/question_tag_text_mapping.pkl'
ml_model = path + 'ml_model/'

In [4]:
# Let us quickly load our question tag data
question_tag = pd.read_pickle(path+data)
question_tag.head(3)

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Tag
0,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,"[sql, asp.net]"
1,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,"[c#, .net]"
2,330,63.0,2008-08-02T02:51:36Z,,29,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,[c++]


### Creating one hot encoding from multilabelled tagged data

In [5]:
# In order to use one vs rest strategy we will need to one hot encoding each tag across all documents.
mlb = MultiLabelBinarizer()
question_tag['Tag_pop'] = question_tag['Tag']
question_tag = question_tag.join(pd.DataFrame(mlb.fit_transform(question_tag.pop('Tag_pop')),
                          columns=mlb.classes_,
                          index=question_tag.index))
question_tag.head(3)

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Tag,.net,agile,ajax,amazon-web-services,android,android-studio,angular2,angularjs,apache,apache-spark,api,asp.net,asp.net-web-api,azure,bash,c,c#,c++,cloud,codeigniter,css,devops,django,docker,drupal,eclipse,elasticsearch,embedded,entity-framework,excel,excel-vba,express,...,qt,r,react-native,reactjs,redis,redux,regex,rest,ruby,ruby-on-rails,sass,scala,selenium,shell,spring,spring-boot,spring-mvc,sql,sql-server,swift,tdd,testing,twitter-bootstrap,twitter-bootstrap-3,typescript,ubuntu,unity3d,unix,vb.net,vba,visual-studio,vue.js,wcf,web-services,windows,wordpress,wpf,xamarin,xcode,xml
0,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,"[sql, asp.net]",0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,"[c#, .net]",1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,330,63.0,2008-08-02T02:51:36Z,,29,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,[c++],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# Creating a list of all existing 'Tags'
dummy = question_tag.drop(['Id', 'OwnerUserId', 'CreationDate', 'ClosedDate', 'Score', 'Title','Body','Tag'], axis=1)
categories = list(dummy.columns.values)

### Text preprocessing

In [0]:
# Let us createa a very basic text preprocessor which we will use for cleaning text.
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

question_tag['Body'] = question_tag['Body'].map(lambda com : clean_text(com))

### Creating a 70/30 Train-Test Split

In [13]:
train, test = train_test_split(question_tag, random_state=42, test_size=0.30, shuffle=True)

X_train = train.Body
X_test = test.Body

print("Train data shape : {}".format(X_train.shape))
print("Test data shape : {}".format(X_test.shape))

Train data shape : (736394,)
Test data shape : (315598,)


# Creating Bag of Words representation using TF - IDF
  1. Initializing the Vectorizer object
  2. Create a corpus from training data.
  3. Create a document term matrix

In [0]:
#Initializing the Vectorizer object
tfidf = TfidfVectorizer(stop_words=stop_words)

#Create a corpus from training data
#Create a document term matrix of training data based on the corpus.
X_train_dtm = tfidf.fit_transform(X_train)

#Create a document term matrix of test data based on the corpus.
#Note that the dimensions/columns of DTM of the test data will be based on the training data corpus only.
X_test_dtm = tfidf.transform(X_test)

## Pipeline
scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply. So we will utilize pipeline to train every classifier.

## One Vs Rest Multilabel strategy
The Multi-label algorithm accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

OneVsRest strategy can be used for multilabel learning, where a classifier is used to predict multiple labels for instance. **Naive Bayes**, **SVM**, **Logistic Regression** supports multi-class, but we are in a multi-label scenario, therefore, we wrap them in the OneVsRestClassifier.

### We create a Training Pipeline and a Scoring Pipeline

In [0]:
def tag_level_training_pipeline(X_train, train, X_test, test, classifier_pipeline, output_directory):
  
  #1. Create a classifier for each Tag
  for category in categories:
    print('... Processing {}'.format(category))
    
    # 1. train the model using X_dtm & y
    classifier_pipeline.fit(X_train, train[category])
    
    # 2. save the model to disk
    filename = ml_model + output_directory +str(category)+ '_model.pkl'
    joblib.dump(classifier_pipeline, filename, compress = 1)
    
    # 3. compute the testing accuracy
    prediction = classifier_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
    print(classification_report(test[category], prediction))



In [0]:
def tag_level_predict(X_train, train, X_test, test, model_directory):
  prediction_df = pd.DataFrame(columns=['dummy1'])
  
  #Score the document across classifier for each Tag
  for category in categories:
    
    # 1. load the model
    filename = ml_model + model_directory +str(category)+ '_model.pkl'
    classifier_pipeline = joblib.load(filename)
    
    # 2. predict on the test data.
    prediction = classifier_pipeline.predict(X_test)
    prediction_df[str(category)] = prediction

  # Remember We had encoded the labels. It time to bring them back to their original form.
  for category in categories:
    prediction_df.loc[prediction_df[str(category)] == 1, str(category)] = category
  prediction_df['predicted_labels'] = prediction_df[[str(i) for i in categories]].values.tolist()
  prediction_df['predicted_labels'] =  prediction_df['predicted_labels'].apply(lambda x : list(set(x)))
  # prediction_df['predicted_labels'] = prediction_df['predicted_labels'].apply(lambda x: x.remove(0) if (0 in x) else x )
  
  # We create result having orignal labels and predicted labels for metrics Evaluation
  final_pred_df = pd.concat([test[['Id','Tag']].reset_index(), prediction_df[['predicted_labels']].reset_index()], axis=1)
  final_pred_df['original_labels'] = final_pred_df['Tag']
  # prediction_df[['Id']] = test[['Id']]
  final_pred_df_result = final_pred_df[['Id','original_labels','predicted_labels']]
  return final_pred_df_result

In [0]:
# importing os module 
import os
try:
  os.rename('/content/drive/My Drive/ICDMAI_Tutorial/notebook/ml_model/SVM/_net_model.pkl', '/content/drive/My Drive/ICDMAI_Tutorial/notebook/ml_model/SVM/.net_model.pkl')
except :
  print("Already in proper filename!")  

In [16]:
## A Dummy example.
X_test = ["How to handle memory locking ?", "How to handle memory locking in java ?", "How to handle memory locking in java python ?","This post is not about java"]
X_test_dtm = tfidf.transform(X_test)
result = tag_level_predict(X_train_dtm, train, X_test_dtm, test.head(1), 'SVM/')

for i in range(result.shape[0]):
  print("Input [",X_test[i],"] || Predicted classes: ",result.predicted_labels[i])



Input [ How to handle memory locking ? ] || Predicted classes:  [0]
Input [ How to handle memory locking in java ? ] || Predicted classes:  [0, 'java']
Input [ How to handle memory locking in java python ? ] || Predicted classes:  [0, 'python', 'java']
Input [ This post is not about java ] || Predicted classes:  [0, 'java']


# Evaluating our results

In [0]:
# Here we define precision, recall, f1 measure at a single document level.
def document_evaluation_metrics(prd_grp,grp,metric="precision"):
    pred_group = prd_grp
    if 0 in pred_group: pred_group.remove(0)
    group = grp

    set_pred_group = set(pred_group)
    set_group = set(group)
    intrsct = set_group.intersection(set_pred_group)
    accuracy = len(intrsct) / float(len(set_pred_group) if len(set_pred_group)>1 else 1)
    recall = len(intrsct) / float(len(set_group) if len(set_group)>1 else 1)
    if metric == "precision":
      return accuracy
    elif metric == "recall":
      return recall
    elif metric == "f1_measure":
      if accuracy == 0 or recall == 0:
        return 0
      elif accuracy > 0 and recall >0 :
        f1_measure = 2*accuracy*recall/(float(accuracy + recall))
        return f1_measure
    
    return -1

# Provide overall average stats and populate document level metrics.
def model_evaluation_stats(final_pred_df, model_name="default"):
  final_pred_df['doc_precision'] = final_pred_df.apply(lambda x: document_evaluation_metrics(x.predicted_labels, x.original_labels, "precision"), axis=1)
  final_pred_df['doc_recall'] = final_pred_df.apply(lambda x: document_evaluation_metrics(x.predicted_labels, x.original_labels, "recall"), axis=1)
  final_pred_df['doc_f1_measure'] = final_pred_df.apply(lambda x: document_evaluation_metrics(x.predicted_labels, x.original_labels, "f1_measure"), axis=1)
  
  print('Avearge precision across documents is {}'.format(final_pred_df['doc_precision'].mean()))
  print('Avearge recall across documents is {}'.format(final_pred_df['doc_recall'].mean()))
  print('Avearge f1 measure across documents is {}'.format(final_pred_df['doc_f1_measure'].mean()))
  pickle.dump(final_pred_df, open(ml_model + model_name + ".pkl", 'wb'))
  # final_pred_df.to_csv(ml_model + 'SVM_Tag_predictions.txt',sep='\t',index=False)

# Let us train, score and evaluate Naive Bayes

In [0]:
#Naive Bayes Classifier
NB_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(MultinomialNB(
                    fit_prior=True, class_prior=None))),
            ])

tag_level_training_pipeline(X_train_dtm, train, X_test_dtm, test, NB_pipeline, 'NaiveBayes/')
result = tag_level_predict(X_train_dtm, train, X_test_dtm, test, 'NaiveBayes/')
model_evaluation_stats(result, "NaiveBayes")

# Let us train, score and evaluate Support Vector Machines

In [0]:
#SVM Classifier
SVC_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

tag_level_training_pipeline(X_train_dtm, train, X_test_dtm, test, SVC_pipeline, 'SVM/')
result = tag_level_predict(X_train_dtm, train, X_test_dtm, test, 'SVM/')
model_evaluation_stats(result, "SVM")

... Processing .net
Test accuracy is 0.9771893358006071
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    308362
           1       0.51      0.09      0.15      7236

    accuracy                           0.98    315598
   macro avg       0.75      0.54      0.57    315598
weighted avg       0.97      0.98      0.97    315598

... Processing agile
Test accuracy is 0.9999429654180318
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    315573
           1       0.89      0.32      0.47        25

    accuracy                           1.00    315598
   macro avg       0.94      0.66      0.74    315598
weighted avg       1.00      1.00      1.00    315598

... Processing ajax
Test accuracy is 0.9887356700612805
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    310952
           1       0.70      0.41      0.52      4646

    accuracy 

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    315583
           1       0.00      0.00      0.00        15

    accuracy                           1.00    315598
   macro avg       0.50      0.50      0.50    315598
weighted avg       1.00      1.00      1.00    315598

... Processing django
Test accuracy is 0.9972718458291878
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    311726
           1       0.96      0.81      0.88      3872

    accuracy                           1.00    315598
   macro avg       0.98      0.90      0.94    315598
weighted avg       1.00      1.00      1.00    315598

... Processing docker
Test accuracy is 0.9996546239203037
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    315127
           1       0.93      0.83      0.88       471

    accuracy                           1.00    315598
   macro avg

# Let us train, score and evaluate Logistic Regression

In [0]:
#Logistic Regression Classifier
LogReg_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
            ])

tag_level_training_pipeline(X_train_dtm, train, X_test_dtm, test, LogReg_pipeline, 'LogisticRegression/')
result = tag_level_predict(X_train_dtm, train, X_test_dtm, test, 'LogisticRegression/')
model_evaluation_stats(result, "LogisticRegression")