# Custom Evaluation Criteria

In this notebook we define a evaluation criteria that is suitable for a multi-class multi-label classification problem, We do this to have an apple to apple comparison between different classifiers and modelling approach inorder to have unified standardized evaluation metric.

We majorly will be doing the following :

  1. Install & Import Packages
  2. Define Document level 'Precision', 'Recall' and 'F1-measure'
  3. Provide aggregated evaluation across documents along with illustrated examples
  5. Save 

# 1. Install & Import Packages

In [0]:
import os
import numpy as np
import pandas as pd
import pickle
import ast

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


# 2. Define Document level 'Precision', 'Recall' and 'F1-measure'


We define precision for a document as follows:
${\displaystyle {\text{precision}}={\frac {|\{{\text{original labels}}\}\cap \{{\text{predicted labels}}\}|}{|\{{\text{predicted labels}}\}|}}}
$

Whereas recall for a document is:
${\displaystyle {\text{recall}}={\frac {|\{{\text{original labels}}\}\cap \{{\text{predicted labels}}\}|}{|\{{\text{original labels}}\}|}}}
$

And F measure derived from the above two values is:

${\displaystyle F=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}}
$

The below function will calculate these metrics for each document, and at the end will give out average stats to measure overall standing

In [0]:

def document_evaluation_metrics(prd_grp,grp,metric="precision"):
    pred_group = prd_grp
    if 0 in pred_group: pred_group.remove(0)
    group = grp

    set_pred_group = set(pred_group)
    set_group = set(group)
    intrsct = set_group.intersection(set_pred_group)
    accuracy = len(intrsct) / float(len(set_pred_group) if len(set_pred_group)>1 else 1)
    recall = len(intrsct) / float(len(set_group) if len(set_group)>1 else 1)
    if metric == "precision":
      return accuracy
    elif metric == "recall":
      return recall
    elif metric == "f1_measure":
      if accuracy == 0 or recall == 0:
        return 0
      elif accuracy > 0 and recall >0 :
        f1_measure = 2*accuracy*recall/(float(accuracy + recall))
        return f1_measure
    
    return -1

def model_evaluation_stats(final_pred_df, model_name="default"):
  final_pred_df['doc_precision'] = final_pred_df.apply(lambda x: document_evaluation_metrics(x.predicted_labels, x.original_labels, "precision"), axis=1)
  final_pred_df['doc_recall'] = final_pred_df.apply(lambda x: document_evaluation_metrics(x.predicted_labels, x.original_labels, "recall"), axis=1)
  final_pred_df['doc_f1_measure'] = final_pred_df.apply(lambda x: document_evaluation_metrics(x.predicted_labels, x.original_labels, "f1_measure"), axis=1)
  
  avg_precision = final_pred_df['doc_precision'].mean()
  avg_recall = final_pred_df['doc_recall'].mean()
  avg_f1 = final_pred_df['doc_f1_measure'].mean()

  print('Avearge precision across documents is {}'.format(avg_precision))
  print('Avearge recall across documents is {}'.format(avg_recall))
  print('Avearge f1 measure across documents is {}'.format(avg_f1))

  row = { 'name': model_name,
          'avg_precision': avg_precision,
          'avg_recall':avg_recall,
          'avg_f1': avg_f1
       }
  
  record_path = ml_model+'record_metrics.csv'

  if os.path.exists(record_path) :
    record_df = pd.read_csv(record_path)
  else :
    record_df = pd.DataFrame(columns=['name','avg_precision','avg_recall','avg_f1'])

  record_df = record_df.append(row,ignore_index=True)
  record_df.to_csv(record_path,index=False)

  pickle.dump(final_pred_df, open(ml_model + model_name + ".pkl", 'wb'))
  # final_pred_df.to_csv(ml_model + 'SVM_Tag_predictions.txt',sep='\t',index=False)


# 3. Provide aggregated evaluation across documents along with illustrated examples

In [0]:
# Data read.
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
data_path = path + 'training_data/70_30_split/'

# temporary output path
ml_model = path + 'ml_model/'

In [0]:
# Put list of the files that you create from testing in notebook : 6_inference_pipeline.ipynb
for data_filename in ['normalised_test_predicted_th_03','normalised_test_predicted_th_04','normalised_test_predicted_th_05',
                      'normalised_test_predicted_th_06','normalised_test_predicted_th_07','normalised_test_predicted_th_09'] :

  print("Processing : {}".format(data_filename))
  print()
  # data_filename =
  pred_df = pd.read_pickle(data_path + data_filename + '.pkl')

  model_evaluation_stats(pred_df,data_filename)
  print()
  print("========================================")

In [0]:
df = pd.read_pickle(ml_model  + '/normalised_test_predicted_th_09.pkl')

In [0]:
df.describe()

Unnamed: 0,doc_precision,doc_recall,doc_f1_measure
count,157799.0,157799.0,157799.0
mean,0.038637,0.067447,0.045914
std,0.141094,0.235249,0.159461
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,0.0,0.0,0.0
max,1.0,1.0,1.0


In [11]:
df = df.sort_values(by=['doc_f1_measure'],ascending=False)
df.head()

Unnamed: 0,text,original_labels,predicted_labels,doc_precision,doc_recall,doc_f1_measure
122338,"""TSQLT unit test - The data types text and tex...",[tdd],[tdd],1.0,1.0,1.0
81538,TFS Build giving different results according t...,[selenium],[selenium],1.0,1.0,1.0
58872,Conversion of long values into double in R. I'...,[r],[r],1.0,1.0,1.0
73228,"""Why does this Perl function appear to process...",[perl],[perl],1.0,1.0,1.0
8666,Android Fragments Button Click Open new Window...,[android],[android],1.0,1.0,1.0


In [0]:
df[df.doc_f1_measure>0.5].tail(50)

Unnamed: 0,text,original_labels,predicted_labels,doc_precision,doc_recall,doc_f1_measure
155892,"""jquery ui datepicker with tooltip - tooltip f...",[jquery],"[jquery, tdd]",0.5,1.0,0.666667
156211,"""accessing specific columns from another table...",[ruby-on-rails],"[ruby-on-rails, typescript]",0.5,1.0,0.666667
21339,AWS Domain Name Email for SSL Cert. I create a...,[amazon-web-services],"[amazon-web-services, testing]",0.5,1.0,0.666667
9591,"""Can a `ST`-like monad be executed purely (wit...",[haskell],"[haskell, regex]",0.5,1.0,0.666667
89732,PostgreSQL - Empty table. I have a table calle...,[postgresql],"[postgresql, testing]",0.5,1.0,0.666667
122812,"""How to login with a user with role userAdmin ...",[mongodb],"[mongodb, testing]",0.5,1.0,0.666667
83679,How to invoke ModelB.create on ModelA.afterCre...,[node.js],"[node.js, testing]",0.5,1.0,0.666667
138205,"""Regex to find numbers and letters. I would fi...",[regex],"[testing, regex]",0.5,1.0,0.666667
29742,"""iText filling multi pages with same textfield...",[java],"[java, tdd]",0.5,1.0,0.666667
21416,t\tHow do I remove a Visual Studio 2015 projec...,"[git, .net]",[git],1.0,0.5,0.666667
