# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [1]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [4]:
# Dataframe
path_df = "Pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = "Pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = "Pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = "Pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = "Pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = "Pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = "Pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = "Pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = "Pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = "Models/best_rfc.pickle"
with open(path_model, 'rb') as data:
    mnb_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'electronics': 1,
    'hardware': 2,
    'machine': 3,
    'none': 4,
    'raw_materials': 5,
    'skilled_manpower' : 6,
    'unskilled_manpower' : 7,
    'vehicle/equipment_hiring' : 8
}

category_names = {
    1: 'electronics',
    2: 'hardware',
    3: 'machine',
    4: 'none',
    5: 'raw_materials',
    6: 'skilled_manpower',
    7: 'unskilled_manpower',
    8: 'vehicle/equipment_hiring',
}

Let's get the predictions on the test set:

In [5]:
predictions = mnb_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [6]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Tender Title', 'Label', 'Label_Code', 'Prediction']]

# Decode
df_test['Label_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Label_Predicted':category_names})

# Clean columns again
df_test = df_test[['Tender Title', 'Label', 'Label_Predicted']]

In [7]:
df_test.head()

Unnamed: 0,Tender Title,Label,Label_Predicted
488,RELATED TO COMPLIANCE OF DIFFERENT ACTS OF JHA...,none,unskilled_manpower
168,OLFA FOR SALE OF IRON ORE FINES,raw_materials,raw_materials
127,FA No. 24003011 dt. 25.03.20 – Air Cooled BF S...,none,none
227,Centralised Procuremenmt of SiMn & HC FeMn,raw_materials,none
207,Installation of Electro Slag Remelting (ESR) U...,machine,none


Let's get the misclassified articles:

In [16]:
df_test.shape

(147, 3)

In [17]:
df_test.to_csv("test_results.csv", index = False)

In [8]:
condition = (df_test['Label'] != df_test['Label_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,Tender Title,Label,Label_Predicted
488,RELATED TO COMPLIANCE OF DIFFERENT ACTS OF JHA...,none,unskilled_manpower
227,Centralised Procuremenmt of SiMn & HC FeMn,raw_materials,none
207,Installation of Electro Slag Remelting (ESR) U...,machine,none


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [12]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Label']))
    print('Predicted Category: %s' %(row_article['Label_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Tender Title']))

We'll get three random numbers from the indexes:

In [13]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 1)
list_samples

[663]

First case:

In [14]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: unskilled_manpower 
Predicted Category: unskilled_manpower
-------------------------------------------
Text: 
BALANCE OF WORKS FOR CONSTRUCTION OF FLYOVER AT ISP, BURNPUR (Pkg.No.77-01)


Second case:

In [15]:
# output_article(df_misclassified.loc[list_samples[1]])

IndexError: list index out of range

Third case:

In [22]:
# output_article(df_misclassified.loc[list_samples[2]])