# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [34]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [35]:
# Dataframe
path_df = "../03. Feature Engineering/Pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = "../03. Feature Engineering/Pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = "../03. Feature Engineering/Pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = "../03. Feature Engineering/Pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = "../03. Feature Engineering/Pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = "../03. Feature Engineering/Pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = "../03. Feature Engineering/Pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = "../03. Feature Engineering/Pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = "../03. Feature Engineering/Pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = "../04. Model Training/Models/best_svc.pickle"
with open(path_model, 'rb') as data:
    svc_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

category_names = {
    0: 'business',
    1: 'entertainment',
    2: 'politics',
    3: 'sport',
    4: 'tech'
}

Let's get the predictions on the test set:

In [36]:
predictions = svc_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [37]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Content', 'Category', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['Content', 'Category', 'Category_Predicted']]

In [38]:
df_test.head()

Unnamed: 0,Content,Category,Category_Predicted
1691,Ireland call up uncapped Campbell\n\nUlster sc...,sport,sport
1103,Gurkhas to help tsunami victims\n\nBritain has...,politics,business
477,Egypt and Israel seal trade deal\n\nIn a sign ...,business,business
197,Cairn shares up on new oil find\n\nShares in C...,business,business
475,Saudi NCCI's shares soar\n\nShares in Saudi Ar...,business,business


Let's get the misclassified articles:

In [39]:
condition = (df_test['Category'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,Content,Category,Category_Predicted
1103,Gurkhas to help tsunami victims\n\nBritain has...,politics,business
1880,Half-Life 2 sweeps Bafta awards\n\nPC first pe...,tech,entertainment
2137,Junk e-mails on relentless rise\n\nSpam traffi...,tech,business


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [40]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Category']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Content']))

We'll get three random numbers from the indexes:

In [41]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[956, 1339, 1205]

First case:

In [42]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: politics
Predicted Category: tech
-------------------------------------------
Text: 
Assembly ballot papers 'missing'

Hundreds of ballot papers for the regional assembly referendum in the North East have "disappeared".

Royal Mail says it is investigating the situation, which has meant about 300 homes in County Durham are not receiving voting packs. Officials at Darlington Council are now in a race against time to try and rectify the situation. The all-postal votes of about two million electors are due to be handed in by 4 November. A spokesman for Darlington Council said: "We have sent out the ballot papers, the problem is with Royal Mail. "Somewhere along the line, something has gone wrong and these ballot papers have not been delivered. "The Royal Mail is investigating to see if they can find out what the problem is."

A spokeswoman for Royal Mail said: "We are investigating a problem with the delivery route in the Mowden area of Darlington. "This is affecting seve

Second case:

In [43]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: sport
Predicted Category: entertainment
-------------------------------------------
Text: 
Holmes feted with further honour

Double Olympic champion Kelly Holmes has been voted European Athletics (EAA) woman athlete of 2004 in the governing body's annual poll.

The Briton, made a dame in the New Year Honours List for taking 800m and 1,500m gold, won vital votes from the public, press and EAA member federations. She is only the second British woman to land the title after- Sally Gunnell won for her world 400m hurdles win in 1993. Swedish triple jumper Christian Olsson was voted male athlete of the year. The accolade is the latest in a long list of awards that Holmes has received since her success in Athens. In addition to becoming a dame, she was also named the BBC Sports Personality of the Year in December. Her gutsy victory in the 800m also earned her the International Association of Athletics Federations' award for the best women's performance in the world for 2004. 

Third case:

In [44]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: politics
Predicted Category: tech
-------------------------------------------
Text: 
MPs issued with Blackberry threat

MPs will be thrown out of the Commons if they use Blackberries in the chamber Speaker Michael Martin has ruled.

The Â£200 handheld computers can be used as a phone, pager or to send e-mails. The devices gained new prominence this week after Alastair Campbell used his to accidentally send an expletive-laden message to a Newsnight journalist. Mr Martin revealed some MPs had been using their Blackberries during debates and he also cautioned members against using hidden earpieces.

The use of electronic devices in the Commons chamber has long been frowned on. The sound of a mobile phone or a pager can result in a strong rebuke from either the Speaker or his deputies. The Speaker chairs debates in the Commons and is charged with ensuring order in the chamber and enforcing rules and conventions of the House. He or she is always an MP chosen by colleagues w

We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.