# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [1]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [11]:
# Dataframe
path_df = "pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = "pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = "pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = "pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = "pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = "pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = "pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = "pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = "pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = "Models/best_knnc.pickle"
with open(path_model, 'rb') as data:
    knn_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'first': 1,
    'second': 2,
    'third': 3
}

category_names = {
    1: 'first',
    2: 'second',
    3: 'third'
}
print("done")

done


Let's get the predictions on the test set:

In [6]:
predictions = knn_model.predict(features_test)

In [12]:
df = df.rename(columns={'text':'Content'})

df.head()

Unnamed: 0,id,Category_Code,headline,n_posts_author,date_seq,month_seq,year,column1,column2,Content,Content_Parsed,Category
0,730,1,"diary of a british scientist, part 2: brushing...",7,977,42,1999,no,yes,by so after deciding that i wanted...,decide i want move lab science c...,first
1,4200,3,alyson reed takes the helm at npa,1,2545,93,2003,no,no,"by n 4 september, a new force joined the...",n 4 september new force join struggle ...,third
2,4453,1,unveiling the blindness,1,2699,98,2004,no,no,"by n a daily basis, i strive to be the ...",n daily basis i strive laziest mexica...,first
3,286,2,talk yourself right into a job,247,7499,256,2017,no,yes,by i’m sure you’ve heard the expression use...,i’ sure you’ hear expression use describ...,second
4,1034,3,loan-repayment for biomedical researchers,84,1901,72,2001,no,yes,by whatever happe...,whatever happen fund proposal cong...,third


Now we'll create the Test Set dataframe with the actual and predicted categories:

In [13]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Content', 'Category', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['Content', 'Category', 'Category_Predicted']]

In [14]:
df_test.head()

Unnamed: 0,Content,Category,Category_Predicted
67,"by bout 3 years ago, i was sitting at ...",second,second
69,"by \t\t\t\t\t\t\t david price, 46, can't r...",first,first
192,by making it in academia is hard for anyone...,third,third
160,by i am finishing...,second,third
65,by his article concerns itself with own...,third,third


Let's get the misclassified articles:

In [17]:
condition = (df_test['Category'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

print(len(df_misclassified))
df_misclassified

4


Unnamed: 0,Content,Category,Category_Predicted
160,by i am finishing...,second,third
127,"by , atricia gosling and bart noordam are ...",third,second
141,by ara is 2 years into her first postdo...,second,third
194,by our new discovery has implications f...,second,third


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [18]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Category']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Content']))

We'll get three random numbers from the indexes:

In [19]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[127, 194, 141]

First case:

In [20]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: third
Predicted Category: second
-------------------------------------------
Text: 
  by  ,  atricia gosling and bart noordam are the authors of   ( ). gosling is a senior medical writer at novartis vaccines and diagnostics in germany and freelance science writer. noordam is a professor of physics at the  , the netherlands, and director  development and engineering at  . he has also worked for mckinsey and co. 5 september 2008 research in industry differs from academic research in several important ways. 25 july 2008 the key to doing well in your thesis defense is extensive preparation. 27 june 2008 if you aren't having regular, structured conversations with your ph.d. adviser, poor communication is probably holding you back. 23 may 2008 problem-solving and communication skills are important for those who aspire to careers advising the captains of industry. 2 may 2008 the values and culture of the nonprofit setting make it an exciting and rewarding career choice for so

Second case:

In [21]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: second
Predicted Category: third
-------------------------------------------
Text: 
  by      our new discovery has implications for breast cancer therapy. who funds you, the national cancer institute (nci)? nope: the u.s. army. you've just developed a swift-growing tree that drinks up metals in the soil as if they were lemonade in july, and it could be the next killer app for cleaning up superfund sites. who cut the r&d checks, the environmental protection agency (epa)? uh-uh: the u.s. air force. a new airborne chemical sensor: epa? the department of energy? homeland security? no: it owes its existence to the small business administration (sba).  why be normal? securing funding is difficult, time-consuming, and unpredictable, especially in an election year. who knows how next year's president's policies will impact the budgets of the national institutes of health (nih), the national science foundation (nsf), and other stalwarts of scientific funding? the outlook right

Third case:

In [22]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: second
Predicted Category: third
-------------------------------------------
Text: 
  by      ara is 2 years into her first postdoc, and she has begun to give some serious thought to the next step in her career. although she is concerned about starting a family and is apprehensive about diving into the publish-or-perish, grant-intensive faculty cycle, she hasn't really considered any option other than applying for academic positions. during her many late nights in the lab she takes time to read the back pages of major science journals and scans online career sites. she visits the job board and tries to talk to pi's at the one major meeting that she attends each year.  despite the generally glum articles about academic careers for scientists, she tells herself that she'll be able to find something. "after all," she thinks, "gary in the next lab got a faculty position, and his cv wasn't much better than mine." but even though she hasn't started applying for jobs, sara ha

We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.