# Investigating Misclassifications by Model

**Methods:**
>1. Load and concat data
>2. Identify misclassified cases
>3. Investigate the respective predictions by model

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.svm as skl_svm
import sklearn.cross_validation as skl_cv
import seaborn as sns
import os
import sys

base_path = '/home/lundi/Python/MNIST/'
sys.path.append(base_path + '/libraries/')

import time
import glob

import MNIST_model_functions as mmf
MNIST_model_functions = mmf.MNIST_model_functions()

## 1. Load and concat data

In [None]:
prediction_data_v1 = pd.DataFrame()

for filename in glob.glob(base_path + '/data/prediction_results/2016.11.7-*_results.csv'):
    prediction_data_v1 = pd.concat([prediction_data_v1, pd.read_csv(filename)])
prediction_data_v1 = prediction_data_v1.rename(columns = {'Unnamed: 0': 'datum_index'})
#prediction_data_v1 = prediction_data_v1.drop(['Unnamed: 0'], axis = 1)

## 2. Identify misclassified cases

In [None]:
prediction_data_v1.loc[prediction_data_v1['datum_index'] == 0]

I will calculate the fraction of misclassifications for each datum

In [None]:
average_misclassification_fraction = prediction_data_v1.groupby(['datum_index'])['is_misclassified'].mean().reset_index()
average_misclassification_fraction = average_misclassification_fraction.rename(columns = {'is_misclassified': 'misclassified_frac'})

Now, I will merge this data onto the original prediction data

In [None]:
prediction_data_v2 = pd.merge(prediction_data_v1, average_misclassification_fraction, on = ['datum_index'])

Now I will grab cases where the misclassification is 0.5

In [None]:
split_classified_data = prediction_data_v2.loc[prediction_data_v2['misclassified_frac'] == 0.5].drop(['misclassified_frac'], axis=1)

## 3. Investigate the respective predictions by model

Let's predict which models predict together. To do this, I will pivot the table to yield predictions

In [None]:
split_classified_pivot_data = pd.pivot_table(split_classified_data[['datum_index','Actual','Predicted','is_misclassified','Model']], 
                   values = ['is_misclassified'], index = ['datum_index'], columns = ['Model'])

In [None]:
split_classified_pivot_data.corr()

A -1.0 here means that they don't predict the same at all. So GBM predicts quite differently from RF and SVC_Poly

Let's see the misclassifications by model and by number (using the overall data)