# Visualizing Results Interactively
This notebook goes through the process of taking Scania heavy truck failure data and visualizing it two different ways.   

1) Confusion Matrix with Sliding Threshold:  
* Demonstrates the effect of prioritizing different errors
* Show how I minimize the model cost

2) 3-D Visuzation of Data with PCA:
* Fun tool to visualize how separable the classes are

In [2]:
#plotly imports 

import plotly as py 
import plotly.graph_objs as go
from plotly import __version__
#use this format for working locally 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot, plot_mpl

init_notebook_mode(connected=True)

print('Plotly version: %s' %(__version__))


#Other Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from custom_metrics import scania_score, confusion_matrix
import data_cleaning as dc
import pickle

#Interactive Visualization
from ipywidgets import interactive, FloatSlider

Plotly version: 3.3.0


In [3]:
# Bring in data from csv files
X_train, X_test, y_train, y_test = dc.ready_aps_data()

In [4]:
# Reduce testing data to 3 principal components
# Scale and standardize (mean 0, std 1) before reducing to 3 components
ssx = StandardScaler()
scaled_X_test = ssx.fit_transform(X_test)

pca_3 = PCA(n_components=3)
X_3d = pca_3.fit_transform(scaled_X_test)



In [5]:
# Trim absurdly large outliers for visualization reasons
values_to_retain = dc.reject_outliers(X_3d, m=10)
X_3d_graph = X_3d[values_to_retain]
y_3d_graph = y_test.values[values_to_retain]

print(X_3d_graph.shape, y_3d_graph.shape)

(15970, 3) (15970,)


In [6]:
# Create mask for positive and negative class
pos = y_3d_graph == 1

Plotly can only handle up to 5000 data points, likely less if I want to save it on their servers. With that in mind, I will still graph all of the rarer positive class examples, but will limit the negative class examples to a smalelr subset of the current 16000.

In [12]:
# Create smaller section of negative class to plot

X_3d_graph_neg = X_3d_graph[~pos][np.random.choice(
    X_3d_graph[~pos].shape[0], 3000, replace=False)]
print(X_3d_graph_neg.shape)

(3000, 3)


Making this visualization was a lot of fun. It becomes more apparent the more of the negative class is plotted, but the two classes do not separate well. There's plenty of overlap in the 3D space. While it may not be fully representative of a 100+ dimension hyperplane, I do think it's illustrative of why, in order to minimize the number of false negatives, the models produce a fair amount of false positives.  

As far as the visuals go, adding lines to the spheres (giving them a visible edge) really helped to make the shapes pop and give a better sense of depth within 3 dimensions.

In [13]:
# Graphing 3-D reduced data
marker_pos = dict(size=3, symbol='circle',
                  #color='rgb(127, 127, 127)',
                  color = 'rgb(255, 127, 14)',
                  line=dict(width=1, color='rgba(217, 217, 217, 0.14)'),
                  opacity=0.8)
marker_neg = dict(size=3, symbol='circle',
                  color='rgb(127, 127, 127)',
                  #line=dict(color='rgb(204, 204, 204)',width=0.5),
                  opacity=0.8)

trace1 = go.Scatter3d(x = X_3d_graph[pos][:,0], y=X_3d_graph[pos][:,1], z=X_3d_graph[pos][:,2],
                    mode = 'markers',
                     marker = marker_pos,
                     name = 'APS Failure')

trace2 = go.Scatter3d(x = X_3d_graph_neg[:,0], y=X_3d_graph_neg[:,1], z=X_3d_graph_neg[:,2],
                    mode = 'markers',
                     marker = marker_neg,
                     name = 'No Check Needed')
data=[trace1, trace2]

layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    title = 'Scania Truck Data Reduced to 3 Components',
    xaxis = dict(title = 'PC1', titlefont = dict(size=18)),
    yaxis = dict(title = 'PC2', titlefont = dict(size=18)),
    #zaxis = dict(title = 'PC3', titlefont = dict(size=18))
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='simple-3d-scatter')

## Sliding Confusion Matrix
This tool comes from a notebook written by one of the instructors at Metis, Lara Kattan. I've added the custom cost function for this classification problem, my current best-fit model, and a print statement to display the dynamically changing cost.

In [14]:
# load trained model
best_model = pickle.load(open('vanilla_rfc.pkl', 'rb'))

In [15]:
def make_confusion_matrix(model, threshold=0.5):
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    y_predict = (model.predict_proba(X_test)[:, 1] >= threshold)
    fraud_confusion = confusion_matrix(y_test, y_predict)
    plt.figure(dpi=130)
    sns.heatmap(fraud_confusion, cmap=plt.cm.Blues, annot=True, square=True, fmt='d',
           xticklabels=['No Failure', 'Failure'],
           yticklabels=['No Failure', 'Failure']);
    plt.xlabel('Predicted')
    plt.ylabel('APS Status')
    print("Cost to Scania: $", scania_score(y_test,y_predict))

In [16]:
interactive(lambda threshold: make_confusion_matrix(best_model, threshold), threshold=(0.0,0.5,0.005))

interactive(children=(FloatSlider(value=0.25, description='threshold', max=0.5, step=0.005), Output()), _dom_c…