# Model Scoring Evaluation

Using the results data set constructed in the `4b_Model_Scoring` Jupyter notebook, this notebook loads the data scores. 

**Note:** This notebook will take about 1 minutes to execute all cells, depending on the compute configuration you have setup.

In [0]:
# import the libraries

# For some data handling
import numpy as np
from pyspark.ml import PipelineModel
# for creating pipelines and model
from pyspark.ml.feature import StringIndexer, VectorAssembler, VectorIndexer

# The scoring uses the same feature engineering script used to train the model
results_table = 'results_output'

In [0]:
dbutils.widgets.removeAll()
dbutils.widgets.text("results_data", results_table)

In [0]:
# make predictions. The Pipeline does all the same operations on the test data
spark.catalog.refreshTable(dbutils.widgets.get("results_data")) 
predictions =  spark.table(dbutils.widgets.get("results_data"))

# Create the confusion matrix for the multiclass prediction results
# This result assumes a decision boundary of p = 0.5
conf_table = predictions.stat.crosstab('indexedLabel', 'prediction')
confuse = conf_table.toPandas()
confuse.head()

Unnamed: 0,indexedLabel_prediction,0.0,1.0,2.0,3.0,4.0
0,0.0,89736,1,2,1,0
1,1.0,1582,716,0,0,0
2,2.0,1161,0,440,0,0
3,3.0,830,0,0,507,8
4,4.0,748,11,0,1,301


The confusion matrix lists each true component failure in rows and the predicted value in columns. Labels numbered 0.0 corresponds to no component failures. Labels numbered 1.0 through 4.0 correspond to failures in one of the four components in the machine. As an example, the third number in the top row indicates how many days we predicted component 2 would fail, when no components actually did fail. The second number in the second row, indicates how many days we correctly predicted a component 1 failure within the next 7 days.

We read the confusion matrix numbers along the diagonal as correctly classifying the component failures. Numbers above the diagonal indicate the model incorrectly predicting a failure when non occured, and those below indicate incorrectly predicting a non-failure for the row indicated component failure.

When evaluating classification models, it is convenient to reduce the results in the confusion matrix into a single performance statistic. However, depending on the problem space, it is impossible to always use the same statistic in this evaluation. Below, we calculate four such statistics. <br>

- **Accuracy**: reports how often we correctly predicted the labeled data. Unfortunatly, when there is a class imbalance (a large number of one of the labels relative to others), this measure is biased towards the largest class. In this case non-failure days.

Because of the class imbalance inherent in predictive maintenance problems, it is better to look at the remaining statistics instead. Here positive predictions indicate a failure.

- **Precision**: Precision is a measure of how well the model classifies the truely positive samples. Precision depends on falsely classifying negative days as positive.

- **Recall**: Recall is a measure of how well the model can find the positive samples. Recall depends on falsely classifying positive days as negative.

- **F1**: F1 considers both the precision and the recall. F1 score is the harmonic average of precision and recall. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

These metrics make the most sense for binary classifiers, though they are still useful for comparision in our multiclass setting. Below we calculate these evaluation statistics for the selected classifier.

In [0]:
# select (prediction, true label) and compute test error
# select (prediction, true label) and compute test error
# True positives - diagonal failure terms 
tp = confuse['1.0'][1]+confuse['2.0'][2]+confuse['3.0'][3]+confuse['4.0'][4]

# False positves - All failure terms - True positives
fp = np.sum(np.sum(confuse[['1.0', '2.0','3.0','4.0']])) - tp

# True negatives 
tn = confuse['0.0'][0]

# False negatives total of non-failure column - TN
fn = np.sum(np.sum(confuse[['0.0']])) - tn

# Accuracy is diagonal/total 
acc_n = tn + tp
acc_d = np.sum(np.sum(confuse[['0.0','1.0', '2.0','3.0','4.0']]))
acc = acc_n/acc_d

# Calculate precision and recall.
prec = tp/(tp+fp)
rec = tp/(tp+fn)

# Print the evaluation metrics to the notebook
print("Accuracy = %g" % acc)
print("Precision = %g" % prec)
print("Recall = %g" % rec )
print("F1 = %g" % (2.0 * prec * rec/(prec + rec)))
print("")

Remember that this is a simulated data set. We would expect a model built on real world data to behave very differently. The accuracy may still be close to one, but the precision and recall numbers would be much lower.

In [0]:
predictions.toPandas().head(20)

Unnamed: 0,machineID,dt_truncated,label_e,features,indexedLabel,indexedFeatures,rawPrediction,probability,prediction
0,7,2016-01-01 12:00:00,0.0,"(190.8501951794539, 460.54198634121167, 99.540...",0.0,"(190.8501951794539, 460.54198634121167, 99.540...","[191.6497477890058, 2.49599081161969, 5.239626...","[0.9582487389450287, 0.012479954058098446, 0.0...",0.0
1,7,2016-01-01 00:00:00,0.0,"[187.6266823616143, 456.81889219354565, 99.762...",0.0,"[187.6266823616143, 456.81889219354565, 99.762...","[176.11622174160905, 1.915640911787661, 20.787...","[0.8805811087080453, 0.009578204558938305, 0.1...",0.0
2,7,2015-12-31 12:00:00,0.0,"[182.18881690839308, 435.6879390313984, 99.507...",0.0,"[182.18881690839308, 435.6879390313984, 99.507...","[187.13372568619744, 1.6115818155734885, 9.222...","[0.9356686284309872, 0.008057909077867442, 0.0...",0.0
3,7,2015-12-31 00:00:00,0.0,"[173.7423203814842, 460.0246224357595, 97.9361...",0.0,"[173.7423203814842, 460.0246224357595, 97.9361...","[192.72941246631925, 2.5728646407658227, 1.231...","[0.9636470623315964, 0.012864323203829115, 0.0...",0.0
4,7,2015-12-30 12:00:00,0.0,"[171.96918720533998, 437.9122537847154, 100.28...",0.0,"[171.96918720533998, 437.9122537847154, 100.28...","[193.3305797449037, 1.7021715942472202, 1.1730...","[0.9666528987245182, 0.008510857971236099, 0.0...",0.0
5,7,2015-12-30 00:00:00,0.0,"[166.36129053267317, 447.15764989587296, 99.86...",0.0,"[166.36129053267317, 447.15764989587296, 99.86...","[194.62444550600773, 1.598460666575496, 1.1142...","[0.9731222275300384, 0.007992303332877477, 0.0...",0.0
6,7,2015-12-29 12:00:00,0.0,"[170.4166984696444, 464.23182547134314, 101.90...",0.0,"[170.4166984696444, 464.23182547134314, 101.90...","[194.30643525776676, 1.705367288766955, 1.1973...","[0.9715321762888339, 0.008526836443834777, 0.0...",0.0
7,7,2015-12-29 00:00:00,0.0,"[178.20149940482136, 472.6140967147758, 100.48...",0.0,"[178.20149940482136, 472.6140967147758, 100.48...","[189.5802963328041, 2.4186690545544764, 6.2647...","[0.9479014816640203, 0.012093345272772381, 0.0...",0.0
8,7,2015-12-28 12:00:00,0.0,"[169.05200841442934, 467.7200447111329, 101.69...",0.0,"[169.05200841442934, 467.7200447111329, 101.69...","[194.11216719370182, 1.771844144291362, 1.2209...","[0.9705608359685088, 0.008859220721456807, 0.0...",0.0
9,7,2015-12-28 00:00:00,0.0,"[165.37406645393918, 472.12783096596195, 104.2...",0.0,"[165.37406645393918, 472.12783096596195, 104.2...","[194.19892903654764, 1.709601057243787, 1.3243...","[0.9709946451827384, 0.008548005286218937, 0.0...",0.0


In [0]:
print(predictions.summary())

In [0]:
predictions.explain()