# Model Evaluation

- Choose, justify and apply a model performance indicator (e.g. F1 score, true positive rate, within cluster sum of squared error, …) to assess your model and justify the choice of an algorithm

- Implement your algorithm in at least one deep learning and at least one non-deep learning algorithm, compare and document model performance

- Apply at least one additional iteration in the process model involving at least the feature creation task and record impact on model performance (e.g. data normalizing, PCA, …)

- Depending on the algorithm class and data set size you might choose specific technologies / frameworks to solve your problem. Please document all your decisions in the ADD (Architectural Decisions Document).
<br><font color=blue></font>

In [1]:
# load cleaned data
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [2]:
preppedDataDF = spark.read.parquet('preppedDataDF.parquet')
preppedDataDF.createOrReplaceTempView("preppedDataDF")

### Third training iteration after adding new features with function and loop

Note version files might be a bit different as I had worked on model def, training and eval all in one notebook to start but to finalize the project now, I'm moving each piece to its own notebook.

## Load trained models and evaluate

I use the confusion matrix as a measure of performance for these models.

## Model 1: Logistic Regression (Classification)
### Supervised machine learning
#### Classification of tiers of players

In [12]:
from pyspark.ml.classification import LogisticRegressionModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

In [40]:
lrModel = LogisticRegressionModel.load("lrModel_trained")

In [41]:
result = lrModel.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
metrics = MulticlassMetrics(predictionAndLabels.rdd)

In [42]:
# pyspark's implementation is bad (precision, recall and metrics are all the same)
print('Precision: {}, Recall: {}'.format(metrics.precision(), metrics.recall()))
print('F-Score: {}'.format(metrics.fMeasure()))



Precision: 0.7849223946784922, Recall: 0.7849223946784922
F-Score: 0.7849223946784922




In [43]:
metrics.confusionMatrix()

DenseMatrix(4, 4, [43.0, 15.0, 0.0, 1.0, 8.0, 18.0, 8.0, 1.0, 1.0, 7.0, 6.0, 0.0, 1.0, 14.0, 41.0, 287.0], 0)

After tuning the hyper-parameters on this model, the preformance has increased.  
Previously, all of the predictions are of the same class.  It appears that most of the player and team stats are quite similar so there is no clear linaer boundary using logistic regression.
However, using the tuned model produced much better results.

In [56]:
print('Optimal hyper-parameters used:')
lrModel.stages

Optimal hyper-parameters used:


AttributeError: 'LogisticRegressionModel' object has no attribute 'stages'

## Model 2: MultilayerPerceptronClassifier (MLP) (Classification)
### More primitive deep learning
#### Classification of tiers of players

In [36]:
from pyspark.ml.classification import MultilayerPerceptronClassificationModel

In [57]:
MLPModel = MultilayerPerceptronClassificationModel.load("MLPModel_trained")

In [58]:
result2 = MLPModel.transform(test)
predictionAndLabels2 = result2.select("prediction", "label")
evaluator2 = MulticlassClassificationEvaluator(metricName="accuracy")
metrics2 = MulticlassMetrics(predictionAndLabels2.rdd)

In [59]:
metrics2.confusionMatrix()

DenseMatrix(4, 4, [6.0, 5.0, 7.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 47.0, 49.0, 48.0, 287.0], 0)

MLP performs substantially worse than logistic regression in this example.  Most of the predictions are in class 3 where most of the players (the ones not ranked in the top 15) are.  There are a few predictions for the top class.  The accuracy for that class is not great though: 6 / 20.

## Model 3: Neural Net (Classification)
### Deep learning
#### Classification of tiers of players

In [25]:
import torch
import torch.nn as nn
import torchvision
import torch.nn.functional as F

In [26]:
# define the model by subclassing Module
class Net(nn.Module):

    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.

        D_in: input dimension
        H: dimension of hidden layer
        D_out: output dimension
        """
        super(Net, self).__init__()
        # definte layers here.  can be re-used
        self.layer_1 = nn.Linear(D_in, H, bias=True)
        self.relu = nn.ReLU()
        self.layer_2 = nn.Linear(H, H, bias=True)
        self.output_layer = nn.Linear(H, D_out, bias=True)

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must 
        return a Variable of output data. We can use Modules defined in the 
        constructor as well as arbitrary operators on Variables.
        """
        out = self.layer_1(x)
        out = self.relu(out)
        out = self.layer_2(out)
        out = self.relu(out)
        out = self.output_layer(out)
        return out

In [27]:
model = torch.load("torch_nn")

In [28]:
model.load_state_dict(torch.load("torch_nn_trained.pt"))

In [33]:
from torch.autograd import Variable
import collections

In [29]:
# unfortunately, a multilabelconfusionmeter isn't available in pytorch yet
# https://discuss.pytorch.org/t/multilabelconfusionmeter-in-pytorch/12803
# so had to write one myself

def multilabelconfusionmeter(test_dataset, model):
    '''
    :param test_dataset: test data set that was split (or use in-sample)
    :param model: the instance of my model class that subclasses nn.Module
    '''
    # returns a matrix of predicted values for rows, actual values for columns and the number of hits in the columns. 
    total_test_data = test_dataset.count()
    batch_x_test = torch.tensor(test_dataset.toPandas()['features'])
    batch_y_test = torch.tensor(test_dataset.toPandas()['label']).long()    
    ff_features = Variable(torch.FloatTensor(batch_x_test))
    labels = Variable(torch.LongTensor(batch_y_test))
    num_labels = len(labels.unique())
    outputs = model(ff_features)
    _, predicted = torch.max(outputs.data, 1)
    
    confusion_matrix = collections.defaultdict(dict)   
    # initialize the results dictionary
    for i in range(num_labels):
        for j in range(num_labels):        
            confusion_matrix[i][j] = 0
            
    # loop and fill in matrix
    for a, p in zip(labels, predicted):
        # labels are the actual value    
        confusion_matrix[p.item()][a.item()] += 1
    return confusion_matrix

In [34]:
mcm = multilabelconfusionmeter(test, model)
mcm

defaultdict(dict,
            {0: {0: 48, 1: 3, 2: 0, 3: 1},
             1: {0: 5, 1: 42, 2: 8, 3: 1},
             2: {0: 0, 1: 8, 2: 42, 3: 6},
             3: {0: 0, 1: 1, 2: 5, 3: 281}})

The neural net fares much better at prediction.  The highest and lowest classes are quite good.  The middle classes are smaller in size and the NN gets those correct much more often than the original model and feature specifications.  The neural net compares favorably to the other models.  In fact, with the additional features created in feat_eng.v4, this iteration is better than the original performance of the model as well.

In [35]:
for x in [0,1,2,3]:
    print('Class {} accuracy: {:.2f}%'.format(
        x, mcm[x][x] / sum(list(mcm[x].values()))* 100))

Class 0 accuracy: 92.31%
Class 1 accuracy: 75.00%
Class 2 accuracy: 75.00%
Class 3 accuracy: 97.91%
