# Model Training

- Choose, justify and apply a model performance indicator (e.g. F1 score, true positive rate, within cluster sum of squared error, …) to assess your model and justify the choice of an algorithm

- Implement your algorithm in at least one deep learning and at least one non-deep learning algorithm, compare and document model performance

- Apply at least one additional iteration in the process model involving at least the feature creation task and record impact on model performance (e.g. data normalizing, PCA, …)

- Depending on the algorithm class and data set size you might choose specific technologies / frameworks to solve your problem. Please document all your decisions in the ADD (Architectural Decisions Document).
<br><font color=blue></font>

In [1]:
# load cleaned data
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [2]:
preppedDataDF = spark.read.parquet('preppedDataDF.parquet')
preppedDataDF.createOrReplaceTempView("preppedDataDF")

### Third training iteration after adding new features with function and loop

Note version files might be a bit different as I had worked on model def, training and eval all in one notebook to start but to finalize the project now, I'm moving each piece to its own notebook.

## Get training and test datasets

In [3]:
from systemml import MLContext, dml
ml = MLContext(spark)

In [4]:
# base the classification on previous years attempts for the players position
# create a label and class to regress against.
sdf2 = spark.sql('''
SELECT player_tier AS label, features
FROM preppedDataDF
''')
#SELECT cast(player_att_norm as float) AS class, player_tier AS label, features
sdf2.createOrReplaceTempView("sdf2")

In [5]:
# Split the data into train and test
splits = sdf2.randomSplit([0.6, 0.4], seed=1234)
train = splits[0]
test = splits[1]

In [59]:
num_inputs = len(train.toPandas()['features'][0])

## Model 1: Logistic Regression (Classification)
### Supervised machine learning
#### Classification of tiers of players

In [29]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

In [160]:
lr = LogisticRegression.load("logistic_regression")

**Hyper-parameter tuning**

In [161]:
# for hyper parameters tuning
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [162]:
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.5, 0.3, 0.1, 0.01]).build()

In [163]:
crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=3)

In [164]:
# Run cross-validation, and choose the best set of parameters.
lr_cvModel = crossval.fit(train)

In [165]:
result = lr_cvModel.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
metrics = MulticlassMetrics(predictionAndLabels.rdd)

In [166]:
evaluator.evaluate(predictionAndLabels)

0.7849223946784922

In [173]:
lr_cvModel.bestModel.interceptVector

DenseVector([-0.3803, -0.3984, -0.5242, 1.3029])

In [119]:
!rm -rf lrModel_trained
lr_cvModel.bestModel.save("lrModel_trained")

## Model 2: MultilayerPerceptronClassifier (MLP) (Classification)
### More primitive deep learning
#### Classification of tiers of players

In [32]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

In [155]:
MLP_trainer = MultilayerPerceptronClassifier.load("MLP_trainer")

In [156]:
# train the model
MLPModel = MLP_trainer.fit(train)

In [157]:
# compute accuracy on the test set
result2 = MLPModel.transform(test)
predictionAndLabels2 = result2.select("prediction", "label")
evaluator2 = MulticlassClassificationEvaluator(metricName="accuracy")
metrics2 = MulticlassMetrics(predictionAndLabels2.rdd)

In [158]:
evaluator2.evaluate(predictionAndLabels2)

0.6496674057649667

From training this MLP to predict player rank from just number of plays, we find that it is 65% accurate.  This is not a great result given that most players would be in tier 3.  On various runs, MLP has about the same performance as logistic regression.

In [159]:
!rm -rf "MLPModel_trained"
MLPModel.save("MLPModel_trained")

## Model 3: Neural Net (Classification)
### Deep learning
#### Classification of tiers of players

In [40]:
import torch
import torch.nn as nn
import torchvision
import torch.nn.functional as F

In [55]:
import sys
import os
import imp

I tried to load the pytorch model as is.  But there are issues with that, it can't read the class.  A slightly cleaner solution is to load the class explicitly.

See: https://discuss.pytorch.org/t/error-loading-saved-model/8371/4

I tried to hack around it by using imp to load the class in the module manually (the periods in the file naming convention prevent a simple 'from module import Net').
That too didn't work so I'm defining the Net class again here in this file.

In [57]:
os.getcwd()

'/gpfs/global_fs01/sym_shared/YPProdSpark/user/s6b7-e822a2b9f546a1-2b5348f6e911/notebook/work'

In [51]:
sys.path.append(os.getcwd())

In [58]:
Net = imp.load_source('Net', '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s6b7-e822a2b9f546a1-2b5348f6e911/notebook/work/YahooFF.model_def.python.v4.py')

FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s6b7-e822a2b9f546a1-2b5348f6e911/notebook/work/YahooFF.model_def.python.v4.py'

Loading model explicitly

In [75]:
# define the model by subclassing Module
class Net(nn.Module):

    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.

        D_in: input dimension
        H: dimension of hidden layer
        D_out: output dimension
        """
        super(Net, self).__init__()
        # definte layers here.  can be re-used
        self.layer_1 = nn.Linear(D_in, H, bias=True)
        self.relu = nn.ReLU()
        self.layer_2 = nn.Linear(H, H, bias=True)
        self.output_layer = nn.Linear(H, D_out, bias=True)

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must 
        return a Variable of output data. We can use Modules defined in the 
        constructor as well as arbitrary operators on Variables.
        """
        out = self.layer_1(x)
        out = self.relu(out)
        out = self.layer_2(out)
        out = self.relu(out)
        out = self.output_layer(out)
        return out


In [63]:
model = torch.load("torch_nn")

In [64]:
def one_hot_encode(target_rows, target_cols, y_tensor):
    '''
    :param target_rows: row dimension
    :param target_cols: column dimension
    :param y_tensor: the y which is a tensor of integers that you want 1 hot encoded (column dim must be correct)
    '''
    # initialize a tensor with the desired dimensions
    y_onehot = torch.LongTensor(target_rows, target_cols)
    # loop through.  make the whole row zero and then set the index indicated by each value in y_tensor to 1
    for y_row, y1s_row in zip(y.view(-1,1),y_onehot):
        y1s_row.zero_()    
        y1s_row.scatter_(-1, y_row.long(), 1)
    return y_onehot

In [65]:
# you have to one hot encode y so you get a tensor of size 827x4
y_onehot = one_hot_encode(827, 4, y)

In [66]:
y_onehot

tensor([[1, 0, 0, 0],
        [1, 0, 0, 0],
        [1, 0, 0, 0],
        ...,
        [0, 0, 0, 1],
        [0, 0, 0, 1],
        [0, 0, 0, 1]])

In [67]:
# loss_fn = torch.nn.MSELoss(reduction='sum')
loss_fn = nn.CrossEntropyLoss()

In [68]:
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [69]:
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

model.train()  # This corrects for the differences in dropout, batch normalization during training and testing.
for epoch in range(2):
    for t in range(10000):
        # Forward pass: compute predicted y by passing x to the model.
        y_pred = model(x)

        # Compute and print loss.
        # if using crossentropy loss, it expects class indices.  if MSELoss, onehot
        if type(loss_fn) is torch.nn.modules.loss.CrossEntropyLoss:
            loss = loss_fn(y_pred, y.long())
        else:
            loss = loss_fn(y_pred, y_onehot.float())
        if t % 5000 == 4999:  # print every 5000 mini-batches
            print('epoch: {}, mini-batch: {}, loss: {}'.format(epoch+1, t+1, loss.item()))

        # Before the backward pass, use the optimizer object to zero all of the
        # gradients for the variables it will update (which are the learnable
        # weights of the model). This is because by default, gradients are
        # accumulated in buffers( i.e, not overwritten) whenever .backward()
        # is called. Checkout docs of torch.autograd.backward for more details.
        optimizer.zero_grad()

        # Backward pass: compute gradient of the loss with respect to model
        # parameters
        loss.backward()

        # Calling the step function on an Optimizer makes an update to its
        # parameters
        optimizer.step()
        
print('Finished Training')

epoch: 1, mini-batch: 5000, loss: 0.6576940417289734
epoch: 1, mini-batch: 10000, loss: 0.3151319921016693
epoch: 2, mini-batch: 5000, loss: 0.24604155123233795
epoch: 2, mini-batch: 10000, loss: 0.22356271743774414
Finished Training


In [70]:
model.eval()  # This corrects for the differences in dropout, batch normalization during training and testing.

Net(
  (layer_1): Linear(in_features=24, out_features=20, bias=True)
  (relu): ReLU()
  (layer_2): Linear(in_features=20, out_features=20, bias=True)
  (output_layer): Linear(in_features=20, out_features=4, bias=True)
)

In [74]:
torch.save(model.state_dict(), "torch_nn_trained.pt")

In [77]:
from torch.autograd import Variable

In [78]:
# Test the Model
correct = 0
total = 0
total_test_data = test.count()
batch_x_test = torch.tensor(test.toPandas()['features'])
batch_y_test = torch.tensor(test.toPandas()['label']).long()
# batch_y_test = one_hot_encode(total_test_data, 4, batch_y_test)
ff_features = Variable(torch.FloatTensor(batch_x_test))
labels = Variable(torch.LongTensor(batch_y_test))
outputs = model(ff_features)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

In [79]:
print('Got {:.2f}% : {} out of {}'.format(correct/total*100,correct,total))

Got 91.57% : 413 out of 451


This is a much better result than the MLP or logistic regression and a better result than the first version of the neural network.

The difference this time is the additional features using the z-score for the player's stats relative to the other players at his same position for the same season.