Welcome! This notebook takes you through the process I went through while creating models to predict fantasy league scores for a player. Let me start by describing the data.

I have data in the form X[0], X[1], X[2], ... , X[35], Y for each player in the premier league for 2013-14 and 2016-17 season. I have cleaned the data for some fields but I will skip describing that process for now. A single entry in dictionary form looks like this -

In [None]:
{
    'X': {
        'assists_per_match_played': 0.1111111111111111,
        'avg_assists_form': 0.0,
        'avg_bps_form': 20.0,
        'avg_clean_sheets_form': 0.0,
        'avg_goals_conceded_form': 2.0,
        'avg_goals_scored_form': 0.3333333333333333,
        'avg_minutes_form': 86.66666666666667,
        'avg_net_transfers_form': 103.33333333333333,
        'avg_points_form': 4.0,
        'avg_red_cards_form': 0.0,
        'avg_saves_form': 0.0,
        'avg_yellow_cards_form': 0.0,
        'bps_per_match_played': 10.777777777777779,
        'clean_sheets_per_match_played': 0.07407407407407407,
        'goals_conceded_per_match_played': 1.0,
        'goals_scored_per_match_played': 0.037037037037037035,
        'is_at_home': 1,
        'last_season_points_per_minutes': 0.03913894324853229,
        'minutes_per_match_played': 57.925925925925924,
        'net_transfers_per_match_played': 409.3703703703704,
        'opponent_goals_conceded_per_match': 1.4324324324324325,
        'opponent_goals_scored_per_match': 1.5405405405405406,
        'opponent_points_per_match': 1.6486486486486487,
        'opponent_points_per_match_last_season': 1.8157894736842106,
        'points_per_match_played': 2.0,
        'price': 4.4,
        'price_change_form': 0.0,
        'red_cards_per_match_played': 0.037037037037037035,
        'saves_per_match_played': 0.0,
        'team_goals_conceded_per_match': 1.3783783783783783,
        'team_goals_scored_per_match': 0.8918918918918919,
        'team_points_per_match': 0.918918918918919,
        'team_points_per_match_last_season': 0.9736842105263158,
        'yellow_cards_per_match_played': 0.07407407407407407
    },
    'Y': {u'points_scored': 2.0}
}

# A bit about earlier trials

Before I get into the details of what worked, let me describe some initial attempts that didn't.

The primary problem with our data is that it is highly imbalanced. Most players score low, with only a few instances of players who score above 5 pts.

Also, since charateristics of each outfield position are different, I decided to train the models seperately for forwards, midfielders, defenders and goalkeepers.

Things I tried before settling in on the solution -

* Linear Regression - Simply wasn't able to give any meaningful predictions. I suspect this is because of high non-linearity and imbalance

* Regression using neural network - Just couldn't get it to work, all predictions seemed to be either concentrated in the low points zone or were very random

This is when I decided to convert the problem into a classification one. I experimented with multiple bin sizes and finally came to this categorisation -
    a. points scored less than 5 - 'low'
    b. between 5 - 8 - 'medium'
    c. greater than 8 - 'high'
For midfielders, this gives a ratio of 'low': 'medium': 'high' as 21.1: 1.53: 1.0

It is already clear that our data is heavily skewed in favor of 'low' points. To test our model in such a case, accuracy is a bad metric to use. If I predict all outcomes to be 'low', I immediately get accuracy of 89%, but it quite meaningless in terms of prediction. We will use a confusion matrix to tune our model instead as it gives a much better picture of how our model is performing. So next I tried -

* 3 layer neural network (3nn) - As expected, it breaks down and predicts all outcomes to be 'low'. We need to somehow offset the imbalance in our data.

* 3nn with oversampling using SMOTE - One of the techniques to overcome imbalance is oversampling. Here we create interpolated copies of minority class data points to balance the dataset. SMOTE is one such algorithm to create these artificial data points. Unfortunately, this didn't work in our case (why needs to be still evaluated).

* 3nn with undersampling - Contrary to oversampling, undersampling creates balance in dataset by removing datapoints from the majority class. This though has a clear disadvantage of losing out on information. This alone also didn't work well on the model.

* 3nn with class weights - Another way to tackle imbalance is by assigning additional costs or weights to each class in the loss calculation. This forces the loss function to assign more importance to the minority classes. I started with weights as ratios of datapoints for each class and gradually iterated to a value which was giving better confusion matrix for our validation data. Finally, something that seemed to work!

* 5nn with class weights - Deeper networks are often better at defining complex relationships to identify minority classes. I started increasing the number of hidden layers and found that a 5 layered fully connected network was working well for our problem.

The 5nn with weights was converging to a solution, but what I realized was that depending on the learning rate, it would sometimes converge to a local minima of 'all low' solution, for which the cost was only slightly higher than our desired model. It struck me that it might be a good idea to undersample the data a bit to increase this difference in loss and stabilize it.

This brings us to the model which I am currently using to predict results, a 5-layered neural network with optimized class weights and 40% undersampling for majority class.

# Model description

## Preprocessing data


In [None]:
def preprocess(
    df,
    trial,
):
    # # shuffle dataframe rows
    if trial:
        frac = 0.25
    else:
        frac = 1
    df = df.sample(frac=frac).reset_index(drop=True)
    dataset = df.values
    # # split into X and Y
    num_of_features = dataset.shape[1] - 1
    X = dataset[:, 0:num_of_features]
    Y = dataset[:, num_of_features]

    # # bin data into categories
    bins = CLASS_BINS
    bin_names = CLASSES
    categories = pd.cut(Y, bins, labels=bin_names)
    
    # # one hot encode
    Y = pd.get_dummies(categories).values

    # # split into training, test and validation sets
    num_of_samples = X.shape[0]
    val_ratio = 0.2
    test_ratio = 0.1
    train_ratio = 1 - val_ratio - test_ratio
    num_of_val_samples = int(val_ratio * num_of_samples)
    num_of_train_samples = int(train_ratio * num_of_samples)

    X_train = X[0:(num_of_train_samples + 1), :]
    Y_train = Y[0:(num_of_train_samples + 1), :]
    X_val = X[num_of_train_samples:(num_of_train_samples + num_of_val_samples + 1), :]
    Y_val = Y[num_of_train_samples:(num_of_train_samples + num_of_val_samples + 1), :]
    X_test = X[num_of_train_samples + num_of_val_samples:, :]
    Y_test = Y[num_of_train_samples + num_of_val_samples:, :]

    # # fix random seed for reproducibility
    seed = 7
    np.random.seed(seed)

    # # standardize data and store mean and scale to file
    scaler = StandardScaler().fit(X_train)
    mean_array = scaler.mean_
    scale_array = scaler.scale_
    X_train_transformed = scaler.transform(X_train)
    X_val_transformed = (X_val - mean_array) / (scale_array)
    X_test_transformed = (X_test - mean_array) / (scale_array)

    data_dict = {
        'train_data': (X_train_transformed, Y_train),
        'val_data': (X_val_transformed, Y_val),
        'test_data': (X_test_transformed, Y_test),
        'norm_arrays': (mean_array, scale_array),
        'num_of_features': num_of_features,
    }

    return data_dict

We randomly shuffle the data first.

We first categorize Y values as 'low', 'mid' and 'high'. Then we apply a trick called one-hot encoding to deal with multiple classes. This converts the Y-array to m X 3 from m X 1, with each row consisting of a 3 elements, one for each class. So, an example with value 'low' would now be represented as [1 0 0]. Similarly 'mid' -> [0 1 0] and 'high' -> [0 0 1].

Now we split our examples into training (70%), validation (20%) and test (10%) datasets.

Next, we normalize our training X values using scikit-learn's StandardScaler. This brings all values for our features in the same range with mean for all values being 0. This usually helps any machine learning algorithm perform better.

We store the mean and scale arrays for transforming validation, test as well as any future values we might need to predict for. It is important to normalize the training set separately instead of doing it together with validation data as we don't want any information from our validation set to leak to the training set.

## Undersampling

In [None]:
def apply_undersampling(X, Y, class_num=0, frac_removed=0.4):
    # # print stats before undersampling
    print('number of training samples before undersampling = %s' % X.shape[0])
    print('Y counter before undersampling -')
    print(get_class_counts(Y))

    # # get indices of rows to be removed for majority class
    y_0 = Y[:, class_num]
    low_indices = np.where(y_0 == 1)[0]
    n_removed = int(low_indices.shape[0] * frac_removed)

    # # delected removed rows
    removed = low_indices[0:n_removed]
    Y = np.delete(Y, removed, axis=0)
    X = np.delete(X, removed, axis=0)

    # # print stats after undersampling
    print('number of training samples after undersampling = %s' % X.shape[0])
    print('Y counter after undersampling -')
    print(get_class_counts(Y))

    return X, Y

We randomly remove 40% examples of majority class ('low') from the training dataset to make the dataset more balanced.

## Network definition

In [None]:
def five_layer_nn(num_of_features=0):
    K.set_learning_phase(1)

    # # create model
    model = Sequential()
    model.add(Dense(
        num_of_features,
        input_dim=num_of_features,
        W_regularizer=l1l2(0.01),
        init='normal',
        activation='relu'
    ))
    # model.add(BatchNormalization())
    model.add(Dropout(0.35))
    model.add(Dense(int(num_of_features * 1.5), init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 2, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 4, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(len(CLASSES), init='normal', activation='softmax'))

    # # define loss optimizer
    adam = Adam(lr=0.0002)

    # # compile model
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy', fbeta_custom])
    return model

Here we define a simple neural network with 3 hidden layers using Keras. We use l1l2 and dropout regularization, softmax activation for output layer (outputs probabilities per class) and cross-entropy loss function. Note that we have already defined a custom f-measure metric (fbeta_custom) with beta=1 to have a single score to judge our model.

The learning rate lr needed to be iterated upon to get good convergence. Too small a value would sometimes make the training stuck in a local minima. Too large a value makes it difficult to get good convergance.

## weight definition

Now we need to optimise the weights for each class to get a good confusion matrix. I started with the ratios of examples for each class (adjusted for undersampling) and iterated as per predictions on validation data. Finally I obtained the following weights.

In [None]:
# define class weight to tackle skew
CLASS_WEIGHT = {
    'forward': {
        0: 1.0,
        1: 7.96 * 0.6 * 0.7,
        2: 18.30 * 0.6 * 0.7 * 0.8,
    },
    'midfielder': {
        0: 1.0,
        1: 13.8 * 0.6 * 0.7,
        2: 21.1 * 0.6 * 0.7 * 0.8,
    },
    'defender': {
        0: 1.0,
        1: 5.58 * 0.6 * 0.7,
        2: 30.71 * 0.6 * 0.7 * 0.8,
    },
    'goalkeeper': {
        0: 1.0,
        1: 4.70 * 0.6 * 0.7 * 1.05,
        2: 27.03 * 0.6 * 0.7 * 0.7,
    },
}

# Putting it all together

I have copied the complete code for prediction below. The main method is train (at the end).

In [None]:
# import random
from collections import Counter
import numpy as np
import pandas as pd
import json
from keras import backend as K
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.regularizers import l1l2
from keras.metrics import fbeta_score
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
from imblearn.over_sampling import SMOTE    # , ADASYN
from keras.optimizers import Adam

import sys
import os
import errno

sys.setrecursionlimit(10000)

SCRIPT_DIR = os.path.dirname(__file__)


# define fbeta beta fn
def fbeta_custom(x, y):
    return fbeta_score(x, y, beta=1.0)


# # to create dir if dir does not exist
def create_filepath(filename):
    if not os.path.exists(os.path.dirname(filename)):
        try:
            os.makedirs(os.path.dirname(filename))
        except OSError as exc:  # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise


def shuffle_X_Y(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p], b[p]


CLASSES = [
    'low',
    'mid',
    'high',
]

CLASS_BINS = [
    -10.0,
    4.0,
    8.0,
    100.0,
]

# define class weight to tackle skew
CLASS_WEIGHT = {
    'forward': {
        0: 1.0,
        1: 7.96 * 0.6 * 0.7,
        2: 18.30 * 0.6 * 0.7 * 0.8,
    },
    'midfielder': {
        0: 1.0,
        1: 13.8 * 0.6 * 0.7,
        2: 21.1 * 0.6 * 0.7 * 0.8,
    },
    'defender': {
        0: 1.0,
        1: 5.58 * 0.6 * 0.7,
        2: 30.71 * 0.6 * 0.7 * 0.8,
    },
    'goalkeeper': {
        0: 1.0,
        1: 4.70 * 0.6 * 0.7 * 1.05,
        2: 27.03 * 0.6 * 0.7 * 0.7,
    },
}


def get_class_counts(Y):
    y_1d = np.empty((Y.shape[0], 1), dtype=np.object_)
    i = 0
    for y_entry in Y:
        class_num = np.where(y_entry == 1)[0][0]
        y_1d[i] = CLASSES[class_num]
        i += 1

    y_1d = y_1d.ravel()
    unique, counts = np.unique(y_1d, return_counts=True)
    return dict(zip(unique, counts))


def apply_undersampling(X, Y, class_num=0, frac_removed=0.4):
    # # print stats before undersampling
    print('number of training samples before undersampling = %s' % X.shape[0])
    print('Y counter before undersampling -')
    print(get_class_counts(Y))

    # # get indices of rows to be removed for majority class
    y_0 = Y[:, class_num]
    low_indices = np.where(y_0 == 1)[0]
    n_removed = int(low_indices.shape[0] * frac_removed)

    # # delected removed rows
    removed = low_indices[0:n_removed]
    Y = np.delete(Y, removed, axis=0)
    X = np.delete(X, removed, axis=0)

    # # print stats after undersampling
    print('number of training samples after undersampling = %s' % X.shape[0])
    print('Y counter after undersampling -')
    print(get_class_counts(Y))

    return X, Y


def apply_smote(X, Y, ratio=1.0):
    print('number of training samples before smote = %s' % X.shape[0])
    sm = SMOTE(kind='regular', ratio=ratio)
    # ada = ADASYN(ratio=ratio)

    # convert y to 1d
    y_1d = np.empty((Y.shape[0], 1), dtype=np.object_)
    i = 0
    for y_entry in Y:
        class_num = np.where(y_entry == 1)[0][0]
        y_1d[i] = CLASSES[class_num]
        i += 1

    y_1d = y_1d.ravel()
    print(y_1d.shape)
    print(y_1d)
    print('Y counter before SMOTE -')
    unique, counts = np.unique(y_1d, return_counts=True)
    print(dict(zip(unique, counts)))
    X, Y = sm.fit_sample(X, y_1d)
    # X, Y = ada.fit_sample(X, y_1d)
    print('Y counter after 1st SMOTE -')
    print(Counter(Y))
    X, Y = sm.fit_sample(X, Y)
    print('Y counter after 2nd SMOTE -')
    print(Counter(Y))

    # one hot encode again
    i = 0
    y_temp = np.zeros(shape=(len(Y), len(CLASSES)))
    for y_entry in Y:
        class_num = CLASSES.index(y_entry)
        y_temp[i][class_num] = 1
        i = i + 1
    Y = y_temp
    print('number of training samples after smote = %s' % X.shape[0])
    return X, Y


# # method to create confusion matrix
def get_confusion_matrix_one_hot(model_results, truth):
    '''model_results and truth should be for one-hot format, i.e, have >= 2 columns,
    where truth is 0/1, and max along each row of model_results is model result
    '''
    assert model_results.shape == truth.shape
    num_outputs = truth.shape[1]
    confusion_matrix = np.zeros((num_outputs, num_outputs), dtype=np.int32)
    predictions = np.argmax(model_results, axis=1)
    assert len(predictions) == truth.shape[0]

    for actual_class in range(num_outputs):
        idx_examples_this_class = truth[:, actual_class] == 1
        prediction_for_this_class = predictions[idx_examples_this_class]
        for predicted_class in range(num_outputs):
            count = np.sum(prediction_for_this_class == predicted_class)
            confusion_matrix[actual_class, predicted_class] = count
    assert np.sum(confusion_matrix) == len(truth)
    assert np.sum(confusion_matrix) == np.sum(truth)
    return confusion_matrix


# # define models
def three_layer_nn(num_of_features=0):
    # # create model
    K.set_learning_phase(1)
    model = Sequential()
    model.add(Dense(
        num_of_features,
        input_dim=num_of_features,
        W_regularizer=l1l2(0.1),
        init='normal',
        activation='relu'
    ))
    # # add dropout to regularize
    model.add(Dropout(0.35))
    # # add hidden layer
    model.add(Dense(num_of_features, init='normal', activation='relu'))
    # # add output layer
    model.add(Dense(len(CLASSES), init='normal', activation='softmax'))
    # # define loss optimiser
    adam = Adam(lr=0.0002)
    # # compile model
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy', fbeta_custom])
    return model


def four_layer_nn(num_of_features=0):
    # # create model
    K.set_learning_phase(1)
    model = Sequential()
    model.add(Dense(
        num_of_features,
        input_dim=num_of_features,
        W_regularizer=l1l2(0.1),
        init='normal',
        activation='relu'
    ))
    # model.add(BatchNormalization())
    model.add(Dropout(0.35))
    model.add(Dense(int(num_of_features * 1.5), init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 4, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(len(CLASSES), init='normal', activation='softmax'))
    # # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', fbeta_custom])
    return model


def five_layer_nn(num_of_features=0):
    K.set_learning_phase(1)

    # # create model
    model = Sequential()
    model.add(Dense(
        num_of_features,
        input_dim=num_of_features,
        W_regularizer=l1l2(0.01),
        init='normal',
        activation='relu'
    ))
    # model.add(BatchNormalization())
    model.add(Dropout(0.35))
    model.add(Dense(int(num_of_features * 1.5), init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 2, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 4, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(len(CLASSES), init='normal', activation='softmax'))

    # # define loss optimizer
    adam = Adam(lr=0.0002)

    # # compile model
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy', fbeta_custom])
    return model


model_dict = {
    'three_layer_nn': three_layer_nn,
    'four_layer_nn': four_layer_nn,
    'five_layer_nn': five_layer_nn,
}


def preprocess(
    df,
    trial,
):
    # # shuffle dataframe rows
    if trial:
        frac = 0.25
    else:
        frac = 1
    df = df.sample(frac=frac).reset_index(drop=True)
    dataset = df.values

    # # split into X and Y
    num_of_features = dataset.shape[1] - 1
    X = dataset[:, 0:num_of_features]
    Y = dataset[:, num_of_features]

    # # bin data into categories
    bins = CLASS_BINS
    bin_names = CLASSES
    categories = pd.cut(Y, bins, labels=bin_names)

    # # one hot encode
    Y = pd.get_dummies(categories).values

    # # split into training, test and validation sets
    num_of_samples = X.shape[0]
    val_ratio = 0.2
    test_ratio = 0.1
    train_ratio = 1 - val_ratio - test_ratio
    num_of_val_samples = int(val_ratio * num_of_samples)
    num_of_train_samples = int(train_ratio * num_of_samples)

    X_train = X[0:(num_of_train_samples + 1), :]
    Y_train = Y[0:(num_of_train_samples + 1), :]
    X_val = X[num_of_train_samples:(num_of_train_samples + num_of_val_samples + 1), :]
    Y_val = Y[num_of_train_samples:(num_of_train_samples + num_of_val_samples + 1), :]
    X_test = X[num_of_train_samples + num_of_val_samples:, :]
    Y_test = Y[num_of_train_samples + num_of_val_samples:, :]

    # # fix random seed for reproducibility
    seed = 7
    np.random.seed(seed)

    # # standardize data and store mean and scale to file
    scaler = StandardScaler().fit(X_train)
    mean_array = scaler.mean_
    scale_array = scaler.scale_
    X_train_transformed = scaler.transform(X_train)
    X_val_transformed = (X_val - mean_array) / (scale_array)
    X_test_transformed = (X_test - mean_array) / (scale_array)

    # # return data
    data_dict = {
        'train_data': (X_train_transformed, Y_train),
        'val_data': (X_val_transformed, Y_val),
        'test_data': (X_test_transformed, Y_test),
        'norm_arrays': (mean_array, scale_array),
        'num_of_features': num_of_features,
    }

    return data_dict


def save_model_files(
    model_name,
    model,
    position,
    mean_array,
    scale_array
):
    mean_list = [mean for mean in mean_array]
    scale_list = [scale for scale in scale_array]
    # # dump model files
    model_filepath = os.path.join(SCRIPT_DIR, 'dumps/%s/keras_%ss/keras_%ss.json' % (model_name, position, position))
    create_filepath(model_filepath)
    weights_filepath = os.path.join(SCRIPT_DIR, 'dumps/%s/keras_%ss/weights.h5' % (model_name, position))
    create_filepath(weights_filepath)
    mean_filepath = os.path.join(SCRIPT_DIR, 'dumps/%s/keras_%ss/mean.json' % (model_name, position))
    create_filepath(mean_filepath)
    scale_filepath = os.path.join(SCRIPT_DIR, 'dumps/%s/keras_%ss/scale.json' % (model_name, position))
    create_filepath(scale_filepath)

    # model.save(model_filepath)
    model.save_weights(weights_filepath)
    model_json = model.to_json()

    with open(model_filepath, "w+") as f:
        f.write(model_json)
    with open(mean_filepath, 'w+') as f:
        f.write(json.dumps(mean_list))
    with open(scale_filepath, 'w+') as f:
        f.write(json.dumps(scale_list))


def train(
    position='forward',
    model_name='three_layer_nn',
    use_class_weights=False,
    smote=False,
    undersampling=False,
    trial=False,
):
    """Creates a network, trains it and dumps it on disk
    """

    # # load dataset
    data_path = os.path.join(SCRIPT_DIR, 'fpl_%ss.ssv' % (position))
    df = pd.read_csv(data_path, delim_whitespace=True, header=None)

    # # preprocess data
    processed_data = preprocess(df=df, trial=trial)
    X_train_transformed, Y_train = processed_data['train_data']
    X_val_transformed, Y_val = processed_data['val_data']
    X_test_transformed, Y_test = processed_data['test_data']
    mean_array, scale_array = processed_data['norm_arrays']
    num_of_features = processed_data['num_of_features']

    X_train_resampled, Y_train_resampled = X_train_transformed, Y_train

    smote_ratio = 1.0
    if undersampling:
        # # Apply random undersampling for dominant class
        X_train_resampled, Y_train_resampled = apply_undersampling(X_train_resampled, Y_train_resampled)
    if smote:
        # # Apply regular SMOTE
        smote_ratio = 0.5
        X_train_resampled, Y_train_resampled = apply_smote(X_train_resampled, Y_train_resampled, ratio=smote_ratio)

    # # evaluate model with standardized dataset
    CLASS_WEIGHT_UNIFORM = {
        0: 1.,
        1: 1.0,
        2: 1.0,
    }
    if use_class_weights:
        class_weight_dict = CLASS_WEIGHT[position]
    else:
        class_weight_dict = CLASS_WEIGHT_UNIFORM
    print('Class weights - ')
    print(class_weight_dict)
    model_fn = model_dict[model_name]
    model = model_fn(num_of_features=num_of_features)
    # print(Y_train_resampled[0:10, :])
    history = model.fit(
        X_train_resampled,
        Y_train_resampled,
        nb_epoch=50,
        batch_size=10,
        verbose=0,
        # validation_split=0.2,
        validation_data=(X_val_transformed, Y_val),
        class_weight=class_weight_dict,
    )
    val_predictions_keras = model.predict(X_val_transformed)
    test_predictions_keras = model.predict(X_test_transformed)
    # print(test_predictions_keras[0:10])

    # # generate and print confusion matrix for validation and test data
    val_conf_matrix = get_confusion_matrix_one_hot(val_predictions_keras, Y_val)
    print('validation confusion matrix - ')
    print(val_conf_matrix)
    conf_matrix = get_confusion_matrix_one_hot(test_predictions_keras, Y_test)
    print('test confusion matrix - ')
    print(conf_matrix)

    if not trial:
        model_name = model_fn.__name__
        save_model_files(model_name, model, position, mean_array, scale_array)


# Let's run it!

Let's try to run the model for midfielders and see how it goes.

In [1]:
from analysis.lib_1 import classifier
classifier.train(position='midfielder', use_class_weights=True, model_name='five_layer_nn', undersampling=True)

Using TensorFlow backend.


number of training samples before undersampling = 6785
Y counter before undersampling -
{'high': 298, 'low': 6048, 'mid': 439}
number of training samples after undersampling = 4366
Y counter after undersampling -
{'high': 298, 'low': 3629, 'mid': 439}
Class weights - 
{0: 1.0, 1: 5.795999999999999, 2: 7.089600000000001}
validation confusion matrix - 
[[1417  172  154]
 [  68   25   34]
 [  25   14   30]]
test confusion matrix - 
[[687  91  79]
 [ 30  14  31]
 [  9   6  23]]


Confusion matrices shown above are of the following format - 

                    (predicted 'low)    (predicted 'mid')    (predicted 'high')
    (actual 'low')      1417                172                  154
    (actual 'mid')      68                  25                   34
    (actual ''high')    25                  14                   30

First thing to note here is that the confusion matrix ratios are similar for validation as well as test cases. This is good, it means our model works well on previously unseen data.

Another interesting point is that our model is having a hard time predicting the 'mid' scores. I suspect that is because of their being similar likelihood of these cases being either 'low' or 'high'. I will look into this in the future to see if this can be improved.

But the great thing about this matrix is that it show our model is pretty good at predicting 'low' and 'high' classes! We are off to a good start :)

# Feature importance

Work in progress

# Visualization and examples

Work in progress

# Future scope of work

* Try anomaly detection algorithms and see how they work to predict high scoring players

* See if I can manually clean data better instead of using random undersampling

* See if I can engineer the features better to enhance performance

* Investigate why oversampling was not working