Welcome! This notebook takes you through the process I went through while creating models to predict fantasy league scores for a player. Let me start by describing the data.

I have data in the form X[0], X[1], X[2], ... , X[35], Y for each player in the premier league for 2013-14 and 2016-17 season. I have cleaned the data for some fields but I will skip describing that process for now. A single entry in dictionary form looks like this -

In [None]:
{
    'X': {
        'assists_per_match_played': 0.1111111111111111,
        'avg_assists_form': 0.0,
        'avg_bps_form': 20.0,
        'avg_clean_sheets_form': 0.0,
        'avg_goals_conceded_form': 2.0,
        'avg_goals_scored_form': 0.3333333333333333,
        'avg_minutes_form': 86.66666666666667,
        'avg_net_transfers_form': 103.33333333333333,
        'avg_points_form': 4.0,
        'avg_red_cards_form': 0.0,
        'avg_saves_form': 0.0,
        'avg_yellow_cards_form': 0.0,
        'bps_per_match_played': 10.777777777777779,
        'clean_sheets_per_match_played': 0.07407407407407407,
        'goals_conceded_per_match_played': 1.0,
        'goals_scored_per_match_played': 0.037037037037037035,
        'is_at_home': 1,
        'last_season_points_per_minutes': 0.03913894324853229,
        'minutes_per_match_played': 57.925925925925924,
        'net_transfers_per_match_played': 409.3703703703704,
        'opponent_goals_conceded_per_match': 1.4324324324324325,
        'opponent_goals_scored_per_match': 1.5405405405405406,
        'opponent_points_per_match': 1.6486486486486487,
        'opponent_points_per_match_last_season': 1.8157894736842106,
        'points_per_match_played': 2.0,
        'price': 4.4,
        'price_change_form': 0.0,
        'red_cards_per_match_played': 0.037037037037037035,
        'saves_per_match_played': 0.0,
        'team_goals_conceded_per_match': 1.3783783783783783,
        'team_goals_scored_per_match': 0.8918918918918919,
        'team_points_per_match': 0.918918918918919,
        'team_points_per_match_last_season': 0.9736842105263158,
        'yellow_cards_per_match_played': 0.07407407407407407
    },
    'Y': {u'points_scored': 2.0}
}

# A bit about earlier trials

Before I get into the details of what worked, let me describe some initial attempts that didn't.

The primary problem with our data is that it is highly imbalanced. Most players score low, with only a few instances of players who score above 5 pts.

Also, since charateristics of each outfield position are different, I decided to train the models seperately for forwards, midfielders, defenders and goalkeepers.

Things I tried before settling in on the solution -

    * Linear Regression - Simply wasn't able to give any meaningful predictions. I suspect this is because of high non-linearity and imbalance

    * Regression using neural network - Just couldn't get it to work, all predictions seemed to be either concentrated in the low points zone or were very random

This is when I decided to convert the problem into a classification one. I experimented with multiple bin sizes and finally came to this categorisation -
    a. points scored less than 5 - 'low'
    b. between 5 - 8 - 'medium'
    c. greater than 8 - 'high'
For midfielders, this gives a ratio of 'low': 'medium': 'high' as 21.1: 1.53: 1.0

It is already clear that our data is heavily skewed in favor of 'low' points. To test our model in such a case, accuracy is a bad metric to use. If I predict all outcomes to be 'low', I immediately get accuracy of 89%, but it quite meaningless in terms of prediction. We will use a confusion matrix to tune our model instead as it gives a much better picture of how our model is performing. So next I tried -

    * 3 layer neural network (3nn) - As expected, it breaks down and predicts all outcomes to be 'low'. We need to somehow offset the imbalance in our data.

    * 3nn with oversampling using SMOTE - One of the techniques to overcome imbalance is oversampling. Here we create interpolated copies of minority class data points to balance the dataset. SMOTE is one such algorithm to create these artificial data points. Unfortunately, this didn't work in our case (why needs to be still evaluated).

    * 3nn with undersampling - Contrary to oversampling, undersampling creates balance in dataset by removing datapoints from the majority class. This though has a clear disadvantage of losing out on information. This alone also didn't work well on the model.

    * 3nn with class weights - Another way to tackle imbalance is by assigning additional costs or weights to each class in the loss calculation. This forces the loss function to assign more importance to the minority classes. I started with weights as ratios of datapoints for each class and gradually iterated to a value which was giving better confusion matrix for our validation data. Finally, something that seemed to work!

    * 5nn with class weights - Deeper networks are often better at defining complex relationships to identify minority classes. I started increasing the number of hidden layers and found that a 5 layered fully connected network was working well for our problem.

The 5nn with weights was converging to a solution, but what I realized was that depending on the learning rate, it would sometimes converge to a local minima of 'all low' solution, for which the cost was only slightly higher than our desired model. It struck me that it might be a good idea to undersample the data a bit to increase this difference in loss and stabilize it.

This brings us to the model which I am currently using to predict results, a 5-layered neural network with optimized class weights and 40% undersampling for majority class.

# Model description

## Preprocessing data


In [None]:
def preprocess(
    df,
    trial,
):
    # # shuffle dataframe rows
    if trial:
        frac = 0.25
    else:
        frac = 1
    df = df.sample(frac=frac).reset_index(drop=True)
    dataset = df.values
    # # split into X and Y
    num_of_features = dataset.shape[1] - 1
    X = dataset[:, 0:num_of_features]
    Y = dataset[:, num_of_features]

    # # bin data into categories
    bins = CLASS_BINS
    bin_names = CLASSES
    categories = pd.cut(Y, bins, labels=bin_names)
    
    # # one hot encode
    Y = pd.get_dummies(categories).values

    # # split into training, test and validation sets
    num_of_samples = X.shape[0]
    val_ratio = 0.2
    test_ratio = 0.1
    train_ratio = 1 - val_ratio - test_ratio
    num_of_val_samples = int(val_ratio * num_of_samples)
    num_of_train_samples = int(train_ratio * num_of_samples)

    X_train = X[0:(num_of_train_samples + 1), :]
    Y_train = Y[0:(num_of_train_samples + 1), :]
    X_val = X[num_of_train_samples:(num_of_train_samples + num_of_val_samples + 1), :]
    Y_val = Y[num_of_train_samples:(num_of_train_samples + num_of_val_samples + 1), :]
    X_test = X[num_of_train_samples + num_of_val_samples:, :]
    Y_test = Y[num_of_train_samples + num_of_val_samples:, :]

    # # fix random seed for reproducibility
    seed = 7
    np.random.seed(seed)

    # # standardize data and store mean and scale to file
    scaler = StandardScaler().fit(X_train)
    mean_array = scaler.mean_
    scale_array = scaler.scale_
    X_train_transformed = scaler.transform(X_train)
    X_val_transformed = (X_val - mean_array) / (scale_array)
    X_test_transformed = (X_test - mean_array) / (scale_array)

    data_dict = {
        'train_data': (X_train_transformed, Y_train),
        'val_data': (X_val_transformed, Y_val),
        'test_data': (X_test_transformed, Y_test),
        'norm_arrays': (mean_array, scale_array),
        'num_of_features': num_of_features,
    }

    return data_dict

We randomly shuffle the data first.

We first categorize Y values as 'low', 'mid' and 'high'. Then we apply a trick called one-hot encoding to deal with multiple classes. This converts the Y-array to m X 3 from m X 1, with each row consisting of a 3 elements, one for each class. So, an example with value 'low' would now be represented as [1 0 0]. Similarly 'mid' -> [0 1 0] and 'high' -> [0 0 1].

Now we split our examples into training (70%), validation (20%) and test (10%) datasets.

Next, we normalize our training X values using scikit-learn's StandardScaler. This brings all values for our features in the same range with mean for all values being 0. This usually helps any machine learning algorithm perform better.

We store the mean and scale arrays for transforming validation, test as well as any future values we might need to predict for. It is important to normalize the training set separately instead of doing it together with validation data as we don't want any information from our validation set to leak to the training set.

## Undersampling

In [None]:
def apply_undersampling(X, Y, class_num=0, frac_removed=0.4):
    # # print stats before undersampling
    print('number of training samples before undersampling = %s' % X.shape[0])
    print('Y counter before undersampling -')
    print(get_class_counts(Y))

    # # get indices of rows to be removed for majority class
    y_0 = Y[:, class_num]
    low_indices = np.where(y_0 == 1)[0]
    n_removed = int(low_indices.shape[0] * frac_removed)

    # # delected removed rows
    removed = low_indices[0:n_removed]
    Y = np.delete(Y, removed, axis=0)
    X = np.delete(X, removed, axis=0)

    # # print stats after undersampling
    print('number of training samples after undersampling = %s' % X.shape[0])
    print('Y counter after undersampling -')
    print(get_class_counts(Y))

    return X, Y

We randomly remove 40% examples of majority class ('low') from the training dataset to make the dataset more balanced.

## Network definition

In [None]:
def five_layer_nn(num_of_features=0):
    K.set_learning_phase(1)

    # # create model
    model = Sequential()
    model.add(Dense(
        num_of_features,
        input_dim=num_of_features,
        W_regularizer=l1l2(0.01),
        init='normal',
        activation='relu'
    ))
    # model.add(BatchNormalization())
    model.add(Dropout(0.35))
    model.add(Dense(int(num_of_features * 1.5), init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 2, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(num_of_features / 4, init='normal', activation='relu'))
    # model.add(BatchNormalization())
    model.add(Dense(len(CLASSES), init='normal', activation='softmax'))

    # # define loss optimizer
    adam = Adam(lr=0.0002)

    # # compile model
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy', fbeta_custom])
    return model

Here we define a simple neural network with 3 hidden layers using Keras. We use l1l2 and dropout regularization, softmax activation for output layer (outputs probabilities per class) and cross-entropy loss function. Note that we have already defined a custom f-measure metric (fbeta_custom) with beta=1 to have a single score to judge our model.

The learning rate lr needed to be iterated upon to get good convergence. Too small a value would sometimes make the training stuck in a local minima. Too large a value makes it difficult to get good convergance.

## weight definition

Now we need to optimise the weights for each class to get a good confusion matrix. I started with the ratios of examples for each class (adjusted for undersampling) and iterated as per predictions on validation data.