# Age of Empires 2 Player Random Forest Model

This model is used to rank invidividual players in team games. This model will allow us to better balance teams by calculating the probability that team wins before we actually play.

Disclaimer: I am not a data scientist. Just a guy with too much free time.

## Todo:
- Create test cases
- Load data from Google Sheet instead of local CSV
- Determine what EDA should be done
- Fix GridSearchCV to LogisticRegression import
- Explore adding a time component to factor in player improvement
- Determine how to better input data for predicting
- Build other classifiers

## Import dependencies

In [1]:
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
#from sklearn import metrics
import numpy as np

## Import Data

In [2]:
df = pd.read_csv("sample_data.csv")

# Designate all columns that are not `Outcome` as features and `Outcome` as target
X = df.loc[:, df.columns != 'Outcome']
y = df.Outcome

df.head(2)

Unnamed: 0,Shaq,Gray,Rushi,Marc,Peter,Pat,Sam,Ori,Vic,Ardy,Chad,Pat_Jr,Pat_Jr_Jr,Matt_M,Ben,Mikey,Evan,Medium_AI,Extra_Team,Outcome
0,1,0,-1,-1,1,0,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1
1,1,0,-1,0,-1,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,-1


## Explore Data
This is where I should explore data. I haven't done any EDA since I created this dataset.

## Split data
Normally, I would split the data into a training set and validation set. The validation set is for checking the accuracy of the best tuned model that results from cross-validation. HOWEVER, we are working with a really small dataset. Rather than hold out data for validation, we will assess the performance of the model through the out of sample cross validation results

## Double data
Since assigning teams is random, we want to ensure that the dataset is balanced. For example, when I record data, I generally always put myself as the home team (code as `1`). We mitigate this by not having an intercept term in our model. To be safe, we will still double the dataset by inverting all the records and concatenating to the orginal dataset.

Doubling happens after splitting. Therefore we would need to double the training and validation sets. We use helper functions for readability.

In [3]:
def invert_dataframe(original_dataframe):
    """Inverts the dataframe by simply multiplying all values by -1.

    Args:
        original_dataframe (df): The dataframe to be inverted.

    Returns:
        inverted_dataframe (df): The inverted dataframe.

    """
    inverted_dataframe = original_dataframe.multiply(-1)
    return inverted_dataframe


def combine_dataframe(first_dataframe, second_dataframe):
    """Combines the dataframes. Assumes that both dataframes have the same columns

    Args:
        first_dataframe (df): The first dataframe to be combined.
        second_dataframe (df): The second dataframe to be combined.

    Returns:
        combined_dataframe (df): The combined dataframe.

    """
    combined_dataframe = pd.concat([first_dataframe, second_dataframe])
    return combined_dataframe


def invert_and_combine(original_dataframe):
    """Inverts and combines the dataframes. Assumes that both dataframes have the same columns

    Args:
        original_dataframe (df): The dataframe to be inverted and combined with the original.

    Returns:
        new_dataframe (df): The combined dataframe.

    """
    inverted_dataframe = invert_dataframe(original_dataframe)
    new_dataframe = combine_dataframe(original_dataframe, inverted_dataframe)
    return new_dataframe

In [4]:
X = invert_and_combine(X)
y = invert_and_combine(y)

## Cross Validation
We will use 3 folds cross validation and GridSearch to determine the optimal hyper parameters for the random forest. We will first use randomized search to narrow down the hyperparameters to grid search over. [Credit for this code and approach](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)

In [5]:
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [6]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 100, cv = 3, verbose=2, 
                               random_state=0, n_jobs = -1)

rf_random.fit(X,y)

print("Best model according to random search: {0} using {1}".format(
    round(rf_random.best_score_, 2), rf_random.best_params_))

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   49.8s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.5min finished


Best model according to random search: 0.54 using {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 70, 'bootstrap': False}


## Cross Validation - Grid Search
Now that we have narrowed down the parameters, we can run a more targeted Grid Search. Since this is a brute force method, it takes a while to run on my quad core 16GB laptop. We will use parameters that are close to the best random search parameters.

In [7]:
param_grid = {
    'bootstrap': [False],
    'max_depth': [None, 10, 20, 30],
    'max_features': ['auto'],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 4, 6],
    'n_estimators': [100, 200, 300, 400, 500]
}

In [8]:
#rf = RandomForestClassifier()
grid = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 1, scoring='accuracy')
grid_models = grid.fit(X, y)

print("Best model according to grid search: {0} using {1}".format(
    round(grid_models.best_score_, 2), grid_models.best_params_))

Fitting 3 folds for each of 180 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done 216 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done 466 tasks      | elapsed:   40.0s
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:   45.9s finished


Best model according to grid search: 0.54 using {'bootstrap': False, 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


In [9]:
change_in_accuracy = grid_models.best_score_ - rf_random.best_score_

print("Difference in accuracy between grid and random: {0}".format(round(change_in_accuracy), 2))

Difference in accuracy between grid and random: 0.0


The best model from grid search only has a 54% accuracy on the test folds. This model is barely better than guessing.

## Create final model
We will re-fit the best performing random forest to the whole data set. We do not have much confidence in this model because of how poorly it performed in cross validation.

In [10]:
final_model_params = grid_models.cv_results_['params'][grid_models.best_index_]

In [11]:
# TODO: must be a cleaner way to import GridSearchCV into RandomForestClassifier
final_model_with_all_data = RandomForestClassifier(bootstrap=final_model_params['bootstrap'],
                                               max_depth=final_model_params['max_depth'],
                                               max_features=final_model_params['max_features'],
                                               min_samples_leaf=final_model_params['min_samples_leaf'],
                                                  n_estimators=final_model_params['n_estimators'])

final_model_with_all_data = final_model_with_all_data.fit(X, y)

We will now check the feature importances. This basically tells us what features are most important when predicting the game outcome (either win or loss.) We could build a slimmed down model using only the variables with high importance. While there is an order to the importance, the actual coefficients are not very high (max coefficient would be 1). In general, most of the features have similar importance and it is not really worth creating a new model with the most important features

In [12]:
importances = list(final_model_with_all_data.feature_importances_)

# List of tuples with variable and importance
feature_list = list(X.columns)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

[print('Feature: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Feature: Rushi                Importance: 0.1
Feature: Marc                 Importance: 0.1
Feature: Peter                Importance: 0.1
Feature: Gray                 Importance: 0.08
Feature: Pat                  Importance: 0.08
Feature: Sam                  Importance: 0.08
Feature: Ori                  Importance: 0.08
Feature: Extra_Team           Importance: 0.07
Feature: Shaq                 Importance: 0.05
Feature: Vic                  Importance: 0.05
Feature: Ardy                 Importance: 0.05
Feature: Matt_M               Importance: 0.03
Feature: Medium_AI            Importance: 0.03
Feature: Chad                 Importance: 0.02
Feature: Pat_Jr_Jr            Importance: 0.02
Feature: Ben                  Importance: 0.02
Feature: Evan                 Importance: 0.02
Feature: Pat_Jr               Importance: 0.01
Feature: Mikey                Importance: 0.0


## Predictions
Ultimately, we want to use this model to determine the probability of a game. Each value in the array corresponds to a person. For example, the first number is Shaq, the second number is Gray, etc.

However, there isn't much to takeaway from this model. Its accuracy in cross validation was barely better than guessing. 

Here we have modeled the probability that Marc (-1) beats Rushi (1).

In [13]:
result = final_model_with_all_data.predict_proba(
    [[0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
print("The probability that Marc beats Rushi is {0}%".format(round(result[0][0]*100, 2)))

The probability that Marc beats Rushi is 29.5%


Here we have modeled the probability that Shaq (-1) beats Gray (1).

In [14]:
result = final_model_with_all_data.predict_proba(
    [[-1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
print("The probability that Shaq beats Gray is {0}%".format(round(result[0][0]*100, 2)))

The probability that Shaq beats Gray is 0.0%


Here we have modeled the probaility that Shaq (-1) and Gray (-1) beat Rushi

In [15]:
result = final_model_with_all_data.predict_proba(
    [[-1, -1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1]])
print("The probability that Shaq and Gray beat Rushi is {0}%".format(round(result[0][0]*100, 2)))

The probability that Shaq and Gray beat Rushi is 100.0%


Here we have modeled the probaility that Marc (-1) beats Sam (1)

In [16]:
result = final_model_with_all_data.predict_proba(
    [[0, 0, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
print("The probability that Marc beats Sam is {0}%".format(round(result[0][0]*100, 2)))

The probability that Marc beats Sam is 71.0%


Here we have modeled the probaility that Marc (-1) and Sam (-1) beat Rushi (1)

In [17]:
result = final_model_with_all_data.predict_proba(
    [[0, 0, 1, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1]])
print("The probability that Marc and Sam beat Rushi is {0}%".format(round(result[0][0]*100, 2)))

The probability that Marc and Sam beat Rushi is 28.75%


Here we have modeled the probaility that Vic (-1) beats Rushi (1)

In [18]:
result = final_model_with_all_data.predict_proba(
    [[0, 0, 1, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
print("The probability that Vic beats Rushi is {0}%".format(round(result[0][0]*100, 2)))

The probability that Vic beats Rushi is 60.5%
