# Age of Empires 2 Player Model

This model is used to rank invidividual players in team games. This model will allow us to better balance teams by calculating the probability that team wins before we actually play.

Disclaimer: I am not a data scientist, who fully undertands the underlying math.

## Todo:
- Create test cases
- Load data from Google Sheet instead of local CSV
- Determine what EDA should be done
- Fix GridSearchCV to LogisticRegression import
- Explore adding a time component to factor in player improvement
- Determine how to better input data for predicting
- Build other classifiers

## Import dependencies

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import metrics
import numpy as np
from utilities.predictions import make_predictions
from utilities.transformations import invert_and_combine

## Import Data

In [2]:
df = pd.read_csv("../data/sample_data.csv")

# Designate all columns that are not `Outcome` as features and `Outcome` as target
X = df.loc[:, df.columns != 'Outcome']
y = df.Outcome

df.head(2)

Unnamed: 0,Shaq,Gray,Rushi,Marc,Peter,Pat,Sam,Ori,Vic,Ardy,Chad,Pat_Jr,Rory,Matt_M,Ben,Mikey,Evan,Medium_AI,Extra_Team,Outcome
0,1,0,-1,-1,1,0,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1
1,1,0,-1,0,-1,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,-1


## Explore Data
This is where I should explore data. I haven't done any EDA since I created this dataset.

## Split data
Normally, I would split the data into a training set and validation set. The validation set is for checking the accuracy of the best tuned model that results from cross-validation. HOWEVER, we are working with a really small dataset. Rather than hold out datafor validation, we will assess the performance of the model through the out of sample cross validation results

In [3]:
#X_train,X_validate,y_train,y_validate=train_test_split(X,y,test_size=0.33,random_state=0)

## Double data
Since assigning teams is random, we want to ensure that the dataset is balanced. For example, when I record data, I generally always put myself as the home team (code as `1`). We mitigate this by not having an intercept term in our model. To be safe, we will still double the dataset by inverting all the records and concatenating to the orginal dataset.

Doubling happens after splitting. Therefore we would need to double the training and validation sets. We use helper functions for readability. The functions are in a utility function

In [4]:
X = invert_and_combine(X)
y = invert_and_combine(y)

# These are commented out because we are not using a validation set
# X_train = invert_and_combine(X_train)
# X_validate = invert_and_combine(X_validate)
# y_train = invert_and_combine(y_train)
# y_validate = invert_and_combine(y_validate)

## Cross Validation
We will use 3 folds cross validation and GridSearch to determine the optimal hyper parameters for the logistic regression. The parameters we will search for is C (regularization and expressed as 1/lambda). We will assume `l2` penalty (default), `liblinear` solver (default), and no fit_intercept.

In [5]:
C = np.logspace(-1,4,1000)
penalty = ['l2']
solver = ['liblinear']
fit_intercept = [False]
param_grid = dict(C=C, penalty=penalty, fit_intercept=fit_intercept, solver=solver)

We will now execute the GridSearch over three folds. We will use `accuracy` to assess the performance of the hyperparameters.

In [6]:
lr = LogisticRegression()
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')
cross_validation_models = grid.fit(X, y)

print("Best model according to grid search: {0} using {1}".format(
    round(cross_validation_models.best_score_, 2), cross_validation_models.best_params_))

Best model according to grid search: 0.82 using {'C': 14.865248449978571, 'fit_intercept': False, 'penalty': 'l2', 'solver': 'liblinear'}


### Picking the right model
There is a balance between model generalization and accuracy. Too low penalization (i.e. a high C) means that the model could be overfitting. However, low penalization means that the model has higher accuracy. Therefore, we will pick the model that has the lowest C, while still within the the bounds of the highest accuracy model. Specifically, we want the lowest C that is one standard error within the best C's performance.

In [7]:
def find_target_accuracy(cv_models):
    """Finds the target accuracy that the second best model has to exceed

    Args:
        cv_models (GridSearchCV): The object that has the info from cross validation.

    Returns:
        target_accuracy (float): The best model's accuracy descreased by its standard deviation

    """
    best_cv_model_index = cv_models.best_index_
    best_cv_model_mean_accuracy = cv_models.cv_results_['mean_test_score'][best_cv_model_index]
    best_cv_model_std_accuracy = cv_models.cv_results_['std_test_score'][best_cv_model_index]
    target_accuracy = best_cv_model_mean_accuracy - best_cv_model_std_accuracy
    return target_accuracy


def find_final_model_params(cv_models):
    """Finds the parameters for the final model that will be trained on all data.
    We want to see whether there is a model that has more generalization but satisfactory accuracy

    Args:
        cv_models (GridSearchCV): The object that has the info from cross validation.

    Returns:
        final_model_params (dict): The final model's parameters

    """
    target_accuracy = find_target_accuracy(cv_models)
    index_of_final_model = loop_through_cv_to_find_index_of_final_model(cv_models, target_accuracy)
    final_model_params = cv_models.cv_results_['params'][index_of_final_model]
    return final_model_params


def loop_through_cv_to_find_index_of_final_model(cv_models, target_accuracy):
    """We want to see whether there is a model that has more generalization but
    satisfactory accuracy. We loop through the results until we find a model
    that has a satisfactory accuracy. The loop will stop at the best model in
    case none of the models with higher generalization are satisfactory.
    This function is assuming that the models are sorted.

    Args:
        cv_models (GridSearchCV): The object that has the info from cross validation.
        target_accuracy (float): The cv best model's accuracy descreased by its standard deviation

    Returns:
        target_index (int): The index of the final model

    """
    target_index = cv_models.best_index_
    for i, score in enumerate(cv_models.cv_results_['mean_test_score']):
        if score > target_accuracy and i < target_index:
            target_index = i
            print("Found adequate model with better generalization: {0:.3f} (+/-{1:.3f}) using {2}".
                  format(cv_models.cv_results_['mean_test_score'][target_index],
                         2 * cv_models.cv_results_['std_test_score'][target_index],
                         cv_models.cv_results_['params'][target_index]))
            break
    return target_index

In [8]:
best_model_parameters = find_final_model_params(cross_validation_models)

Found adequate model with better generalization: 0.793 (+/-0.059) using {'C': 1.867545842761076, 'fit_intercept': False, 'penalty': 'l2', 'solver': 'liblinear'}


# Assess performance
Normally, we would now use the tuned hyperparameters to ensure accuracy on the validation set. As a reminder, the model is trained on the training set and the scores are computed on the validation set. HOWEVER, as mentioned, we would rather use all of our limited data for model building, so we will not assess performance on a validation set.

In [9]:
# from sklearn.metrics import roc_auc_score, roc_curve, classification_report
# import matplotlib.pyplot as plt
# final_model_with_only_training_data = LogisticRegression(penalty=best_model_parameters['penalty'], 
#                                  C=best_model_parameters['C'],
#                                  fit_intercept=best_model_parameters['fit_intercept'],
#                                  solver=best_model_parameters['solver'])

# final_model_with_only_training_data = final_model_with_only_training_data.fit(X, y)
# y_true, y_pred = y_validate, final_model_with_only_training_data.predict(X_validate)
# print(classification_report(y_true, y_pred))

We can also use a ROC curve to visualize performance. The more above the diagonal, the better. More info [here](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5)

In [10]:
# logit_roc_auc = roc_auc_score(y_validate, final_model_with_only_training_data.predict(X_validate))
# fpr, tpr, thresholds = roc_curve(y_validate, final_model_with_only_training_data.predict_proba(X_validate)[:,1])
# plt.figure()
# plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
# plt.plot([0, 1], [0, 1],'r--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver operating characteristic')
# plt.legend(loc="lower right")
# plt.show()

## Create final model
Since we are happy with the performance of our model on the validation set, we will re-fit it with all the data. There is no concern of overfitting because we already validated against data the model hadn't seen. Since we removed the validation set, the final model is the same as the model that resulted from the hyperparameter tuning.

In [11]:
# TODO: must be a cleaner way to import GridSearchCV into LogisticRegression
final_model_with_all_data = LogisticRegression(penalty=best_model_parameters['penalty'],
                                               C=best_model_parameters['C'],
                                               fit_intercept=best_model_parameters['fit_intercept'],
                                               solver=best_model_parameters['solver'])

final_model_with_all_data = final_model_with_all_data.fit(X, y)

We will output the final coefficients to see how players are ranked and with what magnitude. We need to do some busy work to output a list of coefficients.

In [12]:
features = list(df.columns)
features.remove("Outcome")
[coef] = final_model_with_all_data.coef_.tolist()

rounded_coef = []
for number in coef:
    rounded_number = round(number, 2)
    rounded_coef.append(rounded_number)

x = zip(rounded_coef, features)
print("Final model coefficients are: {0}".format(sorted(list(x))))

Final model coefficients are: [(-1.82, 'Sam'), (-1.44, 'Chad'), (-1.42, 'Ori'), (-1.1, 'Marc'), (-0.72, 'Rory'), (-0.7, 'Mikey'), (-0.35, 'Matt_M'), (-0.19, 'Evan'), (-0.05, 'Ben'), (0.34, 'Medium_AI'), (0.48, 'Pat_Jr'), (0.54, 'Pat'), (0.62, 'Peter'), (0.76, 'Shaq'), (1.1, 'Ardy'), (1.36, 'Gray'), (1.91, 'Vic'), (2.07, 'Extra_Team'), (2.68, 'Rushi')]


## Predictions
Ultimately, we want to use this model to determine the probability of a game. Each value in the array corresponds to a person. For example, the first number is Shaq, the second number is Gray, etc. We use a utility function to compute the probabilities

In [13]:
make_predictions(final_model_with_all_data)

The probability that Marc beats Rushi is 2.23%
The probability that Shaq beats Gray is 35.55%
The probability that Shaq and Gray beat Rushi is 82.0%
The probability that Marc beats Sam is 67.1%
The probability that Marc and Sam beat Rushi is 2.86%
The probability that Vic beats Rushi is 31.64%
