# Age of Empires 2 Player Naive Bayes Model

This model is used to rank invidividual players in team games. This model will allow us to better balance teams by calculating the probability that team wins before we actually play. This is a good [tutorial](https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn) on how to build a Bayes classifier.

Disclaimer: I am not a data scientist, who fully undertands the underlying math.

## Todo:
- Create test cases
- Load data from Google Sheet instead of local CSV
- Determine what EDA should be done
- Explore adding a time component to factor in player improvement
- Determine how to better input data for predicting
- Build other classifiers

## Import dependencies

In [1]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from utilities.predictions import make_predictions

## Import Data

In [2]:
df = pd.read_csv("../data/sample_data.csv")

# Designate all columns that are not `Outcome` as features and `Outcome` as target
X = df.loc[:, df.columns != 'Outcome']
y = df.Outcome

df.head(2)

Unnamed: 0,Shaq,Gray,Rushi,Marc,Peter,Pat,Sam,Ori,Vic,Ardy,Chad,Pat_Jr,Rory,Matt_M,Ben,Mikey,Evan,Medium_AI,Extra_Team,Outcome
0,1,0,-1,-1,1,0,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1
1,1,0,-1,0,-1,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,-1


## Explore Data
This is where I should explore data. I haven't done any EDA since I created this dataset.

## Split data
Normally, I would split the data into a training set and validation set. The validation set is for checking the accuracy of the best tuned model that results from cross-validation. HOWEVER, we are working with a really small dataset. Rather than hold out data for validation, we will assess the performance of the model through the out of sample cross validation results

In [3]:
# X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=0)

## Double data
Since assigning teams is random, we want to ensure that the dataset is balanced. For example, when I record data, I generally always put myself as the home team (code as `1`). We mitigate this by not having an intercept term in our model. To be safe, we will still double the dataset by inverting all the records and concatenating to the orginal dataset.

Doubling happens after splitting. Therefore we would need to double the training and validation sets. We use helper functions for readability.

In [4]:
def invert_dataframe(original_dataframe):
    """Inverts the dataframe by simply multiplying all values by -1.

    Args:
        original_dataframe (df): The dataframe to be inverted.

    Returns:
        inverted_dataframe (df): The inverted dataframe.

    """
    inverted_dataframe = original_dataframe.multiply(-1)
    return inverted_dataframe


def combine_dataframe(first_dataframe, second_dataframe):
    """Combines the dataframes. Assumes that both dataframes have the same columns

    Args:
        first_dataframe (df): The first dataframe to be combined.
        second_dataframe (df): The second dataframe to be combined.

    Returns:
        combined_dataframe (df): The combined dataframe.

    """
    combined_dataframe = pd.concat([first_dataframe, second_dataframe])
    return combined_dataframe


def invert_and_combine(original_dataframe):
    """Inverts and combines the dataframes. Assumes that both dataframes have the same columns

    Args:
        original_dataframe (df): The dataframe to be inverted and combined with the original.

    Returns:
        new_dataframe (df): The combined dataframe.

    """
    inverted_dataframe = invert_dataframe(original_dataframe)
    new_dataframe = combine_dataframe(original_dataframe, inverted_dataframe)
    return new_dataframe

In [5]:
X = invert_and_combine(X)
y = invert_and_combine(y)

## Cross Validation
We will use 3 folds cross validation to test the accuracy of Bayes classifier. We use this approach because we don't have enough data to hold out data for a test set. We have no priors, so we won't be using any hyperparameters. Found this [approach](https://stackoverflow.com/questions/51194627/python-naive-bayes-with-cross-validation-using-gaussiannb-classifier)

In [6]:
params = {}
gnb = GaussianNB()
grid = GridSearchCV(estimator=gnb, cv=3, param_grid=params, return_train_score=True, scoring="accuracy")
cross_validation_models = grid.fit(X, y)

print("Best model according to grid search: {0} using {1}".format(
    round(cross_validation_models.best_score_, 2), cross_validation_models.best_params_))

Best model according to grid search: 0.45 using {}


This model is worse than guessing. Its not too surprising that the Bayes classfier doesn't work very well since the independence assumption is violated. When we make teams, we have an implicit bias towards balancing teams based on our priors of each person's skill level.

## Create final model
The final model is simply a naive Bayes classifier over all the data

In [7]:
final_model_with_all_data = GaussianNB()
final_model_with_all_data = final_model_with_all_data.fit(X, y)

## Predictions
Ultimately, we want to use this model to determine the probability of a game. Each value in the array corresponds to a person. For example, the first number is Shaq, the second number is Gray, etc.

In [9]:
make_predictions(final_model_with_all_data)

The probability that Marc beats Rushi is 30.47%
The probability that Shaq beats Gray is 58.31%
The probability that Shaq and Gray beat Rushi is 41.28%
The probability that Marc beats Sam is 61.99%
The probability that Marc and Sam beat Rushi is 19.92%
The probability that Vic beats Rushi is 50.53%
