In [None]:
# Standard Data Science Utility Belt
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# User defined methods
#from wrangle import wrangle, wrangle_explore
from acquire import acquire,team_data_list
import prepare
import model
import env
#from functions import get_data_dictionary

# Stats
from scipy.stats import mannwhitneyu, wilcoxon
from scipy.stats import levene

# Sklearn Modeling
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Viewing Options
pd.set_option("display.max_rows", None, "display.max_columns", None) 
pd.reset_option("display.max_rows", "display.max_columns")

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# Project Planning

### Project and Goals

        Create a machine learning model that can accuratly predict the win/loss outcome of a professional League of Legends match by the 10 minute mark.

### Original Hypothesis

    The biggest driver for predicting win rates will be the data on 'Towers lost'.

### Domain Knowledge Quick Tip


    League of Legends is a multiplayer online battle arena (MOBA) game in which the player controls a character ("champion") with a set of unique abilities from an isometric perspective.  As of April 2021, there are 155 champions available to play. Over the course of a match, champions gain levels by accruing experience points (XP) through killing enemies. Items also increase champions' strength, and are bought with gold, which players accrue passively over time and earn actively by defeating the opposing team's minions, champions, or defensive structures. In the main game mode, Summoner's Rift, items are purchased through a shop menu available to players only when their champion is in the team's base. Each match is discrete; levels and items do not transfer from one match to another.

### The Plan

    Setup the environment, create a new repository, update the .gitignore, create a README.md with the data and common termonology dictionaries, create a trello board, come up with an original hypothesis and setup a morning standup living document.
    
    Acquire the data using the riot API.
    
    Clean the data, drop any useless features, remove duplicate observations, double check data-types, find any null values, decide what do with null values, and encode the features.
    
    Split the Data, into three data sets named train and test.
    
    Explore the data, look through the graphs and evaluate each feature to find drivers of predicting win rates.  Exploration will also include two hypotheses, setting of alpha, statistical tests, rejecting or failing to reject the null hypothesis, and documentation of the findings and takeaways.
    
    Create Models, create three machine learning models plus a baseline model.  Will be using a DecisionTreeClassifier, RandomForestClassifier, and KNeighborsClassifier.  Evaluate models on train and validate datasets.  Pick the model with highest validate accuracy to run on my final test data.
    
    Wrap it up, document conclusions, recommendations, and take aways in the final report notebook.  Create a presentation. 

# Executive Summary - Conclusions & Next Steps

### Conclusion

    Our random forest classifier model accuracy average using cross validation was roughly 97% beating our baseline accuracy by 42%.
    
    Our model confirmed that our original hypothesis of 'TeamWards' being the biggest driver of win rates was incorrect.  Our models feature importance concluded that 'towers_lost' was the biggest driver in predicting win rates. 
    
    Our model identified the most important features as:
        
        'towers_lost_'
        'inhibs_lost_team200'
        'baron_team100'
        'dragon_team100'
        'team_totalGold_100'
        'team_xp_100'
        
    If we had more time we would have liked to:
    
        - run our model on non-pro games & see if 'towers_lost' is still the biggest driver for predicting win rates
        - dive deeper into what are the drivers for gaining towers
        - engineer more features
        - predict a winner at a much earlier time than 20 min into a game
        

### Recommendations

    The data suggests:  
    
    - If you are a coach, revolving your team stragety around objectives can lead to more wins. 
    
    - If you are player, encouraging memebers of ones team to work around objectives can lead to more wins. 

### Key Takeaways

    We used event data from the Riot API to calcuate what the value of each oberservation was at the 10 min marker.
    
    The only 'gameType' used was "classic".  Which means all the data is from the most popular game mode and on the same map.

    Games were pulled from various game patches to include:

        '11.10.376.4811'
        '11.11.377.6311'
        '11.12.379.4946'
        '11.13.382.1241'
        '11.14.384.6677'
        '11.14.385.9967'
        '11.15.388.2387'
        '11.15.389.2308'
        '11.16.390.1945'
        '11.17.393.607'
        '11.17.394.4489'
        '11.18.395.7538'
        '11.19.398.2521'
        '11.19.398.9466'
        '11.20.400.7328'
        '11.21.403.3002'
        '11.22.406.3587'
        '11.23.409.111'

# Data Acquisition (Jared)

In [None]:
df = pd.read_csv('final_10.csv')

# Preparation

In [None]:
df = prepare.clean(df)

In [None]:
### .info(), .head(), .describe()
df.info(verbose=True)

In [None]:
df.head()

In [None]:
df.describe().T

### Distribution - "Team Data Stats"

    We will only be graphing team data stats for 'blue team' due to the amount of features in the this df.  
    
    These features are:
    
         'riftherald_team100',
         'inhibs_lost_team100',
         'team_totalGold_100',
         'team_trueDamageDoneToChampions_100',
         'team_ward_player_100',
         'team_assistsplayer_100',
         'team_xp_100',
         'team_deathsplayer_100',
         'team_jungleMinionsKilled_100',
         'team_killsplayer_100',
         'team_level_100',
         'team_magicDamageDoneToChampions_100',
         'team_minionsKilled_100',
         'team_physicalDamageDoneToChampions_100',
         'team_timeEnemySpentControlled_100']

In [None]:
# data distributions for blue team.
fig, axs = plt.subplots(5,3, sharey = False, figsize = (25,20))
axe = axs.ravel()
sns.set(font_scale = 1.25)
for i,col in enumerate(team_data_list):
    p = df[col].plot.hist(ax = axe[i],title = col, ec = 'black',bins = 10)
    p.set_title(col,fontsize = 20)
    plt.tight_layout()

### Data Dictionary
| Feature                    | Datatype                | Definition   |
|:----------------------|:------------------------|:-------------|
| RedTeamKills|int64|Gives a total of the red teams kills.|
| BlueTeamKills|int64|Gives a total of the blue teams kills|
| RedTeamTowerKills|int64|Gives a total of the number of towers taken by the red team|
| BlueTeamTowerKills|int64|Gives a total of the number of towers taken by the blue team|
| RedTeamTowerAssists|int64|Gives the total number of assists on the red team|
| BlueTeamTowerAssists|int64|Gives the total number of assists on the blue team|
| RedTeamAvgLvl|int64|Takes the mean level of all the players on the red team|
| BlueTeamAvgLvl|int64|Takes the mean level of all the players on the blue team|
| RedTeamGoldSpent|int64|Gives a total amount of gold spent by the red team|
| BlueTeamGoldSpent|int64|Gives a total amount of gold spent by the blue team|
| RedTeamDragons|int64|Gives a total number of dragons killed by the red team|
| BlueTeamDragons|int64|Gives a total number of dragons killed by the blue team|
| RedTeamHeralds|int64|Gives a total number of heralds killed by the red team|
| BlueTeamHeralds|int64|Gives a total number of heralds killed by the blue team|
| RedTeamBarons|int64|Gives a total number of barons killed by the red team|
| BlueTeamBarons|int64|Gives a total number of barons killed by the blue team|
| RedTeamInhibTaken|int64|Gives a total number of inhibitors taken by the red team|
| BlueTeamInhibTaken|int64|Gives a total number of inhibitors taken by the blue team|

## Target Feature - 'winningTeam

    Riot API's code convention
    
        BlueTeam = 100.0
    
        RedTeam = 200.0

In [None]:
# Target feature was engineered from total.Cup.Points
df['winningTeam'].value_counts()

### Target Distribution

In [None]:
# Graphing the Distribution
sns.set(font_scale = 1)
df.winningTeam.hist()
plt.title("Target 'winningTeam' distribution")
plt.show()

### Acquire takeaway

    Summoner names were webscraped from popular sites:
    
        https://na.op.gg/ranking/ladder/
        https://www.trackingthepros.com/players/na/
        
    Only selected the data from the top 3000 playes in the North American server.
    
    Includes only skill levels of: 
    
        - masters
        - grandmasters
        - challenger
        - professionals
    
    Riot api was used to get puuid numbers from players.
    
    Puuid numbers are used to get match lists from the players match histories.
    
    The match lists of games that were longer than 10 min were used to get json files that were prepared into the nice tidy dataframe.

# Prepare

#### Dropped Columns

    The following column got dropped because it didn't offer any value:
    
        - killsplayer_0

    It represents how many kills were made by game objects, not players, and contains several null values.

#### Handle Nulls
    
    Filled in all missing values with a 0. 
    
    Nans were given when a player didn't have any value for any particular feature.  

#### Convert Data Types

    No data types were converted

#### Rename

    Changed the following columns names to something with more readability: 
    
    BlueTeam
    
        - deathsplayer_100 -> BlueTeamDeaths
        
        - goldPerSecond_100 -> BlueTeamGoldPerSecond
        
        - jungleMinionsKilled_100 -> BlueTeamJungleMinionsKilled
        
        - killsplayer_100 -> BlueTeamKills
        
        - level_100 -> BlueTeamLevel
        
        - magicDamageDoneToChampions100 -> BlueTeamMagicDamageDoneToChampions
        
        - minionsKilled_100 -> BlueTeamMinionsKilled
        
        - physicalDamageDoneToChampions_100 -> BlueTeamPhysicalDamageDoneToChampions
        
        - timeEnemySpentControlled_100 -> BlueTeamTimeEnemySpentControlled
        
        - totalDamageDoneToChampions_100 -> BlueTeamTotalDamageDoneToChampions
        
        - totalGold_100 -> BlueTeamTotalGold
        
        - trueDamageDoneToChampions_100 -> BlueTeamTrueDamageDoneToChampions
        
        - ward_player_100 -> BlueTeamWard_player
        
        - assistsplayer_100 -> BlueTeamAssistsplayer
        
        - xp_100 -> BlueTeamXP
        
        
    RedTeam
    
        - deathsplayer_200 -> RedTeamDeaths
        
        - goldPerSecond_200 -> RedTeamGoldPerSecond
        
        - jungleMinionsKilled_200 -> RedTeamJungleMinionsKilled
        
        - killsplayer_200 -> RedTeamKills
        
        - level_200 -> RedTeamLevel
        
        - magicDamageDoneToChampions200 -> RedTeamMagicDamageDoneToChampions
        
        - minionsKilled_200 -> RedTeamMinionsKilled
        
        - physicalDamageDoneToChampions_200 -> RedTeamPhysicalDamageDoneToChampions
        
        - timeEnemySpentControlled_200 -> RedTeamTimeEnemySpentControlled
        
        - totalDamageDoneToChampions_200 -> RedTeamTotalDamageDoneToChampions
        
        - totalGold_200 -> RedTeamTotalGold
        
        - trueDamageDoneToChampions_200 -> RedTeamTrueDamageDoneToChampions
        
        - ward_player_200 -> RedTeamWard_player
        
        - assistsplayer_200 -> RedTeamAssistsplayer
        
        - xp_200 -> RedTeamXP

#### Engineered Features

    - BlueTeamTotalGoldDifference = BlueTeamTotalGold - RedTeamTotalGold
    
    - RedTeamTotalGoldDifference = RedTeamTotalGold - BlueTeamTotalGold
    
    - BlueTeamMVPKills = Blue teams highest individual kill count 
    
    - RedTeamMVPKills = Red teams highest individual kill count 

#### Removed Outliers
    
    No outliers were removed.

#### Scaling
    No scaling was done.

#### Encode

    Created dummy columns for:
    
        - gameVersion
        

#### Split

    Split data into three data frames:
    
        - train
        - test
        
    Used a random_state of 123

### Prepare Takeaway
    
    - All features and observations have no null or empty values.
    
    - Most of the data was already prepped the way we needed it because we pulled the data off of JSON files ourselves.

# Exploration

In [None]:
# Using wrangle_explore because the values wont be scaled or encoded
train, test = prep(acquire())

### Original Hypothesis

    The biggest driver for predicting win rates will be the data on ''.

### Questions
    
    Ward score affects the outcome?
    
    Assists affects the outcome?
    
    Damage by magic affects the outcome?
    
    Towers lost affect the outcome?
    
    XP gained affects the outcome?
    
    Number of inhibitors lost affects the outcome?
    
    Team dragon kills affects the outcome of the game?

# Exploration

## What key drivers effect the outcome of winning a match?

In [None]:
# Heatmap
sns.set()
plt.figure(figsize=(20,28))
heatmap = sns.heatmap(train.corr()[['winningTeam']].sort_values(by='winningTeam', ascending=False), vmin=-.50, vmax=.50, annot=True)
heatmap.set_title('Feautures Correlating with winningTeam')

### Question 1
- Is there a correlation between Blue team's gold and team levels that could effect outcome of the game

# univariate study

In [None]:
# univariate study
univariate_study = ['BlueTeamTotalGoldDifference','BlueTeamLevelDifference']

for i in univariate_study:
    explore_univariate(train, i)
    print(f'Summary Statistics for {i}\n{train[i].describe()}')

 - Blue team had less then average gold and almost average levels at the 10 min mark

# Man Whitney U

# Stats
from scipy.stats import mannwhitneyu, wilcoxon
from scipy.stats import levene

In [None]:
# Stats
from scipy.stats import mannwhitneyu, wilcoxon
from scipy.stats import levene
# From the scypi stats library, im going to use the levene test to check variance.
# It will test the null hypothesis that all input samples are from populations with equal variances.
stats, p = levene(train.BlueTeamTotalGoldDifference, train.BlueTeamLevelDifference)
print(stats, p)
alpha = .05
if p < alpha:
    print("blue and red populations do not have equal variances")
else:
    print("blue and red populations do have equal variances")


In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='BlueTeamTotalGoldDifference',y='BlueTeamLevelDifference',data=train,hue='winningTeam', palette='colorblind')
plt.title('Blue Team Gold and Level difference based on winning team', fontsize = 20)
plt.show()

# Hypothesis Testing

## $H_0$: Blue team's gold difference over 40 and blue team's level difference over 1  is not significant

## $H_a$: Blue team's gold difference over 40 and blue team's level difference over 1  is not significant

In [None]:
import scipy.stats as stats
# hypothesis testing

null_hypothesis = "Blue team's gold difference over 40 and blue team's level difference over 1  is not significant"
alternative_hypothesis = "Blue team's gold difference under 40 and blue team's level difference under 1 is significant"
a = 0.05 #a for alpha 

big_loss = train[train.BlueTeamTotalGoldDifference > 40]
little_loss = train[train.BlueTeamLevelDifference >= 1]
t, p = stats.ttest_ind(big_loss.winningTeam, little_loss.winningTeam)
print(p)
if p < a:
    print(f'Reject null hypothesis that: {null_hypothesis}')
    print (f'There is evidence to suggest: {alternative_hypothesis}')
else:
    print(f'Fail to reject null hypothesis that: {null_hypothesis} There is not sufficient evidence to reject it.')

### Hypothesis Results:

- There is no difference if the blue teams gold difference is over 40 and blue team's level difference is greater then or equal to 1 that could impact the results of game

### Takeaways

- I have a 95% confidence level that there is no difference in outcome when blue team gold difference is over 40 and team level difference is over 1
    

### Question 2
- Is there a correlation between 'Blue Team Physical Damage Difference and Blue Team Kda Difference

# Univariate Study

In [None]:
# univariate study
univariate_study = ['BlueTeamPhysicalDmgDifference','BlueTeamKdaDifference']

for i in univariate_study:
    explore_univariate(train, i)
    print(f'Summary Statistics for {i}\n{train[i].describe()}')

# Man Whitney U

In [None]:
# From the scypi stats library, im going to use the levene test to check variance.
# It will test the null hypothesis that all input samples are from populations with equal variances.
stats, p = levene(train.BlueTeamPhysicalDmgDifference, train.BlueTeamKdaDifference)
print(stats, p)
alpha = .05
if p < alpha:
    print("blue and red populations do not have equal variances")
else:
    print("blue and red populations do have equal variances")


# Hypothesis Testing

## $H_0$: Blue team's physical damage difference over -85 and blue team's kda difference over 0  is not significant

## $H_a$: Blue team's physical damage difference under -85 and blue team's kda difference under 0  is not significant

In [None]:
# hypothesis testing
import scipy.stats as stats
null_hypothesis = "Blue team's physical damage difference under -85 and blue team's kda difference under 0 is not significant"
alternative_hypothesis = "Blue team's physical damage difference over -85 and blue team's kda difference over 0 is significant"
a = 0.05 #a for alpha 

big_xp = train[train.BlueTeamPhysicalDmgDifference > -85]
little_xp = train[train.BlueTeamKdaDifference >= 0]
t, p = stats.ttest_ind(big_xp.winningTeam, little_xp.winningTeam)
print(p)
if p < a:
    print(f'Reject null hypothesis that: {null_hypothesis}')
    print (f'There is evidence to suggest: {alternative_hypothesis}')
else:
    print(f'Fail to reject null hypothesis that: {null_hypothesis} There is not sufficient evidence to reject it.')

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='BlueTeamPhysicalDmgDifference',y='BlueTeamKdaDifference',data=train,hue='winningTeam', palette='colorblind')
plt.title('Blue Team physical damage and kda difference based on winning team', fontsize = 20)
plt.show()

### Hypothesis Results:
    Blue team's physical damage difference over -85 and blue team's kda difference over 0 is significant

### Takeaways

    I have a 95% confidence level that there is a difference in outcome when Blue team's physical damage difference is over -85 and blue team's kda difference over 0 is significant

## Question 3
- Is there a correlation between Blue team ward difference and blue team's minion kill difference that will effect the outcome of the game

In [None]:
# univariate study
univariate_study = ['BlueTeamWardDifference',
       'BlueTeamminionKillDifference']

for i in univariate_study:
    explore_univariate(train, i)
    print(f'Summary Statistics for {i}\n{train[i].describe()}')

- blue team wards and minion kill difference are normally distributed
- blue team ward and minion kill difference are averaging the same

# Man Whitney U

In [None]:
# From the scypi stats library, im going to use the levene test to check variance.
# It will test the null hypothesis that all input samples are from populations with equal variances.
stats, p = levene(train.BlueTeamWardDifference, train.BlueTeamminionKillDifference)
print(stats, p)
alpha = .05
if p < alpha:
    print("blue and red populations do not have equal variances")
else:
    print("blue and red populations do have equal variances")


# Hypothesis Testing

## $H_0$: Blue team's ward difference over 0 and blue team's minion kills difference over 0  is not significant

## $H_a$: Blue team's gold difference under 0 and blue team's minion kills difference under 0  is not significant

In [None]:
# hypothesis testing
import scipy.stats as stats
null_hypothesis = "Blue team's ward difference over 0 and blue team's minion kills difference over 0  is not significant"
alternative_hypothesis = "Blue team's gold difference under 0 and blue team's minion kills difference under 0  is not significant"
a = 0.05 #a for alpha 

big_xp = train[train.BlueTeamWardDifference > 0]
little_xp = train[train.BlueTeamminionKillDifference >= 0]
t, p = stats.ttest_ind(big_xp.winningTeam, little_xp.winningTeam)
print(p)
if p < a:
    print(f'Reject null hypothesis that: {null_hypothesis}')
    print (f'There is evidence to suggest: {alternative_hypothesis}')
else:
    print(f'Fail to reject null hypothesis that: {null_hypothesis} There is not sufficient evidence to reject it.')

### Hypothesis Results:
- with a p value of .06 there maybe some reason to believe that the blue team's wards difference and minion kill difference over 0 maybe significant

# Exploration Take Away

    - Features I am predicting to do reasonably well in my model.  I will try them in the modeling stage and document the process.
    
        'Assists'
        'Barons'
        'Inhibitors'
        'Towers'

# Modeling


    - Create a baseline
    - Split the data into X and y groups
    - No need for validate, using cross-validation in place of
    - Build RandomForestClassifier models
    - Evaluate the best one to use on our the test dataset.

In [None]:
train, test = prep(acquire())

### Create a Baseline

    Since this is a classification problem, we will set the baseline to whichever team has the most wins in our training set.

In [None]:
# Which team has the most wins? Team 200.0
y_train.value_counts()

In [None]:
# The following function call creates the baseline, returns the model, and prints the accuracy
baseline_model = model.baseline_acc(X_train, y_train)

### Split Into X and y Groups

In [None]:
#X group = features, y group = target
X_train, y_train = train.drop(columns = ['winningTeam']), train.winningTeam

In [None]:
#X group = features, y group = target
X_test, y_test = train.drop(columns = ['winningTeam']), test.winningTeam

In [None]:
#Verify they are the same length
X_train.shape, y_train.shape

### Create Train and Test Sets

    We will not need a validate dataset since we are utilizing cross-validation. The following line of code will set aside 20% of the data for testing.

In [None]:
train.shape, test.shape

### Random Forest Classifier

    To use the function we wrote, you must first create a dicitonary keyed with hyperparameters and a range of values for each. It will be used to find and return an optimized Random Forest Classifier model.

In [None]:
#Create the dictionary of hyperparameters we want to optimize across
param_dict = {
    'max_depth': range(1, 16),
    'min_samples_leaf': range(1, 16)
}

In [None]:
#The following function call will find and return an optimized Random Forest Classifier
#and print out that model's mean cross-validated accuracy and its hyperparameters.
best_model = model.get_random_forest_models(X_train, y_train, param_dict)

__Best Model Takeaways__

    - Random Forest Classifier
    - Mean Cross-Validated Accuracy: 96.64%
    - Outperformed Baseline by 42%
    - Max Depth: 8
    - Minimum Samples Per Leaf: 3

__Top 10 Features__

In [None]:
best_features = pd.DataFrame(best_model.feature_importances_, X_train.columns)
best_features.sort_values(by = 0, ascending = False).head(10)

# Evaluation

    - Refit on train data before using it on the test data. 
    
    - This needs to be done because cross-validation fits models on a small subset of the given data which means that our best model has only been trained on a fraction of the actual training data set.

__Fit Best Model on All Train Data__

In [None]:
#The following line of code will fit our best model to all of the training data
best_model.fit(X_train, y_train)

__Evaluate On Test__

In [None]:
#The following code will score our best model on the unseen test data

#Remove this comment and uncomment the following line of code only when ready for testing!

#best_model.score(X_test, y_test)

### Conclusion

    Our random forest classifier model accuracy average using cross validation was roughly 97% beating our baseline accuracy by 42%.
    
    Our model confirmed that our original hypothesis of 'TeamWards' being the biggest driver of win rates was incorrect.  Our models feature importance concluded that 'towers_lost' was the biggest driver in predicting win rates. 
    
    Our model identified the most important features as:
        
        'towers_lost_'
        'inhibs_lost_team200'
        'baron_team100'
        'dragon_team100'
        'team_totalGold_100'
        'team_xp_100'
        
    If we had more time we would have liked to:
    
        - run our model on non-pro games & see if 'towers_lost' is still the biggest driver for predicting win rates
        - dive deeper into what are the drivers for gaining towers
        - engineer more features
        - predict a winner at a much earlier time than 10 min into a game
        

### Recommendations

    The data suggests:  
    
    - If you are a coach, revolving your team stragety around objectives can lead to more wins. 
    
    - If you are player, encouraging memebers of ones team to work around objectives can lead to more wins. 

### Key Takeaways

    We used event data from the Riot API to calcuate what the value of each oberservation was at the 10 min marker.
    
    The only 'gameType' used was "classic".  Which means all the data is from the most popular game mode and on the same map.

    Games were pulled from various game patches to include:

        '11.10.376.4811'
        '11.11.377.6311'
        '11.12.379.4946'
        '11.13.382.1241'
        '11.14.384.6677'
        '11.14.385.9967'
        '11.15.388.2387'
        '11.15.389.2308'
        '11.16.390.1945'
        '11.17.393.607'
        '11.17.394.4489'
        '11.18.395.7538'
        '11.19.398.2521'
        '11.19.398.9466'
        '11.20.400.7328'
        '11.21.403.3002'
        '11.22.406.3587'
        '11.23.409.111'

# Thank you