<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Loading data</a></span></li><li><span><a href="#Additional-preprocessing" data-toc-modified-id="Additional-preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Additional preprocessing</a></span></li><li><span><a href="#Running-logistic-regression" data-toc-modified-id="Running-logistic-regression-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Running logistic regression</a></span></li></ul></div>

# Overview

This file trains the model used for LoLwinner

# Imports

In [31]:
from sklearn.model_selection import train_test_split
import numpy as np
import json
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, log_loss
from sklearn import metrics
from scipy import interp
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus
from sklearn.externals.six import StringIO
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from lib import feature_calculators, utils, match_factory
from importlib import reload
reload(feature_calculators);
reload(utils);
reload(match_factory);

LOG_LEVEL = 'Off'

# Loading data

In [49]:
fin = 'lolwinner_data.npz'

df = np.load(fin)

duration = df['duration']

**Filter matches**. Retain only matches that we actually care about (e.g., exclude matches that end in surrender)

In [50]:
min_duration = 20
max_duration = 60

considered_matches = np.where(
    (duration / 60 >= min_duration) &
    (duration / 60 <= max_duration)
    )[0]

In [51]:
total_gold = df['total_gold'][considered_matches]
winners = Y_all = df['winners'][considered_matches]
versions = df['versions'][considered_matches]
champions = df['champions'][considered_matches]
player_tiers = df['player_tiers'][considered_matches]
num_frames = df['num_frames'][considered_matches]
kills = df['kills'][considered_matches]
buildings = df['buildings'][considered_matches]
monsters = df['monsters'][considered_matches]

gold_diff = total_gold[:, :, 2]

In [58]:
def run_logistic_regression(X_train, Y_train, X_test, cv = 5):
    """runs logistic regression

    returns accuracy on training set, and predictions
    """
    model = LogisticRegressionCV(cv = cv)
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    accuracy = round(model.score(X_train, Y_train) * 100, 2)
    Y_pred_prob = model.predict_proba(X_test)
    return accuracy, Y_pred, Y_pred_prob, model.coef_

# Additional preprocessing

Here, we scale gold, create a feature tensor, and filter for games that last 20+ minutes

**Set of features**

We use four features:

|Feature | Description | Dimension |
|--- | --- | ---
|`gold_diff` | Difference in total accumulated gold between Blue and Red | `(num_samples, max_frames)`
|`kills` | Difference in kills between Blue and Red | `(num_samples, max_frames)`
|`buildings` | Difference in buildings destroyed (for five building types) between Blue and Red | `(num_samples, max_frames, 5)`
|`monsters` | Difference in elite monsters destroyed (for seven monster types) between Blue and Red | `(num_samples, max_frames, 7)`

where

`num_samples` is the number of matches in our data set and

`max_frames` is the max number of frames across samples (i.e., longest duration in minutes)

**Scaling gold and combining features into single tensor**

- We perform min-max scaling to `gold_diff` to the range `[-1,1]`.

- For ease of use in model, we combine the four features into a single tensor with dimension

    `(num_samples, max_frames, 14 = 1 + 1 + 5 + 7)`

In [52]:
gold_diff_rescaled = feature_calculators.rescale_tensor(gold_diff)
gold_diff_reshape = gold_diff.reshape((gold_diff.shape[0], gold_diff.shape[1], 1))
kills_reshape = kills.reshape((kills.shape[0], kills.shape[1], 1))

In [57]:
print(gold_diff_reshape.shape)
print(kills_reshape.shape)
print(buildings.shape)
print(monsters.shape)
X_all = np.concatenate((gold_diff_reshape, kills_reshape, buildings, monsters), axis=2)
print(X_all.shape)

(54494, 72, 1)
(54494, 72, 1)
(54494, 72, 5)
(54494, 72, 7)
(54494, 72, 14)


# Running logistic regression

In [59]:
# minutes in game to analyze
start = 20
end = 23

# how many prior minutes to include
trailing_window = 4 

acc_by_frame = []
auc_by_frame = []
acc_train_by_frame = []
acc_test_by_frame = []
N_by_frame = []
coef = []

#base_fpr = np.linspace(0, 1, 101)

for frame in range(start, end):

    earliest_frame = max(0, frame - trailing_window)

    # filter out matches that have ended by this minute
    matches = np.where((num_frames > frame))[0]

    X = X_all[matches][:, earliest_frame:frame + 1, ]

    X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))

    Y = Y_all[matches]

    train_X, test_X, train_Y, test_Y = train_test_split(
        X, Y, test_size=0.2, random_state=42, stratify=Y)

    try:
        acc_train, Y_pred, Y_pred_prob, model_coef = run_logistic_regression(
            train_X, train_Y, test_X, cv=5)

        coef.append(model_coef)

        Y_pred_prob = Y_pred_prob[:, 1]
        acc_test = accuracy_score(test_Y, Y_pred) * 100
#         fpr, tpr, thresholds = metrics.roc_curve(
#             test_Y, Y_pred_prob, pos_label=1)
#         tpr = interp(base_fpr, fpr, tpr)
#         tpr[0] = 0.0
        auc = metrics.roc_auc_score(test_Y, Y_pred_prob)
        print(
            'frame: %d, train: %0.2f, test: %0.2f, AUC: %0.2f, N: %d\n' %
            (frame, acc_train, acc_test, auc, len(X)),
            end='')
        
        auc_by_frame.append(auc)
        acc_train_by_frame.append(acc_train)
        acc_test_by_frame.append(acc_test)
        N_by_frame.append(len(X))

    except:
        
        print('frame: ', frame, '...issue')

coef = np.array([np.array(x[0]) for x in coef])

frame: 20, train: 81.05, test: 81.62, AUC: 0.90, N: 54494
frame: 21, train: 82.51, test: 82.37, AUC: 0.91, N: 54160
frame: 22, train: 82.94, test: 82.72, AUC: 0.91, N: 51158


In [22]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 
from sklearn.svm import SVC as svc 
from sklearn.metrics import make_scorer, roc_auc_score, accuracy_score
from scipy import stats


mdl = svc(probability = True, random_state = 1)
auc = make_scorer(roc_auc_score)

In [24]:
lag = 1
frame = 25

print(lag)
acc_train_list = []
acc_test_list = []
valid_frames = []
roc_curves = []
auc = []
sample_size = []
coef = []

earliest_frame = max(0,frame-lag)

#X = generate_X(gold_diff_rescaled, kills, monsters, buildings, earliest_frame, frame, valid_games)
X = train_X[:, earliest_frame:frame+1, ]

X = X.reshape((X.shape[0],X.shape[1]*X.shape[2]))
indices = np.where(
    (num_frames[valid_games][indices_train] > frame) #&
#           (buildings[:,-3] == 0)
)[0]

Y = train_y[indices]
X = X[indices]
print(X.shape, Y.shape)


# C_range = 10. ** np.arange(-3, 8)
# gamma_range = 10. ** np.arange(-5, 4)

# param_grid = dict(gamma=gamma_range, C=C_range)

# grid = GridSearchCV(svc(), param_grid=param_grid, verbose=2)

# grid.fit(X, Y)

#RANDOM SEARCH FOR 20 COMBINATIONS OF PARAMETERS
rand_list = {"C": stats.uniform(2, 10),
             "gamma": stats.uniform(0.1, 1)}
              
rand_search = RandomizedSearchCV(svc(), param_distributions = rand_list, n_iter = 20, n_jobs = 4, cv = 3, random_state = 2017, scoring = make_scorer(accuracy_score),
                                verbose = 5)
rand_search.fit(X, Y) 
rand_search.cv_results_

1
(33101, 28) (33101,)
Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] C=2.209602254061174, gamma=0.8670701646824878 ...................
[CV] C=2.209602254061174, gamma=0.8670701646824878 ...................
[CV] C=2.209602254061174, gamma=0.8670701646824878 ...................
[CV] C=6.479197998006022, gamma=0.22054161556730448 ..................


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Users/ccl/anaconda3/envs/test/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-16457c86a524>", line 44, in <module>
    rand_search.fit(X, Y)
  File "/Users/ccl/anaconda3/envs/test/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 639, in fit
    cv.split(X, y, groups)))
  File "/Users/ccl/anaconda3/envs/test/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/Users/ccl/anaconda3/envs/test/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/Users/ccl/anaconda3/envs/test/lib/python3.6/multiprocessing/pool.py", line 638, in get
    self.wait(timeout)
  File "/Users/ccl/anaconda3/envs/test/lib/python3.6/multiprocessing/pool.py", line 635, in wait


KeyboardInterrupt: 

In [27]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [28]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=5, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X, Y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=30, bootstrap=True 
[CV] n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=30, bootstrap=True 
[CV] n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=30, bootstrap=True 
[CV] n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=10, bootstrap=True 
[CV]  n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=30, bootstrap=True, score=0.8287863681682226, total=  27.2s
[CV] n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=10, bootstrap=True 
[CV]  n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=30, bootstrap=True, score=0.8334088643161425, total=  27.4s
[CV] n_estimators=2000, min_samples_split=5, min_samples_leaf=

[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  4.4min


[CV]  n_estimators=2000, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, bootstrap=False, score=0.8325931297018037, total= 2.7min
[CV]  n_estimators=2000, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, bootstrap=False, score=0.8336202990484821, total= 2.7min
[CV] n_estimators=1600, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=10, bootstrap=True 
[CV] n_estimators=800, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=30, bootstrap=False 
[CV]  n_estimators=1600, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=10, bootstrap=True, score=0.8332275899574005, total= 1.1min
[CV] n_estimators=800, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=30, bootstrap=False 
[CV]  n_estimators=2000, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, bootstrap=False, score=0.8299646515000453, total= 2.6min
[CV] n_estimators=800, min_samples_

[CV] n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_features=sqrt, max_depth=30, bootstrap=False 
[CV]  n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_features=sqrt, max_depth=30, bootstrap=False, score=0.834707748074309, total= 1.2min
[CV] n_estimators=1800, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False 
[CV]  n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_features=sqrt, max_depth=30, bootstrap=False, score=0.8332275899574005, total= 1.2min
[CV] n_estimators=1800, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False 
[CV]  n_estimators=1600, min_samples_split=5, min_samples_leaf=2, max_features=sqrt, max_depth=10, bootstrap=False, score=0.8334088643161425, total= 1.7min
[CV] n_estimators=1800, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False 
[CV]  n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_fea

[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 25.3min


[CV] n_estimators=1400, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=80, bootstrap=True 
[CV]  n_estimators=1400, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=80, bootstrap=True, score=0.8343452650657001, total= 1.2min
[CV] n_estimators=1400, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=80, bootstrap=True 
[CV]  n_estimators=1400, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=80, bootstrap=True, score=0.8328650412399166, total= 1.2min
[CV] n_estimators=1800, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=True 
[CV]  n_estimators=1600, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=70, bootstrap=False, score=0.8321399438049488, total= 2.3min
[CV] n_estimators=1800, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=True 
[CV]  n_estimators=1600, min_samples_split=5, min_samples_leaf=1, max_feat

[CV]  n_estimators=800, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=50, bootstrap=False, score=0.8325931297018037, total= 1.1min
[CV] n_estimators=800, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=100, bootstrap=False 
[CV]  n_estimators=800, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=50, bootstrap=False, score=0.8272455361189159, total= 1.1min
[CV] n_estimators=800, min_samples_split=10, min_samples_leaf=4, max_features=sqrt, max_depth=50, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=100, bootstrap=False, score=0.830992297236067, total= 1.2min
[CV] n_estimators=800, min_samples_split=10, min_samples_leaf=4, max_features=sqrt, max_depth=50, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=100, bootstrap=False, score=0.8310522976524971, total= 1.2min
[CV] n_estimators=800, min_samples_s

[CV] n_estimators=800, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=90, bootstrap=True 
[CV]  n_estimators=600, min_samples_split=10, min_samples_leaf=2, max_features=sqrt, max_depth=60, bootstrap=True, score=0.830961660473126, total=  32.9s
[CV] n_estimators=400, min_samples_split=10, min_samples_leaf=4, max_features=sqrt, max_depth=90, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=90, bootstrap=True, score=0.8333484367920254, total=  43.4s
[CV] n_estimators=400, min_samples_split=10, min_samples_leaf=4, max_features=sqrt, max_depth=90, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=90, bootstrap=True, score=0.8339526873923684, total=  43.2s
[CV] n_estimators=400, min_samples_split=10, min_samples_leaf=4, max_features=sqrt, max_depth=90, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=2, min_samples_leaf=4, max_features=sqrt,

[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 56.0min


[CV] n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False, score=0.8334390575441776, total= 1.3min
[CV] n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=50, bootstrap=False, score=0.8338620502129974, total= 1.3min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=30, bootstrap=False 
[CV]  n_estimators=2000, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=50, bootstrap=False, score=0.8315054835493519, total= 2.7min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=30, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=10, min_samples_leaf=4, 

[CV]  n_estimators=400, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=10, bootstrap=True, score=0.8338620502129974, total=  17.1s
[CV] n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=80, bootstrap=False 
[CV]  n_estimators=1400, min_samples_split=10, min_samples_leaf=2, max_features=sqrt, max_depth=80, bootstrap=True, score=0.8320493066255779, total= 1.3min
[CV] n_estimators=1200, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=80, bootstrap=False, score=0.8353420933393747, total= 1.3min
[CV] n_estimators=1200, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=80, bootstrap=False, score=0.8333182271367715, total= 1.3min
[CV] n_estimators=1200, 

[CV] n_estimators=1800, min_samples_split=5, min_samples_leaf=2, max_features=sqrt, max_depth=60, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=20, bootstrap=False, score=0.8304178373969002, total= 1.0min
[CV] n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, bootstrap=False 
[CV]  n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, bootstrap=False, score=0.830992297236067, total=  35.7s
[CV] n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, bootstrap=False 
[CV]  n_estimators=1800, min_samples_split=5, min_samples_leaf=2, max_features=sqrt, max_depth=60, bootstrap=True, score=0.8342546443135478, total= 1.8min
[CV] n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, bootstrap=False 
[CV]  n_estimators=400, min_samples_split=5, min_samples_leaf=1, max_features=au

[CV]  n_estimators=2000, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=60, bootstrap=False, score=0.8329556784192876, total= 2.6min
[CV] n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=100, bootstrap=True, score=0.8345265065700045, total=  53.4s
[CV] n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=100, bootstrap=True, score=0.8334995014955134, total=  53.1s
[CV] n_estimators=800, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=None, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=100, bootstrap=True, score=0.8322305809843198, total=  52.6s
[CV] n_estimators=800, mi

[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=40, bootstrap=True 
[CV]  n_estimators=2000, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=None, bootstrap=False, score=0.8328650412399166, total= 2.6min
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=40, bootstrap=True 
[CV]  n_estimators=800, min_samples_split=5, min_samples_leaf=1, max_features=sqrt, max_depth=40, bootstrap=True, score=0.8288770053475936, total=  47.4s
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=40, bootstrap=True 
[CV]  n_estimators=2000, min_samples_split=5, min_samples_leaf=4, max_features=sqrt, max_depth=None, bootstrap=False, score=0.8306897489350131, total= 2.6min
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=30, bootstrap=False 
[CV]  n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features

[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed: 99.7min


[CV]  n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=40, bootstrap=True, score=0.8328650412399166, total=  32.2s
[CV] n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=30, bootstrap=False 
[CV]  n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=40, bootstrap=True, score=0.8326837668811746, total=  31.9s
[CV] n_estimators=400, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=40, bootstrap=False 
[CV]  n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=30, bootstrap=False, score=0.835432714091527, total=  47.1s
[CV] n_estimators=400, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=40, bootstrap=False 
[CV]  n_estimators=600, min_samples_split=2, min_samples_leaf=4, max_features=sqrt, max_depth=30, bootstrap=False, score=0.8327744040605456, total=  46.5s
[CV] n_estimators=400, min_samples_spli

[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 105.2min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring=None, verbose=5)

In [38]:
type(rf_random)

sklearn.model_selection._search.RandomizedSearchCV

In [50]:
import pandas as pd

df = pd.DataFrame(data = rf_random.cv_results_)



In [53]:
fout = 'rf_randomsearch_frame25.json'
with open(fout,'w') as f:
    json.dump(df.to_json(), f)

In [35]:
for i, x in enumerate(rf_random.cv_results_['mean_test_score']):
    print(i, '%0.4f' % (x))

0 0.8314
1 0.8333
2 0.8326
3 0.8321
4 0.8334
5 0.8334
6 0.8306
7 0.8295
8 0.8282
9 0.8327
10 0.8334
11 0.8293
12 0.8324
13 0.8323
14 0.8324
15 0.8328
16 0.8300
17 0.8322
18 0.8323
19 0.8327
20 0.8310
21 0.8299
22 0.8336
23 0.8332
24 0.8295
25 0.8285
26 0.8285
27 0.8340
28 0.8329
29 0.8331
30 0.8330
31 0.8307
32 0.8295
33 0.8345
34 0.8336
35 0.8328
36 0.8331
37 0.8317
38 0.8327
39 0.8328
40 0.8324
41 0.8334
42 0.8331
43 0.8331
44 0.8301
45 0.8310
46 0.8329
47 0.8326
48 0.8325
49 0.8305
50 0.8302
51 0.8317
52 0.8325
53 0.8326
54 0.8309
55 0.8327
56 0.8331
57 0.8337
58 0.8314
59 0.8336
60 0.8304
61 0.8334
62 0.8330
63 0.8319
64 0.8344
65 0.8335
66 0.8330
67 0.8322
68 0.8312
69 0.8318
70 0.8324
71 0.8331
72 0.8332
73 0.8291
74 0.8330
75 0.8319
76 0.8338
77 0.8326
78 0.8331
79 0.8329
80 0.8322
81 0.8325
82 0.8334
83 0.8319
84 0.8325
85 0.8309
86 0.8316
87 0.8335
88 0.8344
89 0.8328
90 0.8335
91 0.8327
92 0.8325
93 0.8329
94 0.8327
95 0.8321
96 0.8331
97 0.8324
98 0.8333
99 0.8339


In [53]:
# predict just using lead in total gold

base_fpr = np.linspace(0, 1, 101)

all_train = []
all_test = []
all_auc = []

for lag in range(5,6):
    print(lag)
    acc_train_list = []
    acc_test_list = []
    valid_frames = []
    roc_curves = []
    auc = []
    sample_size = []
    coef = []
    for frame in range(40):
        earliest_frame = max(0,frame-lag)
        
        #X = generate_X(gold_diff_rescaled, kills, monsters, buildings, earliest_frame, frame, valid_games)
        X = train_X[:, earliest_frame:frame+1, ]

        X = X.reshape((X.shape[0],X.shape[1]*X.shape[2]))
        indices = np.where(
            (num_frames[valid_games][indices_train] > frame) #&
#           (buildings[:,-3] == 0)
        )[0]

        if len(indices) == 0:
            continue

        Y = train_y[indices]
        X = X[indices]
        print(X.shape, Y.shape)
        try:
            skf = StratifiedKFold(n_splits=5)
            skf.get_n_splits(X, Y)
            avg_acc_test = []
            avg_acc_train = []
            cur_roc_fpr = []
            cur_roc_tpr = []
            cur_auc = []
            cur_coef = []
            for train_indices, test_indices in skf.split(X, Y):
                X_train = X[train_indices]
                X_test = X[test_indices]
                y_train = Y[train_indices]
                y_test = Y[test_indices]

                #cur_acc_train, y_pred, y_pred_prob, model_coef = runDecisionTree(X_train, y_train, X_test, max_depth = 8)
                cur_acc_train, y_pred, y_pred_prob, model_coef = runLogisticRegression(X_train, y_train, X_test)
                #cur_acc_train, y_pred, y_pred_prob, model_coef = runRandomForest(X_train, y_train, X_test, max_depth = 5, n_estimators=100)
                cur_coef.append(model_coef)
                y_pred_prob = y_pred_prob[:,1]

                avg_acc_train.append(cur_acc_train)
                cur_acc_test = accuracy_score(y_test, y_pred)*100
                avg_acc_test.append(cur_acc_test)
                fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob, pos_label=1)
                tpr = interp(base_fpr, fpr, tpr)
                tpr[0] = 0.0
                cur_roc_tpr.append(tpr)
                cur_auc.append(metrics.roc_auc_score(y_test, y_pred_prob))
                
            coef.append(np.array(cur_coef).mean(axis=0))
            roc_curves.append([base_fpr,np.array(cur_roc_tpr).mean(axis=0)])
            auc.append(np.mean(cur_auc))
            acc_train = np.mean(avg_acc_train)
            acc_test = np.mean(avg_acc_test)
            acc_train_list.append(acc_train)
            acc_test_list.append(acc_test)
            valid_frames.append(frame)
            print('frame: %d, train: %0.2f, test: %0.2f, AUC: %0.2f, N %d ' % (frame, acc_train, acc_test,
                                                                               np.mean(cur_auc), len(indices)), end='')
            print(X.shape)
        except:
            print('frame: ', frame, '...issue')
    all_train.append(acc_train_list)
    all_test.append(acc_test_list)
    all_auc.append(np.array(auc)*100)
    
coef = np.array([np.array(x[0]) for x in coef])

5
(43595, 14) (43595,)
frame: 0, train: 52.41, test: 52.41, AUC: 0.50, N 43595 (43595, 14)
(43595, 28) (43595,)
frame: 1, train: 52.63, test: 52.60, AUC: 0.51, N 43595 (43595, 28)
(43595, 42) (43595,)
frame: 2, train: 53.40, test: 53.45, AUC: 0.54, N 43595 (43595, 42)
(43595, 56) (43595,)
frame: 3, train: 57.76, test: 57.82, AUC: 0.61, N 43595 (43595, 56)
(43595, 70) (43595,)
frame: 4, train: 60.78, test: 60.74, AUC: 0.65, N 43595 (43595, 70)
(43595, 84) (43595,)
frame: 5, train: 62.42, test: 62.35, AUC: 0.67, N 43595 (43595, 84)
(43595, 84) (43595,)
frame: 6, train: 63.91, test: 63.83, AUC: 0.69, N 43595 (43595, 84)
(43595, 84) (43595,)
frame: 7, train: 65.63, test: 65.51, AUC: 0.71, N 43595 (43595, 84)
(43595, 84) (43595,)
frame: 8, train: 66.87, test: 66.83, AUC: 0.73, N 43595 (43595, 84)
(43595, 84) (43595,)
frame: 9, train: 68.26, test: 68.27, AUC: 0.75, N 43595 (43595, 84)
(43595, 84) (43595,)
frame: 10, train: 69.77, test: 69.72, AUC: 0.77, N 43595 (43595, 84)
(43595, 84) (43595

In [28]:
max_length = max([len(x) for x in coef])
padded_coef = np.zeros((len(coef), max_length))
padded_coef.shape

(5, 70)

In [29]:
for i, row in enumerate(coef):
    padded_coef[i,:len(row)] = row

In [80]:
for i in range(10,-1,-1):
    print(i)

10
9
8
7
6
5
4
3
2
1
0


In [38]:
test_Y.shape

NameError: name 'test_Y' is not defined

In [74]:
predictions = []
for lag in range(5,6):
    acc_train_list = []
    acc_test_list = []
    valid_frames = []
    roc_curves = []
    auc = []
    sample_size = []
    for frame in range(20):
        earliest_frame = max(0,frame-lag)

        X = test_X[:, earliest_frame:frame+1, ]

        X = X.reshape((X.shape[0],X.shape[1]*X.shape[2]))
        
        indices = np.where(
            (num_frames[valid_games][indices_test] > frame)
        )[0]

        if len(indices) == 0:
            continue

        Y = test_y[indices]
        X = X[indices]
        BX = 1 / (1 + np.exp(-1 * np.einsum('ij,j->i',X,coef[frame])))
        auc = metrics.roc_auc_score(Y, BX)
        acc = accuracy_score(Y, (BX>=0.5)*1)*100
        predictions.append(BX)
        print(frame, Y.shape, BX.shape, 'acc: %0.2f, auc: %0.2f' % (acc, auc))

0 (10899,) (10899,) acc: 47.59, auc: 0.50
1 (10899,) (10899,) acc: 48.87, auc: 0.51
2 (10899,) (10899,) acc: 53.52, auc: 0.55
3 (10899,) (10899,) acc: 57.59, auc: 0.61
4 (10899,) (10899,) acc: 60.40, auc: 0.65
5 (10899,) (10899,) acc: 62.29, auc: 0.67
6 (10899,) (10899,) acc: 63.89, auc: 0.69
7 (10899,) (10899,) acc: 65.59, auc: 0.71
8 (10899,) (10899,) acc: 66.70, auc: 0.73
9 (10899,) (10899,) acc: 68.05, auc: 0.75
10 (10899,) (10899,) acc: 69.67, auc: 0.77
11 (10899,) (10899,) acc: 71.36, auc: 0.79
12 (10899,) (10899,) acc: 72.74, auc: 0.80
13 (10899,) (10899,) acc: 73.82, auc: 0.82
14 (10899,) (10899,) acc: 74.58, auc: 0.83
15 (10899,) (10899,) acc: 75.74, auc: 0.84
16 (10899,) (10899,) acc: 76.91, auc: 0.85
17 (10899,) (10899,) acc: 78.14, auc: 0.86
18 (10899,) (10899,) acc: 79.46, auc: 0.87
19 (10899,) (10899,) acc: 80.30, auc: 0.89


In [30]:
[1]+list(range(10))

[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [68]:
predictions_cum = np.cumsum(predictions,axis=0)
predictions_cum.shape

(20, 10899)

In [69]:
for i in range(len(predictions_cum)):
    predictions_cum[i] = predictions_cum[i] / (i+1)

In [73]:
for lag in range(5,6):
    for frame in range(20):
        auc = metrics.roc_auc_score(test_y, predictions_cum[frame])
        acc = accuracy_score(test_y, (predictions_cum[frame]>=0.5)*1)*100
        print(frame, 'acc: %0.2f, auc: %0.2f' % (acc, auc))

0 acc: 47.59, auc: 0.50
1 acc: 48.87, auc: 0.51
2 acc: 53.56, auc: 0.55
3 acc: 57.73, auc: 0.61
4 acc: 60.33, auc: 0.64
5 acc: 61.88, auc: 0.66
6 acc: 62.84, auc: 0.68
7 acc: 63.97, auc: 0.69
8 acc: 65.00, auc: 0.71
9 acc: 66.15, auc: 0.72
10 acc: 67.49, auc: 0.74
11 acc: 68.70, auc: 0.75
12 acc: 69.52, auc: 0.76
13 acc: 70.46, auc: 0.78
14 acc: 71.36, auc: 0.79
15 acc: 72.50, auc: 0.80
16 acc: 73.62, auc: 0.81
17 acc: 74.46, auc: 0.82
18 acc: 75.46, auc: 0.83
19 acc: 76.39, auc: 0.84
