# Preparing data for Logistic Regression

In [64]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

In [65]:
kobe = pd.DataFrame.from_csv("./data/kobe_new_variables.csv", sep=",")

## Converting variables to category type for better summarization

About the benefits of turning variables into categorical: http://pandas.pydata.org/pandas-docs/stable/categorical.html

In [66]:
kobe.dtypes

action_type                object
combined_shot_type         object
game_event_id               int64
game_id                     int64
loc_x                       int64
loc_y                       int64
period                      int64
playoffs                    int64
season                     object
shot_distance               int64
shot_made_flag            float64
shot_zone_area             object
shot_zone_basic            object
shot_zone_range            object
opponent                   object
seconds_to_period_end       int64
accurate_shot_distance    float64
game_year                   int64
game_month                  int64
game_dayofweek              int64
game_dayofyear              int64
local                       int64
at_sza                     object
at_szb                     object
at_szr                     object
at_sd                      object
dtype: object

Before turning variables into categories, we drop 'season' (string values composed by numbers) out. If we turn it into a category, as we should, we are not sure how the LogisticRegression will react (it's neither int nor float). Probably we should make dummy variables with it, but it is not worth, as we already have 'game_year' variable (int).

In [67]:
kobe = kobe.drop("season", axis=1)

First of all we turn non numerical variables that seem to fit into 'category' dtype.

In [68]:
kobe["action_type"] = kobe["action_type"].astype("category")
kobe["combined_shot_type"] = kobe["combined_shot_type"].astype("category")
kobe["playoffs"] = kobe["playoffs"].astype("category")
kobe["shot_made_flag"] = kobe["shot_made_flag"].astype("category")
kobe["shot_zone_area"] = kobe["shot_zone_area"].astype("category")               
kobe["shot_zone_basic"] = kobe["shot_zone_basic"].astype("category")             
kobe["shot_zone_range"] = kobe["shot_zone_range"].astype("category")
kobe["opponent"] = kobe["opponent"].astype("category")
kobe["local"] = kobe["local"].astype("category")
kobe["at_sza"] = kobe["at_sza"].astype("category")                          
kobe["at_szb"] = kobe["at_szb"].astype("category")    
kobe["at_szr"] = kobe["at_szr"].astype("category")    
kobe["at_sd"] = kobe["at_sd"].astype("category")                           

We check the categories of 'shot_made_flag' target.

In [69]:
kobe["shot_made_flag"].cat.categories

Float64Index([0.0, 1.0], dtype='float64')

We turn also some numerical variables into categories, specifying an order on them.

In [70]:
kobe["game_event_id"] = pd.Categorical(kobe["game_event_id"], ordered = True)
kobe["game_id"] = pd.Categorical(kobe["game_id"], ordered = True)
kobe["period"] = pd.Categorical(kobe["period"], ordered = True)
kobe["game_year"] = pd.Categorical(kobe["game_year"], ordered = True)
kobe["game_month"] = pd.Categorical(kobe["game_month"], ordered = True)
kobe["game_dayofweek"] = pd.Categorical(kobe["game_dayofweek"], ordered = True)
kobe["game_dayofyear"] = pd.Categorical(kobe["game_dayofyear"], ordered = True)

Let's take a look at some basic statistics by dtype.

In [71]:
kobe.describe(include=["category"])

Unnamed: 0,action_type,combined_shot_type,game_event_id,game_id,period,playoffs,shot_made_flag,shot_zone_area,shot_zone_basic,shot_zone_range,opponent,game_year,game_month,game_dayofweek,game_dayofyear,local,at_sza,at_szb,at_szr,at_sd
count,30697,30697,30697,30697,30697,30697,25697,30697,30697,30697,30697,30697,30697,30697,30697,30697,30697,30697,30697,30697
unique,57,6,620,1559,7,2,2,6,7,5,33,21,9,7,237,2,122,130,105,514
top,Jump Shot,Jump Shot,2,21501228,3,0,0,Center(C),Mid-Range,Less Than 8 ft.,SAS,2009,3,6,346,0,Jump Shot-Center(C),Jump Shot-Mid-Range,Jump Shot-16-24 ft.,Layup Shot-0
freq,18880,23485,132,50,8296,26198,14232,13455,12625,9398,1978,2357,5132,6907,257,15741,4742,9797,7060,1958


In [72]:
kobe.describe(include=["number"])

Unnamed: 0,loc_x,loc_y,shot_distance,seconds_to_period_end,accurate_shot_distance
count,30697.0,30697.0,30697.0,30697.0,30697.0
mean,7.110499,91.107535,13.437437,321.502525,13.846562
std,110.124578,87.791361,9.374189,208.175176,9.491986
min,-250.0,-44.0,0.0,0.0,0.0
25%,-68.0,4.0,5.0,142.0,5.3
50%,0.0,74.0,15.0,304.0,15.4
75%,95.0,160.0,21.0,498.0,21.1
max,248.0,791.0,79.0,714.0,79.2


## Making dummy variables

LogisticRegression model of sklearn  doesn't accept string values. OneHotEncoder is the sklearn standart approach to convert categorial features into numerical (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), but it was returning errors reading str values. LabelEncoder is much easier to implement, but it might not be effective when applying the created values to the prediction model (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). So let's use pandas get_dummies function to create dummy variables.

In [88]:
old_cat_variables = ["action_type", "combined_shot_type", "shot_zone_area", "shot_zone_basic", "shot_zone_range", "opponent"]

In [112]:
kobe_old_cat_vars = kobe[old_cat_variables]

In [113]:
kobe_old_cat_vars.shape

(30697, 6)

In [117]:
kobe_sparse_old_catvars = pd.get_dummies(kobe_old_cat_vars)

In [119]:
kobe_sparse_old_catvars.shape

(30697, 114)

In [114]:
new_cat_variables = ["at_sza", "at_szb", "at_szr", "at_sd"]

In [115]:
kobe_new_cat_vars = kobe[new_cat_variables]

In [116]:
kobe_new_cat_vars.shape

(30697, 4)

In [118]:
kobe_sparse_new_catvars = pd.get_dummies(kobe_new_cat_var)

In [120]:
kobe_sparse_new_catvars.shape

(30697, 871)

## Reducing the number of viariables with Boruta and Random Forest Classifier

#### Note: incomplete.

Currently we're handling slightly fewer than 1,000 variables. There is the risk that our model will not detect well the significant variables. So we'll apply an all-relevant feature selection method - Boruta (feature reduction done slavishly following the example you can find here: https://github.com/danielhomola/boruta_py

In [1]:
from sklearn.ensemble import RandomForestClassifier
from boruta_py import BorutaPy

In [125]:
kobe_target = kobe["shot_made_flag"]

In [126]:
kobe_target

shot_id
1       NaN
2         0
3         1
4         0
5         1
6         0
7         1
8       NaN
9         1
10        0
11        0
12        1
13        1
14        0
15        0
16        0
17      NaN
18        1
19        0
20      NaN
21        0
22        0
23        1
24        1
25        1
26        0
27        0
28        0
29        0
30        0
         ..
30668     0
30669   NaN
30670     0
30671     0
30672     0
30673     1
30674     0
30675     1
30676     0
30677     1
30678     0
30679     0
30680     0
30681   NaN
30682     1
30683   NaN
30684     0
30685     0
30686     0
30687   NaN
30688     0
30689     1
30690     0
30691     0
30692     0
30693     0
30694   NaN
30695     1
30696     0
30697     0
Name: shot_made_flag, dtype: category
Categories (2, float64): [0, 1]

In [128]:
kobe_socv = pd.concat([kobe_sparse_old_catvars, kobe_target], axis=1)

In [129]:
kobe_socv["shot_made_flag"]

shot_id
1       NaN
2         0
3         1
4         0
5         1
6         0
7         1
8       NaN
9         1
10        0
11        0
12        1
13        1
14        0
15        0
16        0
17      NaN
18        1
19        0
20      NaN
21        0
22        0
23        1
24        1
25        1
26        0
27        0
28        0
29        0
30        0
         ..
30668     0
30669   NaN
30670     0
30671     0
30672     0
30673     1
30674     0
30675     1
30676     0
30677     1
30678     0
30679     0
30680     0
30681   NaN
30682     1
30683   NaN
30684     0
30685     0
30686     0
30687   NaN
30688     0
30689     1
30690     0
30691     0
30692     0
30693     0
30694   NaN
30695     1
30696     0
30697     0
Name: shot_made_flag, dtype: category
Categories (2, float64): [0, 1]

In [130]:
kobe_socv = kobe_socv.dropna()

In [132]:
kobe_socv.shape

(25697, 115)

In [134]:
y=kobe_socv.pop("shot_made_flag")
X=kobe_socv.copy()

First we will instantiate an estimator that Boruta will use. Then we will instantiate a Boruta Object.

In [146]:
rf = RandomForestClassifier(n_jobs=-1, class_weight="auto", max_depth=8)
feat_selector = BorutaPy(rf, n_estimators="auto", verbose=2)

Once built, we can use this object to identify the relevant features in our dataset.

In [147]:
feat_selector.fit(X,y)

TypeError: unhashable type

We check the selected features

In [None]:
print feat_selector.support_

We check ranking of features

In [None]:
print feat_selector.ranking_

We call transform() on X to filter it down to selected features

In [None]:
X_filtered = feat_selector.transform(X)

_________________________________________________________________________________________________________________________

Since we were not able to reduce the number of variables, it's highly possible that the model will be less effective. So we'll try with different aggregations of the variables that we are currently handling. So we export three different dataset: the original with variables containing numerical data, the second with sparsed distribution of the original categorical variables. and the third with sparsed distribution of the newly created categorical variables.

In [150]:
kobe = kobe[["game_event_id", "game_id", "loc_x", "loc_y", "period", "playoffs", "shot_distance","shot_made_flag", \
     "seconds_to_period_end", "accurate_shot_distance", "game_year", "game_month", "game_dayofweek", "game_dayofyear", \
      "game_dayofweek", "game_dayofyear", "local"]]

In [151]:
kobe.to_csv("./data/kobe_num_variables.csv", sep= ",")

In [152]:
kobe_sparse_old_catvars.to_csv("./data/kobe_old_cat_variables.csv", sep= ",")

In [153]:
kobe_sparse_new_catvars.to_csv("./data/kobe_new_cat_variables.csv", sep= ",")