Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 3

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [X] Continue to iterate on your project: data cleaning, exploration, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Share at least 1 visualization on Slack.

(If you have not yet completed an initial model yet for your portfolio project, then do today's assignment using your Tanzania Waterpumps model.)

## Stretch Goals
- [ ] Make multiple PDPs with 1 feature in isolation.
- [ ] Make multiple PDPs with 2 features in interaction. 
- [ ] Use Plotly to make a 3D PDP.
- [ ] Make PDPs with categorical feature(s). Use Ordinal Encoder, outside of a pipeline, to encode your data first. If there is a natural ordering, then take the time to encode it that way, instead of random integers. Then use the encoded data with pdpbox.I Get readable category names on your plot, instead of integer category codes.

## Links
- [Christoph Molnar: Interpretable Machine Learning — Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) + [animated explanation](https://twitter.com/ChristophMolnar/status/1066398522608635904)
- [Kaggle / Dan Becker: Machine Learning Explainability — Partial Dependence Plots](https://www.kaggle.com/dansbecker/partial-plots)
- [Plotly: 3D PDP example](https://plot.ly/scikit-learn/plot-partial-dependence/#partial-dependence-of-house-value-on-median-age-and-average-occupancy)

In [1]:
import pandas as pd

import numpy as np

from scipy.stats import randint, uniform
import random as ran

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans 
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import f_classif, chi2, SelectKBest, SelectPercentile, SelectFpr, SelectFromModel
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score

import category_encoders as ce

from xgboost import XGBClassifier

import eli5
from eli5.sklearn import PermutationImportance

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='sklearn')

In [3]:
df = pd.read_csv("vgcSun_hot.csv")

In [5]:
dfTe=pd.concat([df[:903],df[3615:4518]])
dfTr=pd.concat([df[904:3614],df[4519:9034]])

In [4]:
target = "winner"
features = df.columns.drop("winner")

In [28]:
numeric_features = features[788:]
categorical_features = features[1:788]

In [29]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a group of columns based on a list.
    """
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.cols]

In [51]:
# Feature Selection Pipelines
numPipe = Pipeline( [
    ("ncol", ColumnSelector(numeric_features)),
    ("nkbe", SelectKBest(score_func=f_classif, k=111))
    ] )

catPipe = Pipeline( [
    ("ccol", ColumnSelector(categorical_features)),
    ("ckbe", SelectKBest(score_func=chi2, k=726))
    ] )

feats = FeatureUnion([('nums', numPipe), ('cats', catPipe)])

In [52]:
XGB = Pipeline([
      ("feats", feats),
      ("XGBoost",
        XGBClassifier(
            n_estimators = 300,
            max_depth=7,
            learning_rate=.5,
            n_jobs=-1
        ))
  ])

Ran = Pipeline([
      ("feats", feats),
      ("Rand", RandomForestClassifier(n_estimators=100, n_jobs=-1, max_features = 0.91))
  ])

In [35]:
Ran.fit(dfTe[features], dfTe[target])

Pipeline(memory=None,
         steps=[('feats',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('nums',
                                                 Pipeline(memory=None,
                                                          steps=[('ncol',
                                                                  ColumnSelector(cols=Index(['team1_Nones', 'team1_Normals', 'team1_Fightings', 'team1_Flyings',
       'team1_Poisons', 'team1_Grounds', 'team1_Rocks', 'team1_Bugs',
       'team1_Ghosts', 'team1_Steels',
       ...
       'team2_Normal_immunity', 'team2_Fighting_imm...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                 

In [36]:
Ran.score(dfTr[features], dfTr[target])

0.5472664359861592

In [49]:
param_distributions = {
    # 'feat__nums__nimp__strategy' : ['mean', 'median'],
    # 'feats__nums__nkbe__k' : randint(1, len(numeric_features)),
    #'feat__nums__nkbe__score_func',
    'feats__cats__ckbe__k' : randint(1, len(categorical_features)),
    #'feat__cats__ckbe__score_func',
    # 'RF__max_depth': [5, 10, 15, 20, None],
    'Rand__max_features':uniform(0, 1),
    #'RF__max_leaf_nodes',
    #'RF__min_samples_leaf',
    #'RF__min_samples_split',
    # 'RF__n_estimators':randint(50, 500)
}

search = RandomizedSearchCV(
    Ran, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=3, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(df[features], df[target]);

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:  6.4min
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  9.8min
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 12

In [50]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', search.best_score_)

Best hyperparameters {'Rand__max_features': 0.9133514547334742, 'feats__cats__ckbe__k': 726, 'feats__nums__nkbe__k': 111}
Cross-validation Accuracy 0.5904361301748948


In [55]:
scores = cross_val_score(model, df[features], df[target], cv=5, scoring='accuracy')
print(f'Accuracy for 5 folds:', scores.mean())

Accuracy for 5 folds: 0.5853378365134899


In [45]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model.fit(dfTr[features], dfTr[target])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [46]:
model.score(dfTe[features], dfTe[target])

0.5714285714285714

In [44]:
scores = cross_val_score(model, df[features], df[target], cv=5, scoring='accuracy')
print(f'Accuracy for 5 folds:', scores)
scores.mean()

Accuracy for 5 folds: [0.59015487 0.59679204 0.56423034 0.56976744 0.58250277]


0.5806894912729447