<a href="https://colab.research.google.com/github/worldwidekatie/DS-Unit-2-Kaggle-Challenge/blob/master/module4-classification-metrics/LS_DS_224_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading

- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] Stacking Ensemble. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [0]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import accuracy_score
import sklearn as sk
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

In [84]:
train, val = train_test_split(train, random_state=42)
train.shape, val.shape

((44550, 41), (14850, 41))

In [0]:
def wrangle(X):
    X = X.copy()
    X['date_recorded'] = pd.to_datetime(X['date_recorded'])
    X['recent_rec'] = ['2013-12-03 00:00:00'] * len(X)
    X['recent_rec'] = pd.to_datetime(X['recent_rec'])
    X['date_days'] = X['recent_rec'] - X['date_recorded']
    X['date_days'] = X['date_days'].astype('timedelta64[D]')

     #Fixing the high cardinality
    #high_card =  ['funder', 'installer', 'wpt_name', 'subvillage', 'ward', 
     #             'scheme_name', 'lga', 'region', 'scheme_management', 
      #            'extraction_type', 'management', 'source', 'extraction_type_group']
    #for i in high_card:
     # top10 = X[i].value_counts()[:10].index
     # X.loc[~X[i].isin(top10), i] = 'OTHER'
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)
target = ['status_group']

features = ['amount_tsh',	'date_days',
            'funder',	'gps_height',	'installer',	
            'longitude', 'latitude',	'wpt_name',	'num_private',	'basin',	
            'subvillage',	'region',	'region_code',	'district_code',	'lga',	'ward',
            'population',	'public_meeting',	'scheme_management',	'scheme_name',	
            'permit',	'construction_year',	'extraction_type', 'extraction_type_group',
          	'management',	'management_group',	'payment',
            'water_quality',	'quality_group',	'quantity',	'source',
          	'source_class',	'waterpoint_type']

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [0]:
from sklearn.feature_selection import SelectKBest

In [71]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    IterativeImputer(),
    StandardScaler(),
    SelectKBest(),
    xgb.XGBClassifier(n_estimators=3500, learning_rate=.6)
)
param_distributions = { 
    #'xgbclassifier__n_estimators': randint(10, 00),
    #'randomforestclassifier__max_depth': [20, 21, 22, 23, 24, 25], 

}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=4, 
    scoring='accuracy', 
    verbose=1, 
    return_train_score=True
)

search.fit(X_train, y_train);

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 4 folds for each of 1 candidates, totalling 4 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 11.5min finished
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [72]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', -search.best_score_)
pipeline = search.best_estimator_
print("Train Accuracy:", pipeline.score(X_train, y_train))
print("Validation Accuracy:", pipeline.score(X_val, y_val))

Best hyperparameters {}
Cross-validation Accuracy -0.7490461079160149
Train Accuracy: 0.9403591470258137
Validation Accuracy: 0.7596632996632997


#So now I know not to waste my time with XGBoost for this dataset... on to, I don't even know what.

In [0]:
def wrangle(X):
    X = X.copy()
    X['date_recorded'] = pd.to_datetime(X['date_recorded'])
    X['recent_rec'] = ['2013-12-03 00:00:00'] * len(X)
    X['recent_rec'] = pd.to_datetime(X['recent_rec'])
    X['date_days'] = X['recent_rec'] - X['date_recorded']
    X['date_days'] = X['date_days'].astype('timedelta64[D]')

    # Fixing the high cardinality
    high_card =  ['funder', 'installer', 'wpt_name', 'subvillage', 'ward', 
                  'scheme_name', 'lga', 'region', 'scheme_management', 
                  'extraction_type', 'management', 'source', 'extraction_type_group']
    for i in high_card:
      top10 = X[i].value_counts()[:10].index
      X.loc[~X[i].isin(top10), i] = 'OTHER'
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)
target = ['status_group']

features = ['amount_tsh',	'date_days',
            'funder',	'gps_height',	'installer',	
            'longitude', 'latitude',	'wpt_name',	'num_private',	'basin',	
            'subvillage',	'region',	'region_code',	'district_code',	'lga',	'ward',
            'population',	'public_meeting',	'scheme_management',	'scheme_name',	
            'permit',	'construction_year',	'extraction_type', 'extraction_type_group',
          	'management',	'management_group',	'payment',
            'water_quality',	'quality_group',	'quantity',	'source',
          	'source_class',	'waterpoint_type']

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [78]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    IterativeImputer(),
    RandomForestClassifier(random_state=42, n_jobs=-1)
)
param_distributions = { 
    #'randomforestclassifier__n_estimators': randint(10, 300),
    'randomforestclassifier__max_depth': [15,16,17,18,19,20,21,22,23,24,25], 
    'randomforestclassifier__min_samples_leaf': [1,2,3,4,5,6,7,8,9,10,11]
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=4, 
    scoring='accuracy', 
    verbose=2, 
    return_train_score=True
)

search.fit(X_train, y_train);

Fitting 4 folds for each of 10 candidates, totalling 40 fits
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25, total=   8.3s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.1s remaining:    0.0s
  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=25, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23, total=   6.8s
[CV] randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23, total=   6.9s
[CV] randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=8, randomforestclassifier__max_depth=23, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22, total=   7.0s
[CV] randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22, total=   7.2s
[CV] randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22, total=   7.2s
[CV] randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=5, randomforestclassifier__max_depth=22, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19, total=   7.0s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19, total=   7.0s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=19, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23, total=   7.2s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=23, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24, total=   6.8s
[CV] randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24, total=   6.8s
[CV] randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24, total=   7.0s
[CV] randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=9, randomforestclassifier__max_depth=24, total=   6.9s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=25, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.2s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15, total=   6.5s
[CV] randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15, total=   6.5s
[CV] randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15, total=   6.7s
[CV] randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=11, randomforestclassifier__max_depth=15, total=   6.7s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.7s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.7s


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:  5.3min finished
  self._final_estimator.fit(Xt, y, **fit_params)


In [79]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', -search.best_score_)
pipeline = search.best_estimator_
print("Train Accuracy:", pipeline.score(X_train, y_train))
print("Validation Accuracy:", pipeline.score(X_val, y_val))

Best hyperparameters {'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__max_depth': 24}
Cross-validation Accuracy -0.8060830110428802
Train Accuracy: 0.9148372615039282
Validation Accuracy: 0.8102356902356902


In [0]:
y_pred = pipeline.predict(X_test)
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission = submission.set_index('id')
submission.to_csv('katie-submission8.csv')

from google.colab import files
files.download('katie-submission8.csv')

# Did get a higher score! But gotta keep trying because why not?

In [85]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    IterativeImputer(),
    RandomForestClassifier(random_state=42, n_jobs=-1)
)
param_distributions = { 
    #'randomforestclassifier__n_estimators': randint(50, 500),
    'randomforestclassifier__max_depth': [20,21,22,23,24,25], 
    'randomforestclassifier__min_samples_leaf': [1,2,3,4]
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=15, 
    cv=4, 
    scoring='accuracy', 
    verbose=2, 
    return_train_score=True
)

search.fit(X_train, y_train);

Fitting 4 folds for each of 15 candidates, totalling 60 fits
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20, total=   7.8s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.7s remaining:    0.0s
  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20, total=   8.0s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20, total=   8.0s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=20, total=   7.9s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24, total=   7.5s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=24, total=   7.7s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21, total=   7.5s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21, total=   7.5s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=21, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=22, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=21, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=25, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20, total=   7.1s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=20, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24, total=   8.2s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24, total=   8.2s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24, total=   8.3s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=24, total=   8.4s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23, total=   7.8s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23, total=   7.7s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23, total=   7.9s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=23, total=   7.8s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23, total=   8.1s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23, total=   8.2s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23, total=   8.3s
[CV] randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=1, randomforestclassifier__max_depth=23, total=   8.1s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21, total=   7.5s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21, total=   7.7s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=21, total=   7.5s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.5s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.6s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.7s
[CV] randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=2, randomforestclassifier__max_depth=24, total=   7.7s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20, total=   7.2s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=20, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24, total=   7.2s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=4, randomforestclassifier__max_depth=24, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22, total=   7.3s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22, total=   7.4s
[CV] randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22 


  self._final_estimator.fit(Xt, y, **fit_params)


[CV]  randomforestclassifier__min_samples_leaf=3, randomforestclassifier__max_depth=22, total=  12.1s


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  8.5min finished
  self._final_estimator.fit(Xt, y, **fit_params)


In [86]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', -search.best_score_)
pipeline = search.best_estimator_
print("Train Accuracy:", pipeline.score(X_train, y_train))
print("Validation Accuracy:", pipeline.score(X_val, y_val))

Best hyperparameters {'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__max_depth': 24}
Cross-validation Accuracy -0.8060830110428802
Train Accuracy: 0.9148372615039282
Validation Accuracy: 0.8102356902356902


In [0]:
y_pred = pipeline.predict(X_test)
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission = submission.set_index('id')
submission.to_csv('katie-submission9.csv')

from google.colab import files
files.download('katie-submission9.csv')