<a href="https://colab.research.google.com/github/timrocar/DS-Unit-2-Kaggle-Challenge/blob/master/module3-cross-validation/LS_DS_223_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---

# Cross-Validation


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


**You can't just copy** from the lesson notebook to this assignment.

- Because the lesson was **regression**, but the assignment is **classification.**
- Because the lesson used [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html), which doesn't work as-is for _multi-class_ classification.

So you will have to adapt the example, which is good real-world practice.

1. Use a model for classification, such as [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. Use hyperparameters that match the classifier, such as `randomforestclassifier__ ...`
3. Use a metric for classification, such as [`scoring='accuracy'`](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)
4. If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, such as [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) (not [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html))



## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- Add your own stretch goals!
- Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/). See the previous assignment notebook for details.
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?


### BONUS: Stacking!

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [169]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [170]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [171]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [172]:
train.set_index('id', inplace=True)
test.set_index('id', inplace=True)

In [173]:
## Writing a wrangle function 
import numpy as np

def wrangle(X):
  # Make a copy
  X = X.copy()

  # Replace bad lat. measurements with 0 to then replace with NaN
  X['latitude'] = X['latitude'].replace  (-2e-08, 0)

  # Replace all lons. and lats. with NaN
  cols_with_zeros = ['longitude', 'latitude']
  for col in cols_with_zeros:
    X[col] = X[col].replace(0, np.NaN)

  # Drop high cardinality
  hc_cols = [col for col in X.describe(include='object'). columns
             if X[col].nunique() > 100]



  # Drop high cards, our repeated column, num_private, and recorded_by
  X = X.drop(['quantity_group', 'recorded_by', 'num_private',
              'scheme_name', 'extraction_type_group', 'extraction_type_class',
              'management_group', 'payment_type', 'quantity_group', 'source_type',
              'source_class', 'waterpoint_type_group'] + hc_cols, axis=1)

  return X

In [174]:
train = wrangle(train)
test = wrangle(test)

## Splitting

In [175]:
# split our featture matrix and target vector 
y = train['status_group']
X = train.drop('status_group', axis=1)

In [176]:
# train-validation split 
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state= 73)

## Baseline

In [177]:
print('Baseline Accuracy:', y_train.value_counts(normalize=True).max())

Baseline Accuracy: 0.5430976430976431


## Classifiers


*   I'm going to try both LightGBM and RandomForest first
*   I will use them both with the same hyperparmeters I used yesterday, but will add more later.
*   I also will try using `RandomSearchCV`



In [178]:
from lightgbm import LGBMClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [179]:
  modelLGBM = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    LGBMClassifier(random_state=17, num_leaves=300)
)

modelLGBM.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type', 'management',
                                      'payment', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'waterpoint_type'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'basin',
                                          'data_type': dtype...
                 LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='split',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, 

In [180]:
print('Training Accuracy:', modelLGBM.score(X_train, y_train))
print('Validation Accuracy:', modelLGBM.score(X_val, y_val))

Training Accuracy: 0.892648709315376
Validation Accuracy: 0.8151515151515152


Changing My Train-Validation split to a .9/.1 ratio gave me a .07 boost in validation accuracy, (seemingly) reducing the overfit a bit !

In [181]:
  modelrfoe = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),

    RandomForestClassifier(random_state=17, max_leaf_nodes=1550)
)

modelrfoe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type', 'management',
                                      'payment', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'waterpoint_type'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'basin',
                                          'data_type': dtype...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_le

In [182]:
print('Training Accuracy:', modelrfoe.score(X_train, y_train))
print('Validation Accuracy:', modelrfoe.score(X_val, y_val))

Training Accuracy: 0.8517583239805462
Validation Accuracy: 0.8102693602693603


## TUNED Classifiers

In [183]:
print(X_train.shape)

(53460, 21)


In [229]:
  tunedLGBM = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    LGBMClassifier(random_state=17, num_leaves=300)
)

parameters  = {
               'lgbmclassifier__num_leaves': range(100,700,5),
               'lgbmclassifier__min_data_in_leaf': (5, 120)
               }

gs= RandomizedSearchCV(tunedLGBM,
                       param_distributions=parameters,
                       n_iter=10,
                       n_jobs=-1,
                       verbose=1)



In [230]:
gs.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  5.4min finished


RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            fill_value=Non

In [231]:
gs.best_params_

{'lgbmclassifier__min_data_in_leaf': 5, 'lgbmclassifier__num_leaves': 295}

In [232]:
best_model = gs.best_estimator_

In [233]:
print('Training Accuracy:', best_model.score(X_train, y_train))
print('Validation Accuracy:', best_model.score(X_val, y_val))

Training Accuracy: 0.9023756079311634
Validation Accuracy: 0.8166666666666667


going to try to be slightly less overfit, run this again with a tighter range. 

In [273]:
  tunedLGBM = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    LGBMClassifier(random_state=17, num_leaves=300)
)

parameters  = {
               'lgbmclassifier__num_leaves': range(100,400,5),
               'lgbmclassifier__min_data_in_leaf': (20, 40)
               }

gs_one_two= RandomizedSearchCV(tunedLGBM,
                       param_distributions=parameters,
                       n_iter=10,
                       n_jobs=-1,
                       verbose=1)



In [274]:
gs_one_two.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  4.7min finished


RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            fill_value=Non

In [278]:
gs_one_two.best_params_

{'lgbmclassifier__min_data_in_leaf': 40, 'lgbmclassifier__num_leaves': 330}

In [279]:
best_model_one_two = gs_one_two.best_estimator_

In [280]:
print('Training Accuracy:', best_model_one_two.score(X_train, y_train))
print('Validation Accuracy:', best_model_one_two.score(X_val, y_val))

Training Accuracy: 0.8909652076318743
Validation Accuracy: 0.8166666666666667


In [281]:
#Copy of submission 
submission_random_LGBM = sample_submission.copy()

In [282]:
y_pred_random = best_model_one_two.predict(test)

In [283]:
submission_random_LGBM['status_group'] = y_pred_random

In [284]:
submission_random_LGBM.to_csv('submission_random_LGBM.csv', index=False)

## Tuned Random Forrest

In [289]:
  tunedrfoe = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=17, max_leaf_nodes=1550)
)

params = {'randomforestclassifier__max_leaf_nodes': range(2000, 4000, 15),
          'randomforestclassifier__n_estimators': range(20,120,5)}

gs_two = RandomizedSearchCV(tunedrfoe,
                        param_distributions=params,
                        n_iter=10,
                        n_jobs=-1,
                        verbose=1)


In [None]:
gs_two.fit(X_train,y_train)

In [None]:
gs_two.best_params_

In [222]:
best_model_two = gs_two.best_estimator_

In [223]:
print('Training Accuracy:', best_model_two.score(X_train, y_train))
print('Validation Accuracy:', best_model_two.score(X_val, y_val))

Training Accuracy: 0.9164983164983165
Validation Accuracy: 0.8196969696969697


In [258]:
#Copy of submission 
submission_one_rf = sample_submission.copy()

In [259]:
y_pred_one_rf = best_model_two.predict(test)

In [260]:
submission_one_rf['status_group'] = y_pred_one_rf

In [261]:
submission_one_rf.to_csv('submission_one_rf.csv', index=False)

## Tuned Classifiers with GridSearch

In [234]:
  gridLGBM = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    LGBMClassifier(random_state=17, num_leaves=300)
)

parameters  = {
               'lgbmclassifier__num_leaves': range(100,700,5),
               'lgbmclassifier__min_data_in_leaf': (5, 120)
               }

gridgs= GridSearchCV(gridLGBM,
                       param_grid=parameters,
                       n_jobs=-1,
                       verbose=1)



In [235]:
gridgs.fit(X_train,y_train)

Fitting 5 folds for each of 240 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed: 15.9min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed: 46.6min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 88.8min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed: 131.2min finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ordinalencoder',
                                        OrdinalEncoder(cols=None,
                                                       drop_invariant=False,
                                                       handle_missing='value',
                                                       handle_unknown='value',
                                                       mapping=None,
                                                       return_df=True,
                                                       verbose=0)),
                                       ('simpleimputer',
                                        SimpleImputer(add_indicator=False,
                                                      copy=True,
                                                      fill_value=None,
                                                      missing_values=nan,
       

In [237]:
gridgs.best_params_

{'lgbmclassifier__min_data_in_leaf': 5, 'lgbmclassifier__num_leaves': 370}

In [238]:
best_model_three = gridgs.best_estimator_

In [239]:
print('Training Accuracy:', gridgs.score(X_train, y_train))
print('Validation Accuracy:', gridgs.score(X_val, y_val))

Training Accuracy: 0.915843621399177
Validation Accuracy: 0.813973063973064


In [285]:
#Copy of submission 
submission_last = sample_submission.copy()

In [286]:
y_pred_three = best_model_three.predict(test)

In [287]:
submission_last['status_group'] = y_pred_three

In [288]:
submission_last.to_csv('submission_last.csv', index=False)

## ONE LAST TRY

In [293]:
  lasttuned = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=17, max_leaf_nodes=1550)
)

params = {'randomforestclassifier__max_leaf_nodes': range(2000, 4000, 5),
          'randomforestclassifier__n_estimators': range(20,140,1)}

gs_last = RandomizedSearchCV(lasttuned,
                        param_distributions=params,
                        n_iter=15,
                        n_jobs=-1,
                        verbose=1)


In [294]:
gs_last.fit(X_train,y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:  5.9min finished


RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            fill_value=Non

In [295]:
gs_last.best_params_

{'randomforestclassifier__max_leaf_nodes': 3590,
 'randomforestclassifier__n_estimators': 130}

In [296]:
best_model_plz = gs_last.best_estimator_

In [297]:
print('Training Accuracy:', best_model_plz.score(X_train, y_train))
print('Validation Accuracy:', best_model_plz.score(X_val, y_val))

Training Accuracy: 0.910082304526749
Validation Accuracy: 0.8176767676767677


In [298]:
#Copy of submission 
last = sample_submission.copy()

In [299]:
last_pred = best_model_plz.predict(test)

In [300]:
last['status_group'] = last_pred

In [301]:
last.to_csv('last.csv', index=False)