Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---

# Cross-Validation


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


You won't be able to just copy from the lesson notebook to this assignment.

- Because the lesson was ***regression***, but the assignment is ***classification.***
- Because the lesson used [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html), which doesn't work as-is for _multi-class_ classification.

So you will have to adapt the example, which is good real-world practice.

1. Use a model for classification, such as [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. Use hyperparameters that match the classifier, such as `randomforestclassifier__ ...`
3. Use a metric for classification, such as [`scoring='accuracy'`](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)
4. If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, such as [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) (not [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html))



## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- Add your own stretch goals!
- Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/). See the previous assignment notebook for details.
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?


### BONUS: Stacking!

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [3]:
train.shape, test.shape

((59400, 41), (14358, 40))

In [4]:
#Wrangle Functions.  We are just going to keep building off the last few assignments
import numpy as np

def wrangle(df):

  #deep copy since we are changing values and the shape of our data. Don't want
  #any warnings.
  df = df.copy()

  #Changing the almost zeros to zero on latitude.  Zero latitude is definitely
  #a mistake, as it is on another part of the world

  df['latitude'] = df['latitude'].replace(-2e08, 0)

  #last assignment I explored and found what columns had too many zeros, so we
  #are just going to build off that.
  columns = ['gps_height', 'longitude', 'latitude', 'population', 
             'construction_year']

  for column in columns:
    df[column].replace(0, np.nan, inplace = True)

    #I'm taking this from the lecture note book.  I suspect what we are doing
    #is creating a boolean column, to identify which of our data is collected
    #vs imputed(happening later in the pipeline).  Then our model may make
    #better decisions, because it can weigh the importance of imputed vs
    #collected data.  Just a theory.

    #quote from Xander: missing values may be a predictive signal
    df[column+'_MISSING'] = df[column].isnull()

  #drop the dupes
  df.drop(columns = ['quantity_group', 'payment_type'], inplace = True)

  #drop never varying, and always varying columns
  df.drop(columns = ['recorded_by', 'id'], inplace = True)

  #Convert date_recorded to dattime
  df['date_recorded'] = pd.to_datetime(df['date_recorded'], 
                                       infer_datetime_format = True)
  
  #making date more usable by extracting month, day and year
  df['year_recorded'] = df['date_recorded'].dt.year
  df['month_recorded'] = df['date_recorded'].dt.month
  df['day_recorded'] = df['date_recorded'].dt.day

  #dropping the datetime column
  df.drop(columns = 'date_recorded', inplace = True)

  #Cool feature I'm borrowing from class: How many years from construction to
  #date recorded
  df['years'] = df['year_recorded'] - df['construction_year']
  df['years_MISSING'] = df['years'].isnull()

  return df

In [5]:
train = wrangle(train)
test = wrangle(test)

In [6]:
#Same as last assignment
target = 'status_group'

X_train = train.drop(columns = target)
y_train = train[target]

X_test = test

In [7]:
X_train.shape, X_test.shape

((59400, 45), (14358, 45))

In [8]:
#okay lets just set up the pipeline here
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_jobs = -1, random_state = 42)
)

In [13]:
#Okay lets go ahead and set up our randomized search CV.  Been really wanting to
#give this a try.  My goal is to run it once, with some pretty big steps in my
#ranges here.  Then if I want, maybe run it again with some smaller

#Justification for my ranges.  Research indicates that minsamples split is most
#reponsive in a range of 2-40, and min samples leafe is most responsive in a
#range of 1-20.  Max leaf nodes I'm selecting off of past good result ranges.

#I'm also using these ranges, because I'm not doing a wild amount of iterations.
#if I were to do a crazy amount of iterations, then I'd be better served by using
#randint over a range, to get better granularity.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__max_leaf_nodes': randint(1, 10000),
    'randomforestclassifier__min_samples_split': randint(2, 40),
    'randomforestclassifier__min_samples_leaf': randint(1, 20),
    'randomforestclassifier__max_depth': randint(1, 20),
    'randomforestclassifier__max_features': uniform(0, 1)
}

search = RandomizedSearchCV(
    pipeline,
    param_distributions = param_distributions,
    n_iter = 100,
    cv = 5,
    scoring = 'accuracy',
    verbose = 10,
    return_train_score = True,
    n_jobs = -1
)

In [14]:
#okay lets give it a go
best_model = search.fit(X_train, y_train);

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   23.5s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   41.7s
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   49.7s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   59.4s
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed:  

In [15]:
print('Best hyperparameters:', search.best_params_)
print('Cross-Validation Score:', search.best_score_)

Best hyperparameters: {'randomforestclassifier__max_depth': 18, 'randomforestclassifier__max_features': 0.6691986778653586, 'randomforestclassifier__max_leaf_nodes': 2804, 'randomforestclassifier__min_samples_leaf': 3, 'randomforestclassifier__min_samples_split': 5, 'simpleimputer__strategy': 'median'}
Cross-Validation Score: 0.8074242424242424


In [17]:
#I'm definitely not excited about that score.  But it is an mvp, so I need to complete MVP
#first, and come back to play later.

y_pred = best_model.predict(X_test)

submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('rand_search_cross_val_2.csv', index = False)