<a href="https://colab.research.google.com/github/dondreojordan/DS-Unit-2-Kaggle-Challenge/blob/master/223_Random_Forests_Assignment_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---

# Cross-Validation


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


**You can't just copy** from the lesson notebook to this assignment.

- Because the lesson was **regression**, but the assignment is **classification.**
- Because the lesson used [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html), which doesn't work as-is for _multi-class_ classification.

So you will have to adapt the example, which is good real-world practice.

1. Use a model for classification, such as [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. Use hyperparameters that match the classifier, such as `randomforestclassifier__ ...`
3. Use a metric for classification, such as [`scoring='accuracy'`](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)
4. If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, such as [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) (not [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html))



## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- Add your own stretch goals!
- Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/). See the previous assignment notebook for details.
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?


### BONUS: Stacking!

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

#Engineer DataFrames Train, Test, and Validate

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_predict

#Merge Train Features and Train Labels
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

#Read test_features and sample_submission
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

#Split train into train and validation
train, val = train_test_split(train, train_size=.80, test_size=0.20,
                              stratify=train['status_group'], random_state=42)

print("Check Train Shape: ",train.shape, "\nCheck Validation Shape", val.shape, "\nCheck Test Shape", test.shape)

Check Train Shape:  (47520, 41) 
Check Validation Shape (11880, 41) 
Check Test Shape (14358, 40)


In [3]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
43360,72938,0.0,2011-07-27,,0,,33.542898,-9.174777,Kwa Mzee Noa,0,Lake Nyasa,Mpandapanda,Mbeya,12,4,Rungwe,Kiwira,0,True,GeoData Consultants Ltd,VWC,K,,0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional
7263,65358,500.0,2011-03-23,Rc Church,2049,ACRA,34.66576,-9.308548,Kwa Yasinta Ng'Ande,0,Rufiji,Kitichi,Iringa,11,4,Njombe,Imalinyi,175,True,GeoData Consultants Ltd,WUA,Tove Mtwango gravity Scheme,True,2008,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
2486,469,25.0,2011-03-07,Donor,290,Do,38.238568,-6.179919,Kwasungwini,0,Wami / Ruvu,Kwedigongo,Pwani,6,1,Bagamoyo,Mbwewe,2300,True,GeoData Consultants Ltd,VWC,,False,2010,india mark ii,india mark ii,handpump,vwc,user-group,pay per bucket,per bucket,salty,salty,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional
313,1298,0.0,2011-07-31,Government Of Tanzania,0,DWE,30.716727,-1.289055,Kwajovin 2,0,Lake Victoria,Kihanga,Kagera,18,1,Karagwe,Isingiro,0,True,GeoData Consultants Ltd,,,True,0,other,other,other,vwc,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
52726,27001,0.0,2011-03-10,Water,0,Gove,35.389331,-6.399942,Chama,0,Internal,Mtakuj,Dodoma,1,6,Bahi,Nondwa,0,True,GeoData Consultants Ltd,VWC,Zeje,True,0,mono,mono,motorpump,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional


#Engineer New Features
Define wrangle(X) train, validate, and test sets.
Clean outliers.
Engineer features.
    
    (For time based, see KaggleChallenge 3 for DSPT2 w/ Keri Kalmbach)
    
https://www.youtube.com/watch?v=Ny6H6VEfHwA&feature=youtu.be

In [4]:
train.columns.sort_values()
# In alphebetical order to find duplicates

Index(['amount_tsh', 'basin', 'construction_year', 'date_recorded',
       'district_code', 'extraction_type', 'extraction_type_class',
       'extraction_type_group', 'funder', 'gps_height', 'id', 'installer',
       'latitude', 'lga', 'longitude', 'management', 'management_group',
       'num_private', 'payment', 'payment_type', 'permit', 'population',
       'public_meeting', 'quality_group', 'quantity', 'quantity_group',
       'recorded_by', 'region', 'region_code', 'scheme_management',
       'scheme_name', 'source', 'source_class', 'source_type', 'status_group',
       'subvillage', 'ward', 'water_quality', 'waterpoint_type',
       'waterpoint_type_group', 'wpt_name'],
      dtype='object')

In [5]:
import numpy as np
def clean(X):
    X = X.copy()   
    # anything around zero needs to be 0 to remove
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    # drop duplicates
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    # replace the zeros with nulls
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)        
    # convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    # Get more precise dates to increase accuracy 
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    # Now you can get another important feature based on age since construction
    X['years'] = X['year_recorded'] - X['construction_year']
    X['age'] = X['years'].isnull()
    # Drop High cardinality
    X = X.drop(['scheme_name','funder','installer'], axis=1)
    # drop low variance column
    X = X.drop(['recorded_by'], axis=1)
    # return the cleaned dataframe
    return X

train = clean(train)
val = clean(val) #Val Set is to help w/ Hyperparameter tuning.
test = clean(test)

print("Check Train Shape: ",train.shape, "\nCheck Validation Shape", val.shape,"\nCheck Test Shape", test.shape)

Check Train Shape:  (47520, 39) 
Check Validation Shape (11880, 39) 
Check Test Shape (14358, 38)


In [6]:
train.head()
#See if clean() function cleaned and added new columns to data set
#New columns: year_recorded,	month_recorded,	day_recorded,	years,	age
#Cleaned:latitude, 'longitude', 'latitude', 'construction_year', 'gps_height', 'population' 

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,scheme_management,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,year_recorded,month_recorded,day_recorded,years,age
43360,72938,0.0,,33.542898,-9.174777,Kwa Mzee Noa,0,Lake Nyasa,Mpandapanda,Mbeya,12,4,Rungwe,Kiwira,,True,VWC,,,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional,2011,7,27,,True
7263,65358,500.0,2049.0,34.66576,-9.308548,Kwa Yasinta Ng'Ande,0,Rufiji,Kitichi,Iringa,11,4,Njombe,Imalinyi,175.0,True,WUA,True,2008.0,gravity,gravity,gravity,wua,user-group,pay monthly,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,2011,3,23,3.0,False
2486,469,25.0,290.0,38.238568,-6.179919,Kwasungwini,0,Wami / Ruvu,Kwedigongo,Pwani,6,1,Bagamoyo,Mbwewe,2300.0,True,VWC,False,2010.0,india mark ii,india mark ii,handpump,vwc,user-group,pay per bucket,salty,salty,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional,2011,3,7,1.0,False
313,1298,0.0,,30.716727,-1.289055,Kwajovin 2,0,Lake Victoria,Kihanga,Kagera,18,1,Karagwe,Isingiro,,True,,True,,other,other,other,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,other,other,non functional,2011,7,31,,True
52726,27001,0.0,,35.389331,-6.399942,Chama,0,Internal,Mtakuj,Dodoma,1,6,Bahi,Nondwa,,True,VWC,True,,mono,mono,motorpump,vwc,user-group,pay per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional,2011,3,10,,True


In [7]:
# The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target & id
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
print(features)

['id', 'amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'year_recorded', 'month_recorded', 'day_recorded', 'years', 'basin', 'region', 'public_meeting', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'water_quality', 'quality_group', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group', 'age']


#Arrange data into X features matrix and y target vector

In [8]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

###Import Classifiers

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

  import pandas.util.testing as tm


#Decided how to validate your model

    -Train/Validate/Test Split: time-based
    -Train/Validate/Test Split: Random 80/20%
    -Cross Validation w/Independent test set (SciKit-Learn Method) 

#Model
*Change model based on expected output design. (Repeat)*

In [10]:
%%time
model = Pipeline([
                  ('ohe', OneHotEncoder()),
                  ('impute', SimpleImputer()),
                  ('classifier', DecisionTreeClassifier())
])

CPU times: user 186 µs, sys: 26 µs, total: 212 µs
Wall time: 219 µs


#Fit Model to Data 
    model.fit(X_train, y_train)
    print('Training Accuracy Score:', model.score(X_train, y_train))
    print('Validation Accuracy Score:', model.score(X_val, y_val))

In [11]:
%%time
model.fit(X_train, y_train)

print('training accuracy:', model.score(X_train, y_train))
print('validation accuracy:', model.score(X_val, y_val))

training accuracy: 1.0
validation accuracy: 0.7379629629629629
CPU times: user 7.62 s, sys: 294 ms, total: 7.92 s
Wall time: 7.89 s


#S  T  O  P                        

In [12]:
%%html
<marquee style='width: 65%; color: black;'><b>Compare Model (above) with Model (below). Swap models as many times as you need to determine max optimization.</b></marquee>

#Model
*Change model based on expected output design. (Repeat)*

In [13]:
%%time
model = Pipeline([
                  ('ohe', OneHotEncoder()),
                  ('impute', SimpleImputer()),
                  ('select', SelectKBest(k=20)),
                  ('classifier', DecisionTreeClassifier())
])

CPU times: user 0 ns, sys: 602 µs, total: 602 µs
Wall time: 798 µs


#Fit Model to Data

In [14]:
%%time
model.fit(X_train, y_train)

print('training accuracy:', model.score(X_train, y_train))
print('validation accuracy:', model.score(X_val, y_val))

training accuracy: 0.7683080808080808
validation accuracy: 0.7393939393939394
CPU times: user 6.21 s, sys: 233 ms, total: 6.44 s
Wall time: 6.45 s


#S T O P

In [15]:
%%html
<marquee style='width: 65%; color: black;'><b>Compare Model (below) with (2) Models (above). Swap models as many times as you need to determine max optimization.</b></marquee>

###cross_val_score
cross_val_score helper function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [16]:
import category_encoders as ce
import numpy as np
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

#Model
*Change model based on expected output design. (Repeat)*

In [17]:
#Linear Model

In [18]:
model = make_pipeline(
     ce.OneHotEncoder(use_cat_names=True), 
     SimpleImputer(strategy='mean'), 
     StandardScaler(), 
     SelectKBest(f_regression, k=20), 
     Ridge(alpha=1.0)
)

#Fit Model to Cross Validation Score

In [19]:
k = 5
scores = cross_val_score(model, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')


print(f'Mean Absolute Error for {k} number of folds:', -scores)
############ ERROR MESSAGE: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
############TypeError: unsupported operand type(s) for /: 'str' and 'int' 


# Why?

TypeError: unsupported operand type(s) for /: 'str' and 'int'

TypeError: unsupported operand type(s) for /: 'str' and 'int'

TypeError: unsupported operand type(s) for /: 'str' and 'int'

TypeError: unsupported operand type(s) for /: 'str' and 'int'



Mean Absolute Error for 5 number of folds: [nan nan nan nan nan]


TypeError: unsupported operand type(s) for /: 'str' and 'int'



In [20]:
-scores.mean()

nan

#S T O P

In [21]:
%%html
<marquee style='width: 65%; color: black;'><b>Compare Model (below) with (3) Models (above). Swap models as many times as you need to determine max optimization.</b></marquee>

In [22]:
from sklearn.ensemble import RandomForestRegressor

#Model
*Change model based on expected output design. (Repeat)*

In [23]:
#Random Forest

In [24]:
pipeline = make_pipeline(
    ce.TargetEncoder(min_samples_leaf=1, smoothing=1), 
    SimpleImputer(strategy='median'), 
    RandomForestRegressor(n_estimators=20, n_jobs=-1, random_state=42)
)

#Fit Model to Cross Validation Score

In [25]:
k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')
print(f'MAE for {k} folds:', -scores)

TypeError: Could not convert functional needs repairfunctionalfunctional needs repairfunctionalfunctionalnon functionalfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalnon functionalfunctional needs repairfunctionalfunctionalfunctionalnon functionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctional needs repairfunctionalfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalnon functionalnon functionalnon functionalnon functionalfunctionalnon functionalnon functionalnon functionalfunctionalnon functionalfunctional needs repairfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalnon functionalnon functionalfunctional needs repairfunctionalfunctionalfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctio

MAE for 3 folds: [nan nan nan]


TypeError: Could not convert functionalfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalnon functionalnon functionalnon functionalnon functionalfunctionalfunctionalnon functionalfunctionalfunctionalnon functionalnon functionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalfunctionalnon functionalnon functionalfunctionalfunctionalfunctionalfunctionalnon functionalfunctionalfunctional needs repairfunctionalfunctionalnon functionalnon functionalfunctionalfunctionalfunctionalfunctionalfunctional needs repairfunctionalnon functionalnon functionalfunctionalfunctionalfunctionalnon functionalfunctional needs repairfunctionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalnon functionalfunctionalnon functionalfunctionalfunctionalfunctionalfunctionalnon functionalnon functionalfunctionalnon functionalfunctionalfunctional needs repairnon functionalnon functionalnon functionalnon functionalfunctionalnon functionalfunctionalf

In [26]:
-scores.mean()

nan

# S T O P

#Model
*Change model based on expected output design. (Repeat)*

In [27]:
# RandomSearchCV (Hyperparameter Optimization)

In [46]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [47]:
pipeline = make_pipeline(
    ce.TargetEncoder(), 
    SimpleImputer(), 
    RandomForestRegressor(random_state=42)
)

##Parameter Distribution / RandomSearchCV 
*If you're on Colab, decrease n_iter & cv parameters*

    param_distributions = {}
    search = RandomizedSearchCV()

In [48]:
from scipy.stats import randint, uniform

In [49]:
param_distributions = {
    'targetencoder__min_samples_leaf': randint(1, 1000),     
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestregressor__n_estimators': randint(50, 500), 
    'randomforestregressor__max_depth': [5, 10, 15, 20, None], 
    'randomforestregressor__max_features': uniform(0, 1), 
}

In [50]:
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=3, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

#Fit Model to Model with Parameter Distribution & RandomizedSearchCV

In [None]:
search.fit(X_train, y_train);

print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', -search.best_score_)

########### ValueError: could not convert string to float: 'functionalfunctionalfunctionalnon functional...
###########


########## WHY??

See detailed results

In [None]:
pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score').T

Make predictions for test set

In [58]:
model = search.best_estimator_

In [59]:
model

Pipeline(memory=None,
         steps=[('targetencoder',
                 TargetEncoder(cols=['basin', 'region', 'public_meeting',
                                     'scheme_management', 'permit',
                                     'extraction_type', 'extraction_type_group',
                                     'extraction_type_class', 'management',
                                     'management_group', 'payment',
                                     'water_quality', 'quality_group',
                                     'quantity', 'source', 'source_type',
                                     'source_class', 'waterpoint_type',
                                     'waterpoint_typ...
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=10,
                                       max_features=0.4565403180560086,
                                       max_leaf_nodes=None, max_samples=None,
              

In [60]:
from sklearn.metrics import mean_absolute_error

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test MAE: ${mae:,.0f}')

############### TypeError: 'NoneType' object is not subscriptable
###############
#Generally speaking, when you get a TypeError it means that your values aren't what you think they are.


### WHY?

TypeError: ignored

>>best_estimator_ : estimator
Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False. ... See refit parameter for more information ...

>>refit : boolean, string, or callable, default=True
Refit an estimator using the best found parameters on the whole dataset.

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" width="50%">
By default, scikit-learn cross-validation will "refit an estimator using the best found parameters on the whole dataset", which means, use all the training data:
Tip: If you're doing 3-way train/validation/test split, you should do this too! After you've optimized your hyperparameters and selected your final model, then manually refit on both the training and validation data.

#Visualize the Train/Validation Curve

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeRegressor

model = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(), 
    DecisionTreeRegressor()
)

depth = range(1, 30, 3)
train_scores, val_scores = validation_curve(
    model, X_train, y_train,
    param_name='decisiontreeregressor__max_depth', 
    param_range=depth, scoring='neg_mean_absolute_error', 
    cv=3,
    n_jobs=-1
)

plt.figure(dpi=150)
plt.plot(depth, np.mean(-train_scores, axis=1), color='blue', label='training error')
plt.plot(depth, np.mean(-val_scores, axis=1), color='red', label='validation error')
plt.title('Validation Curve')
plt.xlabel('model complexity: DecisionTreeRegressor max_depth')
plt.ylabel('model score: Mean Absolute Error')
plt.legend();

#Dondre' After Hours Study Session DS 211 Notes¶
Instructor: Keri Kalmbuch

#Differences between parameter and hyperparameter:

###Parameter
Determind DURING Fitting

    model.fit(X_train, y_train)

###Hyperparameter
Determined BEFORE Fitting

    model = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    StandardScaler(), 
    SelectKBest(f_regression, k=20), 
    Ridge(alpha=1.0))


    Model Hyperparameters:
    RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=0.34431534199991853,
                      max_leaf_nodes=None, max_samples=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=186,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)