<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Categorical-features-in-tree-based-algorithms-in-scikit-learn" data-toc-modified-id="Categorical-features-in-tree-based-algorithms-in-scikit-learn-1">Categorical features in tree-based algorithms in scikit-learn</a></span></li><li><span><a href="#Setting-up-and-applying-cross-validation-search" data-toc-modified-id="Setting-up-and-applying-cross-validation-search-2">Setting up and applying cross validation search</a></span></li></ul></div>

<center><h2>Categorical features in tree-based algorithms in scikit-learn</h2></center>

Below is broken code. The categorical data is not working with a tree-based algorithm. Your task is to get the categorical data to work with tree-based algorithm.

In [82]:
reset -fs

In [83]:
import pandas as pd 

from sklearn.ensemble      import RandomForestClassifier
from sklearn.pipeline      import Pipeline

In [84]:
# Data

data = pd.DataFrame()
data['target'] = ['pos','neg', 'pos',  'neg']
data['pet']    = ['🐱',  '🐱',  '🐱',  '🐶']  # Categorical data

In [85]:
# TODO: Fix the broken code so the data can modeled with tree-based algorithm

# pipe = Pipeline([('tree_algorithm', RandomForestClassifier())])

# pipe.fit(X=data[['pet']], 
#          y=data['target'])

In [86]:
## A solution

from sklearn.preprocessing import OneHotEncoder 



pipe = Pipeline([
                  ('cat_encoder',    OneHotEncoder()),        # Explicitly encode categorical features
                  ('tree_algorithm', RandomForestClassifier())
                ])

pipe.fit(X=data[['pet']], 
         y=data['target'])


# Tree-based algorithms in scikit-learn do NOT automatically handle categorical features.
# Categorical features must be explicitly encoded. A common encoding scheme is One Hot Encoding.

# Sources of Inspiration
# This issue has been outstanding for many years. Do not wait for it be resolved! - https://github.com/scikit-learn/scikit-learn/pull/12866
# https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree

Pipeline(steps=[('cat_encoder', OneHotEncoder()),
                ('tree_algorithm', RandomForestClassifier())])

<center><h2>Setting up and applying cross validation search</h2></center>

I have started defining the search space. Your task is to define the rest of the search space and run `RandomizedSearchCV` to find best parameters

In [87]:
import numpy as np # 👈 Hint to programmatically create search space

In [88]:
# I have started to define the search space. 
# Your task is to define the search space for other important hyperparameters.

# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 20]
# Minimum number of samples required at each leaf node
min_samples_leaf = 2 ** np.arange(5)

serach_space = {'max_features': max_features,
                'bootstrap': bootstrap,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               }

In [89]:
# TODO: 
# Programmatically define the following hyperparameter space values and add them to the search space
# Pick reasonable number of values and appropriate sampling within the range

# n_estimators - should be from 1 to 100
# max_depth - should be from 10 to 110, also include None




In [96]:
# Tests

assert 'n_estimators' in serach_space
assert 'max_depth'    in serach_space

In [90]:
## A solution

# Number of trees in random forest
# Some hyperparameter values must be integers, not numpy.float64
n_estimators = [int(x) for x in np.linspace(start=1, stop=100, num = 15)]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(start=10, stop=110, num=11)]
max_depth.append(None)

serach_space['n_estimators'] = n_estimators
serach_space['max_depth']    = max_depth
serach_space

{'max_features': ['auto', 'sqrt'],
 'bootstrap': [True, False],
 'min_samples_split': [2, 5, 10, 20],
 'min_samples_leaf': array([ 1,  2,  4,  8, 16]),
 'n_estimators': [1, 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78, 85, 92, 100],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None]}

In [91]:
# Create data

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

In [92]:
from sklearn.ensemble         import RandomForestClassifier
from sklearn.model_selection  import RandomizedSearchCV

In [93]:
# TODO:
# 1. Define RandomizedSearchCV
# 2. Fit RandomizedSearchCV
# 3. Print best hyperparamters 


In [94]:
## A solution
clf_rand_cv = RandomizedSearchCV( estimator=RandomForestClassifier(), 
                                  param_distributions=serach_space, 
                                  n_iter=25,
                                  cv=5, 
                                  n_jobs=-1,
                                  verbose=False,
                                )
clf_rand_cv.fit(X, y)
clf_rand_cv.best_params_

{'n_estimators': 43,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 10,
 'bootstrap': True}