# Random Forest


In [1]:
# Esto agrega al python path el directorio ..
import sys
sys.path.append("..")

import numpy as np
import config
from helpers import get_scores
from sklearn.grid_search import RandomizedSearchCV
from transformers import transformer
from data_builder import load_test_data, load_dev_data, load_small_dev_data

df, target = load_dev_data()
X = transformer.fit_transform(df)

print(X.shape)

(80993, 273)


Saco la columna de text porque ocupa mucho espacio

# Optimización de hiperparámetros

Busquemos los mejores (posibles) hiperparámetros

Para eso, primero veamos qué hiperparámetros nos provee la implementación de SKLearn.

In [2]:
from sklearn.ensemble import RandomForestClassifier
help(RandomForestClassifier)

Help on class RandomForestClassifier in module sklearn.ensemble.forest:

class RandomForestClassifier(ForestClassifier)
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and use averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is always the same as the original
 |  input sample size but the samples are drawn with replacement if
 |  `bootstrap=True` (default).
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------
 |  n_estimators : integer, optional (default=10)
 |      The number of trees in the forest.
 |  
 |  criterion : string, optional (default="gini")
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |      Note: this parameter is tree-specific.
 |  
 |  max_features : int, fl

http://scikit-learn.org/stable/modules/ensemble.html#parameters

The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the data). Good results are often achieved when setting max_depth=None in combination with min_samples_split=1 (i.e., when fully developing the trees). Bear in mind though that these values are usually not optimal, and might result in models that consume a lot of RAM. The best parameter values should always be cross-validated. In addition, note that in random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False). When using bootstrap sampling the generalization accuracy can be estimated on the left out or out-of-bag samples. This can be enabled by setting oob_score=True.



Opciones:

- n_estimators: cantidad de árboles
- max_features: qué porcentaje de variables tomo a la hora de partir un nodo.
  Estas variables se eligen aleatoriamente
- min_samples_split: cuántos elementos tengo que tener en un nodo para decidir
  partirlo

In [3]:
from search import find_best_classifier
from sklearn.ensemble import RandomForestClassifier

options = {
    #'criterion': ['gini', 'entropy'],
    #'splitter': ['best', 'random'],
    'n_estimators': range(50, 105, 5),
    'max_features': np.arange(0.1, 1.0, 0.05),
    'min_samples_split': range(1, 101, 10),
}

search_options = {
    'cv': 10,
    'scoring': 'roc_auc',
    'n_jobs': -1,
    'n_iter': 100,
}

search = RandomizedSearchCV(RandomForestClassifier(), options, verbose=1, **search_options)

search.fit(X, target)

print("Mejores parámetros: {}".format(search.best_params_))

get_scores(search.best_estimator_, transformer)   

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 36.9min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 406.7min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 1066.9min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 1755.9min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 2137.9min finished


Mejores parámetros: {'n_estimators': 65, 'min_samples_split': 1, 'max_features': 0.20000000000000004}


Unnamed: 0,precision_score,accuracy_score,f1_score,recall_score,roc_auc_score
RandomForestClassifier,0.990007,0.990333,0.990337,0.990667,0.990333


In [4]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('pca', PCA()),
    ('tree', RandomForestClassifier()),
])
clf

Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('tree', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [5]:
from helpers import add_prefix

new_options = add_prefix(options, 'tree__')

no_features = X.shape[1]

new_options.update({
    'pca__n_components' : range(20, no_features, 10)
})


search = RandomizedSearchCV(clf, new_options, verbose=1, **search_options)

search.fit(X, target)

print("Mejores parámetros: {}".format(search.best_params_))

get_scores(search.best_estimator_, transformer)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 138.8min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 778.7min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 1823.7min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 3302.8min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 4002.7min finished


Mejores parámetros: {'tree__min_samples_split': 1, 'tree__max_features': 0.25000000000000006, 'pca__n_components': 190, 'tree__n_estimators': 65}


Unnamed: 0,precision_score,accuracy_score,f1_score,recall_score,roc_auc_score
Pipeline,0.98361,0.985222,0.985247,0.986889,0.985222
