<a href="https://colab.research.google.com/github/cris-her/AI/blob/master/gradient_boosting_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.simplefilter("ignore")

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
X = pd.read_csv('https://raw.githubusercontent.com/cris-her/datasets-platzi-course/master/intermediate_results/X_opening.csv')
y = X['worldwide_gross']
X = X.drop('worldwide_gross',axis=1)

## Gradient Boosted Trees

In [4]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_validate

ensemble = GradientBoostingRegressor()
results = cross_validate(ensemble,X,y,cv=5,scoring='r2')

In [5]:
test_scores = results['test_score']
#train_scores = results['train_score']
#print(np.mean(train_scores))
print(np.mean(test_scores))

0.5231662496725721


Cómo optimizamos los parametros de este último modelo?

## Optimización de hiperparametros

- Fijar un learning rate alto
- Fijar parametros de los arboles
- Fijados estos parametros, elegir el mejor numero de estimadores que conforman el ensemble
- (Tarea) Con el learning rate dado y el numero de estimadores óptimo, optimizar los parametros de los arboles

**Grid Search**

Por ahora dijimos que:
    
- train_test_split servia para evaluaciones rapidas, testeos y prototipaje
- cross_validate es un método más robusto para poder estimar el rendimiento de tu algoritmo

Sin embargo una vez que hemos finalizado nuestra etapa de prototipaje y ya queremos establecer un modelo definitivo deberiamos seguir el flujo siguiente.

<img src="https://raw.githubusercontent.com/cris-her/machine-learning-platzi/master/img/grid_search_crossval.png" width=700>

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)

In [7]:
from sklearn.model_selection import GridSearchCV

param_test1 = {'n_estimators': range(20,501,20)}

In [8]:
list(param_test1['n_estimators'])

[20,
 40,
 60,
 80,
 100,
 120,
 140,
 160,
 180,
 200,
 220,
 240,
 260,
 280,
 300,
 320,
 340,
 360,
 380,
 400,
 420,
 440,
 460,
 480,
 500]

In [9]:
estimator = GradientBoostingRegressor(learning_rate=0.1,
                                     min_samples_split=500,
                                     min_samples_leaf=50,
                                     max_depth=8,
                                     max_features='sqrt',
                                     subsample=0.8,
                                     random_state=10)

In [10]:
gsearch1 = GridSearchCV(estimator,
                       param_grid = param_test1,
                       scoring='r2',
                       cv=5)

In [11]:
gsearch1.fit(X_train,y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=8,
                                                 max_features='sqrt',
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=50,
                                                 min_samples_split=500,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_iter_no_change=None,
            

In [12]:
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

({'mean_fit_time': array([0.02890449, 0.05406265, 0.08080063, 0.11004753, 0.13504181,
         0.16003218, 0.18966436, 0.21482196, 0.24103608, 0.2692225 ,
         0.30340309, 0.32663245, 0.35684147, 0.38757143, 0.40976191,
         0.44128838, 0.47087436, 0.49716163, 0.52265892, 0.56014495,
         0.58686309, 0.62073998, 0.64571075, 0.67361197, 0.69332862]),
  'mean_score_time': array([0.00150294, 0.00166564, 0.00179734, 0.00210733, 0.00216675,
         0.00228114, 0.00253778, 0.00262918, 0.00279889, 0.00292892,
         0.00326095, 0.00331416, 0.00345984, 0.00387669, 0.0037168 ,
         0.00391922, 0.00437102, 0.00426006, 0.00437279, 0.0046391 ,
         0.00540552, 0.00488133, 0.00507827, 0.00540609, 0.00536194]),
  'mean_test_score': array([0.65533772, 0.71947072, 0.73472393, 0.73893391, 0.74204852,
         0.74593224, 0.74954068, 0.75081976, 0.75256545, 0.7534906 ,
         0.75456927, 0.75530597, 0.75517149, 0.75388522, 0.75460231,
         0.75250064, 0.75350086, 0.75354341,

In [13]:
final_results = cross_validate(gsearch1.best_estimator_,X_train,y_train)

In [14]:
test_scores = final_results['test_score']
#train_scores = final_results['train_score']
#print(np.mean(train_scores))
print(np.mean(test_scores))

0.7553059694284988


In [15]:
estimator = GradientBoostingRegressor(learning_rate=0.1,
                                     min_samples_split=500,
                                     min_samples_leaf=50,
                                     max_depth=8,
                                     max_features='sqrt',
                                     subsample=0.8,
                                     random_state=10,
                                     n_estimators=240)

In [16]:
estimator.fit(X_train,y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=8,
                          max_features='sqrt', max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=50, min_samples_split=500,
                          min_weight_fraction_leaf=0.0, n_estimators=240,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=10, subsample=0.8, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [17]:
estimator.score(X_test,y_test)

0.8092888852563106

## Reflexiones de cierre

**Recursos**

- Reddit /machinelearning y /learnmachinelearning
- Analytics Vidhya y KD Nuggets
- Kaggle.com y "There is no Free Hunch" Blog
- Arxiv, papers
- Libros: "Pattern Recognition and Machine Learning" C.Bishop y "Elements of Statistical Learning".

**Próximos pasos**

- Matemáticas
- Praxis: Feature Engineering, Model Selection y Tuning
- Deep Learning para NLP y Computer Vision
- Machine Learning Bayesiano