# Decission Trees & Random Forests

En este notebook haremos uso de los Decission Trees & Random Forests para intentar predecir el precio en base al sentimiento.


## Imports

In [14]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
import datetime

from pprint import pprint

from sklearn.model_selection import cross_val_score,  GridSearchCV, KFold, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

from JABA.utils import DFPicker

### Pick the data

Haremos uso de la funcion **get_complete_df(dateFrom, dateTo)** de **DFPicker** para recopilar los datos de las fechas de las que queremos realizar la observacion

In [9]:
df = DFPicker.get_complete_df(datetime.date(2018, 3, 2),datetime.date(2018, 3, 4))
price = df['Open']
sentiment = df['sentiment']

C:\Users\User\Documents\GitHub\JABA
JABA/data/tweets/2018-03-03/sentiment_file_nltk.csv


### Model Evaluation

#### 1- RandomForest Regressor

In [12]:
price_normalized = np.log1p(price)
X_train, X_test, y_train, y_test = DFPicker.train_test_splitter(sentiment, price_normalized, 0.2)


Training set has 2304 samples.
Testing set has 576 samples.


In [13]:
kf = KFold(5, shuffle=True, random_state = 42).get_n_splits(sentiment)
result = cross_val_score(RandomForestRegressor(), X_train, y_train, scoring='neg_mean_squared_error', cv=kf,n_jobs=-1)
print('Modelo {}: Error medio y desviación {:.5f} +/- {:5f}'.format(RandomForestRegressor.__name__, -result.mean(), result.std()))

Modelo RandomForestRegressor: Error medio y desviación 0.00052 +/- 0.000029


#### 2- Grid Search

Observar los parametros de los que esta haciendo uso el Random Forest Regressor

In [15]:
rf = RandomForestRegressor(random_state=42)
print('Parameters currently in use: \n')
pprint(rf.get_params())

Parameters currently in use: 

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


Delimitar los valores de estos parametros para el GridSearch

In [37]:
n_estimators = [int(x) for x in np.linspace(start = 10, stop=3000,num=10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 1000, num=10)]
max_depth.append(None)
min_samples_split = range(1,20)
min_samples_leaf = range(1,20)
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
        'max_features': max_features,
        'max_depth': max_depth,
        'min_samples_split': min_samples_split,
        'min_samples_leaf': min_samples_leaf,
        'bootstrap': bootstrap}

Crear Randomized Search

In [38]:
rf_random = RandomizedSearchCV(estimator=rf, param_distributions = random_grid, n_iter=10, cv = 3, verbose=2, random_state=42, n_jobs=-1)
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


  self.best_estimator_.fit(X, y, **fit_params)


RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 120, 230, 340, 450,
                                                      560, 670, 780, 890, 1000,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': range(1, 20),
                                        'min_samples_split': range(1, 20),
                                        'n_estimators': [10, 342, 674, 1006,
                                                         1338, 1671, 2003, 2335,
                                                         2667, 3000]},
                   random_state=42, verbose=2)

Metricas de Error

In [39]:
y_pred_train = rf_random.predict(X_train)
y_pred_test = rf_random.predict(X_test)
print("MSE Train", mean_squared_error(y_train, y_pred_train))
print("MSE Test", mean_squared_error(y_test, y_pred_test))
print("R2 Train", r2_score(y_train, y_pred_train))
print("R2 Test", r2_score(y_test, y_pred_test))

MSE Train 0.0002721160995284901
MSE Test 0.0004458258821162942
R2 Train 0.2693237835712087
R2 Test -0.11588270903496278
