# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

## 1. Carga las librerias que consideres comunes al notebook

In [1]:
import pandas as pd
import numpy as np

## 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [4]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'


names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

df = pd.read_csv(url)
df.columns = names
print(df.shape)

(767, 9)


In [5]:
features = names
features.remove('class')
target = 'class'

## 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=.2,
                                                    random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(613, 8)
(154, 8)
(613,)
(154,)


from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# define model
model = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), bootstrap=True, random_state=42)
# define model evaluation method

# define grid
grid = dict()
grid['n_estimators'] = list(range(1,501, 50))
grid['max_samples'] = list(range(50, 200, 50))
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=10, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)

# summarize
print('Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

In [43]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# define model
grid_bag = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),n_estimators=100, bootstrap=True, random_state=42)
# define model evaluation method

# define grid
grid = dict()
grid['max_samples'] = list(range(50, 200, 50))
# define search
search = GridSearchCV(grid_bag, grid, scoring='accuracy', cv=10, n_jobs=-1)
# perform the search
results_bag = search.fit(X_train, y_train)

# summarize
print('Accuracy: %.3f' % results_bag.best_score_)
print('Config: %s' % results_bag.best_params_)

Accuracy: 0.763
Config: {'max_samples': 50}


In [45]:
bag_model =  BaggingClassifier(
    DecisionTreeClassifier(random_state=42),**results_bag.best_params_,
    n_estimators=100,
    bootstrap=True, random_state=42)
bag_model.fit(X_train, y_train)
bag_predictions = bag_model.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, bag_predictions)

0.8051948051948052

## 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [47]:
from sklearn.ensemble import RandomForestClassifier
# define model
grid_rf = RandomForestClassifier(n_estimators=100, max_features=3, random_state=42)
# define model evaluation method

# define grid
grid = dict()
# define search
grid['criterion'] = ['gini', 'entropy']
search = GridSearchCV(grid_rf, grid, scoring='accuracy', cv=10, n_jobs=-1)
# perform the search
rf_results = search.fit(X_train, y_train)

# summarize
print('Accuracy: %.3f' % rf_results.best_score_)
print('Config: %s' % rf_results.best_params_)

Accuracy: 0.736
Config: {'criterion': 'gini'}


In [49]:
grid_rf = RandomForestClassifier(**rf_results.best_params_,n_estimators=100, max_features=3, random_state=42)

grid_rf.fit(X_train, y_train)
rf_predictions = grid_rf.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, rf_predictions)

0.7922077922077922

## 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [50]:
from sklearn.ensemble import AdaBoostClassifier

grid_ada =  AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30, random_state=42)
# define model evaluation method

# define grid
grid = dict()
# define search
grid['learning_rate'] = list(np.arange(0.1, 1.0, 0.1))
grid['algorithm'] = ['SAMME', 'SAMME.R']
search = GridSearchCV(grid_ada, grid, scoring='accuracy', cv=10, n_jobs=-1)
# perform the search
ada_results = search.fit(X_train, y_train)

# summarize
print('Accuracy: %.3f' % ada_results.best_score_)
print('Config: %s' % ada_results.best_params_)

Accuracy: 0.755
Config: {'algorithm': 'SAMME.R', 'learning_rate': 0.30000000000000004}


In [51]:
grid_ada =  AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), **ada_results.best_params_,
    n_estimators=30,
    random_state=42)

grid_ada.fit(X_train, y_train)
ada_predictions = grid_ada.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, ada_predictions)

0.7922077922077922

## 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

In [52]:
from sklearn.ensemble import GradientBoostingClassifier

# define model evaluation method

grid_gradBoost = GradientBoostingClassifier(n_estimators=100,
                                  random_state=42)
# define grid
grid = dict()
# define search
grid['learning_rate'] = list(np.arange(0.1, 1.0, 0.1))
grid['max_depth'] = [1,2,3]
grid['criterion'] = ['friedman_mse', 'mse', 'mae']
search = GridSearchCV(grid_gradBoost, grid, scoring='accuracy', cv=10, n_jobs=-1)
# perform the search
gradBoost_results = search.fit(X_train, y_train)

# summarize
print('Accuracy: %.3f' % gradBoost_results.best_score_)
print('Config: %s' % gradBoost_results.best_params_)

Accuracy: 0.757
Config: {'criterion': 'friedman_mse', 'learning_rate': 0.1, 'max_depth': 3}


In [53]:
grid_gradBoost = GradientBoostingClassifier(**gradBoost_results.best_params_,
                                            n_estimators=100,
                                            random_state=42)

grid_gradBoost.fit(X_train, y_train)
gradBoost_predictions = grid_gradBoost.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, gradBoost_predictions)

0.7792207792207793

## 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

In [29]:
#!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.2.1-py3-none-win_amd64.whl (86.5 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.2.1


In [30]:
import xgboost

In [54]:
# define model evaluation method

grid_XGB = xgboost.XGBClassifier(n_estimators = 100, random_state=42)

# define grid
grid = dict()
# define search
grid['learning_rate'] = list(np.arange(0.1, 1.0, 0.1))
grid['max_depth'] = [1,2,3]
search = GridSearchCV(grid_XGB, grid, scoring='accuracy', cv=10, n_jobs=-1)
# perform the search
results_XGB = search.fit(X_train, y_train)

# summarize
print('Accuracy: %.3f' % results_XGB.best_score_)
print('Config: %s' % results_XGB.best_params_)

Accuracy: 0.765
Config: {'learning_rate': 0.2, 'max_depth': 2}


In [56]:
grid_XGB = xgboost.XGBClassifier(**results_XGB.best_params_,n_estimators = 100, random_state=42)
grid_XGB.fit(X_train, y_train)
XGB_predictions = grid_XGB.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, XGB_predictions)

0.7662337662337663

In [None]:
xgb_reg = xgboost.XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred) # Not shown
print("Validation MSE:", val_error)

## 8. Resultados
Crea un series con los resultados y sus algoritmos, ordenándolos de mayor a menor

In [57]:
accuracies = [
            accuracy_score(y_test, bag_predictions),
            accuracy_score(y_test, rf_predictions),
            accuracy_score(y_test, ada_predictions),
            accuracy_score(y_test, gradBoost_predictions),
            accuracy_score(y_test, XGB_predictions)]

In [58]:
algoritmos = ['Bagging', 'Random Forest', 'AdaBoost', 'GradientBoost',
             'XGBoost']

In [62]:
df_resultados = pd.DataFrame(zip(algoritmos, accuracies),
                             columns=['Algoritmo', 'Precision'])

In [63]:
df_resultados

Unnamed: 0,Algoritmo,Precision
0,Bagging,0.805195
1,Random Forest,0.792208
2,AdaBoost,0.792208
3,GradientBoost,0.779221
4,XGBoost,0.766234


In [64]:
df_resultados.sort_values('Precision')

Unnamed: 0,Algoritmo,Precision
4,XGBoost,0.766234
3,GradientBoost,0.779221
1,Random Forest,0.792208
2,AdaBoost,0.792208
0,Bagging,0.805195
