# Modelos de Ensamble

## Introducción

La idea de este notebook es utilizar un dataset conocido para predecir si una pelicula va a recibir un oscar o no.

Vamos a crear varios modelos base para luego comparar la performance de los modelos base con la del modelo de ensamble. 

Por último vamos a explorar un poco como podemos hacer esto utilizando la libreria Pycaret de AutoML.


## Dataset

Este dataset esta conformado por los siguientes features:  

 *   **Marketing expense:**    (float64)    Gasto total en Marketing      
 *   **Production expense:**   (float64)    Gasto total de Producción
 *   **Multiplex coverage:**   (float64)    Cobertura promedio de Multiplex
 *   **Budget:**               (float64)    Presupuesto
 *   **Movie_length:**         (float64)    Duración de la película
 *   **Lead_ Actor_Rating:**   (float64)    Puntaje sobre el actor principal
 *   **Lead_Actress_rating:**  (float64)    Puntaje sobre la actriz principal
 *   **Director_rating:**      (float64)    Puntaje sobre el Director
 *   **Producer_rating:**      (float64)    Puntaje sobre el Productor
 *   **Critic_rating:**        (float64)    Puntaje que le puso la crítica
 *   **Trailer_views:**        (int64)      Cantidad de vistas del Trailer
 *   **3D_available:**         (object)     Si esta disponible en 3D (Yes/No)
 *   **Time_taken:**           (float64)    Duración de la película
 *   **Twitter_hastags:**      (float64)    Cantidad de menciones en twitter
 *   **Genre:**                (object)     Genero de la película
 *   **Avg_age_actors:**       (int64)      Edad promedio de los actores
 *   **Num_multiplex:**        (int64)      Cantidad de Multiplex
 *   **Collection:**           (int64)      Recaudación
 *   **Start_Tech_Oscar:**     (int64)      Si recibió un oscar o no.
 

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from scipy.stats import mode
import seaborn as sns

In [2]:
df_total= pd.read_csv("../data/Movie_classification.csv")

In [3]:
df_total.shape

(506, 19)

In [4]:
df_total = df_total.dropna()

In [5]:
df_total.shape

(494, 19)

In [6]:
df = df_total.drop(["Genre", "3D_available"], axis = 1)
df.shape

(494, 17)

In [7]:
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,109.6,223.84,23,494,48000,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,146.64,243.456,42,462,43200,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,147.88,2022.4,38,458,69400,1
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,185.36,225.344,45,472,66800,1
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,176.48,225.792,55,395,72400,1


In [8]:
X = df.drop("Start_Tech_Oscar", axis = 1)
print(X.shape)

y = df['Start_Tech_Oscar']
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 23)


(494, 16)
(494,)


In [9]:
scaler = StandardScaler()
X_train_scl = scaler.fit_transform(X_train)

X_test_scl = scaler.transform(X_test)

Entrenemos un modelo logistic regression para predecir el valor de "Start_Tech_Oscar" y evaluar su perfomance en test mediante

* accuracy

* matriz de confusión
    


In [10]:
model_1 = LogisticRegression()
fit_1 = model_1.fit(X_train_scl, y_train)

In [11]:
predict_1_cat = fit_1.predict(X_test_scl)
accuracy_1 = accuracy_score(y_test, predict_1_cat)
print(accuracy_1)
conf_mat_1 = confusion_matrix(y_test, predict_1_cat)
print(conf_mat_1)

0.6451612903225806
[[33 22]
 [22 47]]


In [12]:
model_2 = KNeighborsClassifier()
fit_2 = model_2.fit(X_train_scl, y_train)

In [13]:
predict_2_cat = fit_2.predict(X_test_scl)
accuracy_2 = accuracy_score(y_test, predict_2_cat)
print(accuracy_2)
conf_mat_2 = confusion_matrix(y_test, predict_2_cat)
print(conf_mat_2)

0.532258064516129
[[28 27]
 [31 38]]


In [14]:
model_3 = tree.DecisionTreeClassifier(criterion='gini')
fit_3 = model_3.fit(X_train_scl, y_train)

In [15]:
predict_3_cat = fit_3.predict(X_test_scl)
accuracy_3 = accuracy_score(y_test, predict_3_cat)
print(accuracy_3)
conf_mat_3 = confusion_matrix(y_test, predict_3_cat)
print(conf_mat_3)

0.6129032258064516
[[34 21]
 [27 42]]


Construyamos un modelo de ensamble usando como modelos base los tres modelos anteriores.

Para esto, escribir una función `predict_ensamble`, que calcule el valor de la etiqueta Start_Tech_Oscar como la moda de las respuestas de los predictores base

 Evaluar la performance del ensamble mediante 

* accuracy

* matriz de confusión




In [22]:
def predict_ensamble(X, model_1, model_2, model_3):
    y_pred_1 = model_1.predict(X)
    y_pred_2 = model_2.predict(X)
    y_pred_3 = model_3.predict(X)
    result_mode = mode([y_pred_1, y_pred_2, y_pred_3]).mode
    result = np.transpose(result_mode)    
    return result

In [23]:
predict_cat_ensemble = predict_ensamble(X_test_scl, fit_1, fit_2, fit_3)
accuracy_ensemble = accuracy_score(y_test, predict_cat_ensemble)
print(accuracy_ensemble)
predict_cat_ensemble.shape
conf_mat_ensemble = confusion_matrix(y_test, predict_cat_ensemble)
print(conf_mat_ensemble)

0.6048387096774194
[[35 20]
 [29 40]]


¿Cómo es la performance del modelo de ensamble respecto de la obtenida en los modelos base?

¿Mejora usando dos modelos en el ensable? ¿Qué modelos usarían?

En este caso, la performance del modelo de ensamble no es mejor que la del mejor de los modelos base.

In [24]:
def predict_ensamble_2(X, model_1, model_2):
    y_pred_1 = model_1.predict(X)
    y_pred_2 = model_2.predict(X)    
    result_mode = mode([y_pred_1, y_pred_2]).mode
    result = np.transpose(result_mode)    
    return result


In [25]:
predict_cat_ensemble_2_3 = predict_ensamble_2(X_test_scl, fit_2, fit_3)
accuracy_ensemble_2_3 = accuracy_score(y_test, predict_cat_ensemble_2_3)
print(accuracy_ensemble_2_3)
accuracy_ensemble_2_3.shape
conf_mat_ensemble_2_3 = confusion_matrix(y_test, predict_cat_ensemble_2_3)
print(conf_mat_ensemble_2_3)

0.5645161290322581
[[43 12]
 [42 27]]


La performance es muy similar al ensamble de tres modelos, peor que la del mejor modelo base.

Probamos las combinaciones 1, 2 y 1, 3 (vemos que no mejora):

In [26]:
predict_cat_ensemble_1_2 = predict_ensamble_2(X_test_scl, fit_1, fit_2)
accuracy_ensemble_1_2 = accuracy_score(y_test, predict_cat_ensemble_1_2)
print(accuracy_ensemble_1_2)
accuracy_ensemble_1_2.shape
conf_mat_ensemble_1_2 = confusion_matrix(y_test, predict_cat_ensemble_1_2)
print(conf_mat_ensemble_1_2)

0.5806451612903226
[[42 13]
 [39 30]]


In [25]:
predict_cat_ensemble_1_3 = predict_ensamble_2(X_test_scl, fit_1, fit_3)
accuracy_ensemble_1_3 = accuracy_score(y_test, predict_cat_ensemble_1_3)
print(accuracy_ensemble_1_3)
accuracy_ensemble_1_3.shape
conf_mat_ensemble_1_3 = confusion_matrix(y_test, predict_cat_ensemble_1_3)
print(conf_mat_ensemble_1_3)

0.6451612903225806
[[44 11]
 [33 36]]


*  ¿ Pueden extender esto a más modelos ? 


# Usando AutoML

Utilizemos lo aprendido en la clase de AutoML para ver los resultados obtenidos al correr muchos modelos a la vez y ponderarlos. Recordemos que nuestra variable target es 'Start_Tech_Oscar'.

Para más info: [Ensemble model Pycaret](https://pycaret.gitbook.io/docs/get-started/functions/optimize#ensemble_model) 

## Metodo Bagging

El ensamblaje, también conocido como agregación Bootstrap, es un meta-algoritmo de aprendizaje automático diseñado para mejorar la estabilidad y la precisión de los algoritmos de aprendizaje automático utilizados en la clasificación y regresión estadística. También reduce la varianza y ayuda a evitar el sobreajuste. Aunque suele aplicarse a los métodos de árboles de decisión, puede utilizarse con cualquier tipo de método. El bagging es un caso especial del enfoque de promediación de modelos.

## Metodo Boosting

El refuerzo es un meta-algoritmo de conjunto para reducir principalmente el sesgo y la varianza en el aprendizaje supervisado. El refuerzo pertenece a la familia de algoritmos de aprendizaje automático que convierten a los aprendices débiles en fuertes. Un aprendiz débil se define como un clasificador que sólo está ligeramente correlacionado con la clasificación verdadera (puede etiquetar los ejemplos mejor que la adivinación aleatoria). En cambio, un aprendiz fuerte es un clasificador que está arbitrariamente bien correlacionado con la clasificación verdadera.

<img src='bb.png' width=90%>

In [27]:
from pycaret.classification import *

In [28]:
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,109.6,223.84,23,494,48000,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,146.64,243.456,42,462,43200,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,147.88,2022.4,38,458,69400,1
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,185.36,225.344,45,472,66800,1
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,176.48,225.792,55,395,72400,1


In [30]:
# Instanciamos la configuración: setup
clf = setup(df, target = 'Start_Tech_Oscar', log_experiment = True, experiment_name = 'oscars',use_gpu=True)

Unnamed: 0,Description,Value
0,session_id,762
1,Target,Start_Tech_Oscar
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(494, 17)"
5,Missing Values,False
6,Numeric Features,16
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


2022/06/11 09:05:25 INFO mlflow.tracking.fluent: Experiment with name 'oscars' does not exist. Creating a new experiment.


In [32]:
# train model
dt = create_model('dt')

# ensemble model
boosted_dt = ensemble_model(dt, method = 'Boosting')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6857,0.6833,0.7,0.7368,0.7179,0.3636,0.3642
1,0.4857,0.477,0.5789,0.5238,0.55,-0.0465,-0.0468
2,0.4571,0.4408,0.6316,0.5,0.5581,-0.1214,-0.1271
3,0.5143,0.5033,0.6316,0.5455,0.5854,0.0067,0.0068
4,0.4857,0.4967,0.3684,0.5385,0.4375,-0.0064,-0.0068
5,0.5294,0.5088,0.6842,0.5652,0.619,0.0181,0.0186
6,0.4412,0.4439,0.4211,0.5,0.4571,-0.11,-0.1117
7,0.6471,0.6421,0.6842,0.6842,0.6842,0.2842,0.2842
8,0.6471,0.6491,0.6316,0.7059,0.6667,0.2941,0.2962
9,0.5588,0.5491,0.6316,0.6,0.6154,0.0989,0.0991


In [34]:
type(boosted_dt)
print(boosted_dt)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                       

Por defecto, PyCaret utiliza 10 estimadores tanto para Bagging como para Boosting. Se puede aumentar cambiando el parámetro n_estimators.

In [35]:
# ensemble model
ensemble_model(dt, n_estimators = 100)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6571,0.6717,0.8,0.6667,0.7273,0.2759,0.2843
1,0.4,0.472,0.5263,0.4545,0.4878,-0.227,-0.2306
2,0.6286,0.6234,0.7368,0.6364,0.6829,0.2404,0.2442
3,0.6571,0.6941,0.8421,0.64,0.7273,0.2881,0.3083
4,0.6286,0.6069,0.5789,0.6875,0.6286,0.2626,0.2664
5,0.5882,0.614,0.6316,0.6316,0.6316,0.1649,0.1649
6,0.5882,0.6877,0.5263,0.6667,0.5882,0.1877,0.193
7,0.6471,0.7053,0.7368,0.6667,0.7,0.274,0.276
8,0.5882,0.7,0.7368,0.6087,0.6667,0.1408,0.1452
9,0.5882,0.6509,0.6316,0.6316,0.6316,0.1649,0.1649


BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=762,
 

In [36]:
# train model
lr = create_model('lr')

# ensemble model
ensemble_model(lr, choose_better = True)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6286,0.5733,0.75,0.6522,0.6977,0.2222,0.2259
1,0.6286,0.6053,0.6316,0.6667,0.6486,0.2553,0.2557
2,0.6,0.6842,0.7368,0.6087,0.6667,0.1779,0.183
3,0.6286,0.6941,0.7368,0.6364,0.6829,0.2404,0.2442
4,0.6,0.6217,0.6316,0.6316,0.6316,0.1941,0.1941
5,0.6176,0.6912,0.6842,0.65,0.6667,0.2191,0.2195
6,0.6176,0.6596,0.7368,0.6364,0.6829,0.2079,0.2114
7,0.4706,0.5263,0.4737,0.5294,0.5,-0.0588,-0.0592
8,0.6471,0.7789,0.8421,0.64,0.7273,0.2527,0.2725
9,0.5588,0.5509,0.5263,0.625,0.5714,0.1237,0.1257


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=762, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Esta función entrena un clasificador de Voto Suave / Regla de la Mayoría para los modelos seleccionados pasados en el parámetro estimator_list. La salida de esta función es una tabla de puntuación con las puntuaciones de CV por pliegues. Se puede acceder a las métricas evaluadas durante el CV mediante la función get_metrics. Las métricas personalizadas pueden añadirse o eliminarse mediante las funciones add_metric y remove_metric

In [37]:
# train a few models
lr = create_model('lr')
dt = create_model('dt')
knn = create_model('knn')

# blend models
blender = blend_models([lr, dt, knn])

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6571,0.7067,0.6,0.75,0.6667,0.3226,0.3311
1,0.4857,0.4309,0.7368,0.5185,0.6087,-0.0788,-0.0898
2,0.4286,0.5132,0.6316,0.48,0.5455,-0.1864,-0.1995
3,0.4,0.4441,0.6842,0.4643,0.5532,-0.2651,-0.3154
4,0.5714,0.6678,0.4737,0.6429,0.5455,0.1573,0.1639
5,0.5294,0.593,0.5789,0.5789,0.5789,0.0456,0.0456
6,0.4706,0.5123,0.4211,0.5333,0.4706,-0.0444,-0.0456
7,0.7059,0.6772,0.7368,0.7368,0.7368,0.4035,0.4035
8,0.5588,0.6561,0.5789,0.6111,0.5946,0.1115,0.1117
9,0.5882,0.6035,0.6316,0.6316,0.6316,0.1649,0.1649


In [38]:
type(blender)

sklearn.ensemble._voting.VotingClassifier

In [39]:
print(blender)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=1000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=762,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('dt',
                              DecisionTreeClassifier(ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',...
                                        