# Baggin - Random Forest

## Introducción

En esta clase vamos a comparar las performance de los distintos modelos basados en ensambles de árboles:

* Árbol de ensamble -> Bagging
* Árbol de ensamble -> Random Forest


## Dataset

En esta clase usaremos un dataset con info de películas ("Movie_classification.csv").  
Este dataset esta conformado por los siguientes features:  

 *   **Marketing expense:**    (float64)    Gasto total en Marketing      
 *   **Production expense:**   (float64)    Gasto total de Producción
 *   **Multiplex coverage:**   (float64)    Cobertura promedio de Multiplex
 *   **Budget:**               (float64)    Presupuesto
 *   **Movie_length:**         (float64)    Duración de la película
 *   **Lead_ Actor_Rating:**   (float64)    Puntaje sobre el actor principal
 *   **Lead_Actress_rating:**  (float64)    Puntaje sobre la actriz principal
 *   **Director_rating:**      (float64)    Puntaje sobre el Director
 *   **Producer_rating:**      (float64)    Puntaje sobre el Productor
 *   **Critic_rating:**        (float64)    Puntaje que le puso la crítica
 *   **Trailer_views:**        (int64)      Cantidad de vistas del Trailer
 *   **3D_available:**         (object)     Si esta disponible en 3D (Yes/No)
 *   **Time_taken:**           (float64)    Duración de la película
 *   **Twitter_hastags:**      (float64)    Cantidad de menciones en twitter
 *   **Genre:**                (object)     Genero de la película
 *   **Avg_age_actors:**       (int64)      Edad promedio de los actores
 *   **Num_multiplex:**        (int64)      Cantidad de Multiplex
 *   **Collection:**           (int64)      Recaudación
 *   **Start_Tech_Oscar:**     (int64)      Si recibió un oscar o no.
 
 


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("../data/Movie_classification.csv", header=0)

In [3]:
df.shape

(506, 19)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           494 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

In [5]:
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,3D_available,Time_taken,Twitter_hastags,Genre,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,YES,109.6,223.84,Thriller,23,494,48000,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,NO,146.64,243.456,Drama,42,462,43200,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,NO,147.88,2022.4,Comedy,38,458,69400,1
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,YES,185.36,225.344,Drama,45,472,66800,1
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,NO,176.48,225.792,Drama,55,395,72400,1


In [6]:
# observamos que time_taken es la unica columna que no tiene 506 observaciónes, 
# por ende vamos a imputar los valores faltantes utilizando la media.
df['Time_taken'].mean()

157.3914979757085

In [7]:
df['Time_taken'].fillna(value = df['Time_taken'].mean(), inplace = True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           506 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

Generación de Variables Dummies. Veamos si existen variables categóricas y en tal caso generar variables dummies para dichas columnas.

In [9]:
df.dtypes.loc[df.dtypes=="object"]  

3D_available    object
Genre           object
dtype: object

In [10]:
df[['3D_available','Genre']].head()

Unnamed: 0,3D_available,Genre
0,YES,Thriller
1,NO,Drama
2,NO,Comedy
3,YES,Drama
4,NO,Drama


In [11]:
df = pd.get_dummies(df,columns = ["3D_available","Genre"],drop_first = True)

In [12]:
df.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,...,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,...,109.6,223.84,23,494,48000,1,1,0,0,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,...,146.64,243.456,42,462,43200,0,0,0,1,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,...,147.88,2022.4,38,458,69400,1,0,1,0,0
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,...,185.36,225.344,45,472,66800,1,1,0,1,0
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,...,176.48,225.792,55,395,72400,1,0,0,1,0


Construyamos una matriz de features (X) y el vector target (Y) para predecir `Start_Tech_Oscar` en el dataset de datos completos

¿Qué valores toma la variable `Start_Tech_Oscar` en el dataset?


In [13]:
X = df.loc[:,df.columns!="Start_Tech_Oscar"]

In [14]:
X.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,109.6,223.84,23,494,48000,1,0,0,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,146.64,243.456,42,462,43200,0,0,1,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,147.88,2022.4,38,458,69400,0,1,0,0
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,185.36,225.344,45,472,66800,1,0,1,0
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,176.48,225.792,55,395,72400,0,0,1,0


In [15]:
X.shape

(506, 20)

In [16]:
y = df["Start_Tech_Oscar"]

In [17]:
y.head()

0    1
1    0
2    1
3    1
4    1
Name: Start_Tech_Oscar, dtype: int64

In [18]:
y.shape

(506,)

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=0)

In [20]:
X_train.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,3D_available_YES,Genre_Comedy,Genre_Drama,Genre_Thriller
220,27.1618,67.4,0.493,38612.805,162.0,8.485,8.64,8.485,8.67,8.52,480270,174.68,224.272,23,536,53400,0,0,0,1
71,23.1752,76.62,0.587,33113.355,91.0,7.28,7.4,7.29,7.455,8.16,491978,200.68,263.472,46,400,43400,0,0,0,0
240,22.2658,64.86,0.572,38312.835,127.8,6.755,6.935,6.8,6.84,8.68,470107,204.8,224.32,24,387,54000,1,1,0,0
6,21.7658,70.74,0.476,33396.66,140.1,7.065,7.265,7.15,7.4,8.96,459241,139.16,243.664,41,522,45800,1,0,0,1
417,538.812,91.2,0.321,29463.72,162.6,9.135,9.305,9.095,9.165,6.96,302776,172.16,301.664,60,589,20800,1,0,0,0


In [21]:
X_train.shape

(404, 20)

In [22]:
X_test.shape

(102, 20)

## Creando un modelo utilizando Bagging

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html  

1) Creamos un clasificador de árbol simple.  
2) Con este clasificador simple, generar el 'meta-modelo' basado en la técnica de Bagging (utilizar 1000 estimadores).  
3) Entrenamos el modelo de Bagging.  
4) Calculamos la matriz de confusión  
5) Calculamos el accuracy tanto para el dataset de prueba.  

In [23]:
from sklearn import tree
clftree = tree.DecisionTreeClassifier()

In [24]:
from sklearn.ensemble import BaggingClassifier

In [25]:
bag_clf = BaggingClassifier(base_estimator=clftree, n_estimators=1000,
                            bootstrap=True, n_jobs=-1,
                            random_state=42)

In [26]:
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=1000,
                  n_jobs=-1, random_state=42)

Probar otro modelo de clasificación como estimador base.

In [27]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [28]:
confusion_matrix(y_test, bag_clf.predict(X_test))

array([[27, 17],
       [22, 36]], dtype=int64)

In [29]:
accuracy_score(y_test, bag_clf.predict(X_test))

0.6176470588235294

## Construimos un modelo utilizando Random Forest

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  


In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
rf_clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1 ,random_state=42)

In [32]:
rf_clf.fit(X_train, y_train)

RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=42)

In [33]:
confusion_matrix(y_test, rf_clf.predict(X_test))

array([[25, 19],
       [18, 40]], dtype=int64)

In [34]:
accuracy_score(y_test, rf_clf.predict(X_test))

0.6372549019607843

## Construimos un modelo utilizando ExtraTreesClassifier()

Este modelo es una variación del Random Forest que busca ser más rápido. Esto lo hace evitando buscar el punto de split optimo en cada nodo de los árboles, sino que por el contrario, selecciona un split de forma aleatoria. Esto hace que el modelo se vuelva más veloz, pudiendo alcanzar o incluso superar la precisión alcanzada por Random Forest.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html  

1) Generaramos el meta-modelo basado en la técnica de ExtraTrees (utilizar 10000 estimadores)  
2) Entrenamos el modelo   
3) Calculamos la matriz de confusión    
4) Calculamos el accuracy tanto para el dataset de prueba.

In [35]:
from sklearn.ensemble import ExtraTreesClassifier

In [36]:
et = ExtraTreesClassifier(n_estimators=10000, class_weight='balanced', random_state=1)

In [37]:
et.fit(X_train, y_train)

ExtraTreesClassifier(class_weight='balanced', n_estimators=10000,
                     random_state=1)

In [38]:
confusion_matrix(y_test, et.predict(X_test))

array([[23, 21],
       [18, 40]], dtype=int64)

In [39]:
accuracy_score(y_test, et.predict(X_test))

0.6176470588235294