# Actividad N° 05: iFood

## Integrantes

**Grupo N° 03**

- Adriana Villalobos
- Gustavo Ledesma
- Alejo Cuello

## Descripción de la actividad

Trabajamos sobre el conjunto de datos *marketing-campaign.csv* de iFood. El objetivo de la actividad es validar los modelos de clasificación y regresión utilizados para predecir distintas variables.

# Consigna

- Creen un modelo de clasificación utilizando Random Forest para la columna `Response`.
- Guarden el modelo de clasificación Random forest como `rfc.pkl`.
- Creen un modelo con regresión lineal y con Random Forest + GridsearchCV para predecir la columna `Income`.
- Guardar ambos modelos de regresion en pkl `lr.pkl` y `rfr.pkl`
- Cargar proyecto en Github / Gitlab, usen git y git-lfs para los `.csv` y `.pkl`.

## Consideraciones

- Repliquen este notebook para la resolución del ejercicio.
- Consideren las etapas: 1) Cargamos los datos, 2) Preparación de la data, 3) Clasificación, 4) Regresión y 5) Guardar un modelo.

**Podemos decidir:**
- Cómo preparar y acondicionar el dataset.
- Pueden agregar y eliminar columnas del dataset.
- Decidir parámetros para ajustar en los modelos de clasificación y regresión.

# Código

#### CARGA DE DATOS

In [6]:
import pandas as pd
import numpy as np
import pickle

from funpymodeling.exploratory import status
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor 

In [7]:
data = pd.read_csv("marketing_campaign.csv", sep=';', index_col=0)
data.head(5)

Unnamed: 0_level_0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,...,7,0,0,0,0,0,0,3,11,1
2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,...,5,0,0,0,0,0,0,3,11,0
4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,...,4,0,0,0,0,0,0,3,11,0
6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,...,6,0,0,0,0,0,0,3,11,0
5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,...,5,0,0,0,0,0,0,3,11,0


In [8]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Year_Birth,0,0.0,0,0.0,59,int64
1,Education,0,0.0,0,0.0,5,object
2,Marital_Status,0,0.0,0,0.0,8,object
3,Income,24,0.010714,0,0.0,1974,float64
4,Kidhome,0,0.0,1293,0.577232,3,int64
5,Teenhome,0,0.0,1158,0.516964,3,int64
6,Dt_Customer,0,0.0,0,0.0,663,object
7,Recency,0,0.0,28,0.0125,100,int64
8,MntWines,0,0.0,13,0.005804,776,int64
9,MntFruits,0,0.0,400,0.178571,158,int64


#### DESCARTE DE REGISTROS CON VARIABLE OBJETIVO NULA

In [None]:
discarded_data = data[data["Income"].isna()]
data = data[data["Income"].notna()]

#### CHEQUEO DE ALGUNAS VARIABLES

In [21]:
print("Casos raros en los que gastó más en productos gold que en la sumatoria de todas las categorías")
data[data["MntGoldProds"] > (data["MntFishProducts"] + data["MntMeatProducts"] + data["MntFruits"] + data["MntSweetProducts"] + data["MntWines"])][["MntGoldProds","MntFishProducts","MntMeatProducts","MntFruits","MntSweetProducts","MntWines"]]

Casos raros en los que gastó más en productos gold que en la sumatoria de todas las categorías


Unnamed: 0_level_0,MntGoldProds,MntFishProducts,MntMeatProducts,MntFruits,MntSweetProducts,MntWines
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5255,362,3,3,1,263,5
4246,262,4,26,11,3,67
6237,291,5,33,4,2,81
10311,321,2,12,4,4,16


In [39]:
print("Chequeamos que la variable Response tenga aproximadamente un 15% de efectividad")
data["Response"].sum() / data.shape[0]

Chequeamos que la variable Response tenga aproximadamente un 15% de efectividad


0.14910714285714285

#### ELIMINAMOS COLUMNAS FECHA Y CON VALORES ÚNICOS

In [22]:
columnas_a_eliminar = ['Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue']
data = data.drop(columns=columnas_a_eliminar)

In [23]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,0,0.0,5,object
1,Marital_Status,0,0.0,0,0.0,8,object
2,Income,24,0.010714,0,0.0,1974,float64
3,Kidhome,0,0.0,1293,0.577232,3,int64
4,Teenhome,0,0.0,1158,0.516964,3,int64
5,Recency,0,0.0,28,0.0125,100,int64
6,MntWines,0,0.0,13,0.005804,776,int64
7,MntFruits,0,0.0,400,0.178571,158,int64
8,MntMeatProducts,0,0.0,1,0.000446,558,int64
9,MntFishProducts,0,0.0,384,0.171429,182,int64


#### TRANSFORMAMOS COLUMNAS OBJECT A NUMÉRICAS, YA QUE SON POCOS VALORES

In [24]:
data['Marital_Status'].unique()

array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)

In [25]:
class_map = {'Single':0, 'Married':1, 'Together':1, 'Divorced':2, 'Widow':3, 'Alone':0, 'Absurd':0, 'YOLO':0}
data['Marital_Status'] = data['Marital_Status'].map(class_map)

In [26]:
data['Education'].unique()

array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)

In [27]:
# Si tuvieramos el orden de la etapa más básica a la más especializada, podríamos ordenarlas
class_map = {'Graduation':0, 'PhD':1, 'Master':2, 'Basic':3, '2n Cycle':4}
data['Education'] = data['Education'].map(class_map)

In [28]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,1127,0.503125,5,int64
1,Marital_Status,0,0.0,487,0.217411,4,int64
2,Income,24,0.010714,0,0.0,1974,float64
3,Kidhome,0,0.0,1293,0.577232,3,int64
4,Teenhome,0,0.0,1158,0.516964,3,int64
5,Recency,0,0.0,28,0.0125,100,int64
6,MntWines,0,0.0,13,0.005804,776,int64
7,MntFruits,0,0.0,400,0.178571,158,int64
8,MntMeatProducts,0,0.0,1,0.000446,558,int64
9,MntFishProducts,0,0.0,384,0.171429,182,int64


#### Split en Train y Test

In [29]:
data_x = data.drop('Response', axis=1)
data_y = data['Response']

In [30]:
data_x = data_x.values
data_y = data_y.values

In [31]:

x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3)

#### Regresión Lineal

In [32]:
x_data_reg = data.drop('Income', axis=1)
y_data_reg = data['Income']
x_data_reg = x_data_reg.values
y_data_reg = y_data_reg.values

In [33]:
x_train, x_test, y_train, y_test = train_test_split(x_data_reg, y_data_reg, test_size=0.3)

In [34]:
# a.Creamos modelo
model = LinearRegression()

In [35]:
# b. fiteamos
model.fit(x_train, y_train)

ValueError: Input y contains NaN.

In [None]:
# c. obtenemos predicciónes para tr y ts
pred_tr = model.predict(x_train)
pred_ts = model.predict(x_test)
pred_tr[0:6]

#### Guardando el modelo en lr.pkl

In [None]:

with open('lr.pkl', 'wb') as handle:
    pickle.dump(model, handle, protocol=pickle.HIGHEST_PROTOCOL)


#### RandomForest

In [None]:
model_rf = RandomForestRegressor()

In [None]:
params = {
    'n_estimators' : [10, 100, 300, 500,1000],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [None]:
grid_rf.fit(x_train, y_train)

In [None]:
grid_rf.best_estimator_

In [None]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

In [None]:
grid_rf.best_params_

#### Guardando el modelo en rfc.pkl

In [None]:
with open('rfc.pickle', 'wb') as handle:
    pickle.dump(grid_rf.best_params_, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
with open('rfc.pickle', 'rb') as handle:
    rfc_tr = pickle.load(handle)

#### RandomForest Columna Income

In [None]:
X = data.drop('Income', axis=1)
y = data['Income']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:


model_rf = RandomForestRegressor()

In [None]:
params = {
    'n_estimators' : [10, 100, 300, 500,1000],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [None]:
grid_rf.fit(X_train, y_train)

In [None]:
grid_rf.best_estimator_

In [None]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

In [None]:
grid_rf.best_params_

#### Combinatoria de parámetros

In [None]:
pd.concat([pd.DataFrame(grid_rf.cv_results_["params"]),
           pd.DataFrame(grid_rf.cv_results_["mean_test_score"], 
                        columns=["neg_mean_absolute_error"])],axis=1).sort_values('neg_mean_absolute_error', ascending=False)

In [None]:
grid_rf.score(X_train, y_train)

In [None]:
grid_rf.score(X_test, y_test)

In [None]:
# Guardar el modelo
# rfr.pkl
with open('rfr.pkl', 'wb') as handle:
    pickle.dump(grid_rf.best_estimator_, handle, protocol=pickle.HIGHEST_PROTOCOL)
