# Actividad N° 05: iFood

## Integrantes

**Grupo N° 03**

- Adriana Villalobos
- Gustavo Ledesma
- Alejo Cuello

## Descripción de la actividad

Trabajamos sobre el conjunto de datos *marketing-campaign.csv* de iFood. El objetivo de la actividad es validar los modelos de clasificación y regresión utilizados para predecir distintas variables.

# Consigna

- Creen un modelo de clasificación utilizando Random Forest para la columna `Response`.
- Guarden el modelo de clasificación Random forest como `rfc.pkl`.
- Creen un modelo con regresión lineal y con Random Forest + GridsearchCV para predecir la columna `Income`.
- Guardar ambos modelos de regresion en pkl `lr.pkl` y `rfr.pkl`
- Cargar proyecto en Github / Gitlab, usen git y git-lfs para los `.csv` y `.pkl`.

## Consideraciones

- Repliquen este notebook para la resolución del ejercicio.
- Consideren las etapas: 1) Cargamos los datos, 2) Preparación de la data, 3) Clasificación, 4) Regresión y 5) Guardar un modelo.

**Podemos decidir:**
- Cómo preparar y acondicionar el dataset.
- Pueden agregar y eliminar columnas del dataset.
- Decidir parámetros para ajustar en los modelos de clasificación y regresión.

# Código

#### CARGA DE DATOS

In [1]:
import pandas as pd
import numpy as np
import pickle

from funpymodeling.exploratory import status
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor 

In [2]:
data = pd.read_csv("marketing_campaign.csv", sep=';', index_col=0)
data.head(5)

Unnamed: 0_level_0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,...,7,0,0,0,0,0,0,3,11,1
2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,...,5,0,0,0,0,0,0,3,11,0
4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,...,4,0,0,0,0,0,0,3,11,0
6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,...,6,0,0,0,0,0,0,3,11,0
5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,...,5,0,0,0,0,0,0,3,11,0


In [3]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Year_Birth,0,0.0,0,0.0,59,int64
1,Education,0,0.0,0,0.0,5,object
2,Marital_Status,0,0.0,0,0.0,8,object
3,Income,24,0.010714,0,0.0,1974,float64
4,Kidhome,0,0.0,1293,0.577232,3,int64
5,Teenhome,0,0.0,1158,0.516964,3,int64
6,Dt_Customer,0,0.0,0,0.0,663,object
7,Recency,0,0.0,28,0.0125,100,int64
8,MntWines,0,0.0,13,0.005804,776,int64
9,MntFruits,0,0.0,400,0.178571,158,int64


In [4]:
status(data)[status(data)["unique"] > (data.shape[0]/2)]

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
3,Income,24,0.010714,0,0.0,1974,float64


#### ELIMINAMOS COLUMNAS FECHA Y CON VALORES ÚNICOS

In [None]:
columnas_a_eliminar = ['Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue']
data = data.drop(columns=columnas_a_eliminar)

NameError: name 'data_imp' is not defined

In [None]:
status(data)

#### TRANSFORMAMOS COLUMNAS OBJECT A NUMÉRICAS, YA QUE SON POCOS VALORES

In [None]:
data['Marital_Status'].unique()

In [None]:
class_map = {'Single':0, 'Married':1, 'Together':1, 'Divorced':2, 'Widow':3, 'Alone':0, 'Absurd':0, 'YOLO':0}
data['Marital_Status'] = data['Marital_Status'].map(class_map)

In [None]:
data['Education'].unique()

In [None]:
# Si tuvieramos el orden de la etapa más básica a la más especializada, podríamos ordenarlas
class_map = {'Graduation':0, 'PhD':1, 'Master':2, 'Basic':3, '2n Cycle':4}
data['Education'] = data['Education'].map(class_map)

In [None]:
status(data)

#### Split en Train y Test

In [None]:
data_x = data.drop('Response', axis=1)
data_y = data['Response']

In [None]:
data_x = data_x.values
data_y = data_y.values

In [None]:

x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3)

#### Regresión Lineal

In [None]:
x_data_reg = data.drop('Income', axis=1)
y_data_reg = data['Income']
x_data_reg = x_data_reg.values
y_data_reg = y_data_reg.values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data_reg, y_data_reg, test_size=0.3)

In [None]:
# a.Creamos modelo
model = LinearRegression()

In [None]:
# b. fiteamos
model.fit(x_train, y_train)

In [None]:
# c. obtenemos predicciónes para tr y ts
pred_tr = model.predict(x_train)
pred_ts = model.predict(x_test)
pred_tr[0:6]

#### Guardando el modelo en lr.pkl

In [None]:

with open('lr.pkl', 'wb') as handle:
    pickle.dump(model, handle, protocol=pickle.HIGHEST_PROTOCOL)


#### RandomForest

In [None]:
model_rf = RandomForestRegressor()

In [None]:
params = {
    'n_estimators' : [10, 100, 300, 500,1000],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [None]:
grid_rf.fit(x_train, y_train)

In [None]:
grid_rf.best_estimator_

In [None]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

In [None]:
grid_rf.best_params_

#### Guardando el modelo en rfc.pkl

In [None]:
with open('rfc.pickle', 'wb') as handle:
    pickle.dump(grid_rf.best_params_, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
with open('rfc.pickle', 'rb') as handle:
    rfc_tr = pickle.load(handle)

#### RandomForest Columna Income

In [None]:
X = data.drop('Income', axis=1)
y = data['Income']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:


model_rf = RandomForestRegressor()

In [None]:
params = {
    'n_estimators' : [10, 100, 300, 500,1000],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [None]:
grid_rf.fit(X_train, y_train)

In [None]:
grid_rf.best_estimator_

In [None]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

In [None]:
grid_rf.best_params_

#### Combinatoria de parámetros

In [None]:
pd.concat([pd.DataFrame(grid_rf.cv_results_["params"]),
           pd.DataFrame(grid_rf.cv_results_["mean_test_score"], 
                        columns=["neg_mean_absolute_error"])],axis=1).sort_values('neg_mean_absolute_error', ascending=False)

In [None]:
grid_rf.score(X_train, y_train)

In [None]:
grid_rf.score(X_test, y_test)

In [None]:
# Guardar el modelo
# rfr.pkl
with open('rfr.pkl', 'wb') as handle:
    pickle.dump(grid_rf.best_estimator_, handle, protocol=pickle.HIGHEST_PROTOCOL)
