# Actividad N° 05: iFood

## Integrantes

**Grupo N° 03**

- Adriana Villalobos
- Gustavo Ledesma
- Alejo Cuello

## Descripción de la actividad

Trabajamos sobre el conjunto de datos *marketing-campaign.csv* de iFood. El objetivo de la actividad es validar los modelos de clasificación y regresión utilizados para predecir distintas variables.

# Consigna

- Creen un modelo de clasificación utilizando Random Forest para la columna `Response`.
- Guarden el modelo de clasificación Random forest como `rfc.pkl`.
- Creen un modelo con regresión lineal y con Random Forest + GridsearchCV para predecir la columna `Income`.
- Guardar ambos modelos de regresion en pkl `lr.pkl` y `rfr.pkl`
- Cargar proyecto en Github / Gitlab, usen git y git-lfs para los `.csv` y `.pkl`.

## Consideraciones

Repliquen este notebook para la resolución del ejercicio. Consideren las etapas:

1) Cargamos los datos

2) Preparación de la data

3) Clasificación

4) Regresión

5) Guardar un modelo.

**Podemos decidir:**
- Cómo preparar y acondicionar el dataset.
- Pueden agregar y eliminar columnas del dataset.
- Decidir parámetros para ajustar en los modelos de clasificación y regresión.

# Código

#### CARGA DE DATOS

In [63]:
import pandas as pd
import numpy as np
import pickle

from funpymodeling.exploratory import status
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [64]:
data = pd.read_csv("marketing_campaign.csv", sep=';', index_col=0)
data.head(5)

Unnamed: 0_level_0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,...,7,0,0,0,0,0,0,3,11,1
2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,...,5,0,0,0,0,0,0,3,11,0
4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,...,4,0,0,0,0,0,0,3,11,0
6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,...,6,0,0,0,0,0,0,3,11,0
5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,...,5,0,0,0,0,0,0,3,11,0


In [65]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Year_Birth,0,0.0,0,0.0,59,int64
1,Education,0,0.0,0,0.0,5,object
2,Marital_Status,0,0.0,0,0.0,8,object
3,Income,24,0.010714,0,0.0,1974,float64
4,Kidhome,0,0.0,1293,0.577232,3,int64
5,Teenhome,0,0.0,1158,0.516964,3,int64
6,Dt_Customer,0,0.0,0,0.0,663,object
7,Recency,0,0.0,28,0.0125,100,int64
8,MntWines,0,0.0,13,0.005804,776,int64
9,MntFruits,0,0.0,400,0.178571,158,int64


#### DESCARTE DE REGISTROS CON VARIABLE OBJETIVO NULA

In [66]:
discarded_data = data[data["Income"].isna()]
data = data[data["Income"].notna()]

#### CHEQUEO DE ALGUNAS VARIABLES

In [67]:
print("Casos raros en los que gastó más en productos gold que en la sumatoria de todas las categorías")
data[data["MntGoldProds"] > (data["MntFishProducts"] + data["MntMeatProducts"] + data["MntFruits"] + data["MntSweetProducts"] + data["MntWines"])][["MntGoldProds","MntFishProducts","MntMeatProducts","MntFruits","MntSweetProducts","MntWines"]]

Casos raros en los que gastó más en productos gold que en la sumatoria de todas las categorías


Unnamed: 0_level_0,MntGoldProds,MntFishProducts,MntMeatProducts,MntFruits,MntSweetProducts,MntWines
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4246,262,4,26,11,3,67
6237,291,5,33,4,2,81
10311,321,2,12,4,4,16


In [68]:
print("Chequeamos que la variable Response tenga aproximadamente un 15% de efectividad")
data["Response"].sum() / data.shape[0]

Chequeamos que la variable Response tenga aproximadamente un 15% de efectividad


0.15027075812274368

#### ELIMINAMOS COLUMNAS FECHA Y CON VALORES ÚNICOS

In [69]:
columnas_a_eliminar = ['Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue']
data = data.drop(columns=columnas_a_eliminar)

In [70]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,0,0.0,5,object
1,Marital_Status,0,0.0,0,0.0,8,object
2,Income,0,0.0,0,0.0,1974,float64
3,Kidhome,0,0.0,1283,0.578971,3,int64
4,Teenhome,0,0.0,1147,0.517599,3,int64
5,Recency,0,0.0,28,0.012635,100,int64
6,MntWines,0,0.0,13,0.005866,776,int64
7,MntFruits,0,0.0,395,0.178249,158,int64
8,MntMeatProducts,0,0.0,1,0.000451,554,int64
9,MntFishProducts,0,0.0,379,0.171029,182,int64


#### TRANSFORMAMOS COLUMNAS OBJECT A NUMÉRICAS, YA QUE SON POCOS VALORES

In [71]:
data['Marital_Status'].unique()

array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)

In [72]:
class_map = {'Single':0, 'Married':1, 'Together':1, 'Divorced':2, 'Widow':3, 'Alone':0, 'Absurd':0, 'YOLO':0}
data['Marital_Status'] = data['Marital_Status'].map(class_map)

In [73]:
data['Education'].unique()

array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)

In [74]:
# Si tuvieramos el orden de la etapa más básica a la más especializada, podríamos ordenarlas
class_map = {'Graduation':0, 'PhD':1, 'Master':2, 'Basic':3, '2n Cycle':4}
data['Education'] = data['Education'].map(class_map)

In [75]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,1116,0.50361,5,int64
1,Marital_Status,0,0.0,478,0.215704,4,int64
2,Income,0,0.0,0,0.0,1974,float64
3,Kidhome,0,0.0,1283,0.578971,3,int64
4,Teenhome,0,0.0,1147,0.517599,3,int64
5,Recency,0,0.0,28,0.012635,100,int64
6,MntWines,0,0.0,13,0.005866,776,int64
7,MntFruits,0,0.0,395,0.178249,158,int64
8,MntMeatProducts,0,0.0,1,0.000451,554,int64
9,MntFishProducts,0,0.0,379,0.171029,182,int64


## Clasificación
Variable target: Response

### Random Forest Classifier

In [76]:
x_data_classification = data.drop('Response', axis=1)
y_data_classification = data['Response']

x_data_classification = x_data_classification.values
y_data_classification = y_data_classification.values

x_train_classification, x_test_classification, y_train_classification, y_test_classification = train_test_split(x_data_classification, y_data_classification, test_size=0.3, random_state=42)

In [87]:
rfc = RandomForestClassifier(n_estimators = 100, random_state = 42)

rfc.fit(x_train_classification, y_train_classification)
y_train_classification_pred = rfc.predict(x_train_classification)

In [88]:
pred_probs = rfc.predict_proba(x_train_classification)
pred_probs

array([[1.  , 0.  ],
       [0.4 , 0.6 ],
       [0.99, 0.01],
       ...,
       [0.99, 0.01],
       [0.92, 0.08],
       [0.9 , 0.1 ]])

In [89]:
with open('rfc.pickle', 'wb') as handle:
    pickle.dump(rfc, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Regresión
Variable target: Income

### Regresión Lineal

In [90]:
x_data_reg = data.drop('Income', axis=1)
y_data_reg = data['Income']

x_data_reg = x_data_reg.values
y_data_reg = y_data_reg.values

In [91]:
x_train, x_test, y_train, y_test = train_test_split(x_data_reg, y_data_reg, test_size=0.3, random_state=42)

In [92]:
model = LinearRegression()
model.fit(x_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [93]:
pred_tr = model.predict(x_train)
pred_ts = model.predict(x_test)

In [115]:

with open('lr.pickle', 'wb') as handle:
    pickle.dump(model, handle, protocol=pickle.HIGHEST_PROTOCOL)


### Random Forest Regressor

In [95]:
model_rf = RandomForestRegressor()

In [96]:
params = {
    # 'n_estimators' : [10, 100, 300, 500,1000],
    'n_estimators' : [10],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [97]:
grid_rf.fit(x_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


0,1,2
,estimator,RandomForestRegressor()
,param_grid,"{'max_features': [50, 100], 'n_estimators': [10]}"
,scoring,'neg_mean_absolute_error'
,n_jobs,
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,10
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,50
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [98]:
grid_rf.best_estimator_

0,1,2
,n_estimators,10
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,50
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [99]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

array([ 60070.6,  26466.2,  35668. ,  65699.5,  20107.1,  83042.4,
        79994.2,  61627.4,  47001.6,  84312.4,  55626.2,  33869.2,
        46548.4,  51031.3,  37012.8,  79024.4,  37723.4,  73243.4,
        36143.2,  84979.3,  49982.1,  60832. ,  31539.7,  35586.6,
        61505.9,  64499.2,  61443.5,  21890.8,  87356.2,  77722.2,
        27954.1,  74054.8,  35600.4,  34851.9,  26315.8,  54963.2,
        93010.3,  45463.8,  65010.2,  70642.9,  76168.6,  46807.4,
        82334.3,  44743.5,  56476.1, 105861.4,  57517.4,  30586.7,
        47889.6,  26690.7,  71977.2,  44490.5,  24503.1,  64562.8,
        63810. ,  73382.7,  74301.6,  64391.3,  41925.2,  33428.2,
        72638.1,  19221.2,  28829.8,  36248.5,  84359.8,  57147.6,
        55100.6,  73991.3,  66996.8,  50198.1,  79002.1,  57580.3,
        18369.8,  37192.2,  44679.7,  44654.2,  59498.3,  81778.3,
        14273.1,  57536.9,  79525.9,  72287.7,  81774.4,  70871.7,
        58008.1,  26099.5,  59406.3,  39815.2,  65874.1,  3111

In [100]:
grid_rf.best_params_

{'max_features': 50, 'n_estimators': 10}

In [105]:
X = data.drop('Income', axis=1)
y = data['Income']

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [107]:
grid_rf.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


0,1,2
,estimator,RandomForestRegressor()
,param_grid,"{'max_features': [50, 100], 'n_estimators': [10, 100, ...]}"
,scoring,'neg_mean_absolute_error'
,n_jobs,
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,1000
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,50
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [108]:
grid_rf.best_estimator_

0,1,2
,n_estimators,1000
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,50
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [109]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)



array([ 57255.209,  29693.828,  33553.363,  69277.491,  17765.462,
        83256.401,  80338.459,  62524.41 ,  51930.901,  83468.948,
        54822.29 ,  33076.422,  52771.862,  57352.12 ,  33875.339,
        74985.933,  38733.451,  76147.935,  38313.823,  84896.911,
        52523.505,  59177.402,  30958.334,  35148.607,  60019.895,
        73761.503,  64534.072,  22483.07 ,  85338.788,  80784.789,
        28428.045,  75950.962,  34194.39 ,  35596.052,  26267.944,
        50395.265,  88424.875,  47837.373,  72346.256,  71088.845,
        78927.092,  45859.485,  82247.112,  51517.736,  52317.489,
        53116.913,  50373.037,  30053.748,  52512.09 ,  22811.006,
        68347.838,  46836.772,  23957.728,  65517.045,  64591.202,
        74262.289,  74818.398,  63832.894,  36522.196,  35475.799,
        70628.688,  19853.717,  32715.574,  33179.96 ,  83562.777,
        58173.584,  55079.322,  74357.664,  69461.512,  49114.459,
       109067.984,  56299.325,  22849.267,  32920.408,  43421.

In [110]:
grid_rf.best_params_

{'max_features': 50, 'n_estimators': 1000}

#### Combinatoria de parámetros

In [111]:
pd.concat([pd.DataFrame(grid_rf.cv_results_["params"]),
           pd.DataFrame(grid_rf.cv_results_["mean_test_score"], 
                        columns=["neg_mean_absolute_error"])],axis=1).sort_values('neg_mean_absolute_error', ascending=False)

Unnamed: 0,max_features,n_estimators,neg_mean_absolute_error
4,50,1000,-6457.364939
8,100,500,-6490.710532
9,100,1000,-6495.310717
7,100,300,-6513.189482
1,50,100,-6515.272858
2,50,300,-6522.712199
3,50,500,-6528.791039
6,100,100,-6541.850717
5,100,10,-6716.23781
0,50,10,-7154.195812


In [112]:
grid_rf.score(X_train, y_train)

-2336.717110045147

In [113]:
grid_rf.score(X_test, y_test)

-5712.869195945947

In [None]:
# Guardar el modelo
with open('rfr.pickle', 'wb') as handle:
    pickle.dump(grid_rf.best_estimator_, handle, protocol=pickle.HIGHEST_PROTOCOL)
