# Actividad N° 05: iFood

## Integrantes

**Grupo N° 03**

- Adriana Villalobos
- Gustavo Ledesma
- Alejo Cuello

## Descripción de la actividad

Trabajamos sobre el conjunto de datos *marketing-campaign.csv* de iFood. El objetivo de la actividad es validar los modelos de clasificación y regresión utilizados para predecir distintas variables.

# Consigna

- Creen un modelo de clasificación utilizando Random Forest para la columna `Response`.
- Guarden el modelo de clasificación Random forest como `rfc.pkl`.
- Creen un modelo con regresión lineal y con Random Forest + GridsearchCV para predecir la columna `Income`.
- Guardar ambos modelos de regresion en pkl `lr.pkl` y `rfr.pkl`
- Cargar proyecto en Github / Gitlab, usen git y git-lfs para los `.csv` y `.pkl`.

## Consideraciones

- Repliquen este notebook para la resolución del ejercicio.
- Consideren las etapas: 1) Cargamos los datos, 2) Preparación de la data, 3) Clasificación, 4) Regresión y 5) Guardar un modelo.

**Podemos decidir:**
- Cómo preparar y acondicionar el dataset.
- Pueden agregar y eliminar columnas del dataset.
- Decidir parámetros para ajustar en los modelos de clasificación y regresión.

# Código

#### CARGA DE DATOS

In [None]:
import pandas as pd
import numpy as np
import pickle

from funpymodeling.exploratory import status
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor 

In [None]:
data = pd.read_csv("marketing_campaign.csv", sep=';', index_col=0)
data.head(5)

In [None]:
status(data)

#### PREPARACIÓN DE LA DATA

##### --- Uso unas funciones para reemplazar las numéricas sin comprometer (tanto) el valor promedio y la desviación estándar

In [None]:
def minmax(x, y):
    #Calcula un rango mínimo y máximo.
    resul1 = x - y
    resul2 = x + y
    resultados = {'min': resul1, 'max': resul2}
    return resultados

In [None]:
def imp_numericas(f):
    #Imputa valores faltantes (NaN) de forma vectorizada.
    #Se espera que la entrada 'f' sea un objeto tipo Series de pandas.
    if not isinstance(f, pd.Series):
        try:
            f = pd.Series(f, dtype=float)
        except ValueError:
            return f

    nan_mask = f.isna()
    
    if not np.any(nan_mask):
        return f
        
    mean_val = f.mean(skipna=True)
    std_val = f.std(skipna=True)
    
    mn_sd = minmax(x=round(mean_val), y=round(std_val))
    
    num_nan = np.sum(nan_mask)
    aleatorios = np.random.randint(mn_sd['min'], mn_sd['max'] + 1, size=num_nan)

    f_imputado = f.copy()
    
    f_imputado[nan_mask] = aleatorios

    f_imputado[nan_mask & (f_imputado < 1)] = 1
    f_imputado[nan_mask & (f_imputado > mn_sd['max'])] = mn_sd['max']
    
    return f_imputado

In [None]:
def imp_data(data):
    """
    Args:
        df (pd.DataFrame): El dataframe a procesar.

    Returns:
        pd.DataFrame: Un nuevo dataframe con las columnas numéricas imputadas.
    """
    df_imputado = data.copy()
    
    for column in df_imputado.columns:
        if pd.api.types.is_numeric_dtype(df_imputado[column]):
            #print(f"Procesando la columna numérica: '{column}'")
            df_imputado[column] = imp_numericas(df_imputado[column])
            
    return df_imputado

In [None]:
data_imp = imp_data(data)

In [None]:
# Comprobamos que no hay valores faltantes
status(data_imp)

In [None]:
print("### Promedio data")
print(round(data['Income'].mean(), 2))

print("-"*40)
print("### Promedio data imputada")
print(round(data_imp['Income'].mean(), 2))

In [None]:
print("### Desviación estándar data")
print(round(data['Income'].std(), 2))

print("-"*40)
print("### Desviación estándar data imputada")
print(round(data_imp['Income'].std(), 2))

#### ELIMINAMOS COLUMNAS FECHA Y CON VALORES ÚNICOS

In [None]:
columnas_a_eliminar = ['Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue']
data_imp2 = data_imp.drop(columns=columnas_a_eliminar)

In [None]:
status(data_imp2)

#### TRANSFORMAMOS COLUMNAS OBJECT A NUMÉRICAS, YA QUE SON POCOS VALORES

In [None]:
data_imp2['Marital_Status'].unique()

In [None]:
class_map = {'Single':0, 'Married':1, 'Together':1, 'Divorced':2, 'Widow':3, 'Alone':0, 'Absurd':0, 'YOLO':0}
data_imp2['Marital_Status'] = data_imp2['Marital_Status'].map(class_map)

In [None]:
data_imp2['Education'].unique()

In [None]:
class_map = {'Graduation':0, 'PhD':1, 'Master':2, 'Basic':3, '2n Cycle':4}
data_imp2['Education'] = data_imp2['Education'].map(class_map)

In [None]:
status(data_imp2)

#### Split en Train y Test

In [None]:
data_x = data_imp2.drop('Response', axis=1)
data_y = data_imp2['Response']

In [None]:
data_x = data_x.values
data_y = data_y.values

In [None]:

x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3)

#### ------------------------------------------------------------------

#### Regresión Lineal

In [None]:
x_data_reg = data_imp2.drop('Income', axis=1)
y_data_reg = data_imp2['Income']
x_data_reg = x_data_reg.values
y_data_reg = y_data_reg.values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data_reg, y_data_reg, test_size=0.3)

In [None]:
# a.Creamos modelo
model = LinearRegression()

In [None]:
# b. fiteamos
model.fit(x_train, y_train)

In [None]:
# c. obtenemos predicciónes para tr y ts
pred_tr = model.predict(x_train)
pred_ts = model.predict(x_test)
pred_tr[0:6]

#### Guardando el modelo en lr.pkl

In [None]:

with open('lr.pkl', 'wb') as handle:
    pickle.dump(model, handle, protocol=pickle.HIGHEST_PROTOCOL)


#### ------------------------------------------------------------------

#### RandomForest

In [None]:
model_rf = RandomForestRegressor()

In [None]:
params = {
    'n_estimators' : [10, 100, 300, 500,1000],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [None]:
grid_rf.fit(x_train, y_train)

In [None]:
grid_rf.best_estimator_

In [None]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

In [None]:
grid_rf.best_params_

#### Guardando el modelo en rfc.pkl

In [None]:
with open('rfc.pickle', 'wb') as handle:
    pickle.dump(grid_rf.best_params_, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
with open('rfc.pickle', 'rb') as handle:
    rfc_tr = pickle.load(handle)

#### RandomForest Columna Income

In [None]:
X = data_imp2.drop('Income', axis=1)
y = data_imp2['Income']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:


model_rf = RandomForestRegressor()

In [None]:
params = {
    'n_estimators' : [10, 100, 300, 500,1000],
    'max_features': [50,100],
    #'bootstrap': [False, True],
    #'max_depth': [50, 500],
    #'min_samples_leaf': [3, 50],
    #'min_samples_split': [10, 50],
}

grid_rf = GridSearchCV(estimator = model_rf,
                        param_grid = params,
                        scoring = 'neg_mean_absolute_error',
                        cv = 5, 
                        verbose = 1
                        )

In [None]:
grid_rf.fit(X_train, y_train)

In [None]:
grid_rf.best_estimator_

In [None]:
grid_rf.predict(x_train)
grid_rf.predict(x_test)

In [None]:
grid_rf.best_params_

#### Combinatoria de parámetros

In [None]:
pd.concat([pd.DataFrame(grid_rf.cv_results_["params"]),
           pd.DataFrame(grid_rf.cv_results_["mean_test_score"], 
                        columns=["neg_mean_absolute_error"])],axis=1).sort_values('neg_mean_absolute_error', ascending=False)

In [None]:
grid_rf.score(X_train, y_train)

In [None]:
grid_rf.score(X_test, y_test)

In [None]:
# Guardar el modelo
# rfr.pkl
with open('rfr.pkl', 'wb') as handle:
    pickle.dump(grid_rf.best_estimator_, handle, protocol=pickle.HIGHEST_PROTOCOL)
