# 2. Modeling

This notebook covers the modeling pipeline for the iFood case study.

## 2.1 Setup

Import necessary libraries and load processed data.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report, confusion_matrix
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.inspection import permutation_importance

pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

## 2.2 Load Processed Data

Load the processed data from the data/processed directory.

In [0]:
# Load processed data
path = "/Workspace/Users/castrobeneyto@gmail.com/ifood-case/data/processed"
profile_df = pd.read_csv(f"{path}/profile.csv")
offers_df = pd.read_csv(f"{path}/offers.csv")
df = pd.read_csv(f"{path}/train.csv")
df.head()

In [0]:
df.info()

## 2.3 Target Specification 

Defining the Prediction Target

With the goal of determining which offer to send to each customer, the problem was modeled as a **propensity model** to predict the probability of a successful outcome given a sent offer. To achieve this, we frame it as a supervised binary classification task. First, we need to define what constitutes a “successful” offer.

In general, possible event paths for `bogo` and `discount` offers are:

- **Successful Offer:** Received → Viewed → Transaction(s) → Completed (**_target = 1_**)

- **Unviewed Success:** Received → Transaction(s) → Completed (**_target = 1_**)

- **Ineffective Offer:** Received → Viewed (**_target = 0_**)

- **Unviewed Offer:** Received (**_target = 0_**)

for `informational`:
- **Successful Offer:** Received → Viewed → Transaction(s) (**_target = 1_**)

- **Ineffective Offer:** Received → Viewed (**_target = 0_**)

- **Unviewed Offer:** Received (**_target = 0_**)


In [0]:
def build_targets(df):
    df = df.copy()
    df.sort_values(["account_id", "time_since_test_start"], inplace=True)

    results = []

    # Process data grouped by user
    for acc, user_df in df.groupby("account_id"):
        
        # Iterate over all received offers for the user
        received_events = user_df[user_df["event"] == "offer received"]

        for idx, offer in received_events.iterrows():
            offer_id = offer["offer_id_unified"]
            offer_type = offer["offer_type"]
            t0 = offer["time_since_test_start"]
            duration = offer["duration"]
            min_value = offer["min_value"]

            # Validity window of the offer
            t_end = t0 + duration

            # Filter all events inside the offer’s validity period
            window = user_df[
                (user_df["time_since_test_start"] >= t0) &
                (user_df["time_since_test_start"] <= t_end)
            ]

            viewed = "offer viewed" in window["event"].values
            completed = "offer completed" in window["event"].values

            transactions = window[window["event"] == "transaction"]

            # ---------------------------
            #   Rules for Informational Offers
            # ---------------------------
            if offer_type == "informational":
                if viewed and (len(transactions) > 0):
                    target = 1
                else:
                    target = 0

            # ---------------------------
            #   Rules for BOGO / Discount Offers
            # ---------------------------
            else:
                if completed:
                    target = 1
                else:
                    target = 0

            results.append({
                "account_id": acc,
                "offer_id_unified": offer_id,
                "t_received": t0,
                "offer_type": offer_type,
                "target": target
            })

    return pd.DataFrame(results)

ml_df = build_targets(df)
ml_df.head()

In [0]:
ml_df.info()

In [0]:
ml_df.target.value_counts(normalize=True)

**Notes:**

- Successes are only valid within each offer’s defined duration window.

- For `bogo` and `discount` offers, an `offer completed` event is the only requirement—and is sufficient—to count as a success.

- For `informational` offers, success requires both an `offer view` and at least one transaction during the validity window.

- There is no transaction ID linked to a specific offer ID, which makes it difficult to attribute financial gains to individual offers.

- A single completed event may satisfy more than one received offer, as long as it falls within the validity windows of both offers and they belong to the same offer type (offer_id).

- It is not possible to determine whether a given transaction should be associated with an informational offer or exclusively with another offer type.

- Even if a customer receives the same offer multiple times, a single instance of satisfying the success rules is enough for the customer to receive a success target for that `offer_id`.

## 2.4 Feature Engineering

Prepare features for modeling.

In [0]:
def feature_engineering(profile_df, offers_df, df):
    df = df.copy()

    # Relevant column subsets
    profile_columns = [
        'id',
        'age',
        'credit_card_limit',
        'gender',
        'registered_on'
    ]

    offers_columns = [
        'id',
        'discount_value',
        'duration',
        'min_value',
        'email',
        'mobile',
        'social',
        'web'
    ]

    # Merge with profile
    df = df.merge(profile_df[profile_columns], left_on="account_id", right_on="id", how='left').drop(columns=['id'])

    # Merge with offer
    df = df.merge(offers_df[offers_columns], left_on="offer_id_unified", right_on="id", how='left').drop(columns=['id'])

    # Temporal features of customer registration
    reference_date = pd.Timestamp("2018-12-31")
    df["registered_on"] = pd.to_datetime(df["registered_on"])
    df["membership_days"] = (reference_date - df["registered_on"]).dt.days
    df["membership_months"] = df["membership_days"] // 30
    df["membership_years"] = df["membership_days"] // 365
    df["registration_year"] = df["registered_on"].dt.year
    df["registration_month"] = df["registered_on"].dt.month
    df["registration_day"] = df["registered_on"].dt.day
    df.drop(columns='registered_on', inplace=True)

    return df

final_df = feature_engineering(profile_df, offers_df, ml_df)
final_df

In [0]:
X = final_df.drop(columns=["target","account_id", "offer_id_unified", "t_received"])
y = final_df["target"]
account_ids = final_df["account_id"]
categorical_cols = ["gender", "offer_type"]
numerical_cols = [col for col in X.columns if col not in categorical_cols]
X

**Notes:**

- Variables were created related to the user's registration period, including membership duration in days, months, and years, as well as registration year, month, and day.



## 2.4 Model Training

Train and evaluate models.

In [0]:
def define_models_and_params(random_state=42):
    
    # 1. Decision Tree Classifier (DT)
    dt_params = {
        'model__max_depth': [3, 5, 7],
        'model__min_samples_split': [2, 5]
    }
    dt_model = ('Decision Tree', DecisionTreeClassifier(random_state=random_state), dt_params)

    # 2. Random Forest Classifier (RF)
    rf_params = {
        'model__n_estimators': [50, 100],
        'model__max_depth': [5, 10]
    }
    rf_model = ('Random Forest', RandomForestClassifier(random_state=random_state, n_jobs=-1), rf_params)

    # 3. XGBoost Classifier (XGB)
    xgb_params = {
        'model__n_estimators': [50, 100],
        'model__learning_rate': [0.05, 0.1]
    }
    xgb_model = ('XGBoost', XGBClassifier(random_state=random_state, eval_metric='logloss', n_jobs=-1), xgb_params)

    # 4. CatBoost Classifier (CAT)
    cat_params = {
        'model__iterations': [50, 100],
        'model__depth': [5, 7],
        'model__verbose': [0]
    }
    cat_model = ('CatBoost', CatBoostClassifier(random_state=random_state), cat_params)

    return [dt_model, rf_model, xgb_model, cat_model]

In [0]:
def create_preprocessor(categorical_cols, numerical_cols):
    preprocess = ColumnTransformer(
        transformers=[
            ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
            ("num", "passthrough", numerical_cols)
        ],
        remainder='drop' 
    )
    return preprocess

In [0]:
def tune_and_evaluate_models(X_train, y_train, X_valid, y_valid, models, preprocessor, groups_train, n_splits=5, scoring='roc_auc', random_state=42):
    """
    Performs GridSearchCV for each model and evaluates the best estimator 
    on the holdout validation set.
    
    Returns: A dictionary of results and a dictionary of the best estimators.
    """
    
    results = {}
    best_estimators = {}
    
    # Stratified K-Fold for robust CV on the training data
    cv_folds = GroupKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    for name, model, params in models:
        print(f"\n--- Starting Hyperparameter Optimization: {name} ---")

        model_pipeline = Pipeline([
            ("preprocess", preprocessor),
            ("model", model)
        ])

        grid_search = GridSearchCV(
            estimator=model_pipeline,
            param_grid=params,
            scoring=scoring,
            cv=cv_folds,
            verbose=1,
            n_jobs=-1 
        )

        grid_search.fit(X_train, y_train, groups=groups_train)

        best_estimator = grid_search.best_estimator_
        best_estimators[name] = best_estimator
        y_pred_valid = best_estimator.predict(X_valid)
        y_proba_valid = best_estimator.predict_proba(X_valid)[:, 1]

        # Metrics
        roc_auc = roc_auc_score(y_valid, y_proba_valid)
        accuracy = accuracy_score(y_valid, y_pred_valid)
        precision = precision_score(y_valid, y_pred_valid, zero_division=0)
        recall = recall_score(y_valid, y_pred_valid, zero_division=0)

        results[name] = {
            'best_cv_score': grid_search.best_score_,
            'best_params': grid_search.best_params_,
            'valid_accuracy': accuracy,
            'valid_precision': precision,
            'valid_recall': recall,
            'valid_roc_auc': roc_auc
        }

        print(f"{name} Best Params: {results[name]['best_params']}")
        print(f"{name} CV {scoring.upper()}: {results[name]['best_cv_score']:.4f}")
        print(f"{name} Holdout Validation Metrics:")
        print(f"   Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | ROC AUC: {roc_auc:.4f}")
        
    return results, best_estimators

In [0]:
preprocessor = create_preprocessor(categorical_cols, numerical_cols)
models = define_models_and_params(random_state=42)

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_idx, valid_idx in gss.split(X, y, groups=account_ids):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
    groups_train = account_ids.iloc[train_idx]

final_results, final_estimators = tune_and_evaluate_models(
    X_train, 
    y_train, 
    X_valid, 
    y_valid, 
    models, 
    preprocessor,
    groups_train,
    scoring='roc_auc'
)


**Notes:**

- Four classifier models were tested using a grid search approach: Decision Tree, Random Forest, XGBoost, and CatBoost.

- Label Encoding was applied to handle categorical variables such as `gender` and `offer_type`.

- Group K-Fold validation was used to prevent any overlap of users between training and validation sets.

- Classification metrics including accuracy, precision, and recall were used to evaluate model performance.

- ROC AUC was used for model optimization, as this metric is threshold-invariant.

## 2.5 Model Evaluation

Evaluate model performance on the test set.

In [0]:
test_df = pd.read_csv(f"{path}/test.csv")
test_df = build_targets(test_df)
test_df = feature_engineering(profile_df, offers_df, test_df)
X_test = test_df.drop(columns=["target","account_id", "offer_id_unified", "t_received"])
y_test = test_df["target"]
X_test

In [0]:
estimator = final_estimators['Random Forest']
y_pred_test = estimator.predict(X_test)
y_proba_test = estimator.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_proba_test)
print(f"ROC AUC: {roc_auc:.4f}")

print(classification_report(y_test, y_pred_test))

In [0]:
cm = confusion_matrix(y_test, y_pred_test)
sns.heatmap(cm, annot=True, fmt='d', cmap='crest')
plt.xlabel('predicted')
plt.ylabel('actual')
plt.show()

In [0]:
print(f"Current Efficiency: {test_df.target.value_counts(normalize=True)[1] * 100:.2f}%")
print(f"Model Efficiency: {precision_score(y_test, y_pred_test) * 100:.2f}%")

**Notes:**

- The model performs well at identifying the positive class (successful offers), with a recall of **78%**.

- Positive class precision of **73.81**% indicates that most positive predictions are correct.

- ROC AUC of **0.768** suggests the model discriminates effectively between successful and unsuccessful offers.

- Current Efficiency (baseline success rate): **60.12%**

- Model Efficiency (precision for positive class): **73.81%**

## 2.6 Model Interpretability
Assess the impact of each feature on the model’s performance.

In [0]:
result = permutation_importance(estimator, X, y, n_repeats=10, random_state=42)

perm_df = pd.DataFrame({
    "feature": X_test.columns,
    "importance_mean": result.importances_mean,
    "importance_std": result.importances_std
}).sort_values("importance_mean", ascending=False)

perm_df

**Notes:**

- **credit_card_limit** and **membership** duration are the most influential features in predicting offer success.

- Demographic features like **gender** and **age** also play a notable role.

- Behavioral or channel-related features like **email** and **mobile** are less important in this model.

## 2.7 Model Inference

Simulation of the model's predictions for new users.

In [0]:
new_users_df = pd.read_csv(f"{path}/users_without_offers.csv")
new_users_df = new_users_df.rename(columns={"id": "account_id"}).copy()

# Temporary keys
users = new_users_df.assign(_key=1)
offers = offers_df.assign(_key=1)

# Cross join
df = (
    users[["_key", "account_id"]]
    .merge(offers[["_key", "id", "offer_type"]], on="_key")
    .drop(columns="_key")
)

df = df.rename(columns={"id": "offer_id_unified"})
print(f"Shape: {df.shape}")
df

In [0]:
X_new_users = feature_engineering(profile_df, offers_df, df).drop(columns=["account_id", "offer_id_unified"])
X_new_users.head()

In [0]:
df['score'] = estimator.predict_proba(X_new_users)[:, 1]
df.sort_values(by=['account_id', 'score'], ascending=[True, False], inplace=True)
df.reset_index(drop=True, inplace=True)
df

In [0]:
df_max = df.loc[df.groupby('account_id')['score'].idxmax()].sort_values('score', ascending=False)
df_max.reset_index(drop=True, inplace=True)

df_max[['account_id', 'offer_id_unified', 'offer_type', 'score']].style.background_gradient(
    subset=['score'], cmap='crest'
).set_table_styles([{'selector': 'td', 'props': [('font-size', '11pt')]}])

**Notes:**

- All possible user-offer combinations were generated for new users, allowing the trained model to predict the propensity score for each pair.

- Based on these scores, offers can be ranked to recommend the best option for each user.


## 2.8 Conclusions

### 2.8.1 Summary

- Developed a propensity model to predict the probability of success for a given offer.

- Increased Current Efficiency (baseline success rate) from **60.12%** to **73.81%**.

- Enabled offer ranking and recommendation for each customer.

### 2.8.2 Improvements

- Further exploration and engineering of new features.

- Optimize decision thresholds for improved performance.

- Refine feature selection to remove less relevant variables.

- Develop a model to predict not only the best offer, but also the optimal timing to send it.

- Enhance model selection using Bayesian optimization.

- Use SHAP or other explainability tools to understand why certain offers are recommended, supporting trust in deployment.

- Track transactions for each offer to attribute financial returns to the offers sent.

### 2.8.3 Next Steps

- Deploy the model in a controlled environment to validate real-world performance and compare against baseline strategies.

- Conduct a randomized controlled experiment to evaluate the causal impact of offers and develop an uplift model.