## Challenge

In [1]:
import sys
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
import joblib

In [2]:
# Load data
train = pd.read_csv('ecommerce_returns_train.csv')
test = pd.read_csv('ecommerce_returns_test.csv')

In [3]:
def preprocess(df):
    """Simple preprocessing pipeline"""
    df_processed = df.copy()
    
    # Encode categorical: product_category
    le_category = LabelEncoder()
    df_processed['product_category_encoded'] = le_category.fit_transform(
        df_processed['product_category']
    )
    
    # Handle missing sizes (Fashion items only have sizes)
    if df_processed['size_purchased'].notna().any():
        most_common_size = df_processed['size_purchased'].mode()[0]
        df_processed['size_purchased'].fillna(most_common_size, inplace=True)
        
        le_size = LabelEncoder()
        df_processed['size_encoded'] = le_size.fit_transform(
            df_processed['size_purchased']
        )
    
    # Feature selection
    feature_cols = [
        'customer_age', 'customer_tenure_days', 'product_category_encoded',
        'product_price', 'days_since_last_purchase', 'previous_returns',
        'product_rating', 'size_encoded', 'discount_applied'
    ]
    
    X = df_processed[feature_cols]
    y = df_processed['is_return']
    
    return X, y

In [4]:
# Prepare data
X_train, y_train = preprocess(train)
X_test, y_test = preprocess(test)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train baseline model
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train_scaled, y_train)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed['size_purchased'].fillna(most_common_size, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed['size_purchased'].fillna(most_common_size, inplace=True)


In [5]:
# Predictions
y_pred = baseline_model.predict(X_test_scaled)

# Basic evaluation
print("Baseline Model Performance")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Save artifacts
joblib.dump(baseline_model, 'baseline_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

print("\n" + "=" * 50)
print("YOUR TASK: Evaluate thoroughly and improve this baseline")
print("=" * 50)

Baseline Model Performance
Accuracy: 0.7475

Classification Report:
              precision    recall  f1-score   support

           0       0.75      1.00      0.86      1495
           1       0.00      0.00      0.00       505

    accuracy                           0.75      2000
   macro avg       0.37      0.50      0.43      2000
weighted avg       0.56      0.75      0.64      2000


YOUR TASK: Evaluate thoroughly and improve this baseline


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## == Part 1 ==

### 1. Weaknesses
The model fails completely on the minority class (label = 1) and might be because of the imbalance class
- Precision = 0
- Recall = 0
- F1 = 0

The model never predicts the class 1.
Accuracy is misleading because predicting all zeros gives ~74% accuracy.
Poor handling of categorical variables
LabelEncoder is not a good idea because we don't have an order. One hot encoding may work better.

### 2. Where Does the Model Fail Most?
The model fails heavily on return (class 1) predictions <br>
Completely misses actual returns: recall = 0. <br>
Potencial reasons: <br>
    - Logistic regression is too simple for the problem you want to solve. <br>
    - Imbalance problem <br>
    - Features do not capture the interaction between classes <br>

### 3. Is Accuracy the Right Metric?
No because of the imbalance dataset. A better metric might be F1 or recall itself can help to identify the problems in the model.

## == Part 2 ==

### 1. Define “Success” in Business Terms
A model is successful if it **reduces total financial loss from returns** by:  
Catching as many true returns as possible while avoiding unnecessary interventions on non-returns.

**Final business objective**:  
`Minimize total financial cost = return losses + intervention losses`

### 2. Recommended Metrics (Business-Aligned)
- **Recall (class 1)** — “Catch rate” of costly returns  
  False Negatives are very expensive → High recall saves money.
- **Precision (class 1)**  
  Each False Positive (unnecessary intervention) costs $3 → Low precision = wasted money.
- **Precision-Recall AUC**  
  Most appropriate metric for highly imbalanced datasets.

### 3. False Positives vs False Negatives (Cost Trade-Off)
| Mistake                  | Cost    | Description                                      |
|--------------------------|---------|--------------------------------------------------|
| False Negative (missed return) | **–$18** | Predicted no-return → customer returns → big loss |
| False Positive (wrong intervention) | **–$3**  | Predicted return → unnecessary action → small loss |

**Key insight**: Missing a return is **6× more expensive** than a wasted intervention.  
→ The optimal model should **accept more false positives** to **maximize recall**, even if precision drops.

### 4. Calculate Financial Impact of Predictions
```python
def compute_financial_impact(y_true, y_proba, threshold=0.5):
    y_pred = (y_proba >= threshold).astype(int)
    impact = 0
    for yt, yp in zip(y_true, y_pred):
        if yp == 1 and yt == 1:      # Correctly caught return → save $15 net
            impact += 15
        elif yp == 1 and yt == 0:    # False positive → waste $3
            impact -= 3
        elif yp == 0 and yt == 1:    # False negative → lose $18
            impact -= 18
    return impact

### 5. Determine optimal threshold

In [6]:
def compute_financial_impact(y_true, y_proba, threshold=0.5):
    y_pred = (y_proba >= threshold).astype(int)
    
    impact = 0
    for yt, yp in zip(y_true, y_pred):
        if yp == 1 and yt == 1:
            impact += 15
        elif yp == 1 and yt == 0:
            impact -= 3
        elif yp == 0 and yt == 1:
            impact -= 18    
    return impact

In [7]:
thresholds = np.linspace(0.05, 0.95, 10)
results = []

for thre in thresholds:
    impact = compute_financial_impact(y_test, y_pred, threshold=thre)
    results.append((thre, impact))

df = pd.DataFrame(results, columns=['threshold', 'total_impact'])
df.sort_values('total_impact', ascending=False).head()
df

Unnamed: 0,threshold,total_impact
0,0.05,-9090
1,0.15,-9090
2,0.25,-9090
3,0.35,-9090
4,0.45,-9090
5,0.55,-9090
6,0.65,-9090
7,0.75,-9090
8,0.85,-9090
9,0.95,-9090


the result is: -9090 USD

### 6. Critical Questions

Because missing a return (–18) is 6× more expensive than wasting an intervention (–3), the optimal balance leans heavily toward recall even if precision decreases.
The target balance is maximize recall of class 1 up to the point where additional false positives (precision drop) begin to reduce net financial benefit.

## == Part 3 ==

### 1. Dataset Problems
- **Issue 1**: The main problem is the class imbalace. Despite many people suggest to use SMOTE or similar techniques to balance the dataset, that is not a good idea because it changes the distribution of the data. It's better to keep it as it is.
- **Issue 2**: Try another categorical encoding because with LabelEncoding the order is impossed.
- **Issue 3**: Try a better model that can capture model non lineal relationship in the dataset and perhaps make it more interpretable like tree-based models.

### 2. Feature Engineering
A good set of new features might be
- `price * discount_applied` captures effective purchase price
- `tenure / age` loyalty tendency
- `previous_returns_ratio` strong return predictor

### 3. Different Algortithm
`RandomForest` can be a good choice to start modeling the data differently with a simple hyperparameters tunning using a grid.

In [8]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score,  make_scorer, f1_score, recall_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV

In [9]:
def feat_eng(df, fit_encoders=False, encoders=None):
    df = df.copy()

    if fit_encoders:
        encoders = {}

        le_cat = LabelEncoder()
        df["product_category_encoded"] = le_cat.fit_transform(df["product_category"])
        encoders["cat"] = le_cat

        le_size = LabelEncoder()
        df["size_purchased"].fillna(df["size_purchased"].mode()[0], inplace=True)
        df["size_encoded"] = le_size.fit_transform(df["size_purchased"])
        encoders["size"] = le_size
    else:
        df["product_category_encoded"] = encoders["cat"].transform(df["product_category"])
        df["size_purchased"].fillna(df["size_purchased"].mode()[0], inplace=True)
        df["size_encoded"] = encoders["size"].transform(df["size_purchased"])

    # New features
    df["effective_price"] = df["product_price"] * (1 - df["discount_applied"])
    df["age_tenure_ratio"] = df["customer_tenure_days"] / (df["customer_age"] + 1)
    df["previous_returns_ratio"] = df["previous_returns"] / (df["customer_tenure_days"] + 1)

    feature_cols = [
        "customer_age","customer_tenure_days","product_category_encoded",
        "product_price","effective_price","age_tenure_ratio",
        "days_since_last_purchase","previous_returns","previous_returns_ratio",
        "product_rating","size_encoded","discount_applied"
    ]

    X = df[feature_cols]
    y = df["is_return"]

    return X, y, encoders

In [10]:
# Fit encoders
X_train, y_train, encoders = feat_eng(train, fit_encoders=True)
X_test, y_test, _ = feat_eng(test, fit_encoders=False, encoders=encoders)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["size_purchased"].fillna(df["size_purchased"].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["size_purchased"].fillna(df["size_purchased"].mode()[0], inplace=True)


In [11]:
param_dist = {
    "n_estimators": [150, 300, 500, 800],
    "max_depth": [6, 8, 10, 12, 15, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 8],
    "max_features": ["sqrt", "log2", None],
    "bootstrap": [True, False],
    "class_weight": ["balanced"]  # do not change this
}

scoring = {
    "recall_class_1": make_scorer(recall_score, pos_label=1),
    "f1_class_1": make_scorer(f1_score, pos_label=1),
}

rf_base = RandomForestClassifier(random_state=42)

rf_search = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=param_dist,
    n_iter=40,
    scoring=scoring,
    refit="f1_class_1",
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

rf_search.fit(X_train, y_train)

best_rf = rf_search.best_estimator_
pred_rf_tuned = best_rf.predict(X_test)

Fitting 3 folds for each of 40 candidates, totalling 120 fits


In [12]:
import pickle

# Save to pickle
with open("best_rf_model.pkl", "wb") as f:
    pickle.dump(best_rf, f)

In [13]:
print("Best Random Forest")
print(classification_report(y_test, pred_rf_tuned))
print("Accuracy:", accuracy_score(y_test, pred_rf_tuned))

Best Random Forest
              precision    recall  f1-score   support

           0       0.84      0.42      0.56      1495
           1       0.31      0.76      0.44       505

    accuracy                           0.51      2000
   macro avg       0.57      0.59      0.50      2000
weighted avg       0.70      0.51      0.53      2000

Accuracy: 0.5055


Even though the accuracy is lower, the model is predicting class 1. What we expected from this experiment is:

- Recall for class 1 will increase
- Precision for class 1 will increase
- Consequently, the F1-score for class 1 will increase

## == Part 4 ==

### Metrics to Track
#### A. Model Performance Metrics
- Precision (class 1 – returners)
Ensures interventions are accurate and not wasted.
- Recall (class 1 – returners)
Critical because missing a return costs $18.
- F1 Score (class 1)
Balances recall and precision for the business objective.

Confusion Matrix Over Time and tracks false positives and false negatives directly tied to cost.
#### B. Business Metrics
- Net financial impact per day/week (savings vs losses)
- Intervention rate (% of orders flagged)
- Return rate drift (actual returns shifting over time)

### How to Detect Model Degradation

#### A. Metric Drift
- **Recall (class 1)** drops >15% from baseline  
- **Precision (class 1)** drops >20% from baseline  

#### B. Data Drift (Population Stability Index - PSI)
- **PSI > 0.25** severe drift requires **retraining required**

### C. Concept Drift
- Customer behavior or return patterns have changed  

### When to Retrain the Model

#### A. Triggered by Performance Degradation
- **Recall (class 1)** drops >15% from baseline  
- **F1-score (class 1)** drops >10% from baseline  

#### B. Scheduled Retraining
- Mandatory before major sales events:  
  - Black Friday / Cyber Monday  
  - Holiday season  
  - Summer sales, etc.

#### C. Triggered by Business Events
- Launch of **new markets** or **new product lines**  
- Major **pricing strategy changes**  
- Updates to **return/refund policies** or logistics processes