# Click Prediction Preprocessing Pipeline

This notebook outlines a preprocessing workflow for predicting the probability of a customer clicking an offer (target `y`) given they have seen it. After preprocessing, we perform PCA for dimensionality reduction and train a Naive Bayes classifier.

**Steps covered:**
1. Setup & Data Loading
2. Class Distribution Analysis
3. Dropping Unused ID Variables
4. Missing-Value Handling
5. Temporal Feature Engineering
6. Encoding High-Cardinality Categories
7. Feature Selection
8. Train/Validation Split
9. Saving Preprocessed Data to CSV
10. PCA Dimensionality Reduction
11. Naive Bayes Classification

Each section contains markdown explanations and commented code.

---

## 1. Setup & Data Loading

Import necessary libraries and load the first 500 rows for pipeline prototyping and the data dictionary.

In [None]:
import pandas as pd  # data manipulation
import numpy as np   # numerical operations
from sklearn.model_selection import train_test_split

# Load sample data (first 500 rows) and dictionary
train_sample = pd.read_csv('output/train_df_head.csv')
test_sample  = pd.read_csv('output/test_df_head.csv')
# data_dict    = pd.read_csv('output/data_dict.csv')  # Uncomment if available

print(f"Train sample shape: {train_sample.shape}")
print(f"Test sample shape:  {test_sample.shape}")

## 2. Class Distribution Analysis

Analyze the class distribution of the target variable `y`.

In [None]:
# Compute class distribution of y
class_counts = train_sample['y'].value_counts()
class_props  = train_sample['y'].value_counts(normalize=True)
print("Class Counts:\n", class_counts)
print("\nClass Proportions:\n", class_props)

## 3. Drop Unused ID Variables

Drop ID columns that are not used as features.

In [None]:
drop_cols = ['id1']
train_sample.drop(columns=drop_cols, inplace=True)
test_sample.drop(columns=drop_cols, inplace=True)

## 4. Missing-Value Handling

Handle missing values by replacing sentinels, dropping high-missing features, and imputing with median.

In [None]:
from sklearn.impute import SimpleImputer

# Replace sentinel values with NaN
num_cols = train_sample.select_dtypes(include=['int', 'float']).columns
train_sample[num_cols] = train_sample[num_cols].replace(-9999, np.nan)
test_sample[num_cols]  = test_sample[num_cols].replace(-9999, np.nan)

# Drop features with >95% missing
miss_pct = train_sample[num_cols].isna().mean()
drop_high_miss = miss_pct[miss_pct > 0.95].index.tolist()
train_sample.drop(columns=drop_high_miss, inplace=True)
test_sample.drop(columns=drop_high_miss, inplace=True)

# Impute with median and add indicators
median_imp = SimpleImputer(strategy='median', add_indicator=True)
train_sample[num_cols] = median_imp.fit_transform(train_sample[num_cols])
test_sample[num_cols]  = median_imp.transform(test_sample[num_cols])

## 5. Temporal Feature Engineering

Extract features from impression timestamp.

In [None]:
# Convert impression timestamp to datetime and extract features
train_sample['ts_imp'] = pd.to_datetime(train_sample['id4'])
train_sample['hour']   = train_sample['ts_imp'].dt.hour
train_sample['weekday']= train_sample['ts_imp'].dt.dayofweek
train_sample['is_weekend'] = train_sample['weekday'].isin([5,6]).astype(int)

test_sample['ts_imp'] = pd.to_datetime(test_sample['id4'])
test_sample['hour']   = test_sample['ts_imp'].dt.hour
test_sample['weekday']= test_sample['ts_imp'].dt.dayofweek
test_sample['is_weekend'] = test_sample['weekday'].isin([5,6]).astype(int)

# Drop raw timestamp
train_sample.drop(columns=['id4','ts_imp'], inplace=True)
test_sample.drop(columns=['id4','ts_imp'], inplace=True)

## 6. Encoding High-Cardinality Categories

Encode high-cardinality categorical features using target encoding.

In [None]:
from category_encoders import TargetEncoder

tgt_enc = TargetEncoder(cols=['id2','id3'], smoothing=0.2)
train_sample[['id2_enc','id3_enc']] = tgt_enc.fit_transform(train_sample[['id2','id3']], train_sample['y'])
test_sample[['id2_enc','id3_enc']]  = tgt_enc.transform(test_sample[['id2','id3']])

# Drop original IDs
train_sample.drop(columns=['id2','id3'], inplace=True)
test_sample.drop(columns=['id2','id3'], inplace=True)

## 7. Feature Selection

Select features using L1-regularized logistic regression (Lasso).

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

X_full = train_sample.drop(columns=['y'])
y_full = train_sample['y']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_full)

lasso = LogisticRegressionCV(penalty='l1', solver='saga', cv=3, scoring='roc_auc', max_iter=1000)
lasso.fit(X_scaled, y_full)

coef_mask = np.abs(lasso.coef_).ravel() > 1e-6
selected_features = X_full.columns[coef_mask].tolist()
print(f"Selected {len(selected_features)} features out of {X_full.shape[1]}")

train_sel = train_sample[selected_features + ['y']]
test_sel  = test_sample[selected_features]

## 8. Train/Validation Split

Split the data into training and validation sets using a time-based split.

In [None]:
# Time-based split using id5 timestamp
train_sel['imp_time'] = pd.to_datetime(train_sample['id5'])
train_sel.sort_values('imp_time', inplace=True)
cutoff = int(len(train_sel)*0.8)

train_final = train_sel.iloc[:cutoff].drop(columns=['imp_time'])
valid_final = train_sel.iloc[cutoff:].drop(columns=['imp_time'])

X_train, y_train = train_final.drop(columns=['y']), train_final['y']
X_val,   y_val   = valid_final.drop(columns=['y']), valid_final['y']

print(f"Train: {X_train.shape}, Val: {X_val.shape}")

## 9. Saving Preprocessed Data to CSV

Save the preprocessed train, validation, and test sets to CSV.

In [None]:
train_final.to_csv('output/preprocessed_train.csv', index=False)
valid_final.to_csv('output/preprocessed_valid.csv', index=False)
test_sel.to_csv('output/preprocessed_test.csv', index=False)
print("Saved preprocessed CSV files.")

## 10. PCA Dimensionality Reduction

Apply PCA on scaled training features, retaining 95% variance, then transform validation and test sets.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale features first
all_features = selected_features
scaler_pca = StandardScaler()
X_train_scaled = scaler_pca.fit_transform(X_train[all_features])
X_val_scaled   = scaler_pca.transform(X_val[all_features])
X_test_scaled  = scaler_pca.transform(test_sel[all_features])

# Fit PCA to 95% variance
pca = PCA(n_components=0.95, svd_solver='full')
X_train_pca = pca.fit_transform(X_train_scaled)
X_val_pca   = pca.transform(X_val_scaled)
X_test_pca  = pca.transform(X_test_scaled)

print(f"PCA components: {pca.n_components_}")

## 11. Naive Bayes Classification

Train a Gaussian Naive Bayes on the PCA-transformed data and evaluate on validation.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import average_precision_score, precision_recall_curve

# Fit
nb = GaussianNB()
nb.fit(X_train_pca, y_train)

# Predict probabilities and evaluate MAP (average precision)
y_val_pred = nb.predict_proba(X_val_pca)[:,1]
map7 = average_precision_score(y_val, y_val_pred)
print(f"Validation Average Precision (Proxy for MAP@7): {map7:.4f}")

# Save predictions for test set
y_test_pred = nb.predict_proba(X_test_pca)[:,1]
pd.DataFrame({'y_pred': y_test_pred}).to_csv('output/nb_test_predictions.csv', index=False)
print("Saved Naive Bayes test predictions.")