This notebook prepares the CTR prediction dataset for machine learning model development by applying standard preprocessing steps. 

The goal is to transform the raw, cleaned dataset into a well-structured, numeric format suitable for model training.

Create a cleaned development dataset by:
- Encoding categorical features
- Standardizing numerical features
- Splitting data into training and testing sets

In [25]:
import sys
!{sys.executable} -m pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Collecting scikit-learn>=1.6.0 (from category_encoders)
  Downloading scikit_learn-1.7.0-cp312-cp312-win_amd64.whl.metadata (14 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
Downloading scikit_learn-1.7.0-cp312-cp312-win_amd64.whl (10.7 MB)
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
   ------------ --------------------------- 3.4/10.7 MB 16.7 MB/s eta 0:00:01
   ---------------------------- ----------- 7.6/10.7 MB 19.6 MB/s eta 0:00:01
   ---------------------------------------- 10.7/10.7 MB 18.5 MB/s eta 0:00:00
Installing collected packages: scikit-learn, category_encoders
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.5.1
    Uninstalling scikit-learn-1.5.1:
      Successfully uninstalled scikit-learn-1.5.1
Successfully installed category_encoders-2.8.1 scikit-learn-1.7.0


In [1]:
#CTR-Pre-processingAndTrainingDataDevelopment
#loading the necessary packages

#%reset
%reset_selective -f regex
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import pprint
import numpy as np
import seaborn as sns
from scipy.stats import zscore
from sklearn import preprocessing
%matplotlib inline
from sklearn.preprocessing import StandardScaler


In [3]:

# Load cleaned data
file_path = "/mnt/data/cleaned_ctr_prediction_data.csv"
ctr_df = pd.read_csv(r"C:\Users\vidus\Projects\Springboard\CapstoneTwo_CTRprediction\data\processed\cleaned_ctr_prediction_data.csv")

# Identify categorical and numeric features
categorical_low_card = ['banner_pos', 'device_type', 'device_conn_type']
categorical_high_card = ['site_id', 'app_id', 'device_model']
numeric_features = ['C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
target_variable = 'click'

# Preview data to confirm structure
ctr_df[categorical_low_card + categorical_high_card + numeric_features + [target_variable]].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75058 entries, 0 to 75057
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   banner_pos        75058 non-null  int64
 1   device_type       75058 non-null  int64
 2   device_conn_type  75058 non-null  int64
 3   site_id           75058 non-null  int64
 4   app_id            75058 non-null  int64
 5   device_model      75058 non-null  int64
 6   C14               75058 non-null  int64
 7   C15               75058 non-null  int64
 8   C16               75058 non-null  int64
 9   C17               75058 non-null  int64
 10  C18               75058 non-null  int64
 11  C19               75058 non-null  int64
 12  C20               75058 non-null  int64
 13  C21               75058 non-null  int64
 14  click             75058 non-null  int64
dtypes: int64(15)
memory usage: 8.6 MB


In [5]:
# Step 1: Create dummy variables for low-cardinality categorical features
categorical_low_card = ['banner_pos', 'device_type', 'device_conn_type']
ctr_df_dummies = pd.get_dummies(ctr_df, columns=categorical_low_card, drop_first=True)

Above I have created Dummy Variables for Low-Cardinality Categorical Features
Low-cardinality categorical variables (e.g., `banner_pos`, `device_type`) were one-hot encoded using `pd.get_dummies()` to convert them into binary features.


In [18]:
# Step 2: Standardize numeric features
scaler = StandardScaler()
ctr_df_dummies[numeric_features] = scaler.fit_transform(ctr_df_dummies[numeric_features])


In the above Step 2, Target Encode High-Cardinality Categorical Features

High-cardinality features like:

- site_id

- app_id

- device_model

can introduce sparsity and noise when one-hot encoded.

Instead, I apply Target Encoding, which replaces each category with the mean of the target variable (click) within that category. 

This reduces dimensionality while retaining predictive power.

In [42]:
# Step 3: Target Encode High-Cardinality Features
# Target encoding helps with features like site_id, app_id, device_model.
#Using category_encoders.TargetEncoder, we replace each category with the mean of the target (click) for that category.

import category_encoders as ce

# Instantiate target encoder
target_encoder = ce.TargetEncoder(cols=['site_id', 'app_id', 'device_model'])

# Fit and transform
ctr_df_encoded = target_encoder.fit_transform(ctr_df_dummies, ctr_df['click'])



In the above Step 3: Target Encode High-Cardinality Features

Some categorical variables in this dataset, such as `site_id`, `app_id`, and `device_model`, have a large number of unique categories (i.e., high cardinality). Applying one-hot encoding to these would create a very sparse dataset and potentially introduce noise or overfitting in models.

To address this, I use **Target Encoding** via `category_encoders.TargetEncoder`, which replaces each category with the **mean of the target variable (`click`)** for that category. This approach reduces dimensionality while still preserving useful predictive information.

This step is crucial for improving model performance when dealing with high-cardinality categorical features.


In [26]:
# Step 4:  Standard Scale the Numeric Features
from sklearn.preprocessing import StandardScaler

numeric_features = ['C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
scaler = StandardScaler()
ctr_df_encoded[numeric_features] = scaler.fit_transform(ctr_df_encoded[numeric_features])


In the above Step 4: Standardize the Numeric Features

To ensure that all numeric features contribute equally to the model and to improve optimization convergence for algorithms like Logistic Regression or Gradient Boosting, we apply **standard scaling**.

We use `StandardScaler` from `sklearn.preprocessing` to transform each numeric feature (`C14` to `C21`) such that they have a **mean of 0 and standard deviation of 1**. This prevents features with larger magnitudes from dominating the model training process.

Only numeric features are standardized—categorical features (including dummy or encoded ones) are not scaled.


In [13]:
# Step 5: Split into Training and Testing Sets
from sklearn.model_selection import train_test_split

# Define X and y
X = ctr_df_encoded.drop(columns=['click'])
y = ctr_df_encoded['click']

# Train-Test Split with stratification to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Optional: Check shape
print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


Train set shape: (60046, 31)
Test set shape: (15012, 31)


Step 5: Split into Training and Testing Sets
To evaluate model performance fairly and avoid data leakage, we split our dataset into training and testing subsets:

Training Set: Used to train the machine learning model.

Testing Set: Used to evaluate the model’s performance on unseen data.

We use the train_test_split() function from sklearn.model_selection with the following configurations:

test_size=0.2: 20% of the data is reserved for testing.

random_state=42: Ensures reproducibility of the split.

stratify=y: Maintains the original class distribution (important for imbalanced data like CTR).

This ensures the model is trained on a representative subset of the data and evaluated on an equally balanced holdout set.

In [33]:
# Use raw string (prefix with 'r') to avoid issues with backslashes
## Save datasets in the processed data folder
X_train.to_csv(r"C:\Users\vidus\Projects\Springboard\CapstoneTwo_CTRprediction\data\processed\X_train.csv", index=False)
X_test.to_csv(r"C:\Users\vidus\Projects\Springboard\CapstoneTwo_CTRprediction\data\processed\X_test.csv", index=False)
y_train.to_csv(r"C:\Users\vidus\Projects\Springboard\CapstoneTwo_CTRprediction\data\processed\y_train.csv", index=False)
y_test.to_csv(r"C:\Users\vidus\Projects\Springboard\CapstoneTwo_CTRprediction\data\processed\y_test.csv", index=False)


Save Processed Datasets
We save the train and test sets as .csv files for reuse in the modeling phase:

X_train.csv, X_test.csv

y_train.csv, y_test.csv

Next Step: 
Use the preprocessed datasets to train and evaluate machine learning models during the modeling phase of the capstone.


### Feature Type Justification

We identified categorical and continuous features through data types and domain understanding.  
- **Categorical Features**: `'banner_pos'`, `'device_type'`, and `'device_conn_type'` have a small number of unique values and are encoded using one-hot encoding.  
- **High Cardinality Categorical**: `'site_id'`, `'app_id'`, and `'device_model'` have a large number of unique categories, so we applied target encoding to avoid high dimensionality.  
- **Continuous Features**: `'C14'` to `'C21'` are treated as numeric based on their value distribution and usage in previous CTR prediction literature.  
We applied StandardScaler only on these numeric features to standardize their magnitude, which is important for many machine learning algorithms.


Conclusion:

In this notebook, I successfully completed the Pre-processing and Training Data Development phase for the Click-Through Rate (CTR) prediction project. Key steps included:

Dummy Encoding: Applied one-hot encoding to low-cardinality categorical features (banner_pos, device_type, device_conn_type) to prepare them for modeling.

Target Encoding: Handled high-cardinality categorical features (site_id, app_id, device_model) using target encoding, which reduces dimensionality while preserving meaningful patterns with respect to the target variable click.

Standardization: Scaled continuous numeric features (C14 to C21) using StandardScaler to ensure features contribute equally to model learning.

Train-Test Split: Split the dataset into training and testing subsets using an 80/20 ratio while maintaining the target distribution (stratify=y) to ensure fair model evaluation.

Data Export: Saved the final training and testing datasets as CSV files for use in the next modeling step.

With these preprocessing steps complete, now have a clean and standardized dataset ready for building and evaluating predictive machine learning models.