In [2]:
import pandas as pd
import numpy as np
import pickle

# Load cleaned data from previous notebook using pickle
with open('step2.pkl', 'rb') as f:
    df = pickle.load(f)
    
print("Cleaned data loaded from step2.pkl:", df.shape)

Cleaned data loaded from step2.pkl: (7953, 9)


## Categorical Encoding for Machine Learning
To prepare categorical features for machine learning algorithms, we apply encoding techniques:
- **One-Hot Encoding:** Converts each category value into a new binary column. Used for 'brand', 'category_code', and 'event_type'.
- **Label Encoding:** Assigns each unique category value an integer label. Useful for algorithms that can interpret integer values as categories.

These encodings transform categorical variables into a format suitable for clustering and other ML models.

### Example: One-Hot Encoding
- **Input:** ['red', 'green', 'blue']
- **Output:**
    | red | green | blue |
    |-----|-------|------|
    |  1  |   0   |  0   |
    |  0  |   1   |  0   |
    |  0  |   0   |  1   |
One-Hot Encoding is preferred for clustering and most ML algorithms, as it avoids implying any order or priority among categories.

## Feature Scaling for Machine Learning
Feature scaling is an essential preprocessing step for numeric columns (such as 'price') before clustering. We use StandardScaler to standardize numeric features to have mean 0 and variance 1. This ensures that all features contribute equally to distance-based algorithms like KMeans, preventing features with larger scales from dominating the clustering process.

In our pipeline, StandardScaler is applied to numeric columns automatically before clustering.

In [5]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

In [6]:
# Define columns
categorical_cols = ['brand', 'category_code', 'event_type']
numeric_cols = ['price']

In [7]:
# Preprocessing: OneHot for categorical, Scale for numeric
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('num', StandardScaler(), numeric_cols)
])
X = preprocessor.fit_transform(df[categorical_cols + numeric_cols])

In [8]:
# Full pipeline: Preprocessing + Clustering
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('cluster', KMeans(n_clusters=4, random_state=42))
 ])

# Fit the pipeline
pipeline.fit(df[categorical_cols + numeric_cols])

# Predict clusters
df['cluster'] = pipeline.predict(df[categorical_cols + numeric_cols])

## Create Customer-Level Features

In [9]:
customer_features = df.groupby('user_id').agg({
    'price': ['sum', 'mean', 'count'],
    'event_type': 'nunique',
    'category_code': 'nunique',
    'brand': 'nunique'
})
customer_features.columns = ['total_spend', 'avg_spend', 'purchase_count', 'event_type_count', 'category_count', 'brand_count']
customer_features = customer_features.reset_index()

In [10]:
# Save preprocessed data and customer features for next notebook using pickle
import pickle

# Save all data using pickle
data_bundle = {
    'X': X,
    'preprocessor': preprocessor,
    'customer_features': customer_features,
    'df': df
}

with open('step3.pkl', 'wb') as f:
    pickle.dump(data_bundle, f)

print("Preprocessed data saved as step3.pkl")

Preprocessed data saved as step3.pkl
