In [1]:
import pandas as pd
import numpy as np
import pickle

# Load cleaned data from previous notebook using pickle
with open('Pickle Files/step2.pkl', 'rb') as f:
    df = pickle.load(f)
    
print("Cleaned data loaded from Pickle Files/step2.pkl:", df.shape)

Cleaned data loaded from Pickle Files/step2.pkl: (8536, 9)


## Categorical Encoding for Machine Learning
To prepare categorical features for machine learning algorithms, we apply encoding techniques:
- **One-Hot Encoding:** Converts each category value into a new binary column. Used for 'brand', 'category_code', and 'event_type'.
- **Label Encoding:** Assigns each unique category value an integer label. Useful for algorithms that can interpret integer values as categories.

These encodings transform categorical variables into a format suitable for clustering and other ML models.

### Example: One-Hot Encoding
- **Input:** ['red', 'green', 'blue']
- **Output:**
    | red | green | blue |
    |-----|-------|------|
    |  1  |   0   |  0   |
    |  0  |   1   |  0   |
    |  0  |   0   |  1   |
One-Hot Encoding is preferred for clustering and most ML algorithms, as it avoids implying any order or priority among categories.

## Feature Scaling for Machine Learning
Feature scaling is an essential preprocessing step for numeric columns (such as 'price') before clustering. We use StandardScaler to standardize numeric features to have mean 0 and variance 1. This ensures that all features contribute equally to distance-based algorithms like KMeans, preventing features with larger scales from dominating the clustering process.

In our pipeline, StandardScaler is applied to numeric columns automatically before clustering.

In [2]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

In [3]:
# Define columns
categorical_cols = ['brand', 'category_code', 'event_type']
numeric_cols = ['price']

In [4]:
# Preprocessing: OneHot for categorical, Scale for numeric
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('num', StandardScaler(), numeric_cols)
])
X = preprocessor.fit_transform(df[categorical_cols + numeric_cols])

In [5]:
# Full pipeline: Preprocessing + Clustering
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('cluster', KMeans(n_clusters=4, random_state=42))
 ])

# Fit the pipeline
pipeline.fit(df[categorical_cols + numeric_cols])

# Predict clusters
df['cluster'] = pipeline.predict(df[categorical_cols + numeric_cols])

## Browsing Patterns Analysis
Analyze customer browsing behavior to understand user engagement patterns:
- **Browsing-to-Purchase Ratio:** How many views/carts lead to actual purchases
- **Session Activity:** Number of actions per session
- **Event Funnel:** View → Cart → Purchase conversion rates
- **Category/Brand Exploration:** Diversity of browsing vs purchasing behavior

These patterns help identify different customer segments like window shoppers, decisive buyers, and exploratory users.

In [6]:
# Calculate browsing patterns for each user
browsing_patterns = df.groupby('user_id').agg({
    'event_type': ['count', lambda x: (x == 'view').sum(), 
                   lambda x: (x == 'cart').sum(), 
                   lambda x: (x == 'purchase').sum()],
    'user_session': 'nunique',
    'category_code': 'nunique',
    'brand': 'nunique'
})

# Flatten column names
browsing_patterns.columns = ['total_events', 'views', 'carts', 'purchases', 
                            'unique_sessions', 'categories_browsed', 'brands_browsed']

# Calculate browsing ratios and patterns
browsing_patterns['view_to_purchase_ratio'] = browsing_patterns['views'] / (browsing_patterns['purchases'] + 1)
browsing_patterns['cart_to_purchase_ratio'] = browsing_patterns['carts'] / (browsing_patterns['purchases'] + 1)
browsing_patterns['events_per_session'] = browsing_patterns['total_events'] / browsing_patterns['unique_sessions']
browsing_patterns['purchase_conversion_rate'] = browsing_patterns['purchases'] / browsing_patterns['total_events']

browsing_patterns = browsing_patterns.reset_index()
print("Browsing patterns calculated:", browsing_patterns.shape)

Browsing patterns calculated: (2436, 12)


In [7]:
# Calculate session duration for each user
# Convert event_time to datetime if not already done
df['event_time'] = pd.to_datetime(df['event_time'])

# Calculate session duration metrics
session_duration = df.groupby(['user_id', 'user_session']).agg({
    'event_time': ['min', 'max', 'count']
})

# Flatten column names
session_duration.columns = ['session_start', 'session_end', 'events_in_session']

# Calculate duration in minutes for each session
session_duration['session_duration_minutes'] = (
    session_duration['session_end'] - session_duration['session_start']
).dt.total_seconds() / 60

# Aggregate session duration metrics per user
user_session_metrics = session_duration.groupby('user_id').agg({
    'session_duration_minutes': ['mean', 'sum', 'max', 'count'],
    'events_in_session': 'mean'
})

# Flatten column names
user_session_metrics.columns = ['avg_session_duration', 'total_session_time', 
                               'max_session_duration', 'session_count', 'avg_events_per_session']

user_session_metrics = user_session_metrics.reset_index()
print("Session duration metrics calculated:", user_session_metrics.shape)

Session duration metrics calculated: (2436, 6)


## Spending Behavior Analysis
Analyze customer spending patterns to understand purchase behavior:
- **Total and Average Spend:** Overall spending power and typical transaction size
- **Purchase Frequency:** How often customers make purchases
- **Spending Consistency:** Variance in spending amounts
- **Price Sensitivity:** Range of prices customers engage with

These metrics help identify high-value customers, frequent buyers, and price-sensitive segments.

In [8]:
# Calculate spending behavior patterns for each user
spending_behavior = df[df['event_type'] == 'purchase'].groupby('user_id').agg({
    'price': ['sum', 'mean', 'std', 'min', 'max', 'count']
})

# Flatten column names
spending_behavior.columns = ['total_spend', 'avg_spend', 'spend_std', 
                           'min_spend', 'max_spend', 'purchase_count']

# Calculate additional spending metrics
spending_behavior['spend_range'] = spending_behavior['max_spend'] - spending_behavior['min_spend']
spending_behavior['spend_consistency'] = spending_behavior['spend_std'] / (spending_behavior['avg_spend'] + 1)
spending_behavior['spending_per_day'] = spending_behavior['total_spend'] / 30  # Assuming 30-day period

# Fill NaN values for users with no purchases
spending_behavior = spending_behavior.fillna(0)
spending_behavior = spending_behavior.reset_index()
print("Spending behavior calculated:", spending_behavior.shape)

Spending behavior calculated: (87, 10)


## Comprehensive Customer Features
Combine **Purchase History**, **Browsing Patterns**, and **Spending Behavior** into a single customer dataset ready for clustering analysis.

In [9]:
# Combine all customer features: Purchase History + Browsing Patterns + Spending Behavior + Session Duration
customer_features_comprehensive = browsing_patterns.merge(spending_behavior, on='user_id', how='outer')
customer_features_comprehensive = customer_features_comprehensive.merge(user_session_metrics, on='user_id', how='outer')

# Add purchase history features from original data
purchase_history = df.groupby('user_id').agg({
    'price': ['sum', 'mean', 'count'],
    'event_type': 'nunique',
    'category_code': 'nunique',
    'brand': 'nunique'
})
purchase_history.columns = ['total_revenue', 'avg_transaction', 'total_transactions', 
                           'event_types_used', 'categories_purchased', 'brands_purchased']
purchase_history = purchase_history.reset_index()

# Final comprehensive customer features
customer_features_final = customer_features_comprehensive.merge(purchase_history, on='user_id', how='outer')
customer_features_final = customer_features_final.fillna(0)

print("✅ ALL ENGINEERED FEATURES COMPLETE:")
print("📊 Purchase History: Total spend, transaction patterns, frequency")
print("🔍 Browsing Patterns: View-to-purchase ratios, session activity, conversion rates")
print("💰 Spending Behavior: Spending consistency, price sensitivity, purchase patterns")
print("⏰ Session Duration: Average session time, total engagement time, session patterns")
print(f"\nFinal dataset shape: {customer_features_final.shape}")
print("\n✅ Feature Engineering Complete: Purchase frequency, browsing-to-purchase ratio, session duration, and average spend")
customer_features_final.head()

✅ ALL ENGINEERED FEATURES COMPLETE:
📊 Purchase History: Total spend, transaction patterns, frequency
🔍 Browsing Patterns: View-to-purchase ratios, session activity, conversion rates
💰 Spending Behavior: Spending consistency, price sensitivity, purchase patterns
⏰ Session Duration: Average session time, total engagement time, session patterns

Final dataset shape: (2436, 32)

✅ Feature Engineering Complete: Purchase frequency, browsing-to-purchase ratio, session duration, and average spend


Unnamed: 0,user_id,total_events,views,carts,purchases,unique_sessions,categories_browsed,brands_browsed,view_to_purchase_ratio,cart_to_purchase_ratio,...,total_session_time,max_session_duration,session_count,avg_events_per_session,total_revenue,avg_transaction,total_transactions,event_types_used,categories_purchased,brands_purchased
0,474832046.0,1,1,0,0,1,1,1,1.0,0.0,...,0.0,0.0,1,1.0,102.71,102.71,1,1,1,1
1,474967396.0,5,5,0,0,1,1,3,5.0,0.0,...,2.533333,2.533333,1,5.0,1343.52,268.704,5,1,1,3
2,477121012.0,1,1,0,0,1,1,1,1.0,0.0,...,0.0,0.0,1,1.0,179.39,179.39,1,1,1,1
3,479233261.0,7,7,0,0,1,2,2,7.0,0.0,...,1.083333,1.083333,1,7.0,1789.71,255.672857,7,1,2,2
4,485580346.0,1,1,0,0,1,1,1,1.0,0.0,...,0.0,0.0,1,1.0,322.91,322.91,1,1,1,1


In [10]:
# Save comprehensive preprocessed data for next notebook using pickle
import pickle

# Save all data including comprehensive customer features
data_bundle = {
    'X': X,
    'preprocessor': preprocessor,
    'customer_features_final': customer_features_final,
    'browsing_patterns': browsing_patterns,
    'spending_behavior': spending_behavior,
    'purchase_history': purchase_history,
    'df': df
}

with open('Pickle Files/step3.pkl', 'wb') as f:
    pickle.dump(data_bundle, f)

print("✅ Comprehensive preprocessed data saved as Pickle Files/step3.pkl")
print("📦 Includes: Purchase History + Browsing Patterns + Spending Behavior")
print("🎯 Ready for K-Means clustering analysis")

✅ Comprehensive preprocessed data saved as Pickle Files/step3.pkl
📦 Includes: Purchase History + Browsing Patterns + Spending Behavior
🎯 Ready for K-Means clustering analysis

📦 Includes: Purchase History + Browsing Patterns + Spending Behavior
🎯 Ready for K-Means clustering analysis
