# 03 – Exploratory Data Analysis & Data Fusion (PySpark)

In this notebook we explore the cleaned data and perform data fusion.  We demonstrate **early fusion**, where all relevant tables are joined into one wide table, and **hybrid fusion**, where we engineer features from auxiliary tables and merge them back into the main data set.

## Load cleaned data

We load the cleaned datasets saved as Parquet files from the previous step.  Parquet storage is efficient and preserves schema information.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
import os

spark = SparkSession.builder.appName('CTR_EDA_Fusion').getOrCreate()
processed_dir = os.path.join('..', 'data', 'processed')

user_df = spark.read.parquet(os.path.join(processed_dir, 'user_profile_clean.parquet'))
ad_df = spark.read.parquet(os.path.join(processed_dir, 'ad_feature_clean.parquet'))
click_df = spark.read.parquet(os.path.join(processed_dir, 'raw_sample_clean.parquet'))
behaviour_df = spark.read.parquet(os.path.join(processed_dir, 'behavior_log_clean.parquet'))

print('Data loaded:')
print('user_df:', user_df.count())
print('ad_df:', ad_df.count())
print('click_df:', click_df.count())
print('behaviour_df:', behaviour_df.count())


## Early fusion

We join the user profile and ad feature tables onto the click log.  A left join retains all click records, while adding demographic and ad attributes.  Keys used:

- `user` from click log matches `userid` in the user profile.
- `adgroup_id` from click log matches `adgroup_id` in the ad feature table.

In [None]:
# Join click log with user profile on user ID
click_user = click_df.join(user_df, click_df['user'] == user_df['userid'], how='left')

# Join with ad features on adgroup ID
full_df = click_user.join(ad_df, 'adgroup_id', how='left')

print('Early fused dataframe rows:', full_df.count())


## Hybrid fusion features

We engineer behavioural features from the behaviour log.  We compute the count of each behaviour type (page view, cart addition, favourite and purchase) per user and merge these aggregated features into the fused table.

In [None]:
# Pivot behaviour log to get counts per user for each btag category
behaviour_counts = behaviour_df.groupBy('user').pivot('btag').agg(count('*')).fillna(0)

# Merge counts into the fused table
full_df = full_df.join(behaviour_counts, 'user', how='left')

# Fill any missing counts with zero
for col_name in ['buy', 'cart', 'fav', 'ipv']:
    if col_name in full_df.columns:
        full_df = full_df.na.fill({col_name: 0})

print('Hybrid fused dataframe rows:', full_df.count())


In [None]:
# Save hybrid fused DataFrame for later use
import os
processed_dir = os.path.join('..', 'data', 'processed')
full_df.write.mode('overwrite').parquet(os.path.join(processed_dir, 'hybrid_fusion.parquet'))
print('Hybrid fused data saved to hybrid_fusion.parquet')


## Sampling for visual exploration

For plotting and correlation analysis, we take a random sample of the fused DataFrame and convert it to a Pandas DataFrame.  This is necessary because visualisation libraries expect in‑memory data.

In [None]:
# Take a sample of 10 000 rows for plotting
sample_size = 10000
sample_df = full_df.sample(fraction=sample_size / full_df.count(), seed=42).toPandas()

# Plot distributions of selected numeric features
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = ['age_level', 'price', 'buy', 'cart', 'fav', 'ipv']

fig, axes = plt.subplots(len(numeric_cols), 2, figsize=(12, 4 * len(numeric_cols)))
for i, col_name in enumerate(numeric_cols):
    sns.histplot(sample_df[col_name], ax=axes[i, 0], kde=True)
    axes[i, 0].set_title(f'Distribution of {col_name}')
    sns.boxplot(x=sample_df[col_name], ax=axes[i, 1])
    axes[i, 1].set_title(f'Boxplot of {col_name}')
plt.tight_layout()
plt.show()


## Correlation analysis

We compute the correlation matrix for the numeric features on the sample.  This helps identify multicollinearity among features.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

corr = sample_df[numeric_cols].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', center=0)
plt.title('Correlation Matrix of Numeric Features')
plt.show()
