# 01 – Data Overview (PySpark)

This notebook introduces the Taobao CTR dataset. We will load up to **500k rows** from each CSV file using PySpark (to handle large data volumes) and inspect the schema and basic statistics. 

**Note on Kaggle download**: If you have Kaggle API credentials set up, you can automatically download the dataset. Uncomment the lines in the code cell below and ensure your `kaggle.json` credentials file is in the correct location (`~/.kaggle`). Otherwise, place the CSV files manually into `data/raw`. 

* Data sources:
  * `raw_sample.csv` – user impressions and clicks
  * `ad_feature.csv` – advertisement metadata
  * `user_profile.csv` – user demographics
  * `behavior_log.csv` – user behaviour logs (page view, cart, favourite, purchase)


In [None]:
from pyspark.sql import SparkSession
import os

# Start Spark session
spark = (
    SparkSession.builder
        .appName("CTR_Data_Overview")
        .config("spark.sql.shuffle.partitions", 200) 
        .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN") # Reduce log verbosity
print("Spark version:", spark.version)

# Define project and data paths
project_root = r"D:\projects\Ai\project_fusion_ecu"

raw_dir = os.path.join(project_root, "data", "raw")
print("Raw data directory:", raw_dir)

# Expected file names
file_names = {
    "raw_sample": "raw_sample.csv",
    "ad_feature": "ad_feature.csv",
    "user_profile": "user_profile.csv",
    "behavior_log": "behavior_log.csv",
}

# Verify that files exist
missing = []
for key, fname in file_names.items():
    path = os.path.join(raw_dir, fname)
    exists = os.path.exists(path)
    print(f"{key:15s} -> {path} | exists: {exists}")
    if not exists:
        missing.append(fname)

if missing:
    raise FileNotFoundError(
        "Missing files in data/raw: "
        + ", ".join(missing)
        + "\nPlease copy them into "
        + raw_dir
        + " with the exact same names."
    )

# Load RANDOM sample 
TARGET_ROWS = 500_000

SAMPLE_FRACTION = 0.25 
SEED = 42

def read_random_sample_csv(path, target_rows=500_000, frac=0.25, seed=42, cache_df=True):
    df = (
        spark.read.csv(path, header=True, inferSchema=True)
        .sample(withReplacement=False, fraction=frac, seed=seed)  # random sampling
        .limit(target_rows) 
    )
    if cache_df:  #due to lazy evaluation
        df = df.cache()
    return df

user_df = read_random_sample_csv(
    os.path.join(raw_dir, file_names["user_profile"]),
    target_rows=TARGET_ROWS,
    frac=SAMPLE_FRACTION,
    seed=SEED,
)

ad_df = read_random_sample_csv(
    os.path.join(raw_dir, file_names["ad_feature"]),
    target_rows=TARGET_ROWS,
    frac=SAMPLE_FRACTION,
    seed=SEED + 1, # different seed for different samples
)

click_df = read_random_sample_csv(
    os.path.join(raw_dir, file_names["raw_sample"]),
    target_rows=TARGET_ROWS,
    frac=SAMPLE_FRACTION,
    seed=SEED + 2,
)

behavior_df = read_random_sample_csv(
    os.path.join(raw_dir, file_names["behavior_log"]),
    target_rows=TARGET_ROWS,
    frac=SAMPLE_FRACTION,
    seed=SEED + 3,
)

print("Loaded dataframes (random sample):")
print("user_df rows:     ", user_df.count())
print("ad_df rows:       ", ad_df.count())
print("click_df rows:    ", click_df.count())
print("behavior_df rows: ", behavior_df.count())

# Inspect schemas
for name, df in [
    ("User Profile", user_df),
    ("Ad Feature", ad_df),
    ("Click Log", click_df),
    ("Behavior Log", behavior_df),
]:
    print(f"\nSchema for {name}:")
    df.printSchema()

# Basic statistics for numeric columns in click_df
numeric_cols = [c for c, dtype in click_df.dtypes if dtype in ("int", "bigint", "double", "float")]
print("\nNumeric columns in click_df:", numeric_cols)

if numeric_cols:
    click_df.describe(numeric_cols).show()
else:
    print("No numeric columns found in click_df.")

spark.stop()


Spark version: 4.0.1
Raw data directory: D:\projects\Ai\project_fusion_ecu\data\raw
raw_sample      -> D:\projects\Ai\project_fusion_ecu\data\raw\raw_sample.csv | exists: True
ad_feature      -> D:\projects\Ai\project_fusion_ecu\data\raw\ad_feature.csv | exists: True
user_profile    -> D:\projects\Ai\project_fusion_ecu\data\raw\user_profile.csv | exists: True
behavior_log    -> D:\projects\Ai\project_fusion_ecu\data\raw\behavior_log.csv | exists: True
Loaded dataframes (random sample):
user_df rows:      265613
ad_df rows:        211615
click_df rows:     500000
behavior_df rows:  500000

Schema for User Profile:
root
 |-- userid : double (nullable = true)
 |--  cms_segid: double (nullable = true)
 |--  cms_group_id: double (nullable = true)
 |--  final_gender_code: double (nullable = true)
 |--  age_level: double (nullable = true)
 |--  pvalue_level: string (nullable = true)
 |--  shopping_level: double (nullable = true)
 |--  occupation: double (nullable = true)
 |--  new_user_class_