# 01 – Data Overview (PySpark)

This notebook introduces the Taobao CTR dataset. We will load up to **1 million rows** from each CSV file using PySpark (to handle large data volumes) and inspect the schema and basic statistics. 

**Note on Kaggle download**: If you have Kaggle API credentials set up, you can automatically download the dataset. Uncomment the lines in the code cell below and ensure your `kaggle.json` credentials file is in the correct location (`~/.kaggle`). Otherwise, place the CSV files manually into `data/raw`. 

* Data sources:
  * `raw_sample.csv` – user impressions and clicks
  * `ad_feature.csv` – advertisement metadata
  * `user_profile.csv` – user demographics
  * `behavior_log.csv` – user behaviour logs (page view, cart, favourite, purchase)


In [1]:

# Uncomment and configure the following if Kaggle API is available:
# from kaggle.api.kaggle_api_extended import KaggleApi
# api = KaggleApi()
# api.authenticate()
# api.dataset_download_files('t/t', path='../data/raw', unzip=True)
# The above will download the dataset into the raw data folder.


In [4]:
from pyspark.sql import SparkSession
import os

# ============================
# 1. Start Spark session
# ============================
spark = (
    SparkSession.builder
        .appName("CTR_Data_Overview")
        .config("spark.sql.shuffle.partitions", "200")
        .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("Spark version:", spark.version)

# ============================
# 2. Define project and data paths
# ============================

# Option A: hard-coded project root (recommended for now)
project_root = r"D:\projects\Ai\project_fusion_ecu"

# Option B (alternative): if this notebook lives inside `project_fusion_ecu/notebooks`
# project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))

raw_dir = os.path.join(project_root, "data", "raw")
print("Raw data directory:", raw_dir)

# ============================
# 3. Expected file names
# ============================
file_names = {
    "raw_sample": "raw_sample.csv",
    "ad_feature": "ad_feature.csv",
    "user_profile": "user_profile.csv",
    # If your file is named `raw_behavior_log.csv` change the next line accordingly
    "behavior_log": "behavior_log.csv",
}

# ============================
# 4. Verify that files exist
# ============================
missing = []
for key, fname in file_names.items():
    path = os.path.join(raw_dir, fname)
    exists = os.path.exists(path)
    print(f"{key:15s} -> {path} | exists: {exists}")
    if not exists:
        missing.append(fname)

if missing:
    raise FileNotFoundError(
        "Missing files in data/raw: "
        + ", ".join(missing)
        + "\nPlease copy them into "
        + raw_dir
        + " with the exact same names."
    )

# ============================
# 5. Load up to 1M rows per file
# ============================
user_df = (
    spark.read.csv(
        os.path.join(raw_dir, file_names["user_profile"]),
        header=True,
        inferSchema=True,
    )
    .limit(1_000_000)
    .cache()
)

ad_df = (
    spark.read.csv(
        os.path.join(raw_dir, file_names["ad_feature"]),
        header=True,
        inferSchema=True,
    )
    .limit(1_000_000)
    .cache()
)

click_df = (
    spark.read.csv(
        os.path.join(raw_dir, file_names["raw_sample"]),
        header=True,
        inferSchema=True,
    )
    .limit(1_000_000)
    .cache()
)

behavior_df = (
    spark.read.csv(
        os.path.join(raw_dir, file_names["behavior_log"]),
        header=True,
        inferSchema=True,
    )
    .limit(1_000_000)
    .cache()
)

print("Loaded dataframes:")
print("user_df rows:     ", user_df.count())
print("ad_df rows:       ", ad_df.count())
print("click_df rows:    ", click_df.count())
print("behavior_df rows: ", behavior_df.count())

# ============================
# 6. Inspect schemas
# ============================
for name, df in [
    ("User Profile", user_df),
    ("Ad Feature", ad_df),
    ("Click Log", click_df),
    ("Behavior Log", behavior_df),
]:
    print(f"\nSchema for {name}:")
    df.printSchema()

# ============================
# 7. Basic statistics for numeric columns in click_df
# ============================
numeric_cols = [c for c, dtype in click_df.dtypes if dtype in ("int", "bigint", "double")]
print("\nNumeric columns in click_df:", numeric_cols)

if numeric_cols:
    click_df.describe(numeric_cols).show()
else:
    print("No numeric columns found in click_df.")

# ============================
# 8. Stop Spark session (optional)
# ============================
spark.stop()


Spark version: 4.0.1
Raw data directory: D:\projects\Ai\project_fusion_ecu\data\raw
raw_sample      -> D:\projects\Ai\project_fusion_ecu\data\raw\raw_sample.csv | exists: True
ad_feature      -> D:\projects\Ai\project_fusion_ecu\data\raw\ad_feature.csv | exists: True
user_profile    -> D:\projects\Ai\project_fusion_ecu\data\raw\user_profile.csv | exists: True
behavior_log    -> D:\projects\Ai\project_fusion_ecu\data\raw\behavior_log.csv | exists: True
Loaded dataframes:
user_df rows:      1000000
ad_df rows:        846811
click_df rows:     1000000
behavior_df rows:  1000000

Schema for User Profile:
root
 |-- userid: integer (nullable = true)
 |-- cms_segid: integer (nullable = true)
 |-- cms_group_id: integer (nullable = true)
 |-- final_gender_code: integer (nullable = true)
 |-- age_level: integer (nullable = true)
 |-- pvalue_level: integer (nullable = true)
 |-- shopping_level: integer (nullable = true)
 |-- occupation: integer (nullable = true)
 |-- new_user_class_level : intege