# 01 – Data Overview (PySpark)

This notebook introduces the Taobao CTR dataset. We will load up to **500k rows** from each CSV file using PySpark (to handle large data volumes) and inspect the schema and basic statistics. 

**Note on Kaggle download**: If you have Kaggle API credentials set up, you can automatically download the dataset. Uncomment the lines in the code cell below and ensure your `kaggle.json` credentials file is in the correct location (`~/.kaggle`). Otherwise, place the CSV files manually into `data/raw`. 

* Data sources:
  * `raw_sample.csv` – user impressions and clicks
  * `ad_feature.csv` – advertisement metadata
  * `user_profile.csv` – user demographics
  * `behavior_log.csv` – user behaviour logs (page view, cart, favourite, purchase)


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import os


# 1) Start Spark session
spark = (
    SparkSession.builder
        .appName("CTR_Data_Overview")
        .config("spark.sql.shuffle.partitions", 200)
        .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("Spark version:", spark.version)



try:
    here = os.path.dirname(os.path.abspath(__file__))
    project_root = os.path.abspath(os.path.join(here, ".."))
except NameError:
    # Fallback for notebooks / interactive
    project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))

raw_dir = os.path.join(project_root, "data", "raw")
print("Project root:", project_root)
print("Raw data directory:", raw_dir)

file_names = {
    "raw_sample": "raw_sample.csv",
    "ad_feature": "ad_feature.csv",
    "user_profile": "user_profile.csv",
    "behavior_log": "behavior_log.csv",
}


missing = []
paths = {}

for key, fname in file_names.items():
    fpath = os.path.join(raw_dir, fname)
    paths[key] = fpath
    if not os.path.isfile(fpath):
        missing.append(fpath)

if missing:
    print("\nERROR: Missing required files:")
    for p in missing:
        print(" -", p)
    raise FileNotFoundError("Some required CSV files are missing. Check your data/raw folder.")

def read_csv_safe(path: str):

    return (
        spark.read
            .option("header", "true")
            .option("inferSchema", "true")
            .option("mode", "DROPMALFORMED")
            .csv(path)
    )


def overview_df(df, name: str, show_n: int = 5):
    print("\n" + "=" * 70)
    print(f"DATASET: {name}")
    print("=" * 70)

    # Row count (action)
    n_rows = df.count()
    n_cols = len(df.columns)
    print(f"Rows: {n_rows:,} | Columns: {n_cols}")

    # Schema
    print("\nSchema:")
    df.printSchema()

    # Sample
    print(f"\nSample ({show_n} rows):")
    df.show(show_n, truncate=False)

    # Missing values overview (for each column)
    # Note: For large tables, this is still OK but does one pass.
    miss_exprs = [
        F.sum(
            F.when(F.col(c).isNull() | (F.trim(F.col(c).cast("string")) == ""), 1).otherwise(0)
        ).alias(c)
        for c in df.columns
    ]
    miss_row = df.select(miss_exprs).collect()[0].asDict()

    # Print top missing columns
    miss_sorted = sorted(miss_row.items(), key=lambda x: x[1], reverse=True)
    top_missing = [(c, m) for c, m in miss_sorted if m and m > 0][:20]

    print("\nTop missing columns (up to 20):")
    if not top_missing:
        print("No missing values detected (null/empty).")
    else:
        for c, m in top_missing:
            pct = (m / n_rows * 100.0) if n_rows > 0 else 0.0
            print(f"- {c}: {m:,} ({pct:.2f}%)")


df_raw_sample   = read_csv_safe(paths["raw_sample"])
df_ad_feature   = read_csv_safe(paths["ad_feature"])
df_user_profile = read_csv_safe(paths["user_profile"])
df_behavior_log = read_csv_safe(paths["behavior_log"])


overview_df(df_raw_sample,   "raw_sample")
overview_df(df_ad_feature,   "ad_feature")
overview_df(df_user_profile, "user_profile")
overview_df(df_behavior_log, "behavior_log")


print("\n Done. All datasets loaded and profiled.")


Spark version: 4.0.1
Project root: d:\projects\Ai\project_fusion_ecu
Raw data directory: d:\projects\Ai\project_fusion_ecu\data\raw

DATASET: raw_sample
Rows: 100,836 | Columns: 6

Schema:
root
 |-- user: integer (nullable = true)
 |-- adgroup_id: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- time_stamp: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- clk: integer (nullable = true)


Sample (5 rows):
+----+----------+------+----------+-----+---+
|user|adgroup_id|rating|time_stamp|label|clk|
+----+----------+------+----------+-----+---+
|1   |1         |4.0   |964982703 |1    |1  |
|1   |3         |4.0   |964981247 |1    |1  |
|1   |6         |4.0   |964982224 |1    |1  |
|1   |47        |5.0   |964983815 |1    |1  |
|1   |50        |5.0   |964982931 |1    |1  |
+----+----------+------+----------+-----+---+
only showing top 5 rows

Top missing columns (up to 20):
No missing values detected (null/empty).

DATASET: ad_feature
Rows: 9,742 | Colum