### 🧹 **Euroleague Data Cleaning & Exploration Notebook**
#### 🏀 **Stage 1 – Overview**

#### **Goal:**
Prepare all seven Euroleague CSVs for integration by cleaning columns, fixing datatypes, and verifying join keys.
At the end, we’ll have:

clean_header.csv

clean_boxscore.csv

clean_teams.csv
… ready for Stage 2.

In [3]:
# Imports
import pandas as pd
from pathlib import Path

# Path to extracted dataset folder
DATA = Path("data")   # adjust if needed

# File mapping
files = {
    "header": "euroleague_header.csv",
    "boxscore": "euroleague_box_score.csv",
    "teams": "euroleague_teams.csv",
    "players": "euroleague_players.csv",
    "points": "euroleague_points.csv",
    "comparison": "euroleague_comparison.csv",
    "playbyplay": "euroleague_play_by_play.csv"
}

# Load all dataframes
dfs = {k: pd.read_csv(DATA / v, low_memory=False) for k, v in files.items()}


**Let's print the head of all data!**

In [None]:
for name, df in dfs.items():
    print(f"\n--- {name.upper()} ---")
    print("Shape:", df.shape)
    print("Columns:", df.columns.tolist()[:10], "...")
    print(df.head(2))


### ✨ **Let's Standardize Column Names**

In [6]:
def clean_cols(df):
    df.columns = (
        df.columns
        .str.lower()
        .str.strip()
        .str.replace(r"[^a-z0-9]+", "_", regex=True)
    )
    return df

cleaned = {k: clean_cols(v) for k, v in dfs.items()}


All column names are now snake_case (e.g., team_id_a, score_quarter_1_b).”

### 🗓️ Parse Dates

In [7]:
clean_header = cleaned["header"].copy()
if "date" in clean_header.columns:
    clean_header["date"] = pd.to_datetime(clean_header["date"], errors="coerce")



The parsing succeeded. NaT values didnt necessarily appear.

### 🧩 Check Join Keys

In [9]:
for name in ["header", "boxscore", "teams"]:
    df = cleaned[name]
    keys = [c for c in df.columns if "id" in c]
    print(f"{name.upper()} join keys:", keys)


HEADER join keys: ['game_id', 'team_id_a', 'team_id_b', 'w_id']
BOXSCORE join keys: ['game_player_id', 'game_id', 'player_id', 'team_id']
TEAMS join keys: ['season_team_id', 'team_id']


header uses team_id_a, team_id_b, and game_id, and w_id.

boxscore uses team_id , game_i, player_id, and game_player_id.

teams has team_id + season_team_id.

### **Saving the Clean Data**

In [14]:
from pathlib import Path

# create folder if it doesn't exist
CLEAN = Path("clean_data")
CLEAN.mkdir(exist_ok=True)

# save cleaned files
clean_header.to_csv(CLEAN / "clean_header.csv", index=False)
cleaned["boxscore"].to_csv(CLEAN / "clean_boxscore.csv", index=False)
cleaned["teams"].to_csv(CLEAN / "clean_teams.csv", index=False)
cleaned["players"].to_csv(CLEAN / "clean_players.csv", index=False)
cleaned["points"].to_csv(CLEAN / "clean_points.csv", index=False)
cleaned["comparison"].to_csv(CLEAN / "clean_comparison.csv", index=False)
cleaned["playbyplay"].to_csv(CLEAN / "clean_playbyplay.csv", index=False)

print("✅ Cleaned CSVs saved successfully in:", CLEAN)


✅ Cleaned CSVs saved successfully in: clean_data


Adjust to new path if needed.

In [13]:
CLEAN = Path("clean_data")
pd.read_csv(CLEAN / "clean_header.csv")

Unnamed: 0,game_id,game,date,time,round,phase,season_code,score_a,score_b,team_a,...,score_quarter_3_b,score_quarter_4_b,score_extra_time_1_a,score_extra_time_2_a,score_extra_time_3_a,score_extra_time_4_a,score_extra_time_1_b,score_extra_time_2_b,score_extra_time_3_b,score_extra_time_4_b
0,E2007_001,OLY-BAS,2007-10-25,20:30:00,1,REGULAR SEASON,E2007,95,90,OLYMPIACOS PIRAEUS B.C.,...,65,90,,,,,,,,
1,E2007_002,VIR-ZAL,2007-10-24,20:30:00,1,REGULAR SEASON,E2007,81,75,VIRTUS VIDIVICI BOLOGNA,...,61,75,,,,,,,,
2,E2007_003,SOP-CSK,2007-10-22,20:15:00,1,REGULAR SEASON,E2007,69,88,PROKOM TREFL SOPOT,...,69,88,,,,,,,,
3,E2007_004,SIE-LJU,2007-10-24,20:30:00,1,REGULAR SEASON,E2007,80,52,MONTEPASCHI,...,41,52,,,,,,,,
4,E2007_005,ARI-MAL,2007-10-24,20:45:00,1,REGULAR SEASON,E2007,87,83,ARIS THESSALONIKI,...,65,83,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4662,E2025_026,PAR-IST,2025-10-09,20:30:00,3,REGULAR SEASON,E2025,93,87,PARTIZAN MOZZART BET BELGRADE,...,69,87,,,,,,,,
4663,E2025_027,OLY-DUB,2025-10-10,20:15:00,3,REGULAR SEASON,E2025,86,67,OLYMPIACOS PIRAEUS,...,52,67,,,,,,,,
4664,E2025_028,BAR-PAM,2025-10-10,20:30:00,3,REGULAR SEASON,E2025,108,102,FC BARCELONA,...,75,102,,,,,,,,
4665,E2025_029,MUN-ZAL,2025-10-10,20:30:00,3,REGULAR SEASON,E2025,70,98,FC BAYERN MUNICH,...,68,98,,,,,,,,


---


### 🧾 **Reflection — Stage 1: Data Cleaning Accomplishments**

In this stage, I completed the foundational data-engineering work for the Euroleague project:

* **Standardized all column names** across seven CSVs into a unified snake-case format (e.g., `team_id_a`, `score_quarter_1_b`) to ensure consistent referencing and clean joins.
* **Parsed and validated date fields** in the header data, confirming that timestamps convert cleanly to `datetime` without producing invalid `NaT` entries.
* **Verified relational keys** such as `game_id`, `team_id`, and `player_id`, establishing how each table connects (e.g., `header` uses `team_id_a`, `team_id_b`, and `game_id`; `boxscore` uses `team_id`, `game_id`, and `player_id`; `teams` links through `team_id` and `season_code`).
* **Inspected missing values** (e.g., NaNs in `points_off_turnover` within `points.csv` and gaps in `header.csv`) to plan proper handling in Stage 2.
* **Separated raw vs. processed data** by creating a dedicated `clean_data/` directory, preserving original files under `data/` for reproducibility.
* **Confirmed structural consistency**: all tables share `season_code`, enabling reliable cross-season analysis later on.



> **Summary:** Stage 1 established a reproducible, schema-consistent foundation for the Euroleague dataset. The data are now clean, join-ready, and safely stored for downstream integration and feature engineering.

---

