# Lesson 3: Data Modeling & Schema Design - Practice Notebook

This notebook provides hands-on practice for designing optimal database schemas for ML feature engineering.

**Prerequisites:**
- Lesson 1 (Data Pipelines)
- Lesson 2 (BigQuery Deep Dive)
- BigQuery with MLB data

**What you'll learn:**
- Star schema vs snowflake schema
- Normalization vs denormalization
- Creating fact and dimension tables
- Building denormalized feature tables for ML
- Schema optimization techniques

## Setup and Configuration

In [1]:
# Import libraries
from google.cloud import bigquery
import pandas as pd
from datetime import datetime
import os

print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


In [None]:
# BigQuery Configuration
PROJECT_ID = "hankstank"
DATASET = "mlb_historical_data"

# Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID)

print(f"üîß Connected to project: {PROJECT_ID}")
print(f"üìä Using dataset: {DATASET}")

In [None]:
# Helper function from Lesson 2
def run_query(query, show_cost=True, limit=10):
    """Execute a BigQuery query and return results as a DataFrame"""
    try:
        if show_cost:
            job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
            dry_run_job = client.query(query, job_config=job_config)
            bytes_processed = dry_run_job.total_bytes_processed
            gb_processed = bytes_processed / 1e9
            cost_estimate = (bytes_processed / 1e12) * 5
            
            print(f"üìä Query will process: {gb_processed:.3f} GB")
            print(f"üí∞ Estimated cost: ${cost_estimate:.6f}\n")
        
        df = client.query(query).to_dataframe()
        print(f"‚úÖ Query returned {len(df)} rows")
        
        if limit and len(df) > limit:
            print(f"   (showing first {limit} rows)\n")
            return df.head(limit)
        
        return df
        
    except Exception as e:
        print(f"‚ùå Query error: {e}")
        return None

print("‚úÖ Helper functions loaded")

---

## Section 1: Analyze Current Schema

Let's examine your current database schema to understand its structure.

### Check All Tables in Your Dataset

In [None]:
# List all tables in your dataset
dataset_ref = client.dataset(DATASET)
tables = list(client.list_tables(dataset_ref))

print(f"üìã Tables in {DATASET}:\n")
for table in tables:
    table_ref = dataset_ref.table(table.table_id)
    table_obj = client.get_table(table_ref)
    print(f"  {table.table_id:30s} - {table_obj.num_rows:,} rows, {table_obj.num_bytes / 1e9:.2f} GB")


### Examine games_historical Schema

In [None]:
schema_query = f"""
SELECT 
  column_name,
  data_type,
  is_nullable
FROM `{PROJECT_ID}.{DATASET}.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'games_historical'
ORDER BY ordinal_position
"""

schema_df = run_query(schema_query, show_cost=False, limit=None)
display(schema_df)

### Sample Data from games_historical

In [None]:
sample_query = f"""
SELECT *
FROM `{PROJECT_ID}.{DATASET}.games_historical`
ORDER BY game_date DESC
LIMIT 5
"""

sample_df = run_query(sample_query, show_cost=False, limit=None)
display(sample_df)

---

## Section 2: Create a Star Schema

Let's design a star schema with fact and dimension tables.

### Create Dimension Table: dim_teams

Extract unique team information into a dimension table.

In [None]:
# Note: Adjust column names based on your actual schema
# This is a template - you'll need to modify based on the schema you saw above

create_dim_teams = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET}.dim_teams` AS
SELECT DISTINCT
  team_id AS team_key,
  team_name,
  -- Add other team attributes here based on your data
  -- city,
  -- division,
  -- league,
  CURRENT_TIMESTAMP() AS created_at
FROM (
  SELECT DISTINCT home_team_id AS team_id, home_team_name AS team_name
  FROM `{PROJECT_ID}.{DATASET}.games_historical`
  WHERE home_team_id IS NOT NULL
  
  UNION DISTINCT
  
  SELECT DISTINCT away_team_id AS team_id, away_team_name AS team_name
  FROM `{PROJECT_ID}.{DATASET}.games_historical`
  WHERE away_team_id IS NOT NULL
)
"""

print("Creating dim_teams dimension table...")
print("\n‚ö†Ô∏è  Review this query and adjust column names based on your schema before running!\n")
print(create_dim_teams)

# Uncomment when ready:
# result = client.query(create_dim_teams)
# print("‚úÖ dim_teams created successfully")

### Create Dimension Table: dim_dates

Create a date dimension with useful attributes for analysis.

In [None]:
create_dim_dates = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET}.dim_dates` AS
WITH date_range AS (
  SELECT DISTINCT game_date AS date_key
  FROM `{PROJECT_ID}.{DATASET}.games_historical`
  WHERE game_date IS NOT NULL
)
SELECT 
  date_key,
  EXTRACT(YEAR FROM date_key) AS year,
  EXTRACT(MONTH FROM date_key) AS month,
  EXTRACT(DAY FROM date_key) AS day,
  EXTRACT(QUARTER FROM date_key) AS quarter,
  FORMAT_DATE('%A', date_key) AS day_of_week,
  FORMAT_DATE('%B', date_key) AS month_name,
  EXTRACT(DAYOFWEEK FROM date_key) IN (1, 7) AS is_weekend,
  EXTRACT(DAYOFYEAR FROM date_key) AS day_of_year,
  CURRENT_TIMESTAMP() AS created_at
FROM date_range
ORDER BY date_key
"""

print("Creating dim_dates dimension table...")
print(create_dim_dates)

# Uncomment when ready:
# result = client.query(create_dim_dates)
# print("‚úÖ dim_dates created successfully")

### Create Fact Table: fact_games

Create a lean fact table with keys and measures.

In [None]:
create_fact_games = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET}.fact_games`
PARTITION BY date_key
CLUSTER BY home_team_key, away_team_key
AS
SELECT 
  game_pk,
  game_date AS date_key,
  home_team_id AS home_team_key,
  away_team_id AS away_team_key,
  
  -- Measures (quantifiable metrics)
  home_score,
  away_score,
  ABS(home_score - away_score) AS score_differential,
  home_score + away_score AS total_runs,
  
  -- Flags
  home_score > away_score AS home_won,
  CASE 
    WHEN home_score > away_score THEN home_team_id
    ELSE away_team_id
  END AS winning_team_key,
  
  -- Metadata
  season,
  CURRENT_TIMESTAMP() AS created_at
FROM `{PROJECT_ID}.{DATASET}.games_historical`
WHERE game_date IS NOT NULL
"""

print("Creating fact_games fact table...")
print("\n‚ö†Ô∏è  Review and adjust column names before running!\n")
print(create_fact_games)

# Uncomment when ready:
# result = client.query(create_fact_games)
# print("‚úÖ fact_games created successfully")

### Query the Star Schema

Now query using the star schema (fact + dimensions).

In [None]:
star_schema_query = f"""
SELECT 
  d.year,
  d.month_name,
  d.day_of_week,
  ht.team_name AS home_team,
  at.team_name AS away_team,
  f.home_score,
  f.away_score,
  f.home_won
FROM `{PROJECT_ID}.{DATASET}.fact_games` f
JOIN `{PROJECT_ID}.{DATASET}.dim_dates` d 
  ON f.date_key = d.date_key
JOIN `{PROJECT_ID}.{DATASET}.dim_teams` ht 
  ON f.home_team_key = ht.team_key
JOIN `{PROJECT_ID}.{DATASET}.dim_teams` at 
  ON f.away_team_key = at.team_key
WHERE d.year = 2026
ORDER BY f.date_key DESC
LIMIT 20
"""

print("Querying star schema...")
# Uncomment after creating dimension and fact tables:
# result = run_query(star_schema_query)
# display(result)

---

## Section 3: Create Denormalized ML Feature Table

The key to fast ML queries: denormalize everything into one table!

### Build ML Game Prediction Features

This table combines raw data with rolling statistics in one place.

In [None]:
create_ml_features = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET}.ml_game_prediction_features`
PARTITION BY game_date
CLUSTER BY home_team_id, away_team_id
AS
WITH team_games AS (
  -- Unpivot games to team-level view
  SELECT 
    game_pk,
    game_date,
    season,
    home_team_id AS team_id,
    home_score AS runs_scored,
    away_score AS runs_allowed,
    CASE WHEN home_score > away_score THEN 1 ELSE 0 END AS won
  FROM `{PROJECT_ID}.{DATASET}.games_historical`
  
  UNION ALL
  
  SELECT 
    game_pk,
    game_date,
    season,
    away_team_id AS team_id,
    away_score AS runs_scored,
    home_score AS runs_allowed,
    CASE WHEN away_score > home_score THEN 1 ELSE 0 END AS won
  FROM `{PROJECT_ID}.{DATASET}.games_historical`
),

team_rolling_stats AS (
  -- Calculate rolling statistics per team
  SELECT 
    team_id,
    game_date,
    game_pk,
    
    -- Last 10 games stats
    AVG(won) OVER w10 AS l10_win_pct,
    AVG(runs_scored) OVER w10 AS l10_runs_scored,
    AVG(runs_allowed) OVER w10 AS l10_runs_allowed,
    
    -- Season-to-date stats
    SUM(won) OVER season AS season_wins,
    COUNT(*) OVER season AS games_played,
    SUM(runs_scored) OVER season AS season_runs_scored,
    SUM(runs_allowed) OVER season AS season_runs_allowed
    
  FROM team_games
  WINDOW 
    w10 AS (
      PARTITION BY team_id, season 
      ORDER BY game_date 
      ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
    ),
    season AS (
      PARTITION BY team_id, season 
      ORDER BY game_date
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )
)

-- Join everything together in denormalized form
SELECT 
  g.game_pk,
  g.game_date,
  g.season,
  
  -- Home team features (denormalized)
  g.home_team_id,
  h.l10_win_pct AS home_l10_win_pct,
  h.l10_runs_scored AS home_l10_runs_scored,
  h.l10_runs_allowed AS home_l10_runs_allowed,
  h.season_wins AS home_season_wins,
  h.games_played AS home_games_played,
  
  -- Away team features (denormalized)
  g.away_team_id,
  a.l10_win_pct AS away_l10_win_pct,
  a.l10_runs_scored AS away_l10_runs_scored,
  a.l10_runs_allowed AS away_l10_runs_allowed,
  a.season_wins AS away_season_wins,
  a.games_played AS away_games_played,
  
  -- Matchup features (derived)
  h.l10_win_pct - a.l10_win_pct AS win_pct_diff,
  h.l10_runs_scored - a.l10_runs_scored AS runs_scored_diff,
  a.l10_runs_allowed - h.l10_runs_allowed AS pitching_diff,
  
  -- Target variable
  g.home_score,
  g.away_score,
  CASE WHEN g.home_score > g.away_score THEN 1 ELSE 0 END AS home_won,
  
  -- Metadata
  CURRENT_TIMESTAMP() AS feature_created_at
  
FROM `{PROJECT_ID}.{DATASET}.games_historical` g
LEFT JOIN team_rolling_stats h 
  ON g.game_pk = h.game_pk 
  AND g.home_team_id = h.team_id
LEFT JOIN team_rolling_stats a 
  ON g.game_pk = a.game_pk 
  AND g.away_team_id = a.team_id
WHERE g.game_date IS NOT NULL
ORDER BY g.game_date DESC
"""

print("Creating ml_game_prediction_features table...")
print("\n‚ö†Ô∏è  This creates a denormalized feature table optimized for ML!\n")
print("\nüîç Review the query structure before running\n")

# Uncomment when ready:
# result = client.query(create_ml_features)
# print("‚úÖ ml_game_prediction_features created successfully")

### Query the ML Feature Table

Now querying is simple - no joins needed!

In [None]:
ml_query = f"""
SELECT 
  game_date,
  home_team_id,
  away_team_id,
  home_l10_win_pct,
  away_l10_win_pct,
  win_pct_diff,
  home_score,
  away_score,
  home_won
FROM `{PROJECT_ID}.{DATASET}.ml_game_prediction_features`
WHERE season = 2026
  AND home_l10_win_pct IS NOT NULL
  AND away_l10_win_pct IS NOT NULL
ORDER BY game_date DESC
LIMIT 20
"""

print("Querying denormalized ML features (no joins!)...")
# Uncomment after creating the feature table:
# result = run_query(ml_query)
# display(result)

---

## Section 4: Performance Comparison

Let's benchmark normalized vs denormalized queries.

### Benchmark: Normalized Query (Multiple Joins)

In [None]:
import time

# Normalized query with joins
normalized_query = f"""
SELECT 
  f.date_key,
  ht.team_name AS home_team,
  at.team_name AS away_team,
  d.day_of_week,
  f.home_score,
  f.away_score
FROM `{PROJECT_ID}.{DATASET}.fact_games` f
JOIN `{PROJECT_ID}.{DATASET}.dim_teams` ht ON f.home_team_key = ht.team_key
JOIN `{PROJECT_ID}.{DATASET}.dim_teams` at ON f.away_team_key = at.team_key
JOIN `{PROJECT_ID}.{DATASET}.dim_dates` d ON f.date_key = d.date_key
WHERE d.year = 2026
LIMIT 1000
"""

print("‚è±Ô∏è  Running normalized query (with joins)...")
# Uncomment to test:
# start = time.time()
# result_norm = run_query(normalized_query, show_cost=True, limit=5)
# norm_time = time.time() - start
# print(f"Execution time: {norm_time:.2f} seconds")

### Benchmark: Denormalized Query (No Joins)

In [None]:
# Denormalized query - no joins!
denormalized_query = f"""
SELECT 
  game_date,
  home_team_id,
  away_team_id,
  home_l10_win_pct,
  away_l10_win_pct,
  home_score,
  away_score
FROM `{PROJECT_ID}.{DATASET}.ml_game_prediction_features`
WHERE season = 2026
LIMIT 1000
"""

print("‚è±Ô∏è  Running denormalized query (no joins)...")
# Uncomment to test:
# start = time.time()
# result_denorm = run_query(denormalized_query, show_cost=True, limit=5)
# denorm_time = time.time() - start
# print(f"Execution time: {denorm_time:.2f} seconds")
# print(f"\nüöÄ Speedup: {norm_time / denorm_time:.1f}x faster!")

---

## Section 5: Practice Exercises

Now it's your turn to design schemas!

### Exercise 1: Create a Player Dimension Table

Design and create a `dim_players` table with player attributes.

In [None]:
# Exercise 1: Create dim_players
# Hint: Extract unique players from player_stats_historical or similar table
# Include: player_id, full_name, position, bats, throws, etc.

exercise_1_query = """
-- Write your CREATE TABLE query here

"""

# Uncomment when ready:
# result = client.query(exercise_1_query)
# print("‚úÖ dim_players created!")

### Exercise 2: Create Player Performance Feature Table

Build a denormalized table for player batting prediction with rolling stats.

In [None]:
# Exercise 2: Create ml_player_batting_features
# Include:
# - player_id, player_name, game_date
# - last_10_games_batting_avg
# - last_10_games_home_runs
# - season_to_date_stats
# - opponent_pitcher_stats

exercise_2_query = """
-- Write your feature engineering query here
-- Use window functions for rolling statistics

"""

# Uncomment when ready:
# result = client.query(exercise_2_query)
# print("‚úÖ ml_player_batting_features created!")

### Exercise 3: Design Your Own Schema

Choose a prediction task and design an optimal schema for it.

Ideas:
- Pitcher strikeout prediction
- Player home run prediction  
- Team runs scored prediction
- Win streak prediction

In [None]:
# Exercise 3: Design your own ML feature table
# Think about:
# 1. What are you predicting? (target variable)
# 2. What features would be useful?
# 3. What rolling statistics make sense?
# 4. What time windows? (last 5, 10, 20 games?)

exercise_3_query = """
-- Your creative schema design here!

"""

# Uncomment when ready:
# result = client.query(exercise_3_query)
# print("‚úÖ Your custom feature table created!")

---

## Summary

**What you learned:**
- ‚úÖ Star schema design (fact + dimension tables)
- ‚úÖ Normalization vs denormalization tradeoffs
- ‚úÖ Creating denormalized ML feature tables
- ‚úÖ Performance benefits of denormalization
- ‚úÖ Designing schemas for specific ML tasks

**Key Insights:**
1. **Normalize for storage**, denormalize for ML queries
2. **Star schema** is ideal for BigQuery analytics
3. **Denormalized feature tables** are 10-100x faster
4. **Partition and cluster** feature tables for best performance
5. **Window functions** in feature creation enable powerful rolling stats

**Next Steps:**
1. Create the dimension and fact tables for your data
2. Build your first denormalized ML feature table
3. Benchmark query performance improvements
4. Design feature tables for your specific ML use cases
5. **Move to Lesson 4:** Workflow Orchestration

---

**Ready to design lightning-fast ML schemas? Run the exercises above!** üöÄ