# Data Collection System Test Notebook (2025 Season)

**Project:** Hank's Tank MLB Data Platform  
**Component:** Data Collection Prototype

This notebook serves as a proof-of-concept for the automated data collection system defined in `docs/DATA_COLLECTION_SYSTEM_DESIGN.md`. 
We will validate the specific API endpoints, parameters, and response structures required to build our historical database.

**Objectives:**
1. Verify connectivity to MLB Stats API and Baseball Savant.
2. Validate that critical parameters (like `qualified=false`) return the expected data volume.
3. Inspect JSON response structures to ensure our schema mapping is correct.
4. Test data availability for a sample date in the 2025 season.

**Test Date:** May 15, 2025 (Selected as a typical regular season day with a full slate of games)

In [None]:
%pip install pybaseball

Collecting pybaseball
  Using cached pybaseball-2.2.7-py3-none-any.whl.metadata (11 kB)
Collecting beautifulsoup4>=4.4.0 (from pybaseball)
  Downloading beautifulsoup4-4.14.3-py3-none-any.whl.metadata (3.8 kB)
Collecting lxml>=4.2.1 (from pybaseball)
  Downloading lxml-6.0.2-cp313-cp313-macosx_10_13_universal2.whl.metadata (3.6 kB)
Collecting pygithub>=1.51 (from pybaseball)
  Downloading pygithub-2.8.1-py3-none-any.whl.metadata (3.9 kB)
Collecting scipy>=1.4.0 (from pybaseball)
  Downloading scipy-1.16.3-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting matplotlib>=2.0.0 (from pybaseball)
  Downloading matplotlib-3.10.8-cp313-cp313-macosx_11_0_arm64.whl.metadata (52 kB)
Collecting tqdm>=4.50.0 (from pybaseball)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting attrs>=20.3.0 (from pybaseball)
  Downloading attrs-25.4.0-py3-none-any.whl.metadata (10 kB)
Collecting soupsieve>=1.6.1 (from beautifulsoup4>=4.4.0->pybaseball)
  Downloading soupsieve-2.8.1-py

In [25]:
import requests
import pandas as pd
import json
from datetime import datetime
import time
import urllib3

# Disable SSL warnings for testing
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Configuration
BASE_URL = "https://statsapi.mlb.com/api/v1"
TEST_DATE = "2025-05-15"

def print_structure(d, indent=0):
    """
    Recursively prints the structure of a dictionary/list with types and values.
    """
    spacing = '  ' * indent
    if isinstance(d, dict):
        for key, value in d.items():
            if isinstance(value, dict):
                print(f"{spacing}{key} (dict):")
                print_structure(value, indent + 1)
            elif isinstance(value, list):
                print(f"{spacing}{key} (list) [{len(value)} items]:")
                if value:
                    print(f"{spacing}  Sample item:")
                    print_structure(value[0], indent + 2)
            else:
                print(f"{spacing}{key}: {value} ({type(value).__name__})")
    elif isinstance(d, list):
        print(f"{spacing}List [{len(d)} items]:")
        if d:
            print(f"{spacing}  Sample item:")
            print_structure(d[0], indent + 1)
    else:
        print(f"{spacing}{d} ({type(d).__name__})")

def print_json_summary(data, keys_to_show=None):
    """Helper to print JSON structure without flooding the output"""
    if isinstance(data, dict):
        print(f"Keys: {list(data.keys())}")
        if keys_to_show:
            subset = {k: data[k] for k in keys_to_show if k in data}
            print(json.dumps(subset, indent=2))
    elif isinstance(data, list):
        print(f"List Length: {len(data)}")
        if len(data) > 0:
            print("First Item Keys:", list(data[0].keys()))

## 1. Schedule Endpoint Test

**Goal:** Identify all games played on a specific date to drive the rest of the collection process.

**Endpoint:** `/schedule`

**Key Parameters:**
- `sportId=1`: Filters for MLB games only.
- `hydrate=team,linescore,flags,venue,decisions`: This is crucial. It "hydrates" the response with extra details that usually require separate API calls. We get venue info, final scores, and winning/losing pitchers in a single request.

**Expected Output:** A list of game objects containing the `gamePk` (unique ID) which is required for fetching detailed game feeds.

In [29]:
url = f"{BASE_URL}/schedule"
params = {
    "sportId": 1,
    "startDate": TEST_DATE,
    "endDate": TEST_DATE,
    "hydrate": "team,linescore,flags,venue,decisions"
}

print(f"Fetching schedule for {TEST_DATE}...")
# Added verify=False to bypass SSL errors in local environment
response = requests.get(url, params=params, verify=False)
schedule_data = response.json()

if 'dates' in schedule_data and len(schedule_data['dates']) > 0:
    games = schedule_data['dates'][0]['games']
    print(f"✅ Success: Found {len(games)} games.")
    
    # Pick a sample game for later tests
    sample_game = games[0]
    print("Example Game Structure:")
    print_structure(sample_game)
    SAMPLE_GAME_PK = sample_game['gamePk']
    SAMPLE_HOME_TEAM_ID = sample_game['teams']['home']['team']['id']
    
    print(f"\nSample Game Info:")
    print(f"Game PK: {SAMPLE_GAME_PK}")
    print(f"Matchup: {sample_game['teams']['away']['team']['name']} @ {sample_game['teams']['home']['team']['name']}")
    print(f"Status: {sample_game['status']['detailedState']}")
    print(f"Game Type: {sample_game['gameType']}")
else:
    print("❌ Error: No games found.")

Fetching schedule for 2025-05-15...
✅ Success: Found 6 games.
Example Game Structure:
gamePk: 777909 (int)
gameGuid: f78bf3a9-58a8-4578-a852-8819c1eeaa0f (str)
link: /api/v1.1/game/777909/feed/live (str)
gameType: R (str)
season: 2025 (str)
gameDate: 2025-05-15T16:15:00Z (str)
officialDate: 2025-05-15 (str)
status (dict):
  abstractGameState: Final (str)
  codedGameState: F (str)
  detailedState: Final (str)
  statusCode: F (str)
  startTimeTBD: False (bool)
  abstractGameCode: F (str)
teams (dict):
  away (dict):
    team (dict):
      springLeague (dict):
        id: 115 (int)
        name: Grapefruit League (str)
        link: /api/v1/league/115 (str)
        abbreviation: GL (str)
      allStarStatus: N (str)
      id: 120 (int)
      name: Washington Nationals (str)
      link: /api/v1/teams/120 (str)
      season: 2025 (int)
      venue (dict):
        id: 3309 (int)
        name: Nationals Park (str)
        link: /api/v1/venues/3309 (str)
      springVenue (dict):
        id: 5

## 2. Full Game Feed Test (The "Firehose")

**Goal:** Retrieve the complete data package for a single game, including every play, pitch, and stat.

**Endpoint:** `/game/{gamePk}/feed/live`

**Structure:**
- `gameData`: Static metadata (teams, venue, start time, weather, players involved).
- `liveData`: Dynamic data.
    - `linescore`: Inning-by-inning scores.
    - `boxscore`: Final stats for every player.
    - `plays`: Detailed play-by-play log (used as a fallback if Statcast fails).

This endpoint provides the "source of truth" for game outcomes and traditional stats.

In [27]:
url = f"https://statsapi.mlb.com/api/v1.1/game/{SAMPLE_GAME_PK}/feed/live"

print(f"Fetching game feed for Game PK {SAMPLE_GAME_PK}...")
response = requests.get(url, verify=False)
game_feed = response.json()

if 'gameData' in game_feed and 'liveData' in game_feed:
    print("✅ Success: Game feed structure valid.")
    
    # Check Boxscore
    boxscore = game_feed['liveData']['boxscore']
    home_batters = boxscore['teams']['home']['batters']
    away_batters = boxscore['teams']['away']['batters']
    print(f"Boxscore: {len(home_batters)} home batters, {len(away_batters)} away batters.")
    
    # Check Linescore
    linescore = game_feed['liveData']['linescore']
    print(f"Linescore: {linescore['teams']['home']['runs']} - {linescore['teams']['away']['runs']}")
    
    # Check for Play-by-Play (Statcast fallback)
    plays = game_feed['liveData']['plays']['allPlays']
    print(f"Play-by-Play: {len(plays)} events found.")
    if len(plays) > 0:
        print("Example Play Structure:")
        print_structure(plays[0])
else:
    print("❌ Error: Invalid game feed structure.")

Fetching game feed for Game PK 777909...
✅ Success: Game feed structure valid.
Boxscore: 13 home batters, 14 away batters.
Linescore: 5 - 2
Play-by-Play: 71 events found.
Example Play Structure:
result (dict):
  type: atBat (str)
  event: Strikeout (str)
  eventType: strikeout (str)
  description: CJ Abrams strikes out swinging. (str)
  rbi: 0 (int)
  awayScore: 0 (int)
  homeScore: 0 (int)
  isOut: True (bool)
about (dict):
  atBatIndex: 0 (int)
  halfInning: top (str)
  isTopInning: True (bool)
  inning: 1 (int)
  startTime: 2025-05-15T16:15:22.772Z (str)
  endTime: 2025-05-15T16:15:56.479Z (str)
  isComplete: True (bool)
  isScoringPlay: False (bool)
  hasReview: False (bool)
  hasOut: True (bool)
  captivatingIndex: 14 (int)
count (dict):
  balls: 0 (int)
  strikes: 3 (int)
  outs: 1 (int)
matchup (dict):
  batter (dict):
    id: 682928 (int)
    fullName: CJ Abrams (str)
    link: /api/v1/people/682928 (str)
  batSide (dict):
    code: L (str)
    description: Left (str)
  pitcher

## 3. Player Stats Endpoint Test

**Goal:** Fetch daily performance stats for **every** player who appeared in a game, not just the stars.

**Endpoint:** `/stats`

**Critical Configuration:**
- `qualified=false`: **Most Important.** By default, APIs often filter for "qualified" players (e.g., 3.1 PA/game). We need this set to `false` to capture relievers, pinch hitters, and defensive replacements.
- `stats=gameLog`: Requests stats for specific games rather than season totals.
- `limit=1000`: Ensures we don't hit pagination limits on busy days.

In [36]:
url = f"{BASE_URL}/stats"
params = {
    "stats": "byDateRange", # Changed from 'gameLog' to 'byDateRange' to get all players
    "group": "hitting", 
    "gameType": "R",
    "startDate": TEST_DATE, # 'byDateRange' requires start/end dates
    "endDate": TEST_DATE,
    "limit": 1000,
    "qualified": "false"
}

print(f"Fetching hitting stats for {TEST_DATE}...")
response = requests.get(url, params=params, verify=False)
stats_data = response.json()

if 'stats' in stats_data and len(stats_data['stats']) > 0:
    stat_group = stats_data['stats'][0]
    if 'splits' in stat_group:
        splits = stat_group['splits']
        print(f"✅ Success: Found {len(splits)} player records.")
        
        # Verify we have some data
        if len(splits) > 0:
            sample = splits[0]
            print("\nSample Player Stat:")
            print(f"Player: {sample['player']['fullName']} (ID: {sample['player']['id']})")
            print(f"Team: {sample['team']['name']}")
            print("Stats Structure:")
            print(json.dumps(sample['stat'], indent=2))
    else:
        print("⚠️ Warning: Stats group found but no 'splits' key.")
else:
    print("❌ Error: No stats found. Response keys:", list(stats_data.keys()))

Fetching hitting stats for 2025-05-15...
✅ Success: Found 75 player records.

Sample Player Stat:
Player: Hyeseong Kim (ID: 808975)
Team: Los Angeles Dodgers
Stats Structure:
{
  "gamesPlayed": 1,
  "groundOuts": 0,
  "airOuts": 0,
  "runs": 4,
  "doubles": 1,
  "triples": 0,
  "homeRuns": 0,
  "strikeOuts": 0,
  "baseOnBalls": 2,
  "intentionalWalks": 0,
  "hits": 3,
  "hitByPitch": 0,
  "avg": "1.000",
  "atBats": 3,
  "obp": "1.000",
  "slg": "1.333",
  "ops": "2.333",
  "caughtStealing": 0,
  "stolenBases": 1,
  "stolenBasePercentage": "1.000",
  "caughtStealingPercentage": ".000",
  "groundIntoDoublePlay": 0,
  "groundIntoTriplePlay": 0,
  "numberOfPitches": 21,
  "plateAppearances": 5,
  "totalBases": 4,
  "rbi": 2,
  "leftOnBase": 0,
  "sacBunts": 0,
  "sacFlies": 0,
  "babip": "1.000",
  "groundOutsToAirouts": "-.--",
  "catchersInterference": 0,
  "atBatsPerHomeRun": "-.--"
}


In [37]:
url = f"{BASE_URL}/stats"
params = {
    "stats": "byDateRange", # Changed from 'gameLog' to 'byDateRange' to get all players
    "group": "pitching", 
    "gameType": "R",
    "startDate": TEST_DATE, # 'byDateRange' requires start/end dates
    "endDate": TEST_DATE,
    "limit": 1000,
    "qualified": "false"
}

print(f"Fetching hitting stats for {TEST_DATE}...")
response = requests.get(url, params=params, verify=False)
stats_data = response.json()

if 'stats' in stats_data and len(stats_data['stats']) > 0:
    stat_group = stats_data['stats'][0]
    if 'splits' in stat_group:
        splits = stat_group['splits']
        print(f"✅ Success: Found {len(splits)} player records.")
        
        # Verify we have some data
        if len(splits) > 0:
            sample = splits[0]
            print("\nSample Player Stat:")
            print(f"Player: {sample['player']['fullName']} (ID: {sample['player']['id']})")
            print(f"Team: {sample['team']['name']}")
            print("Stats Structure:")
            print(json.dumps(sample['stat'], indent=2))
    else:
        print("⚠️ Warning: Stats group found but no 'splits' key.")
else:
    print("❌ Error: No stats found. Response keys:", list(stats_data.keys()))

Fetching hitting stats for 2025-05-15...
✅ Success: Found 34 player records.

Sample Player Stat:
Player: Shawn Armstrong (ID: 542888)
Team: Texas Rangers
Stats Structure:
{
  "gamesPlayed": 1,
  "gamesStarted": 0,
  "groundOuts": 0,
  "airOuts": 3,
  "runs": 0,
  "doubles": 0,
  "triples": 0,
  "homeRuns": 0,
  "strikeOuts": 0,
  "baseOnBalls": 1,
  "intentionalWalks": 0,
  "hits": 0,
  "hitByPitch": 0,
  "avg": ".000",
  "atBats": 3,
  "obp": ".250",
  "slg": ".000",
  "ops": ".250",
  "caughtStealing": 0,
  "stolenBases": 0,
  "stolenBasePercentage": ".---",
  "caughtStealingPercentage": ".---",
  "groundIntoDoublePlay": 0,
  "numberOfPitches": 13,
  "era": "0.00",
  "inningsPitched": "1.0",
  "outsPitched": 3,
  "wins": 0,
  "losses": 0,
  "saves": 1,
  "saveOpportunities": 1,
  "holds": 0,
  "blownSaves": 0,
  "earnedRuns": 0,
  "whip": "1.00",
  "battersFaced": 4,
  "outs": 3,
  "gamesPitched": 1,
  "completeGames": 0,
  "shutouts": 0,
  "balls": 6,
  "strikes": 7,
  "strikePerce

## 4. Transactions Endpoint Test

**Goal:** Track player movement to maintain accurate rosters and injury status.

**Endpoint:** `/transactions`

**Use Case:**
We use this to detect:
- Roster moves (Call-ups, options)
- Injuries (IL placements)
- Trades
- DFAs and Releases

This data is vital for feature engineering (e.g., "Is the starting pitcher coming off the IL?").

In [39]:
url = f"{BASE_URL}/transactions"
params = {
    "sportId": 1,
    "startDate": TEST_DATE,
    "endDate": TEST_DATE
}

print(f"Fetching transactions for {TEST_DATE}...")
response = requests.get(url, params=params, verify=False)
trans_data = response.json()

if 'transactions' in trans_data:
    print(f"✅ Success: Found {len(trans_data['transactions'])} transactions.")
    if len(trans_data['transactions']) > 0:
        print("Sample Transaction Structure:")
        print(json.dumps(trans_data['transactions'][0], indent=2))
else:
    print("⚠️ Note: No transactions found (this might be normal for some dates).")

Fetching transactions for 2025-05-15...
✅ Success: Found 30 transactions.
Sample Transaction Structure:
{
  "id": 837759,
  "person": {
    "id": 666139,
    "fullName": "Josh Lowe",
    "link": "/api/v1/people/666139"
  },
  "toTeam": {
    "id": 139,
    "name": "Tampa Bay Rays",
    "link": "/api/v1/teams/139"
  },
  "date": "2025-05-15",
  "effectiveDate": "2025-05-15",
  "resolutionDate": "2025-05-15",
  "typeCode": "SC",
  "typeDesc": "Status Change",
  "description": "Tampa Bay Rays activated RF Josh Lowe from the 10-day injured list."
}


## 5. Rosters Endpoint Test

**Goal:** Get a snapshot of who is actually available to play for a team on a given day.

**Endpoint:** `/teams/{teamId}/roster`

**Parameters:**
- `rosterType=active`: Fetches the 26-man active roster.
- `date`: Allows us to look up historical rosters (time-travel), which is essential for training models on past data.

In [41]:
url = f"{BASE_URL}/teams/{SAMPLE_HOME_TEAM_ID}/roster"
params = {
    "rosterType": "active",
    "date": TEST_DATE
}

print(f"Fetching roster for Team ID {SAMPLE_HOME_TEAM_ID} on {TEST_DATE}...")
response = requests.get(url, params=params, verify=False)
roster_data = response.json()

if 'roster' in roster_data:
    print(f"✅ Success: Found {len(roster_data['roster'])} players on active roster.")
    if len(roster_data['roster']) > 0:
        print("Sample Roster Entry Structure:")
        print(json.dumps(roster_data['roster'][0], indent=2))
else:
    print("❌ Error: Roster fetch failed.")

Fetching roster for Team ID 144 on 2025-05-15...
✅ Success: Found 26 players on active roster.
Sample Roster Entry Structure:
{
  "person": {
    "id": 700363,
    "fullName": "AJ Smith-Shawver",
    "link": "/api/v1/people/700363"
  },
  "jerseyNumber": "32",
  "position": {
    "code": "1",
    "name": "Pitcher",
    "type": "Pitcher",
    "abbreviation": "P"
  },
  "status": {
    "code": "A",
    "description": "Active"
  },
  "parentTeamId": 144
}


## 6. Statcast Data Test (pybaseball)

**Goal:** Retrieve advanced pitch-by-pitch physics data (velocity, spin rate, exit velo, etc.).

**Source:** Baseball Savant (via `pybaseball` library)

**Why pybaseball?**
While the MLB API contains some pitch data, Baseball Savant provides the "Statcast" specific metrics (effective velocity, break angle, catch probability) in a format that is easier to analyze for ML.

**Note:** This step fetches data from external servers and may take a few seconds.

In [46]:
import requests
import urllib3

# Suppress warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Monkey-patch requests to force verify=False for pybaseball
# This is necessary because pybaseball doesn't expose a verify parameter
_orig_get = requests.get
_orig_session_request = requests.Session.request

def patched_get(*args, **kwargs):
    kwargs['verify'] = False
    return _orig_get(*args, **kwargs)

def patched_session_request(self, *args, **kwargs):
    kwargs['verify'] = False
    return _orig_session_request(self, *args, **kwargs)

requests.get = patched_get
requests.Session.request = patched_session_request

try:
    from pybaseball import statcast_single_game
    print("✅ pybaseball is installed.")
    
    # Ensure we have game PKs from the schedule step
    if 'games' in locals() and len(games) > 0:
        game_pks = [g['gamePk'] for g in games]
        print(f"Found {len(game_pks)} games to fetch. Processing individually...")
        
        statcast_frames = []
        
        for pk in game_pks:
            print(f"  Fetching Statcast for Game {pk}...", end=" ")
            try:
                # Fetch data for single game
                game_df = statcast_single_game(pk)
                
                if not game_df.empty:
                    statcast_frames.append(game_df)
                    print(f"✅ {len(game_df)} records")
                else:
                    print("⚠️ No data")
                
                # Sleep briefly to be nice to the API
                time.sleep(0.5)
                
            except Exception as e:
                print(f"❌ Error: {e}")

        if statcast_frames:
            df = pd.concat(statcast_frames)
            print(f"\n✅ Success: Fetched total {len(df)} pitch records.")
            print("\nSample Data:")
            print(df[['game_pk', 'batter', 'pitcher', 'events', 'launch_speed', 'launch_angle']].head(3))
        else:
            print("\n⚠️ Warning: No data returned for any game.")
            
    else:
        print("⚠️ No games found in 'games' variable. Run the Schedule cell first.")
        
except ImportError:
    print("⚠️ pybaseball not installed. Skipping this test.")
    print("To install: !pip install pybaseball")
except Exception as e:
    print(f"❌ Error fetching Statcast data: {e}")
finally:
    # Restore original requests methods to avoid side effects later
    requests.get = _orig_get
    requests.Session.request = _orig_session_request

✅ pybaseball is installed.
Found 6 games to fetch. Processing individually...
  Fetching Statcast for Game 777909... 

  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)


✅ 267 records
  Fetching Statcast for Game 777911... 

  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)


✅ 289 records
  Fetching Statcast for Game 777912... 

  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)


✅ 246 records
  Fetching Statcast for Game 777913... 

  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)


✅ 268 records
  Fetching Statcast for Game 777910... 

  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)


✅ 200 records
  Fetching Statcast for Game 777946... 

  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)


✅ 320 records

✅ Success: Fetched total 1590 pitch records.

Sample Data:
     game_pk  batter  pitcher                     events  launch_speed  \
178   777909  677588   628452  grounded_into_double_play          70.3   
191   777909  677588   628452                        NaN           NaN   
197   777909  669743   628452                  force_out          78.2   

     launch_angle  
178         -15.0  
191           NaN  
197         -12.0  
