# NBA Player Performance Prediction: Data Acquisition Module - Usage & Design

**Version:** 1.1
**Date:** 2025-05-10
**Author:** [Your Name/Team Name]

## 1. Introduction

This document provides a guide on how to use the Data Acquisition Python script for the NBA Player Performance Prediction framework. It also explains the key programming choices made during its development, making it suitable for presentation purposes.

The primary goal of this module is to:
* Fetch comprehensive NBA data (player stats, player info, game details, team info).
* Utilize the official BallDontLie API via its Python SDK.
* Process and store this data efficiently in Parquet format for downstream analysis and model training.

## 2. How to Use the Script

This script is designed to be run in an environment where Python and the required libraries are installed, such as a Jupyter Notebook or a standalone Python execution.

### 2.1. Prerequisites

* **Python 3.7+**
* **Required Libraries:**
    * `pandas`
    * `numpy`
    * `balldontlie-py` (the official BallDontLie SDK)
    * `pyarrow`
    Install them via pip:
    ```bash
    pip install pandas numpy balldontlie-py pyarrow
    ```
* **BallDontLie API Key:**
    1.  Obtain a free API key from [https://www.balldontlie.io/#introduction](https://www.balldontlie.io/#introduction).
    2.  Set this key as an environment variable named `BALLDONTLIE_API_KEY`.
        * **Why environment variables?** This is a security best practice. It avoids hardcoding sensitive credentials directly into the script, making the code safer to share and manage across different environments (development, production).
        * **How to set (example for Conda):**
            ```bash
            conda activate your_env_name
            conda env config vars set BALLDONTLIE_API_KEY="YOUR_ACTUAL_API_KEY"
            conda deactivate
            conda activate your_env_name
            ```
            (Replace `your_env_name` and `YOUR_ACTUAL_API_KEY` accordingly).

### 2.2. Script Configuration

Before running the main fetching functions, you can adjust these global variables at the top of the script:

* `RAW_DATA_DIR`, `PROCESSED_DATA_DIR`, `MODELS_DIR`, `REPORTS_DIR`: These define the directory structure. The script will create these if they don't exist.
    * **Why this structure?** A standardized directory layout promotes organization and makes it easy to locate raw data, processed datasets, trained models, and generated reports. This is crucial for reproducibility and collaboration.
* `TARGET_STATS`: A list of player statistics (e.g., `['pts', 'reb', 'ast']`) that are the primary focus for prediction. This helps in keeping the project scope defined.
* `SEASONS_TO_FETCH`: A list of NBA seasons (e.g., `[2023, 2024, 2025]`) for which to acquire data. The BallDontLie API uses the year the season *ends* (e.g., 2024 for the 2023-2024 NBA season).

### 2.3. Running the Script / Fetching Data

The script is typically run by executing its cells in a Jupyter Notebook or by running the Python file. The core functionality lies in the `Workspace_*` functions.

**Example Usage (within the `if __name__ == "__main__":` block or a notebook cell):**

1.  **Ensure SDK Initialization:**
    The script automatically attempts to initialize the `sdk_api_client`. If successful, it prints:
    `BalldontlieAPI SDK client initialized.`
    If it fails (e.g., API key not set), it prints a critical error, and the fetching functions will return empty lists.

2.  **Calling Fetch Functions:**
    Each `Workspace_*` function is designed to retrieve a specific type of data.

    * **Fetch Player Game Statistics:**
        ```python
        # To fetch stats for seasons defined in SEASONS_TO_FETCH:
        list_player_stats = fetch_player_game_stats(seasons=SEASONS_TO_FETCH)
        
        # Or, to fetch for a specific date range:
        # list_player_stats_dates = fetch_player_game_stats(start_date="2023-10-24", end_date="2023-10-30")
        
        if list_player_stats:
            df_player_stats = pd.DataFrame(list_player_stats)
            print("\n--- Player Game Stats ---")
            print(f"Number of records: {len(df_player_stats)}")
            print("Columns in df_player_stats:")
            print(df_player_stats.columns.tolist())
            print(df_player_stats.head())
        ```
        * **Output:** Saves data to `data/raw/player_game_stats_seasons_YYYY_YYYY.parquet` (or similar for dates) and returns a list of dictionaries (which is then converted to `df_player_stats`).

    * **Fetch All Players Data:**
        ```python
        list_all_players = fetch_all_players_data()
        # To search for a specific player:
        # list_lebron = fetch_all_players_data(search="LeBron James")
        
        if list_all_players:
            df_all_players = pd.DataFrame(list_all_players)
            print("\n--- All Players ---")
            # ... (print info as above) ...
        ```
        * **Output:** Saves to `data/raw/all_players_data_sdk.parquet` (or with a search suffix) and returns player data.

    * **Fetch Games Data:**
        ```python
        list_games = fetch_games_data(seasons=SEASONS_TO_FETCH)
        if list_games:
            df_games = pd.DataFrame(list_games)
            print("\n--- Games ---")
            # ... (print info as above) ...
        ```
        * **Output:** Saves to `data/raw/games_data_seasons_YYYY_sdk.parquet` and returns game schedule/results.

    * **Fetch All Teams Data:**
        ```python
        list_all_teams = fetch_all_teams_data()
        if list_all_teams:
            df_all_teams = pd.DataFrame(list_all_teams)
            print("\n--- All Teams ---")
            # ... (print info as above) ...
        ```
        * **Output:** Saves to `data/raw/all_teams_data_sdk.parquet` and returns team information.

### 2.4. Output Data

* **Console Logs:** The script provides print statements indicating its progress, any errors, and confirmation of saved files.
* **Parquet Files:** The primary output. Raw, flattened data is saved into the `data/raw/` directory.
    * **Why Parquet?** Parquet is a columnar storage format optimized for analytical queries. It offers:
        * **Efficiency:** Smaller file sizes due to good compression.
        * **Performance:** Faster read times for analytical queries, especially when selecting a subset of columns, as only the required columns are read.
        * **Schema Support:** Stores schema metadata with the data.
    This choice is beneficial for handling potentially large datasets and for efficient use in subsequent data analysis and machine learning pipelines.

## 3. Key Programming Choices & Rationale

### 3.1. Official BallDontLie SDK (`balldontlie-py`)

* **Choice:** Instead of making direct HTTP requests using libraries like `requests`, the official SDK is used.
* **Why?**
    * **Abstraction:** The SDK handles the low-level details of API endpoint URLs, request formatting, and often some basic error handling/retry logic. This simplifies our data acquisition code.
    * **Maintenance:** If the API changes slightly, the SDK maintainers are likely to update it, reducing our maintenance burden.
    * **Pydantic Models:** The SDK returns data as Pydantic model instances (e.g., `NBAStats`, `NBAPlayer`). This provides type safety and structured objects, making it easier to understand the data structure directly in Python.

### 3.2. Helper Function for Pagination (`_fetch_all_pages_sdk`)

* **Choice:** A generic helper function to handle API endpoints that return data in multiple pages.
* **Why?**
    * **DRY (Don't Repeat Yourself):** Many API endpoints (like player stats, games, all players) are paginated using a similar "cursor" mechanism. This helper centralizes the pagination logic, avoiding repetitive code in each main `Workspace_*` function.
    * **Modularity:** Makes the main fetching functions cleaner and focused on their specific data type and parameters.

### 3.3. Helper Function for Data Flattening (`_parse_and_flatten_data`)

* **Choice:** A generic function to convert lists of SDK Pydantic model instances (which can have nested model attributes) into a flat list of Python dictionaries.
* **Why?**
    * **Tabular Format for Analysis:** Pandas DataFrames, which are excellent for analysis, work best with flat, tabular data. API responses are often nested JSON.
    * **Handling Nested SDK Objects:** The SDK returns Pydantic objects, where nested data (e.g., player details within a stat record) are themselves Pydantic objects. This function:
        1.  Accesses top-level attributes directly from the main Pydantic object (e.g., `item_obj.pts`).
        2.  Accesses nested Pydantic objects (e.g., `item_obj.player`).
        3.  Converts these nested Pydantic objects to dictionaries (using `.to_dict()` or `.model_dump()`).
        4.  Extracts specified attributes from these nested dictionaries.
        5.  Creates new, prefixed column names for these flattened attributes (e.g., `player_id`, `team_abbreviation`) to avoid name collisions and maintain clarity.
    * **Consistency:** Provides a consistent way to process different types of SDK objects into a ready-to-use format for DataFrame creation.

### 3.4. Local Caching for Player Stats

* **Choice:** The `Workspace_player_game_stats` function checks if a Parquet file for the requested seasons/dates already exists and loads it, rather than re-fetching from the API.
* **Why?**
    * **Efficiency:** Player game stats can be a large dataset. Re-fetching it every time is time-consuming and puts unnecessary load on the API.
    * **API Rate Limits:** Reduces the number of API calls, helping to stay within potential rate limits.
    * **Development Speed:** Allows for faster iteration during development and analysis, as data can be loaded quickly from local storage.

### 3.5. Modular Fetch Functions (`Workspace_player_game_stats`, `Workspace_all_players_data`, etc.)

* **Choice:** Separate functions for each distinct data entity.
* **Why?**
    * **Clarity & Organization:** Each function has a clear responsibility.
    * **Reusability:** These functions can be called independently as needed.
    * **Maintainability:** Easier to debug or modify logic for a specific data type without affecting others. Each function defines its own `main_fields` and `nested_fields_map` tailored to the API response structure for that entity.

### 3.6. Error Handling

* **Choice:** Basic `try-except` blocks are used for SDK initialization, API calls within helper functions, and file operations (Parquet saving/loading).
* **Why?**
    * **Robustness:** Prevents the script from crashing due to common issues like network problems, API errors, or file system errors.
    * **Informative Feedback:** Prints error messages to the console, helping to diagnose problems.

### 3.7. Type Hinting

* **Choice:** Using Python's `typing` module for type hints (e.g., `List[Dict[str, Any]]`).
* **Why?**
    * **Readability:** Makes the expected types of function parameters and return values clear.
    * **Maintainability & Error Prevention:** Helps catch type-related errors early, especially with static analysis tools (like MyPy).

## 4. Conclusion

This data acquisition script provides a structured, robust, and efficient way to gather NBA data. The design choices prioritize security (environment variables), code organization (helper functions, modularity), efficiency (Parquet, local caching), and maintainability (SDK usage, type hinting). This foundation enables reliable data collection for the subsequent stages of the NBA Player Performance Prediction framework.

In [2]:
# NBA Player Performance Prediction Framework
# This script integrates data loading functionality directly for use in a Jupyter Notebook.
# It uses the official BallDontLie SDK and PyArrow for efficient Parquet storage.

# --- Project Setup & Configuration ---

# Standard library imports
import os # For interacting with the operating system, like accessing environment variables (e.g., API keys) and managing file paths.
import time # For adding delays if needed (e.g., for API rate limiting, though currently minimal in this script).
import datetime # For handling date and time information, which can be useful for filtering data or naming files.
from typing import List, Dict, Any, Tuple, Optional # For type hinting, which improves code readability, helps catch errors, and aids static analysis.

# Third-party library imports
import pandas as pd # Pandas is a powerful library for data manipulation and analysis. It's used here to convert the fetched data into DataFrames for easier processing AFTER parsing.
import numpy as np # NumPy is used for numerical operations and is a core dependency for Pandas. It provides efficient array operations.
# import requests # This line would be for direct HTTP calls, but the BallDontLie SDK handles this internally, simplifying our code.
from balldontlie import BalldontlieAPI # The official Python SDK (Software Development Kit) for interacting with the BallDontLie API, abstracting away direct API call complexities.

# PyArrow for Parquet handling
import pyarrow as pa # PyArrow provides a cross-language development platform for in-memory data. It's used here for its efficient data structures when creating Parquet files.
import pyarrow.parquet as pq # Specifically used for reading and writing data in the Parquet format, a columnar storage file format optimized for big data processing and analytics.


# --- Configuration ---
# API Key: Fetched from an environment variable for security to avoid hardcoding sensitive information like API keys directly into the script.
BALLDONTLIE_API_KEY = os.getenv('BALLDONTLIE_API_KEY')

# Data Storage Paths: Define standardized locations for clear project structure and efficient management of inputs and outputs.
RAW_DATA_DIR = "data/raw" # Directory for storing raw data fetched directly from the API (e.g., as Parquet files).
PROCESSED_DATA_DIR = "data/processed" # Directory for storing data that has been cleaned, transformed, or feature-engineered.
MODELS_DIR = "models" # Directory for storing trained machine learning models.
REPORTS_DIR = "reports" # Directory for storing outputs like Exploratory Data Analysis (EDA) reports, model evaluation summaries, etc.

# Loop to iterate defined directory paths, creating directories in `os.makedirs()` if they don't already exist.
# `exist_ok=True` ensures that if a directory already exists, the function doesn't raise an error, allowing the script to be run multiple times if needed. 
for dir_path in [RAW_DATA_DIR, PROCESSED_DATA_DIR, MODELS_DIR, REPORTS_DIR]:
    os.makedirs(dir_path, exist_ok=True)

# Target Player Statistics: Define the primary outcomes the project aims to predict.
# This list specifies which statistical categories are of interest for our prediction models.
TARGET_STATS = ['pts', 'reb', 'ast'] # Points, Rebounds, Assists.

# Seasons to Fetch: Specify the NBA seasons for which data should be acquired (e.g., 2024 for the 2023-2024 season).
SEASONS_TO_FETCH = [2021, 2022, 2023, 2024, 2025]

# --- SDK Initialization ---
# Initialize the BallDontLieAPI client. This object will be used for all interactions with the API.
sdk_api_client = None # Initialize to None. This allows the script to run even if initialization fails, though API calls will then be skipped or handled.
_sdk_initialized_flag_main = False # Flag to ensure the SDK initialization message prints only once, even if this code block is run multiple times (e.g., in a Jupyter cell).

# Check if the API key was successfully loaded from the environment variable.
if not BALLDONTLIE_API_KEY:
    # If the API key is not found, print a critical error message.
    # Data fetching relies on this key, so the script cannot proceed with API calls without it.
    print("CRITICAL ERROR: BALLDONTLIE_API_KEY environment variable not set.")
else:
    # If the API key is present, proceed with SDK initialization.
    # The `_sdk_initialized_flag_main` prevents re-initializing if already done.
    if not _sdk_initialized_flag_main:
        try:
            # Attempt to create an instance of the BalldontlieAPI client, passing the API key.
            sdk_api_client = BalldontlieAPI(api_key=BALLDONTLIE_API_KEY)
            print(f"BalldontlieAPI SDK client initialized.") # Confirmation message.
            _sdk_initialized_flag_main = True # Set the flag to True after successful initialization.
        except Exception as e:
            # If any error occurs during SDK initialization, print an error message.
            print(f"ERROR: Failed to initialize BalldontlieAPI SDK: {e}")
            # `sdk_api_client` will remain None, and subsequent checks for its existence will prevent API calls.


# --- Data Acquisition Helper Functions ---

def _fetch_all_pages_sdk(sdk_list_method_callable,
                         base_params: Optional[Dict[str, Any]] = None,
                         per_page: int = 100) -> List[Any]:
    """
    Fetches all data from a paginated API endpoint using the provided SDK method.
    This function is chosen for endpoints that return data in pages, abstracting the pagination logic.
    The BallDontLie API uses cursor-based pagination.
    """
    # If the SDK client isn't initialized (e.g., missing API key), we can't make calls.
    if not sdk_api_client:
        print("SDK Helper: SDK client not initialized. Cannot fetch pages.")
        return [] # Return empty list to signify no data can be fetched.

    all_items_data = [] # This list will accumulate data from all pages.
    current_cursor = None # Initialize cursor; the first API call doesn't use one.

    # Ensure base_params is a dictionary to avoid errors if None is passed.
    if base_params is None:
        base_params = {}

    # Try to get a readable name for the SDK method for better error messages.
    method_name = "SDK_call" # Default if introspection fails.
    if hasattr(sdk_list_method_callable, '__name__'):
        method_name = sdk_list_method_callable.__name__
    # If it's a bound method, include the class name (e.g., "Players.list").
    if hasattr(sdk_list_method_callable, '__self__') and hasattr(sdk_list_method_callable.__self__, '__class__'):
        method_name = f"{sdk_list_method_callable.__self__.__class__.__name__}.{method_name}"

    # Loop to fetch data page by page.
    while True:
        params_for_sdk = base_params.copy() # Use a copy to avoid modifying the original base_params.
        params_for_sdk['per_page'] = per_page # Set how many items to fetch per request.
        if current_cursor is not None: # If we have a cursor from a previous page,
            params_for_sdk['cursor'] = current_cursor # add it to get the next page.

        try:
            # Make the actual API call using the SDK method provided.
            # `**params_for_sdk` unpacks the dictionary into keyword arguments.
            response = sdk_list_method_callable(**params_for_sdk)
        except Exception as e:
            # If the API call itself fails (network error, API server error not caught by SDK).
            print(f"ERROR: SDK call failed for {method_name}: {e}. Params: {params_for_sdk}")
            return [] # Return empty to indicate failure.

        # Process the SDK's response.
        # We expect the SDK to return an object that has a 'data' attribute which is a list of items.
        if response and hasattr(response, 'data') and isinstance(response.data, list):
            all_items_data.extend(response.data) # Add items from the current page to our main list.

            # Check for pagination metadata to see if there's a next page.
            if hasattr(response, 'meta') and response.meta:
                next_cursor_val = getattr(response.meta, 'next_cursor', None) # Get 'next_cursor' if it exists.
                if next_cursor_val is not None:
                    current_cursor = next_cursor_val # Update cursor for the next loop iteration.
                else:
                    break # No 'next_cursor' means we've reached the last page.
            else:
                break # No 'meta' object, assume no more pages.

        else:
            # Handle cases where response structure is unexpected or data list is empty on the first try.
            if response and hasattr(response, 'data') and isinstance(response.data, list) and len(response.data) == 0:
                break # Empty data list (and no error) usually means no more results for the query.
            print(f"SDK Helper: Failed to fetch data or invalid response structure for {method_name}.")
            break # Exit loop if response is malformed.
    return all_items_data


def _parse_and_flatten_data(sdk_objects: List[Any],
                            main_fields: List[str],
                            nested_fields_map: Optional[Dict[str, List[str]]] = None
                            ) -> List[Dict[str, Any]]:
    """
    Parses a list of SDK Pydantic model instances (e.g., NBAStats, NBAPlayer) into a 
    flattened list of Python dictionaries. This is chosen because SDKs often return objects,
    and for tabular analysis (like with Pandas), a flat list of dictionaries is ideal.
    Nested Pydantic objects are converted to dictionaries and their specified attributes are extracted.
    """
    if not sdk_objects: # If there are no SDK objects to parse, return an empty list.
        # print("Debug: _parse_and_flatten_data received empty sdk_objects.") # For debugging
        return []

    processed_rows: List[Dict[str, Any]] = [] # This will store our flattened rows.
    if nested_fields_map is None: # Ensure nested_fields_map is a dictionary for consistent handling.
        nested_fields_map = {}

    for i, item_obj in enumerate(sdk_objects): # `item_obj` is typically a Pydantic model instance from the SDK.
        row: Dict[str, Any] = {} # Initialize an empty dictionary for the current flat row.
        
        # Populate main fields by directly accessing attributes of the Pydantic model instance.
        # This is preferred over converting the whole `item_obj` to a dict first if we only need specific fields
        # and know the structure (as defined by Pydantic models).
        for field in main_fields:
            if hasattr(item_obj, field): # Check if the Pydantic object has the attribute.
                row[field] = getattr(item_obj, field) # Get the attribute's value.
            else:
                row[field] = None # Assign None if a main field is unexpectedly missing.

        # Process and flatten specified nested Pydantic model objects.
        for nested_name, nested_attrs_to_extract in nested_fields_map.items():
            # `nested_name` (e.g., 'player', 'team') is an attribute of `item_obj` that holds another Pydantic model.
            # `nested_attrs_to_extract` is a list of attributes we want from that nested model.

            if hasattr(item_obj, nested_name): # Check if the main object has the nested object attribute.
                nested_pydantic_object = getattr(item_obj, nested_name) # Get the nested Pydantic model instance.

                if nested_pydantic_object is not None:
                    # The nested object needs to be converted to a dictionary to easily .get() its attributes.
                    # Pydantic models typically have .model_dump() (v2) or .dict() (v1).
                    # The balldontlie-py SDK might provide a .to_dict() alias for this.
                    nested_dict_to_extract_from = None
                    if hasattr(nested_pydantic_object, 'to_dict'): # Check for SDK's specific method first.
                        try:
                            nested_dict_to_extract_from = nested_pydantic_object.to_dict()
                        except Exception as e_sdk_to_dict:
                            print(f"Warning: item {i}, nested_name '{nested_name}', .to_dict() failed: {e_sdk_to_dict}")
                    elif hasattr(nested_pydantic_object, 'model_dump'): # Standard Pydantic v2 method.
                        try:
                            nested_dict_to_extract_from = nested_pydantic_object.model_dump(exclude_none=True)
                        except Exception as e_model_dump:
                             print(f"Warning: item {i}, nested_name '{nested_name}', .model_dump() failed: {e_model_dump}")
                    elif isinstance(nested_pydantic_object, dict): # If it was already a dictionary (e.g. from a raw API response if not using SDK models directly).
                         nested_dict_to_extract_from = nested_pydantic_object

                    # If we successfully obtained a dictionary for the nested object:
                    if nested_dict_to_extract_from:
                        for attr_name in nested_attrs_to_extract:
                            # Create a prefixed column name (e.g., 'player_id', 'team_abbreviation').
                            # Use .get() for safety, returning None if an attribute is missing from the nested dict.
                            row[f'{nested_name}_{attr_name}'] = nested_dict_to_extract_from.get(attr_name)

        processed_rows.append(row) # Add the fully processed (flattened) row.
        
    return processed_rows

# --- Main Data Fetching Functions ---

def fetch_player_game_stats(
    seasons: Optional[List[int]] = None,
    start_date: Optional[str] = None,
    end_date: Optional[str] = None,
    per_page: int = 100,
    save_to_parquet: bool = True
) -> List[Dict[str, Any]]:
    """
    Fetches player game-by-game statistics.
    This function is chosen to encapsulate all logic for acquiring this specific dataset,
    including API parameter handling, filename generation, local caching (reading from Parquet),
    and parsing the specific structure of player stats responses.
    """
    # Ensure SDK client is available before making API calls.
    if not sdk_api_client:
        print("SDK client not initialized. Cannot fetch player game stats.")
        return []

    base_params: Dict[str, Any] = {} # Dictionary to hold parameters for the SDK call.
    filename_parts = ["player_game_stats"] # Base for constructing the output Parquet filename.
    log_criteria = "" # String for logging the criteria used for fetching.

    # Determine fetch criteria: date range takes precedence over seasons if both are provided.
    # This logic defines how the API will be queried and how the output file will be named.
    if start_date and end_date:
        base_params["start_date"] = start_date
        base_params["end_date"] = end_date
        filename_parts.append(f"dates_{start_date}_to_{end_date}") # Add date range to filename for uniqueness.
        log_criteria = f"from {start_date} to {end_date}"
    elif seasons:
        base_params["seasons"] = seasons # Use 'seasons' parameter for the API.
        filename_parts.append(f"seasons_{'_'.join(map(str, seasons))}") # Add seasons to filename.
        log_criteria = f"for season(s): {seasons}"
    else:
        # It's important to have some filter (seasons or date range) for stats to avoid overly broad queries.
        print("ERROR: Must provide season(s) or start_date/end_date for player stats.")
        return [] # Return empty if no valid criteria.
    
    parquet_file_path = os.path.join(RAW_DATA_DIR, f"{'_'.join(filename_parts)}.parquet")

    # Optimization: If data already exists in a Parquet file, load it to save API calls and time.
    # This "caching" is useful for development and re-running analyses without re-fetching.
    if os.path.exists(parquet_file_path):
        print(f"Loading stats {log_criteria} from {parquet_file_path}...")
        try:
            table = pq.read_table(parquet_file_path) # Read the Parquet file into a PyArrow Table.
            loaded_data = table.to_pylist() # Convert the PyArrow Table to a list of Python dictionaries.
            print(f"Loaded {len(loaded_data)} records.")
            return loaded_data # Return the loaded data.
        except Exception as e:
            # If loading from Parquet fails (e.g., corrupted file), print an error and proceed to re-fetch from API.
            print(f"Error loading Parquet {parquet_file_path}: {e}. Re-fetching.")
            # Do not return here; proceed to fetch the data.

    print(f"Fetching stats from API {log_criteria}...")
    # Fetch raw SDK objects (list of NBAStats Pydantic models) using the helper function.
    # `sdk_api_client.nba.stats.list` is the specific SDK method for fetching game statistics.
    sdk_objects = _fetch_all_pages_sdk(sdk_api_client.nba.stats.list,
                                       base_params=base_params, per_page=per_page)
    if not sdk_objects: # If no data is returned from the API (e.g., no games in the specified range).
        print(f"No stats data fetched from API for {log_criteria}.")
        return [] # Return empty list.

    # Define fields for parsing player_game_stats.
    # These are the direct attributes of an `NBAStats` Pydantic model.
    main_stat_fields = [
        'id', 'min', 'fgm', 'fga', 'fg_pct', 'fg3m', 'fg3a', 'fg3_pct',
        'ftm', 'fta', 'ft_pct', 'oreb', 'dreb', 'reb', 'ast', 'stl',
        'blk', 'turnover', 'pf', 'pts'
    ]
    # Define nested Pydantic model attributes within an `NBAStats` object and the fields to extract from them.
    # `player`, `team`, and `game` are attributes of `NBAStats` that are instances of `NBAPlayer`, `NBATeam`, `NBAGame`.
    nested_stat_fields = {
        'player': ['id', 'first_name', 'last_name', 'position', 'team_id'],
        'team': ['id', 'abbreviation', 'full_name'],
        'game': ['id', 'date', 'season', 'home_team_id', 'visitor_team_id', 'postseason']
    }
    # Parse the raw SDK objects (Pydantic models) into a list of flattened dictionaries.
    list_of_dicts = _parse_and_flatten_data(sdk_objects, main_stat_fields, nested_stat_fields)
    
    if not list_of_dicts: # If parsing results in no data (e.g., SDK objects were empty or couldn't be parsed).
        print(f"Parsing resulted in no data for {log_criteria}.")
        return []

    # Save the processed data to a Parquet file if requested and if data exists.
    # Parquet is chosen for its efficiency in storage and query performance.
    if save_to_parquet:
        try:
            if list_of_dicts: # Ensure there's data to write.
                # Create a PyArrow Table from the list of dictionaries. This is an efficient in-memory columnar format.
                table = pa.Table.from_pylist(list_of_dicts)
                # Write the PyArrow Table to a Parquet file.
                pq.write_table(table, parquet_file_path)
                print(f"Saved {len(list_of_dicts)} stats records to {parquet_file_path}")
        except Exception as e: # If writing to Parquet fails.
            print(f"ERROR: Could not write Parquet {parquet_file_path}: {e}")
    return list_of_dicts # Return the list of dictionaries.

def fetch_all_players_data(search: Optional[str] = None, save_to_parquet: bool = True) -> List[Dict[str,Any]]:
    """ 
    Fetches data for NBA players using the SDK. Can optionally search for specific players.
    The structure of player data often includes a nested 'team' object representing the player's current team.
    Returns a list of dictionaries, where each dictionary represents a player.
    """
    if not sdk_api_client: # Ensure SDK is initialized.
        print("SDK client not initialized. Cannot fetch players data.")
        return []
    print(f"Fetching players data (SDK){f' matching {search}' if search else ''}...")
    base_params = {} # Initialize parameters for the API call.
    if search: # If a search term is provided, add it to the parameters.
        base_params["search"] = search
    
    # Fetch player objects from the API. `sdk_objects` will be a list of `NBAPlayer` Pydantic models.
    sdk_objects = _fetch_all_pages_sdk(sdk_api_client.nba.players.list, base_params=base_params)
    
    # Define the main fields to extract directly from each `NBAPlayer` Pydantic model.
    main_player_fields = ['id', 'first_name', 'last_name', 'position', 'height', 'weight',
                          'jersey_number', 'college', 'country', 'draft_year', 
                          'draft_round', 'draft_number']
    # Define the nested 'team' object (an `NBATeam` model) and the fields to extract from it.
    nested_player_fields = {
        'team': ['id', 'abbreviation', 'full_name', 'conference', 'division', 'city']
    }
    # Parse the fetched SDK objects (Pydantic models) into a list of flattened dictionaries.
    list_of_dicts = _parse_and_flatten_data(sdk_objects, main_player_fields, nested_player_fields)
    
    # If data was fetched and `save_to_parquet` is true, save it.
    if list_of_dicts and save_to_parquet:
        # Create a suffix for the filename if a search term was used, to distinguish it.
        suffix = f"_search_{search.replace(' ','_')}" if search else ""
        filepath = os.path.join(RAW_DATA_DIR, f"all_players_data{suffix}_sdk.parquet")
        try:
            table = pa.Table.from_pylist(list_of_dicts) # Convert list of dicts to PyArrow Table.
            pq.write_table(table, filepath) # Write table to Parquet file.
            print(f"Saved {len(list_of_dicts)} players records to {filepath}")
        except Exception as e: # Catch any error during Parquet saving.
            print(f"ERROR saving players Parquet: {e}")
    elif not list_of_dicts: # If no data was fetched/parsed.
        print("No players data fetched.")
    return list_of_dicts # Return the list of player dictionaries.

def fetch_games_data(seasons: Optional[List[int]] = None, 
                     start_date: Optional[str] = None,
                     end_date: Optional[str] = None,
                     save_to_parquet: bool = True) -> List[Dict[str,Any]]:
    """ 
    Fetches game details (schedule, scores, etc.) using the SDK.
    Game data includes nested 'home_team' and 'visitor_team' objects.
    Returns a list of dictionaries, where each dictionary represents a game.
    """
    if not sdk_api_client: # Ensure SDK is initialized.
        print("SDK client not initialized. Cannot fetch games data.")
        return []
    
    log_parts = ["Fetching game details (SDK)"] # For constructing a descriptive log message.
    base_params: Dict[str, Any] = {} # API parameters.
    filename_parts = ["games_data"] # Base for Parquet filename.

    # Logic to set API parameters and filename parts based on provided arguments.
    if start_date and end_date:
        base_params["start_date"] = start_date
        base_params["end_date"] = end_date
        filename_parts.append(f"dates_{start_date}_to_{end_date}")
        log_parts.append(f"from {start_date} to {end_date}")
    elif seasons:
        base_params["seasons"] = seasons
        filename_parts.append(f"seasons_{'_'.join(map(str, seasons))}")
        log_parts.append(f"for seasons: {seasons}")
    else:
        # If no specific filter, the API might fetch all games, which can be a very large dataset.
        log_parts.append("for all available games (caution: may be a large query).")
        filename_parts.append("all_time") # Filename for all-time data.

    print(f"{' '.join(log_parts)}...")
    
    # Fetch game objects. `sdk_objects` will be a list of `NBAGame` Pydantic models.
    sdk_objects = _fetch_all_pages_sdk(sdk_api_client.nba.games.list, base_params=base_params)

    # Define main fields for `NBAGame` Pydantic models.
    main_game_fields = ['id', 'date', 'season', 'status', 'period', 'time', 
                        'postseason', 'home_team_score', 'visitor_team_score']
    # Define nested 'home_team' and 'visitor_team' objects (which are `NBATeam` models) and fields to extract.
    nested_game_fields = {
        'home_team': ['id', 'abbreviation', 'full_name', 'conference', 'division', 'city'],
        'visitor_team': ['id', 'abbreviation', 'full_name', 'conference', 'division', 'city']
    }
    # Parse the Pydantic models into flattened dictionaries.
    list_of_dicts = _parse_and_flatten_data(sdk_objects, main_game_fields, nested_game_fields)

    # Save to Parquet if requested and data exists.
    if list_of_dicts and save_to_parquet:
        filepath = os.path.join(RAW_DATA_DIR, f"{'_'.join(filename_parts)}_sdk.parquet")
        try:
            table = pa.Table.from_pylist(list_of_dicts)
            pq.write_table(table, filepath)
            print(f"Saved {len(list_of_dicts)} games records to {filepath}")
        except Exception as e: print(f"ERROR saving games Parquet: {e}")
    elif not list_of_dicts: print("No games data fetched.")
    return list_of_dicts


def fetch_all_teams_data(save_to_parquet: bool = True) -> List[Dict[str,Any]]:
    """ 
    Fetches data for all NBA teams using the SDK.
    Team data from the API is generally flat (not deeply nested).
    This function was corrected to call the SDK method directly, as the teams endpoint
    doesn't support pagination parameters like 'per_page'.
    Returns a list of dictionaries, where each dictionary represents a team.
    """
    if not sdk_api_client: # Ensure SDK is initialized.
        print("SDK client not initialized. Cannot fetch teams data.")
        return []
    print("Fetching all teams data (SDK)...")
    
    sdk_objects_raw_data = [] # Initialize to an empty list to store raw data from SDK.
    try:
        # Directly call the `teams.list()` endpoint that typically returns all teams at once and does not use pagination parameters like `per_page` or `cursor`.
        response_obj = sdk_api_client.nba.teams.list() # `response_obj` is a PaginatedListResponse
        
        # The SDK response object should have a 'data' attribute containing the list of team Pydantic models.
        if response_obj and hasattr(response_obj, 'data') and isinstance(response_obj.data, list):
            sdk_objects_raw_data = response_obj.data # These are NBATeam Pydantic model instances
            print(f"Successfully fetched {len(sdk_objects_raw_data)} team objects from API.")
        else:
            print("Warning: No 'data' attribute in teams response or 'data' is not a list.")
            if response_obj: print(f"Raw response from teams.list(): {response_obj}") # Log raw response for debugging
            
    except TypeError as te:
        # This might catch an error if the SDK's teams.list() method was called with unexpected arguments.
        print(f"TypeError during direct call to teams.list(): {te}")
        # It's unlikely to be a parameter issue now, but good to have a general catch.
        return []
    except Exception as e:
        # Catch any other exceptions during the API call.
        print(f"Error fetching teams data directly: {e}")
        return []

    if not sdk_objects_raw_data: # If no objects were retrieved.
        print("No team sdk_objects were successfully retrieved from the API call.")
        # The "No teams data fetched." message will be printed later if list_of_dicts is empty.

    # Define fields for `NBATeam` Pydantic models. Team data is usually flat.
    main_team_fields = ['id', 'conference', 'division', 'city', 'name', 'full_name', 'abbreviation']
    # Parse the Pydantic models (NBATeam instances) into flattened dictionaries.
    # An empty `nested_fields_map` ({}) is passed as team objects are not expected to have further nested structures that need specific flattening by this parser.
    list_of_dicts = _parse_and_flatten_data(sdk_objects_raw_data, main_team_fields, {}) 

    if list_of_dicts and save_to_parquet:
        filepath = os.path.join(RAW_DATA_DIR, "all_teams_data_sdk.parquet")
        try:
            table = pa.Table.from_pylist(list_of_dicts)
            pq.write_table(table, filepath)
            print(f"Saved {len(list_of_dicts)} teams records to {filepath}")
        except Exception as e: print(f"ERROR saving teams Parquet: {e}")
    elif not list_of_dicts:
        # Check if objects were fetched or if parsing failed.
        if sdk_objects_raw_data: 
             print("Parsing team data resulted in no records (check _parse_and_flatten_data or raw object structure).")
        else: 
             print("No teams data fetched from API (API call might have returned empty or failed).")
    return list_of_dicts

BalldontlieAPI SDK client initialized.


In [3]:
print("\n--- Starting Data Acquisition ---")

# fetch_all_players_data fetches general information about all players.
list_all_players = fetch_all_players_data()

# fetch_games_data fetches details for games in the specified seasons.
list_games = fetch_games_data(seasons=SEASONS_TO_FETCH)

# fetch_all_teams_data fetches information about all NBA teams.
list_all_teams = fetch_all_teams_data()

# fetch_player_game_stats fetches player statistics for each game in the specified seasons.
list_player_stats = fetch_player_game_stats(seasons=SEASONS_TO_FETCH)


--- Starting Data Acquisition ---
Fetching players data (SDK)...
{'data': [{'id': 1, 'first_name': 'Alex', 'last_name': 'Abrines', 'position': 'G', 'height': '6-6', 'weight': '190', 'jersey_number': '8', 'college': 'FC Barcelona', 'country': 'Spain', 'draft_year': 2013, 'draft_round': 2, 'draft_number': 32, 'team': {'id': 21, 'conference': 'West', 'division': 'Northwest', 'city': 'Oklahoma City', 'name': 'Thunder', 'full_name': 'Oklahoma City Thunder', 'abbreviation': 'OKC'}}, {'id': 2, 'first_name': 'Jaylen', 'last_name': 'Adams', 'position': 'G', 'height': '6-0', 'weight': '225', 'jersey_number': '10', 'college': 'St. Bonaventure', 'country': 'USA', 'draft_year': None, 'draft_round': None, 'draft_number': None, 'team': {'id': 1, 'conference': 'East', 'division': 'Southeast', 'city': 'Atlanta', 'name': 'Hawks', 'full_name': 'Atlanta Hawks', 'abbreviation': 'ATL'}}, {'id': 3, 'first_name': 'Steven', 'last_name': 'Adams', 'position': 'C', 'height': '6-11', 'weight': '265', 'jersey_numb