# NBA Fantasy Data Collection

This notebook demonstrates the data collection process for NBA player performance prediction, following the methodology from Papageorgiou et al. (2024).

## Setup

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yaml
from datetime import datetime
from tqdm.notebook import tqdm

from src.data.collector import NBADataCollector

# Set plotting style
plt.style.use('seaborn')
%matplotlib inline

## Load Configuration

In [None]:
# Load configuration
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Extract data collection parameters
SEASONS = config['data']['seasons']
MIN_GAMES = config['data']['min_games_played']
MIN_MINUTES = config['data']['min_minutes_per_game']

print(f"Collecting data for seasons: {SEASONS}")
print(f"Minimum games required: {MIN_GAMES}")
print(f"Minimum minutes per game: {MIN_MINUTES}")

## Initialize Data Collector

In [None]:
collector = NBADataCollector(rate_limit_pause=1.0)

# Get list of active players
active_players = collector.get_active_players()
print(f"Found {len(active_players)} active players")

## Filter Eligible Players

Following the paper's methodology, we'll filter for players who meet minimum game and playing time requirements.

In [None]:
def check_player_eligibility(player_data):
    """Check if player meets minimum requirements."""
    if player_data is None or player_data['games'] is None:
        return False
        
    games = player_data['games']
    return len(games) >= MIN_GAMES and games['MIN'].mean() >= MIN_MINUTES

# Collect and filter players
eligible_players = []
player_data_dict = {}

for player in tqdm(active_players, desc="Collecting player data"):
    player_id = player['id']
    
    # Collect player data
    data = collector.collect_player_data(
        player_id=player_id,
        seasons=SEASONS,
        include_info=True
    )
    
    if check_player_eligibility(data):
        eligible_players.append(player)
        player_data_dict[player_id] = data

print(f"Found {len(eligible_players)} eligible players")

## Analyze Player Distribution

In [None]:
# Create position distribution plot
position_counts = pd.Series([p['info']['POSITION'] for p in player_data_dict.values()])

plt.figure(figsize=(10, 6))
position_counts.value_counts().plot(kind='bar')
plt.title('Distribution of Player Positions')
plt.xlabel('Position')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Create Combined Dataset

In [None]:
# Combine all player games into single DataFrame
all_games = []

for player_id, data in player_data_dict.items():
    games = data['games'].copy()
    games['PLAYER_ID'] = player_id
    
    # Add player info
    for key in ['POSITION', 'TEAM_ID']:
        games[key] = data['info'][key]
        
    all_games.append(games)

combined_df = pd.concat(all_games, ignore_index=True)
print(f"Combined dataset shape: {combined_df.shape}")
combined_df.head()

## Analyze Game Statistics

In [None]:
# Plot distribution of key statistics
key_stats = ['PTS', 'REB', 'AST', 'MIN']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
for ax, stat in zip(axes.flat, key_stats):
    sns.histplot(data=combined_df, x=stat, ax=ax)
    ax.set_title(f'Distribution of {stat}')
    
plt.tight_layout()
plt.show()

## Save Collected Data

In [None]:
# Save combined dataset
combined_df.to_csv('../data/raw/all_games.csv', index=False)

# Save player info
player_info = pd.DataFrame([data['info'] for data in player_data_dict.values()])
player_info.to_csv('../data/raw/player_info.csv', index=False)

print("Data saved successfully!")

## Data Quality Check

In [None]:
def print_data_quality_report(df):
    """Print basic data quality metrics."""
    print("Data Quality Report\n")
    print(f"Number of records: {len(df)}")
    print(f"Number of features: {df.shape[1]}\n")
    
    print("Missing values:")
    missing = df.isnull().sum()
    print(missing[missing > 0])
    print("\nFeature datatypes:")
    print(df.dtypes)

print_data_quality_report(combined_df)

## Next Steps

The collected data will be used in the next notebook for processing and feature engineering. Key points from this collection phase:

1. Successfully collected data for {len(eligible_players)} eligible players
2. Created a combined dataset with {combined_df.shape[0]} game records
3. Saved raw data for further processing

Data quality looks good with minimal missing values and appropriate data types.