# Flagrant Fouls Data Collection

Extract flagrant foul data and box score statistics from NBA API for seasons 2020-21 through 2024-25.

**Data collected per game:**
- Flagrant fouls (home/away)
- Final scores (home/away)
- Rebounds (home/away)
- Assists (home/away)
- Turnovers (home/away)
- Free throws made/attempted (home/away)
- Inactive players count (home/away)

**Process:**
- Loads existing CSV to avoid duplicate game IDs
- Fetches new games from API
- Saves to CSV immediately after each successful API call
- Stops on first read timeout (1 hour cooldown needed)

In [None]:
import pandas as pd
import numpy as np
import time
from datetime import datetime, timedelta
from pathlib import Path
from nba_api.stats.endpoints.leaguegamefinder import LeagueGameFinder
from nba_api.stats.endpoints.playbyplayv3 import PlayByPlayV3
from nba_api.stats.endpoints.boxscoretraditionalv3 import BoxScoreTraditionalV3
import warnings
warnings.filterwarnings('ignore')

## 1. Load Existing Data

In [2]:
csv_file = Path('nba_flagrant_fouls.csv')

# Load existing data if it exists
if csv_file.exists():
    existing_df = pd.read_csv(csv_file)
    existing_game_ids = set(existing_df['game_id'].unique())
    print(f"Loaded existing CSV with {len(existing_df)} games")
    print(f"Unique game IDs to skip: {len(existing_game_ids)}")
else:
    existing_df = pd.DataFrame()
    existing_game_ids = set()
    print("No existing CSV found. Starting fresh.")

Loaded existing CSV with 600 games
Unique game IDs to skip: 600


## 2. Define Data Extraction Function

In [None]:
def extract_game_data(game_id):
    """
    Extract flagrant fouls, game outcome, and box score statistics from a single game.
    
    Args:
        game_id: NBA game ID
    
    Returns:
        tuple: (game_data dict, error) - one will be None
    """
    try:
        # Get play-by-play for flagrant fouls
        pbp_response = PlayByPlayV3(game_id=game_id)
        pbp = pbp_response.play_by_play.get_data_frame()
        
        # Extract flagrants by team
        flagrants = pbp[pbp['subType'].isin(['Flagrant Type 1', 'Flagrant Type 2'])]
        
        home_flagrants = len(flagrants[flagrants['location'] == 'h'])
        away_flagrants = len(flagrants[flagrants['location'] == 'v'])
        
        # Get final score (from last row)
        final_row = pbp.iloc[-1]
        home_score = final_row['scoreHome']
        away_score = final_row['scoreAway']
        
        # Get team IDs (from first non-empty row)
        team_rows = pbp[pbp['teamId'] != 0]
        home_team = team_rows[team_rows['location'] == 'h']['teamId'].iloc[0]
        away_team = team_rows[team_rows['location'] == 'v']['teamId'].iloc[0]
        
        # Get box score statistics
        box_response = BoxScoreTraditionalV3(game_id=game_id)
        box_stats = box_response.box_score_player_stats.get_data_frame()
        
        # Aggregate team-level statistics
        team_stats = box_stats.groupby('teamId').agg({
            'reb': 'sum',      # Total rebounds
            'ast': 'sum',      # Assists
            'tov': 'sum',      # Turnovers
            'ftm': 'sum',      # Free throws made
            'fta': 'sum'       # Free throws attempted
        }).to_dict('index')
        
        # Extract stats for each team
        home_stats = team_stats.get(home_team, {})
        away_stats = team_stats.get(away_team, {})
        
        # Count inactive players (proxy for injuries)
        # Players with 0 minutes played and DNP status
        home_inactive = len(box_stats[(box_stats['teamId'] == home_team) & (box_stats['min'] == 0)])
        away_inactive = len(box_stats[(box_stats['teamId'] == away_team) & (box_stats['min'] == 0)])
        
        game_data = {
            'game_id': game_id,
            'home_team': home_team,
            'away_team': away_team,
            'home_flagrants': home_flagrants,
            'away_flagrants': away_flagrants,
            'home_score': home_score,
            'away_score': away_score,
            'home_rebounds': home_stats.get('reb', 0),
            'away_rebounds': away_stats.get('reb', 0),
            'home_assists': home_stats.get('ast', 0),
            'away_assists': away_stats.get('ast', 0),
            'home_turnovers': home_stats.get('tov', 0),
            'away_turnovers': away_stats.get('tov', 0),
            'home_ftm': home_stats.get('ftm', 0),
            'away_ftm': away_stats.get('ftm', 0),
            'home_fta': home_stats.get('fta', 0),
            'away_fta': away_stats.get('fta', 0),
            'home_inactive_players': home_inactive,
            'away_inactive_players': away_inactive
        }
        return game_data, None
    
    except Exception as e:
        return None, e

## 3. Fetch Games from NBA API

In [4]:
# Define seasons to extract
seasons = ['2020-21', '2021-22', '2022-23', '2023-24', '2024-25']

all_game_ids = []

print(f"Fetching game IDs from {len(seasons)} seasons...")
for season in seasons:
    try:
        gamefinder = LeagueGameFinder(season_nullable=season)
        games_df = gamefinder.get_data_frames()[0]
        game_ids = games_df['GAME_ID'].unique()
        all_game_ids.extend(game_ids)
        print(f"  {season}: {len(game_ids)} games")
    except Exception as e:
        print(f"  {season}: ERROR - {type(e).__name__}")
        continue

print(f"\nTotal unique games: {len(set(all_game_ids))}")

# Filter out games already in CSV
new_game_ids = [gid for gid in all_game_ids if gid not in existing_game_ids]
print(f"New games to extract: {len(new_game_ids)}")
print(f"Games to skip (already in CSV): {len(all_game_ids) - len(new_game_ids)}")

Fetching game IDs from 5 seasons...
  2020-21: 1221 games
  2021-22: 1394 games
  2022-23: 1395 games
  2023-24: 1397 games
  2024-25: 1401 games

Total unique games: 6808
New games to extract: 6808
Games to skip (already in CSV): 0


## 4. Extract Flagrant Foul Data

In [5]:
print(f"\nStarting data extraction for {len(new_game_ids)} new games...")
print(f"(1 second throttle between API calls)\n")

successful_count = 0
failed_count = 0

try:
    for i, game_id in enumerate(new_game_ids):
        if i % 100 == 0 and i > 0:
            print(f"Progress: {i}/{len(new_game_ids)} | Saved: {successful_count} | Errors: {failed_count}")
        
        game_data, error = extract_game_data(game_id)
        
        if game_data:
            # Convert to DataFrame and save immediately
            game_df = pd.DataFrame([game_data])
            
            # Append to CSV
            if csv_file.exists():
                game_df.to_csv(csv_file, mode='a', header=False, index=False)
            else:
                game_df.to_csv(csv_file, mode='w', header=True, index=False)
            
            successful_count += 1
        else:
            # Check if it's a read timeout
            if 'ReadTimeout' in str(type(error).__name__) or 'timeout' in str(error).lower():
                print(f"\n{'='*70}")
                print(f"READ TIMEOUT DETECTED - IP likely rate limited")
                print(f"{'='*70}")
                
                # Calculate 1 hour from now
                retry_time = datetime.now() + timedelta(hours=1)
                print(f"\nPlease wait until: {retry_time.strftime('%Y-%m-%d %H:%M:%S')} (local time)")
                print(f"This is approximately 1 hour from now.")
                print(f"\nProgress so far:")
                print(f"  Games extracted: {successful_count}")
                print(f"  Games processed: {i + 1}/{len(new_game_ids)}")
                print(f"\nRun this notebook again in 1 hour to continue extraction.")
                break
            else:
                failed_count += 1
        
        # Throttle to avoid rate limiting
        time.sleep(1.0)

except KeyboardInterrupt:
    print(f"\n\nInterrupted by user")

print(f"\n{'='*70}")
print(f"EXTRACTION COMPLETE")
print(f"{'='*70}")
print(f"Successfully saved: {successful_count} games")
print(f"Errors encountered: {failed_count} games")
print(f"Total processed: {successful_count + failed_count}/{len(new_game_ids)}")
print(f"Data saved to: {csv_file}")


Starting data extraction for 6808 new games...
(1 second throttle between API calls)

Progress: 100/6808 | Saved: 100 | Errors: 0
Progress: 200/6808 | Saved: 200 | Errors: 0
Progress: 300/6808 | Saved: 300 | Errors: 0
Progress: 400/6808 | Saved: 400 | Errors: 0
Progress: 500/6808 | Saved: 500 | Errors: 0

READ TIMEOUT DETECTED - IP likely rate limited

Please wait until: 2025-11-29 12:23:01 (local time)
This is approximately 1 hour from now.

Progress so far:
  Games extracted: 597
  Games processed: 598/6808

Run this notebook again in 1 hour to continue extraction.

EXTRACTION COMPLETE
Successfully saved: 597 games
Errors encountered: 0 games
Total processed: 597/6808
Data saved to: nba_flagrant_fouls.csv


## 5. Verify Final Data

In [6]:
# Load and display summary
final_df = pd.read_csv(csv_file)

print(f"\nFinal CSV Summary:")
print(f"Total games: {len(final_df)}")
print(f"Unique games: {final_df['game_id'].nunique()}")
print(f"\nGames per season (inferred from game_id):")
final_df['season'] = final_df['game_id'].str[3:7]
print(final_df.groupby('season').size().sort_index())
print(f"\nData shape: {final_df.shape}")
print(f"Columns: {final_df.columns.tolist()}")


Final CSV Summary:
Total games: 1197
Unique games: 1197

Games per season (inferred from game_id):


AttributeError: Can only use .str accessor with string values!