# PBPSTATS Data Preprocessing - pbpstats_2024.csv

This notebook preprocesses the NBA play-by-play statistics data from `pbpstats_2024.csv`, which contains detailed event-level data for NBA games. The dataset includes information such as event times, shot attempts, rebounds, fouls, scoring details, and video URLs. 

The notebook covers the following sections:
 - Data Loading
 - Data Description
 - Missing Values & Data Type Checks
 - Feature Engineering
 - Exploratory Data Analysis (EDA)
 - Data Quality & Anomaly Detection
 - Next Steps

Our goal is to obtain a clean, enriched play-by-play dataset that can later be merged with tracking and shot detail data for comprehensive modeling of game events and the eventual EPV (Expected Possession Value) analysis.

## Table of Contents

1. Introduction
2. Data Loading & Validation
3. Data Description
4. Missing Values & Data Types
5. Feature Engineering
6. Exploratory Data Analysis (EDA)
7. Data Quality & Outlier Detection
8. Next Steps

## 1. Introduction <a id="introduction"></a>

The dataset `pbpstats_2024.csv` contains the following key columns:

- **ENDTIME:** End time of the event (in MM:SS format).
- **EVENTS:** Description of the event(s) that occurred.
- **FG2A:** 2-point field goal attempts.
- **FG2M:** 2-point field goals made.
- **FG3A:** 3-point field goal attempts.
- **FG3M:** 3-point field goals made.
- **GAMEDATE:** Date of the game.
- **GAMEID:** Unique game identifier.
- **NONSHOOTINGFOULSTHATRESULTEDINFTS:** Number of non-shooting fouls resulting in free throws.
- **OFFENSIVEREBOUNDS:** Number of offensive rebounds.
- **OPPONENT:** Opposing team abbreviation.
- **PERIOD:** Game period/quarter.
- **SHOOTINGFOULSDRAWN:** Number of shooting fouls drawn.
- **STARTSCOREDIFFERENTIAL:** Score differential at the start of the event.
- **STARTTIME:** Start time of the event (in MM:SS format).
- **STARTTYPE:** Type of event start (e.g., regular, timeout).
- **TURNOVERS:** Number of turnovers during the event.
- **DESCRIPTION:** Detailed description of the event.
- **URL:** Link to video footage of the event.

We will inspect, clean, and engineer features from this data to facilitate deeper analysis and later integration with other datasets.

## Extended Introduction and EPV Overview

This notebook not only cleans the PBPSTATS data but also lays the groundwork for extracting features that feed into an Expected Possession Value (EPV) model. By integrating this dataset with player tracking and shot detail data, we aim to analyze game flow and predict possession outcomes.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import logging
from pathlib import Path

# Set visualization parameters
sns.set(style="whitegrid", context="talk")
plt.rcParams["figure.figsize"] = (12, 6)

# Set logging configuration
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

print("Libraries imported successfully.")

## 2. Data Loading <a id="data-loading"></a>

Load the play-by-play data from the CSV file. Ensure that the file is located in the `data/raw/` folder. In this example, we assume the path is `../data/raw/pbpstats_2024.csv`.

In [None]:
# Define the file path for PBPSTATS data
file_path = Path('../data/raw/pbpstats_2024.csv')

# Load the data into a DataFrame
try:
    df_pbp = pd.read_csv(file_path)
    logging.info("PBPSTATS data loaded successfully.")
    logging.info(df_pbp.info())
except Exception as e:
    logging.error(f"Error loading file: {e}")

# Display the first few rows and shape
display(df_pbp.head())
print('Dataset shape:', df_pbp.shape)

In [None]:
# Convert GAMEDATE column to datetime
df_pbp['GAMEDATE'] = pd.to_datetime(df_pbp['GAMEDATE'], errors='coerce')
print('GAMEDATE converted; sample:', df_pbp['GAMEDATE'].head())

## 3. Data Description <a id="data-description"></a>

### Detailed Column Descriptions

- **ENDTIME:** End time of the event (MM:SS). Used to calculate event duration.
- **EVENTS:** Textual event summary; helps in classifying the type of play (e.g. shot, turnover).
- **FG2A & FG2M:** Essential for computing 2-point shooting efficiency.
- **FG3A & FG3M:** Essential for 3-point efficiency metrics.
- **GAMEDATE & GAMEID:** For merging datasets and time-based analysis.
- **TURNOVERS:** Indicates possession disruptions impacting team performance.
- **URL:** Video link field; missing values are imputed with 'no_url_provided'.

### Why These Columns Matter

- **ENDTIME & STARTTIME:** Critical for calculating event durations which indicate pace and play intensity.
- **FG2A/FG2M/FG3A/FG3M:** Determine shooting efficiency, an essential metric for performance analysis.
- **GAMEDATE & GAMEID:** Required for merging datasets and time-based analysis.
- **TURNOVERS:** Help quantify decision-making and team control during possessions.
- **URL:** Although often missing, these links can support game video validations; missing values are replaced with 'no_url_provided'.

## 4. Missing Values & Data Types <a id="missing-values"></a>

### Handling Missing Data & Data Types

Special attention is given to time fields and the URL column. Time fields (`STARTTIME`, `ENDTIME`) are converted to seconds for numerical analysis, while missing URLs are imputed with a consistent placeholder.

### Missing Values Analysis

Below we summarize missing data as a percentage per column. For example, a high percentage in URL is expected and handled by replacing missing values with 'no_url_provided'.

In [None]:
# Check for missing values in the PBPSTATS dataset
missing_counts_pbp = df_pbp.isnull().sum()
total = len(df_pbp)
missing_percent = (missing_counts_pbp/total)*100
print('Missing values in each column:')
print(missing_counts_pbp)
print('\nMissing percentages in each column:')
print(missing_percent)

### 4.2 Display Current Data Types

In [None]:
# Display the current data types for the PBPSTATS dataset
print('\nData types:')
print(df_pbp.dtypes)

#### **Next Steps for Data Types & Missing Values:**
- Convert time columns such as `ENDTIME` and `STARTTIME` to datetime objects or to seconds (numerical format) for easier time-based computations.
- Evaluate and impute (or drop) any columns with significant missing values.
- Ensure categorical columns (e.g., `STARTTYPE`, `DESCRIPTION`) are correctly formatted.

## 5. Feature Engineering <a id="feature-engineering"></a>

In this section, we derive new features from the raw play-by-play data to capture key aspects of game events. The planned steps include:

1. **Time Features:**
   - Convert `STARTTIME` and `ENDTIME` from MM:SS to seconds.
   - Derive **EVENT_DURATION** as the absolute difference between `ENDTIME` and `STARTTIME` (in seconds).

2. **Categorical Features:**
   - Parse the `EVENTS` and `DESCRIPTION` columns to extract common event types (e.g., shot attempt, turnover, rebound).
   - Create dummy variables if necessary to indicate the presence of key event types.

3. **Game State Features:**
   - Compute **SCORE_DIFF** as the difference in scores at the start of each event.
   - Calculate **SCORE_CHANGE** as the difference in scores between the start and end of each event.

4. **URL Handling:**
   - Replace missing URLs with a consistent placeholder (e.g., 'no_url_provided').

In [None]:
# Define a helper function to convert time strings in MM:SS to seconds
def time_str_to_seconds(time_str):
    """
    Convert a time string in MM:SS format to seconds.
    Returns 0 in case of failure.
    """
    try:
        minutes, seconds = time_str.split(':')
        return int(minutes) * 60 + float(seconds)
    except Exception:
        return 0

# Convert STARTTIME and ENDTIME to seconds
df_pbp['STARTTIME_SEC'] = df_pbp['STARTTIME'].apply(time_str_to_seconds)
df_pbp['ENDTIME_SEC'] = df_pbp['ENDTIME'].apply(time_str_to_seconds)

# Compute event duration as the absolute difference
df_pbp['EVENT_DURATION'] = abs(df_pbp['ENDTIME_SEC'] - df_pbp['STARTTIME_SEC'])

# Display a sample of the new features
display(df_pbp[['STARTTIME', 'ENDTIME', 'STARTTIME_SEC', 'ENDTIME_SEC', 'EVENT_DURATION']].head())

### 5.2 Calculating Shooting Percentages

Calculate 2-point and 3-point shooting percentages with division-by-zero handling.

In [None]:
# Calculate shooting percentages with division-by-zero handling
df_pbp['FG2_PCT'] = df_pbp.apply(lambda r: r['FG2M'] / r['FG2A'] if r['FG2A'] > 0 else 0, axis=1)
df_pbp['FG3_PCT'] = df_pbp.apply(lambda r: r['FG3M'] / r['FG3A'] if r['FG3A'] > 0 else 0, axis=1)
print('Shooting percentages computed.')

### 5.3 Game State Features

- **SCORE_DIFF:** Use the `STARTSCOREDIFFERENTIAL` column (if available) to indicate the score difference at the start of the event.
- **SCORE_CHANGE:** Compute the change in score during the event.

In [8]:
# Calculate score differential and score change
df_pbp['SCORE_DIFF'] = df_pbp['STARTSCOREDIFFERENTIAL']
df_pbp['SCORE_CHANGE'] = df_pbp['SCORE_DIFF'] - df_pbp['STARTSCOREDIFFERENTIAL']


### 5.4 URL Handling

Replace missing URL values with a placeholder.

In [None]:
df_pbp['URL'] = df_pbp['URL'].fillna('no_url_provided')
print('Missing URL values have been handled.')

## 6. Exploratory Data Analysis (EDA) <a id="eda"></a>

## 6.1 Distribution Analysis
We examine the distribution of key metrics using histograms with KDE overlays.

In [None]:
# Compute summary statistics for key numerical features in the PBPSTATS dataset
pbp_summary_stats = df_pbp.describe()
print("Summary Statistics for PBPSTATS Key Features:")
display(pbp_summary_stats)

### 6.1.1 First Game Detailed Analysis

In this section, we extract the data for the first game (based on the smallest GAMEID) and analyze key features. We focus on:
- **Event Duration:** How long events last in this game.
- **Shooting Percentages:** Distribution of FG2_PCT and FG3_PCT.
- **Turnovers:** Frequency of turnovers as an indicator of possession control.


In [None]:
# Identify the first game based on the smallest GAMEID
first_game_id = df_pbp['GAMEID'].min()
print(f"First GAMEID: {first_game_id}")

# Filter the dataframe for the first game
df_first_game = df_pbp[df_pbp['GAMEID'] == first_game_id]
print(f"First game shape: {df_first_game.shape}")
display(df_first_game.head())

### 6.1.2 Analysis of the First Game

Let's analyze some key features for the first game:
- **Event Duration Distribution**
- **Shooting Percentages (FG2_PCT & FG3_PCT)**
- **Turnover Analysis**

In [None]:
plt.figure(figsize=(14, 5))

# Histogram for EVENT_DURATION in the first game
plt.subplot(1, 2, 1)
sns.histplot(df_first_game['EVENT_DURATION'], bins=20, kde=True, color='skyblue')
plt.title('EVENT_DURATION Distribution - First Game')

# Histogram for FG2_PCT in the first game
plt.subplot(1, 2, 2)
sns.histplot(df_first_game['FG2_PCT'], bins=20, kde=True, color='salmon')
plt.title('FG2 Percentage - First Game')

plt.tight_layout()
plt.show()

# Display a count plot for TURNOVERS in the first game
plt.figure(figsize=(7,5))
sns.countplot(x='TURNOVERS', data=df_first_game, palette='viridis')
plt.title('Turnover Counts - First Game')
plt.xlabel('Turnovers')
plt.ylabel('Count')
plt.show()

### 6.1.3 Random Games Comparison

Next, we randomly select 3–5 distinct games from the dataset and analyze key features to compare game dynamics. For these games, we will look at:
- **Event Duration Distribution**
- **Scoring and Turnover Metrics**
- **Comparative Summary Statistics**

In [None]:
# Get unique game IDs
unique_games = df_pbp['GAMEID'].unique()

# Randomly select 5 games (if available)
np.random.seed(42)
selected_game_ids = np.random.choice(unique_games, size=5, replace=False)
print(f"Selected Game IDs for random analysis: {selected_game_ids}")

# Create a dataframe for the selected games
df_random_games = df_pbp[df_pbp['GAMEID'].isin(selected_game_ids)]
print(f"Random games dataframe shape: {df_random_games.shape}")
display(df_random_games.head())

### 6.1.4 Analysis of Random Games

For the selected games, we create multi-panel plots to compare key metrics across games. We focus on EVENT_DURATION and scoring differentials.

In [None]:
import matplotlib.pyplot as plt

# Create a boxplot of EVENT_DURATION grouped by GAMEID
plt.figure(figsize=(10, 6))
sns.boxplot(x='GAMEID', y='EVENT_DURATION', data=df_random_games, palette='Set2')
plt.title('Event Duration by GAMEID (Random Games)')
plt.xlabel('GAMEID')
plt.ylabel('EVENT_DURATION (seconds)')
plt.show()

# Create a bar plot of average FG2_PCT for each game
avg_fg2_pct = df_random_games.groupby('GAMEID')['FG2_PCT'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(x='GAMEID', y='FG2_PCT', data=avg_fg2_pct, palette='Set1')
plt.title('Average 2-Point FG Percentage by GAMEID (Random Games)')
plt.xlabel('GAMEID')
plt.ylabel('Average FG2_PCT')
plt.show()

# Display summary statistics for the selected games
print('Summary Statistics for Selected Games:')
display(df_random_games.groupby('GAMEID').describe())

The analysis of random games provides insights into the variability of game dynamics, event durations, and scoring patterns. This comparison helps identify trends and outliers in the dataset. 

In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Global visualization settings and color palette
sns.set_style("whitegrid")
sns.set_context("talk")
colors = {
    'primary': '#1f77b4',
    'secondary': '#ff7f0e',
    'highlight': '#2ca02c'
}
plt.rcParams.update({
    'figure.figsize': (12, 6),
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'axes.grid': True
})


## 6.2 Correlation Heatmap



We create a correlation heatmap to visualize the relationships between numerical features. This analysis helps identify potential multicollinearity and relationships between variables.


In [None]:
numerical_cols = ['EVENT_DURATION', 'FG2_PCT', 'FG3_PCT', 'PERIOD', 'STARTSCOREDIFFERENTIAL']
correlation_matrix = df_pbp[numerical_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

### Overview
The correlation matrix shows relationships between five key features:
- EVENT_DURATION
- FG2_PCT (2-point Field Goal Percentage)
- FG3_PCT (3-point Field Goal Percentage)
- PERIOD
- STARTSCOREDIFFERENTIAL

### Key Correlations

#### Strong Correlations (|r| > 0.5)
- None observed - all correlations are relatively weak (|r| < 0.2)

#### Moderate Correlations (0.2 < |r| < 0.5)
- None observed

#### Weak Correlations (|r| < 0.2)
1. **FG2_PCT and FG3_PCT** (r = -0.16)
   - Slight negative correlation
   - Suggests that when 2-point shooting percentage is higher, 3-point percentage tends to be slightly lower
   - Could indicate strategic trade-offs in shot selection

2. **EVENT_DURATION and FG2_PCT** (r = -0.078)
   - Very weak negative correlation
   - Longer events are slightly associated with lower 2-point shooting percentages
   - May reflect defensive pressure leading to longer possessions

3. **Other Correlations** (all |r| < 0.05)
   - EVENT_DURATION and FG3_PCT: -0.035
   - PERIOD and other variables: all near 0.02
   - STARTSCOREDIFFERENTIAL and other variables: all near 0.02 or lower

### Insights

1. **Independence of Features**
   - Most features show very weak correlations
   - Suggests these metrics capture different aspects of game play
   - Variables are largely independent of each other

2. **Shot Selection**
   - The weak negative correlation between FG2_PCT and FG3_PCT might reflect:
     - Team shooting strategies
     - Defensive adjustments
     - Player specialization

3. **Game Flow**
   - PERIOD and STARTSCOREDIFFERENTIAL showing minimal correlations indicates:
     - Consistent play patterns across quarters
     - Score differential doesn't strongly influence other metrics
     - Game dynamics remain relatively stable regardless of game situation

4. **Event Duration**
   - Weak correlations with shooting percentages suggest:
     - Play duration isn't strongly tied to shooting success
     - Teams maintain consistent efficiency regardless of possession length

### Implications for Analysis
1. **Feature Selection**
   - Low correlations suggest these features provide unique information
   - All features should be retained for modeling
   - No need to address multicollinearity

2. **Modeling Considerations**
   - May need to engineer interaction terms
   - Consider non-linear relationships
   - Look for conditional dependencies not captured by linear correlation

3. **Future Investigation**
   - Consider additional features that might show stronger relationships
   - Examine temporal patterns within games
   - Investigate team-specific patterns 

## 6.3 Additional Graphs: Boxplot & Scatter Plot



We create a boxplot for EVENT_DURATION to highlight outliers and a scatter plot for FG2A vs FG2M to inspect the relationship between shot attempts and makes. These visualizations provide additional insights into the data distribution and relationships between variables.

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# EVENT_DURATION distribution with KDE
sns.histplot(data=df_pbp, x='EVENT_DURATION', kde=True, ax=ax1, color=colors['primary'])
ax1.set_title('Event Duration Distribution')

# FG2_PCT distribution with KDE
sns.histplot(data=df_pbp, x='FG2_PCT', kde=True, ax=ax2, color=colors['secondary'])
ax2.set_title('2-Point FG Percentage')

# FG3_PCT distribution with KDE
sns.histplot(data=df_pbp, x='FG3_PCT', kde=True, ax=ax3, color=colors['highlight'])
ax3.set_title('3-Point FG Percentage')

# Events count by PERIOD as a bar plot
df_pbp['PERIOD'].value_counts().sort_index().plot(kind='bar', ax=ax4, color=colors['primary'])
ax4.set_title('Events by Period')
ax4.set_xlabel('Period')
ax4.set_ylabel('Count')

plt.tight_layout()
plt.show()



| **Event Duration Distribution** <br><br> - Right-skewed distribution with a long tail from 0 to ~60 seconds, peaking between 10–20 seconds. <br> - Implies most plays resolve within 20 seconds with few extended events (over 30s).  | **2-Point FG Percentage Distribution** <br><br> - Trimodal pattern with a large spike at 0.0 (missed shots), a moderate peak around 0.5–0.6, and a spike at 1.0 (perfect shooting). <br> - Indicates possessions often end either in misses or in highly efficient shots, with fewer average attempts.  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **3-Point FG Percentage Distribution** <br><br> - Similar to the 2-point pattern but more extreme: pronounced spikes at 0 and 1 with little middle-range variation. <br> - Emphasizes an all-or-nothing shooting outcome for 3-point attempts.           | **Events by Period** <br><br> - Regular quarters (1–4) maintain consistent event counts (~60,000 each), while overtime periods (5–6) have dramatically fewer events. <br> - Reflects a steady game pace in regulation and distinct OT dynamics.  |

In [None]:
# Boxplot for EVENT_DURATION to highlight outliers
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
sns.boxplot(x=df_pbp['EVENT_DURATION'], color='lightgreen')
plt.title('Boxplot of EVENT_DURATION')

# Scatter plot for FG2A vs FG2M to inspect relationship between shot attempts and makes
plt.subplot(1, 2, 2)
sns.scatterplot(x='FG2A', y='FG2M', data=df_pbp, alpha=0.5, color='navy')
plt.title('Scatter Plot: FG2A vs FG2M')
plt.xlabel('2-Point Field Goal Attempts')
plt.ylabel('2-Point Field Goal Makes')

plt.tight_layout()
plt.show()

| **Boxplot of EVENT_DURATION** | **Scatter Plot: FG2A vs FG2M (2-Point FG Attempts vs Makes)** |
|------------------------------|---------------------------------------------------------------|
| **Graph Type:** Boxplot of event durations.  <br> **Key Observations:** <br> - Median around 15 seconds <br> - IQR approximately 10–20 seconds <br> - Whiskers extend to ~5–30 seconds <br> - Numerous outliers beyond 30 seconds (some reaching 60+ seconds)  <br><br> **Statistical Insights:** <br> - Most events cluster between 10–20 seconds <br> - Median aligns with typical play duration (~15 sec), reflecting the NBA shot clock context  <br><br> **Outlier Analysis:** <br> - Upper outliers appear systematically, likely representing timeouts, free throw sequences, review periods, or end-of-quarter situations. | **Graph Type:** Scatter plot of 2-Point Field Goal Attempts (FG2A) vs Makes (FG2M).  <br> **Key Observations:** <br> - Points appear at specific integer coordinates <br> - Clear upper boundary (makes cannot exceed attempts) <br> - Three discrete levels corresponding to shot outcomes: 0, 1, and 2 makes  <br><br> **Pattern Analysis:** <br> - Zero makes are most common (attempts from 0 to 7) <br> - One make spans scenarios with 1–7 attempts <br> - Two makes are rarer, possible only with 2+ attempts  <br><br> **Efficiency Insights:** <br> - Perfect shooting efficiency (points along the diagonal) is rare <br> - Most data points lie below optimal efficiency, implying defensive impact on scoring. |

## 7. Data Quality & Anomaly Detection <a id="data-quality"></a>

### 7.1 Initial Data Quality Assessment
First, we perform a comprehensive check of data quality metrics including missing values, duplicates, and basic validation rules.

In [None]:
def initial_quality_assessment(df):
    quality_metrics = {
        'total_rows': len(df),
        'missing_values': df.isnull().sum(),
        'duplicate_rows': len(df[df.duplicated()]),
        'memory_usage': df.memory_usage().sum() / 1024**2  # in MB
    }
    return quality_metrics

# Run initial assessment
quality_results = initial_quality_assessment(df_pbp)
print("Data Quality Metrics:")
for metric, value in quality_results.items():
    if metric != 'missing_values':
        print(f"{metric}: {value}")
print("\nMissing Values by Column:")
print(quality_results['missing_values'])

### 7.2 Feature-Specific Validation
Validate business rules and logical constraints for specific features.

In [None]:
def validate_feature_rules(df):
    validation_results = {
        'valid_periods': df['PERIOD'].between(1, 6).all(),
        'valid_shots': (df['FG2M'] <= df['FG2A']).all() and (df['FG3M'] <= df['FG3A']).all(),
        'valid_times': (df['ENDTIME_SEC'] > df['STARTTIME_SEC']).all(),
        'valid_scores': (df['STARTSCOREDIFFERENTIAL'].abs() <= 50).all
    }
    return validation_results

# Run validation checks
validation_results = validate_feature_rules(df_pbp)
print("Feature Validation Results:")
for rule, passed in validation_results.items():
    print(f"{rule}: {'✓' if passed else '✗'}")

# Identify problematic records
invalid_shots = df_pbp[~((df_pbp['FG2M'] <= df_pbp['FG2A']) & (df_pbp['FG3M'] <= df_pbp['FG3A']))]
if len(invalid_shots) > 0:
    print("\nFound invalid shot records:")
    display(invalid_shots[['GAMEID', 'FG2A', 'FG2M', 'FG3A', 'FG3M']])

### 1. Passing Validations
- **Period Values** (✓): All periods are within valid range (1-6)
- **Shot Attempts** (✓): All FG2M/FG3M are less than or equal to FG2A/FG3A
- **Score Differentials** (✓): All score differentials are within reasonable bounds (≤ 50)

### 2. Failed Validation
- **Time Sequence** (✗): Some ENDTIME_SEC values are not greater than STARTTIME_SEC

### 7.3 Comprehensive Outlier Detection
Analyze outliers across all numerical features using the IQR method and additional statistical measures.

In [22]:
def detect_outliers_iqr_pbp(df, feature):
    """
    Detect outliers using the IQR method.
    
    Args:
        df (pd.DataFrame): Input DataFrame
        feature (str): Column name to analyze
        
    Returns:
        tuple: (outliers DataFrame, lower bound, upper bound)
    """
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
    return outliers, lower_bound, upper_bound

In [None]:
def analyze_all_features(df):
    """
    Comprehensive analysis of all numerical features.
    
    Args:
        df (pd.DataFrame): Input DataFrame
        
    Returns:
        dict: Analysis results for each feature
    """
    numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
    outlier_summary = {}
    
    for feature in numerical_features:
        outliers, lower, upper = detect_outliers_iqr_pbp(df, feature)
        stats = df[feature].describe()
        outlier_summary[feature] = {
            'count': len(outliers),
            'percentage': (len(outliers)/len(df))*100,
            'bounds': (lower, upper),
            'skewness': df[feature].skew(),
            'kurtosis': df[feature].kurtosis(),
            'stats': stats
        }
    return outlier_summary

In [24]:
def plot_feature_distribution(df, feature, outlier_bounds=None):
    """
    Plot distribution of a feature with outlier bounds if provided.
    
    Args:
        df (pd.DataFrame): Input DataFrame
        feature (str): Feature to plot
        outlier_bounds (tuple): Optional (lower, upper) bounds for outliers
    """
    plt.figure(figsize=(10, 6))
    sns.histplot(df[feature], kde=True)
    if outlier_bounds:
        plt.axvline(x=outlier_bounds[0], color='r', linestyle='--', label='Lower bound')
        plt.axvline(x=outlier_bounds[1], color='r', linestyle='--', label='Upper bound')
    plt.title(f'Distribution of {feature}')
    plt.legend()
    plt.show()

In [26]:
def plot_feature_distributions(df, outlier_analysis, features_per_row=3):
    """
    Plot multiple feature distributions in a grid of subplots.
    
    Args:
        df (pd.DataFrame): Input DataFrame
        outlier_analysis (dict): Dictionary containing outlier analysis results
        features_per_row (int): Number of plots per row
    """
    # Calculate number of rows needed
    n_features = len(outlier_analysis)
    n_rows = (n_features + features_per_row - 1) // features_per_row
    
    # Create figure and subplots
    fig, axes = plt.subplots(n_rows, features_per_row, 
                            figsize=(15, 5*n_rows),
                            squeeze=False)
    
    # Flatten axes for easier iteration
    axes_flat = axes.flatten()

    # Plot each feature
    for idx, (feature, metrics) in enumerate(outlier_analysis.items()):
        ax = axes_flat[idx]
        
        # Plot histogram with KDE
        sns.histplot(df[feature], kde=True, ax=ax)
        
        # Add outlier bounds
        if 'bounds' in metrics:
            lower, upper = metrics['bounds']
            ax.axvline(x=lower, color='r', linestyle='--', label='Lower bound')
            ax.axvline(x=upper, color='r', linestyle='--', label='Upper bound')
        
        # Add title with statistics
        ax.set_title(f'{feature}\nOutliers: {metrics["percentage"]:.1f}%\n' +
                    f'Skew: {metrics["skewness"]:.2f}')
        
        # Rotate x-axis labels if needed
        ax.tick_params(axis='x', rotation=45)
        
        # Add legend
        ax.legend()
    
    # Remove empty subplots if any
    for idx in range(len(outlier_analysis), len(axes_flat)):
        fig.delaxes(axes_flat[idx])
    
    # Adjust layout
    plt.tight_layout()
    return fig
def run_feature_analysis(df):
    """
    Run complete feature analysis with compact visualization.
    """
    # Run analysis
    outlier_analysis = analyze_all_features(df)
    
    # Create visualization
    fig = plot_feature_distributions(df, outlier_analysis)
    
    # Print detailed statistics
    print("\nDetailed Feature Analysis:")
    print("="*80)
    for feature, metrics in outlier_analysis.items():
        print(f"\nFeature: {feature}")
        print(f"{'='*40}")
        print(f"Outliers: {metrics['count']} ({metrics['percentage']:.2f}%)")
        print(f"Bounds: {metrics['bounds']}")
        print(f"Skewness: {metrics['skewness']:.2f}")
        print(f"Kurtosis: {metrics['kurtosis']:.2f}")
        if 'stats' in metrics:
            print("\nDescriptive Statistics:")
            print(metrics['stats'])
    
    return outlier_analysis, fig

In [None]:
# Run the analysis
outlier_analysis, fig = run_feature_analysis(df_pbp)

# Save the figure if needed
fig.savefig('feature_analysis.png', dpi=300, bbox_inches='tight')


# Analysis of Feature Distributions and Outliers

## Shot-Related Features

### 1. FG2A (2-Point Field Goal Attempts)
- **Distribution**: Discrete, concentrated at 0-4 attempts
- **Outliers**: 1.5% of data
- **Skewness**: 1.25 (right-skewed)
- **Pattern**: Most possessions have 0-1 attempts, rare cases of 4+ attempts

### 2. FG2M (2-Point Field Goals Made)
- **Distribution**: Highly concentrated at 0-1 makes
- **Outliers**: 22.5% (relatively high)
- **Skewness**: 1.33 (right-skewed)
- **Pattern**: Strong binary pattern (make/miss)

### 3. FG3A/FG3M (3-Point Attempts/Makes)
- **Distribution**: Similar to FG2A/M but more extreme
- **Outliers**: 0.3% (very few)
- **Skewness**: 1.12/2.86 respectively
- **Pattern**: More concentrated at 0, fewer multiple-attempt possessions

## Game Flow Features

### 4. PERIOD
- **Distribution**: Uniform across periods 1-4
- **Outliers**: 0.0% (as expected)
- **Pattern**: Clear peaks for each quarter, minimal overtime periods

### 5. STARTSCOREDIFFERENTIAL
- **Distribution**: Normal/bell-shaped
- **Outliers**: 3.8%
- **Skewness**: -0.02 (nearly symmetric)
- **Range**: Mostly within ±25 points

### 6. EVENT_DURATION
- **Distribution**: Right-skewed
- **Outliers**: 1.1%
- **Pattern**: 
  - Peak around 15-20 seconds
  - Long tail extending to 60 seconds
  - Aligns with shot clock duration

## Foul-Related Features

### 7. NONSHOOTINGFOULSTHATRESULTEDINFTS
- **Distribution**: Highly skewed
- **Outliers**: 3.8%
- **Pattern**: Majority at 0, sharp decline

### 8. SHOOTINGFOULSDRAWN
- **Distribution**: Right-skewed
- **Outliers**: 19.8%
- **Pattern**: Most common at 0-1, rare above 2

## Time-Related Features

### 9. STARTTIME_SEC/ENDTIME_SEC
- **Distribution**: Relatively uniform within game periods
- **Outliers**: 0.0%
- **Pattern**: Clear period boundaries at multiples of 720 seconds

## Additional Metrics

### 10. OFFENSIVEREBOUNDS
- **Distribution**: Right-skewed
- **Outliers**: 24.5%
- **Pattern**: Most possessions have 0-1 rebounds

### 11. TURNOVERS
- **Distribution**: Highly right-skewed
- **Outliers**: 10.7%
- **Pattern**: Majority at 0, rapid decline

## Key Insights

1. **Shot Distributions**:
   - Most possessions involve 0-1 shot attempts
   - Higher variance in 2-point attempts vs 3-point
   - Clear make/miss patterns align with expected basketball statistics

2. **Game Flow**:
   - Score differentials follow normal distribution
   - Event durations align with shot clock expectations
   - Clear period structure with expected frequency drops in overtime

3. **Outlier Patterns**:
   - Highest outlier percentages in:
     - FG2M (22.5%)
     - OFFENSIVEREBOUNDS (24.5%)
     - SHOOTINGFOULSDRAWN (19.8%)
   - Lowest in:
     - PERIOD (0.0%)
     - FG3A (0.3%)
     - TIME-related features (0.0%)

4. **Data Quality Implications**:
   - Most distributions follow expected basketball patterns
   - Outliers generally represent legitimate game situations
   - Time-related features show good structural integrity

## Recommendations

1. **Outlier Handling**:
   - Keep most outliers as they represent valid game situations
   - Focus validation on extreme cases in:
     - SHOOTINGFOULSDRAWN > 2
     - EVENT_DURATION > 40 seconds
     - Multiple shot attempts in single possession

2. **Feature Engineering**:
   - Consider creating composite features for:
     - Shooting efficiency
     - Possession outcome classification
     - Time management metrics

3. **Data Quality**:
   - Implement range validation for score differentials
   - Verify time sequence integrity
   - Cross-validate multiple shot attempts

In [None]:
# Fill missing URL values with a placeholder string
df_pbp['URL'] = df_pbp['URL'].fillna('no_url_provided')
print('Missing URL values have been handled.')

## 8. Next Steps & Save Cleaned Data

After further processing and EDA, export the refined dataset for modeling.

In [None]:
# Save the refined PBPSTATS dataset to a CSV file
output_path = Path('../data/processed/refined_pbpstats_2024.csv')
df_pbp.to_csv(output_path, index=False)
logging.info(f"Refined PBPSTATS data saved at: {output_path}")

### Final Thoughts

Our comprehensive preprocessing and EDA have prepared a clean, enriched play-by-play dataset. The next phase will focus on building predictive models that leverage these insights to forecast game events and contribute to our overall EPV model for NBA games.