# **NBA Play-by-Play Preprocessing: Durant Era Impact**

This notebook prepares NBA play-by-play data for downstream analysis of team behavior during different phases of Kevin Durant's career.

## **What this notebook does**
- Loads play-by-play data from 11 NBA seasons (2013–2024)
- Combines all seasons into a single DataFrame
- Optimizes data types to reduce memory usage
- Converts date and time columns for proper filtering and analysis
- Segments the data into three career eras:
  - Pre-GSW Era (before July 7, 2016)
  - GSW Era (July 7, 2016 to July 7, 2019)
  - Post-GSW Era (after July 7, 2019)
- Filters for OKC and GSW games during each era
- Displays basic statistics to verify correct filtering

## **Purpose**
This preprocessing sets up the data for comparing how OKC and GSW team behavior shifted before, during, and after Durant's time with the Warriors.


**SECTION 1: DATA LOADING AND PREPARATION**

This section loads NBA play-by-play data from multiple seasons
and prepares it for analysis


In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import os

# Base path to your Google Drive folder
base_path = "/content/drive/MyDrive/NBA Data/Combined Files [With Game ID]/"

# List of file names by season
seasons = [
    "2013-2014", "2014-2015", "2015-2016", "2016-2017", "2017-2018",
    "2018-2019", "2019-2020", "2020-2021", "2021-2022", "2022-2023", "2023-2024"
]

# Combine them into one DataFrame
all_data = pd.DataFrame()

for season in seasons:
    file_name = f"{season}_NBA_PbP_Logs.csv"
    full_path = os.path.join(base_path, file_name)

    print(f"Loading {file_name}...")
    df = pd.read_csv(full_path)
    df['season'] = season  # Track what season each row is from
    all_data = pd.concat([all_data, df], ignore_index=True)

# Display basic information about the dataset
all_data.info()


Mounted at /content/drive
Loading 2013-2014_NBA_PbP_Logs.csv...
Loading 2014-2015_NBA_PbP_Logs.csv...
Loading 2015-2016_NBA_PbP_Logs.csv...
Loading 2016-2017_NBA_PbP_Logs.csv...
Loading 2017-2018_NBA_PbP_Logs.csv...
Loading 2018-2019_NBA_PbP_Logs.csv...
Loading 2019-2020_NBA_PbP_Logs.csv...
Loading 2020-2021_NBA_PbP_Logs.csv...
Loading 2021-2022_NBA_PbP_Logs.csv...
Loading 2022-2023_NBA_PbP_Logs.csv...
Loading 2023-2024_NBA_PbP_Logs.csv...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6553006 entries, 0 to 6553005
Data columns (total 47 columns):
 #   Column          Dtype  
---  ------          -----  
 0   game_id         int64  
 1   data_set        object 
 2   date            object 
 3   a1              object 
 4   a2              object 
 5   a3              object 
 6   a4              object 
 7   a5              object 
 8   h1              object 
 9   h2              object 
 10  h3              object 
 11  h4              object 
 12  h5              object 
 13  per

**SECTION 2: DATA TYPE OPTIMIZATION**

Converting columns to appropriate data types to improve performance and reduce memory usage

In [None]:
# Convert date to datetime format for better filtering
all_data['date'] = pd.to_datetime(all_data['date'], errors='coerce')

# Convert time columns to timedelta for accurate time calculations
time_cols = ['remaining_time', 'elapsed', 'play_length']
for col in time_cols:
    all_data[col] = pd.to_timedelta(all_data[col], errors='coerce')

# Convert text columns with limited unique values to category type to save memory
category_cols = [
    'data_set', 'season', 'team', 'event_type', 'type', 'player', 'assist', 'block',
    'steal', 'entered', 'left', 'reason', 'result', 'possession', 'a1', 'a2', 'a3', 'a4', 'a5',
    'h1', 'h2', 'h3', 'h4', 'h5', 'home', 'away', 'opponent'
]

for col in category_cols:
    all_data[col] = all_data[col].astype('category')

**SECTION 3: DURANT CAREER ERA ANALYSIS FUNCTION**

This function divides the data into three periods based on Durant's career moves and filters for OKC and GSW games

In [None]:
def analyze_durant_impact(all_data):
    """
    Filter OKC and GSW games across Durant's career eras
    Returns dictionaries with filtered dataframes
    """
    # 1. Time-based era division
    pre_warriors_era = all_data[all_data['date'] < '2016-07-07']        # Durant on OKC
    warriors_era = all_data[(all_data['date'] >= '2016-07-07') &
                          (all_data['date'] < '2019-07-07')]             # Durant on GSW
    post_warriors_era = all_data[all_data['date'] >= '2019-07-07']      # Durant on BKN/PHX

    # 2. Function to filter team games
    def filter_team_games(df, team_code):
        return df[
            df['matchup'].str.contains(f'{team_code}@', na=False) |
            df['matchup'].str.contains(f'@{team_code}', na=False)
        ]

    # 3. Filter for each team in each era
    # OKC games across eras
    okc_pre_durant_leave = filter_team_games(pre_warriors_era, 'OKC')
    okc_durant_gsw_era = filter_team_games(warriors_era, 'OKC')
    okc_post_durant = filter_team_games(post_warriors_era, 'OKC')

    # GSW games across eras
    gsw_pre_durant = filter_team_games(pre_warriors_era, 'GSW')
    gsw_durant_era = filter_team_games(warriors_era, 'GSW')
    gsw_post_durant = filter_team_games(post_warriors_era, 'GSW')

    return {
        'okc_pre': okc_pre_durant_leave,
        'okc_during': okc_durant_gsw_era,
        'okc_post': okc_post_durant,
        'gsw_pre': gsw_pre_durant,
        'gsw_during': gsw_durant_era,
        'gsw_post': gsw_post_durant
    }

In [None]:
# Call the analysis function to get the filtered dataframes
results = analyze_durant_impact(all_data)

# Print the number of games in each filtered dataset
print("OKC Games by Era:")
print(f"Pre-Durant leaving OKC: {results['okc_pre']['game_id'].nunique()} games")
print(f"During Durant's GSW era: {results['okc_during']['game_id'].nunique()} games")
print(f"Post-Durant GSW era: {results['okc_post']['game_id'].nunique()} games")

print("\nGSW Games by Era:")
print(f"Pre-Durant joining GSW: {results['gsw_pre']['game_id'].nunique()} games")
print(f"With Durant: {results['gsw_during']['game_id'].nunique()} games")
print(f"Post-Durant: {results['gsw_post']['game_id'].nunique()} games")

# Show sample matchups to verify correct filtering
print("\nSample matchups for OKC pre-Durant leaving:")
print(results['okc_pre']['matchup'].unique()[:5])  # First 5 matchups

print("\nSample matchups for GSW pre-Durant:")
print(results['gsw_pre']['matchup'].unique()[:5])  # First 5 matchups

# Print the total number of plays in each dataset
print("\nTotal plays in each dataset:")
for key, df in results.items():
    print(f"{key}: {len(df)} plays")

OKC Games by Era:
Pre-Durant leaving OKC: 283 games
During Durant's GSW era: 262 games
Post-Durant GSW era: 409 games

GSW Games by Era:
Pre-Durant joining GSW: 298 games
With Durant: 306 games
Post-Durant: 421 games

Sample matchups for OKC pre-Durant leaving:
['LAL@OKC' 'ORL@OKC' 'OKC@DEN' 'CHI@OKC' 'OKC@SAS']

Sample matchups for GSW pre-Durant:
['DAL@GSW' 'HOU@GSW' 'GSW@PHX' 'NOP@GSW' 'SAS@GSW']

Total plays in each dataset:
okc_pre: 132409 plays
okc_during: 125256 plays
okc_post: 192389 plays
gsw_pre: 138984 plays
gsw_during: 142432 plays
gsw_post: 198561 plays
