# Tracking Data Preprocessing - datanba_2024.csv



This notebook preprocesses the NBA tracking data from `datanba_2024.csv`, which contains high-frequency (25 Hz) positional data for NBA players and the ball. The notebook covers data loading, description, missing values and data type checks, feature engineering, exploratory data analysis (EDA), and data quality & anomaly detection. 

The final goal is to obtain a clean and enriched dataset ready for advanced modeling (e.g., using LSTMs, Transformers, or Graph Neural Networks) to predict player trajectories and analyze team strategies.

## Table of Contents


1. [Introduction](#introduction)
2. [Data Loading](#data-loading)
3. [Data Description](#data-description)
4. [Missing Values & Data Types](#missing-values)
5. [Feature Engineering](#feature-engineering)
6. [Exploratory Data Analysis (EDA)](#eda)
7. [Data Quality & Anomaly Detection](#data-quality)
8. [Next Steps](#next-steps)

## 1. Introduction <a id="introduction"></a>

The dataset `datanba_2024.csv` contains the following key columns:

- **evt:** Event identifier (integer).
- **wallclk:** Wall clock timestamp (string in ISO format).
- **cl:** In-game clock (string, e.g., '12:00').
- **de:** Event description (string).
- **locX, locY:** Spatial coordinates on the court (integers).
- **opt1, opt2, opt3, opt4:** Optional fields with additional numerical data.
- **mtype, etype:** Codes representing event types (integers).
- **opid:** Optional event identifier (float, many missing values).
- **tid:** Team identifier (integer).
- **pid:** Player identifier (integer).
- **hs, vs:** Home and visitor scores (integers).
- **epid:** Event period identifier (float).
- **oftid:** Offensive team identifier (integer).
- **ord:** Order of the event in the game (integer).
- **pts:** Points scored in the event (integer).
- **PERIOD:** Game period/quarter (integer).
- **GAME_ID:** Unique game identifier (integer).

We will inspect, clean, and engineer features from this data before integrating it with other datasets.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set visualization parameters
sns.set(style="whitegrid", context="talk")
plt.rcParams["figure.figsize"] = (12, 6)

print("Libraries imported successfully.")

## 2. Data Loading <a id="data-loading"></a>


Load the tracking data from the CSV file. Ensure that the file is located in the `data/raw/` folder. In this example, we assume the path is `../data/raw/datanba_2024.csv`.

In [None]:
# Define the file path
file_path = '../data/raw/datanba_2024.csv'

# Load the data into a DataFrame
try:
    df_tracking = pd.read_csv(file_path)
    print('Tracking data loaded successfully.')
except Exception as e:
    print(f'Error loading file: {e}')

# Display the first few rows
display(df_tracking.head())

# Print the dataset shape
print('Dataset shape:', df_tracking.shape)

## 3. Data Description <a id="data-description"></a>


Below is an overview of the columns in this dataset:

- **evt:** Event identifier.
- **wallclk:** Wall clock timestamp (ISO format).
- **cl:** In-game clock in MM:SS format.
- **de:** Description of the event.
- **locX, locY:** X and Y coordinates on the court.
- **opt1, opt2, opt3, opt4:** Additional optional numerical data.
- **mtype, etype:** Codes representing event types.
- **opid:** Optional event identifier (note many missing values).
- **tid:** Team identifier.
- **pid:** Player identifier.
- **hs, vs:** Home and visitor scores.
- **epid:** Event period identifier.
- **oftid:** Offensive team identifier.
- **ord:** Order of the event in the game.
- **pts:** Points scored in the event.
- **PERIOD:** Game period/quarter.
- **GAME_ID:** Unique game identifier.

This detailed description helps us plan the cleaning and feature engineering steps.

## 4. Missing Values & Data Types <a id="missing-values"></a>


We now inspect the dataset for missing values and review the current data types. This helps us identify which columns need conversion or imputation.

### 4.1 Check for Missing Values



In this step, we inspect the dataset for any missing values. This helps us identify which columns may need imputation or special handling during the cleaning process.


In [None]:
# Check for missing values
missing_counts = df_tracking.isnull().sum()
print('Missing values in each column:')
print(missing_counts)


#### Missing Values Analysis

#### Observations:


- **Complete Columns:**  
  Most columns (e.g., `evt`, `wallclk`, `cl`, `de`, `locX`, `locY`, `tid`, `pid`, `hs`, `vs`, `PERIOD`, `GAME_ID`, etc.) have **0 missing values**. This indicates that these fields are fully populated and likely crucial for our analysis.
  
- **Columns with Missing Values:**  
  - **opid:** There are **267,514 missing values**. This column is labeled as an optional event identifier, and the high number of missing entries might be inherent to the data source.
  - **epid:** There are **211,048 missing values**. This column represents the event period identifier and also has many missing entries, which may indicate that this field is not consistently recorded or is optional.



#### Implications for Analysis:


- Since the majority of the columns have no missing values, we can focus our cleaning efforts on understanding and handling the missing data in `opid` and `epid`.  
- Depending on the relevance of these columns to our modeling objectives, we might choose to impute missing values, drop these columns, or simply flag them as optional data that may not affect our core analysis.
- Overall, the quality of the tracking data is strong, with only a couple of optional fields showing missing values.

This analysis of missing values helps us to identify where additional cleaning or special handling may be necessary before proceeding with further feature engineering and modeling.

### 4.2 Display Current Data Types

Before converting any columns, we display the current data types of our DataFrame. This gives us insight into the structure of the data and informs us which columns need conversion.

In [None]:
# Display the current data types
print('\nData types:')
print(df_tracking.dtypes)

#### Data Types Overview


Below is the output of the data types present in our dataset:

```
evt             int64
wallclk        object
cl             object
de             object
locX            int64
locY            int64
opt1            int64
opt2            int64
opt3            int64
opt4            int64
mtype           int64
etype           int64
opid          float64
tid             int64
pid             int64
hs              int64
vs              int64
epid          float64
oftid           int64
ord             int64
pts             int64
PERIOD          int64
GAME_ID         int64
cl_seconds      int64
time_diff     float64
dx            float64
dy            float64
distance      float64
velocity      float64
```

#### **Observations:**


- **Integer Columns (`int64`)**:  
  Many fields, such as `evt`, `locX`, `locY`, `tid`, `pid`, `PERIOD`, `GAME_ID`, and `pts`, are stored as integers, which is appropriate for categorical and numerical data.  

- **Floating-Point Columns (`float64`)**:  
  Fields like `opid`, `epid`, `time_diff`, `dx`, `dy`, `distance`, and `velocity` are in floating-point format, which makes sense since these involve either missing values or continuous numerical calculations.

- **Object Columns (`object`)**:  
  - `wallclk`, `cl`, and `de` are stored as **object (string) types**.  
  - `wallclk` (wall clock timestamp) might be better converted into **datetime format** for easier time-based calculations.  
  - `cl` (in-game clock) may need **conversion into seconds** for easier numerical operations.  



#### **Next Steps:**


- Convert `wallclk` to `datetime64` format.
- Transform `cl` (clock) into a numerical format (`cl_seconds`).
- Ensure `de` (event description) remains an object since it contains textual data.
- Check if `opid` and `epid` should be imputed or left as is.

By ensuring that all data types are correctly formatted, we can perform **efficient computations, avoid type-related errors, and prepare the data for feature engineering and modeling.**

### 4.3 Convert Specific Columns to Numeric



Some columns, such as spatial coordinates (`locX`, `locY`) and points (`pts`), should be numeric. We use `pd.to_numeric()` with `errors='coerce'` to ensure any non-numeric values are safely converted to `NaN`.

In [None]:
# Convert specific columns to numeric if needed
numeric_columns = ['locX', 'locY', 'pts']
for col in numeric_columns:
    df_tracking[col] = pd.to_numeric(df_tracking[col], errors='coerce')

print('\nData types after conversion:')
print(df_tracking.dtypes)

#### Analysis of Data Type Conversion and Missing Values



After loading the dataset, we inspected the missing values and data types. Most columns are complete; however, we observed that the columns `opid` and `epid` have many missing values, which appears to be inherent in the data source (they are optional event identifiers). 

We then verified that critical numerical fields such as `locX`, `locY`, and `pts` are indeed numeric. This is crucial for any arithmetic operations and feature calculations that we will perform in the next section.

Additionally, we created a new column `cl_seconds` by converting the in-game clock (`cl`), which is in `MM:SS` format, into total seconds. This conversion will allow us to compute time differences between consecutive events. 

The data types after conversion now include additional fields from our subsequent processing (e.g., `cl_seconds`, `time_diff`, `dx`, `dy`, `distance`, and `velocity`). This sets a good foundation for our feature engineering steps.

## 5. Feature Engineering <a id="feature-engineering"></a>


In this section, we derive new features from the raw tracking data to capture the spatial and temporal dynamics of NBA player movement. These features are critical for our later modeling tasks (e.g., predicting player trajectories). The main steps include:

1. Sorting the data by player identifier and time.
2. Converting the in-game clock (in MM:SS format) to total seconds.
3. Computing the time difference between consecutive events for each player.
4. Calculating spatial differences (dx and dy) between consecutive positions.
5. Computing the Euclidean distance traveled between events.
6. Calculating velocity as distance divided by the time difference.

Let's implement these steps one by one.


### 5.1 Convert In-Game Clock to Seconds



The column `cl` contains the in-game clock in `MM:SS` format. For numerical operations, we need to convert these values to total seconds. We will create a new column called `cl_seconds` to store this conversion.


In [None]:
# Define a helper function to convert "MM:SS" to total seconds
def time_to_seconds(time_str):
    try:
        parts = time_str.split(':')
        minutes = int(parts[0])
        # Use float conversion for the seconds part to handle decimals
        seconds = float(parts[1])
        return minutes * 60 + seconds
    except Exception:
        return 0


# Apply the conversion function to the 'cl' column and create a new column 'cl_seconds'
df_tracking['cl_seconds'] = df_tracking['cl'].apply(time_to_seconds)

# Display a sample to verify the conversion
display(df_tracking[['cl', 'cl_seconds']].head())


#### Investigating Event Types When the In-Game Clock is Zero


In this section, we focus on rows where the in-game clock (`cl_seconds`) is 0. We want to inspect these events, analyze their descriptions, and understand what kind of events typically occur at this time (for example, "End Period" or "Rebound" events). We will split our analysis into several steps:
1. Filter and display a sample of the zero clock events.
2. Analyze the frequency of event descriptions.
3. Search for specific keywords within the event descriptions.

**Our Goals in this Section:**

- **Identify Event Types:**  
  Analyze the `de` (event description) column to see which events are most common when `cl_seconds` is 0.

- **Frequency Analysis:**  
  Calculate the frequency of different event types (or key substrings) when the in-game clock is zero. This will help us understand whether a large number of events occur at the end of periods or if there are any anomalies.

- **Implications for Feature Engineering:**  
  Knowing the context behind these low clock values may inform how we treat them in further analysis (for example, whether to exclude some events or adjust our feature engineering steps).

Let's proceed by exploring the event descriptions for records with `cl_seconds` equal to 0.


In [None]:
# Filter the DataFrame for rows where cl_seconds is 0
zero_clock_events = df_tracking[df_tracking['cl_seconds'] == 0]

# Display a sample of these events to see what the event descriptions look like
display(zero_clock_events[['cl', 'cl_seconds', 'de', 'evt']].head(20))


In [None]:
# Analyze the frequency of event descriptions in these zero clock events
event_counts = zero_clock_events['de'].value_counts()

print("Frequency of event descriptions when cl_seconds == 0:")
print(event_counts)


In [None]:
# Optionally, if we want to check for keywords (e.g., 'End Period', 'Rebound')
keywords = ['End Period', 'Rebound', 'Shot', 'Turnover', 'Foul', 'Replay']
for keyword in keywords:
    keyword_count = zero_clock_events['de'].str.contains(keyword, case=False, na=False).sum()
    print(f"Number of events containing '{keyword}': {keyword_count}")


#### Summary of Investigating Zero Clock Events



In this section, we filtered the dataset to focus on events where the in-game clock (`cl_seconds`) is 0. Our analysis revealed the following:

- **Frequency Analysis:**  
  - "End Period" events are dominant, with 2,611 occurrences.
  - There are 390 events containing the keyword "Rebound." and 494 events with the keyword "Shot." and 243 events with the keyword "Turnover."
  - Other keywords such as "Turnover"  and "Replay" appear with lower frequencies.
  
- **Implications:**  
  - The prevalence of "End Period" events confirms that many events naturally occur at the end of a period.
  - The frequency of other events (like rebounds and shots) at 0 seconds provides context on how some actions are recorded right at the period boundaries.
  - This insight is useful for our feature engineering because it suggests that events with `cl_seconds` of 0 may need to be flagged or handled differently during modeling.

With these observations in hand, we are now ready to move on to the next step in our feature engineering process.


### 5.2 Sorting the Data



To accurately compute differences between consecutive events, we need to ensure that the data for each player is sorted in chronological order. We'll sort the DataFrame by the player identifier (`pid`) and the converted in-game clock (`cl_seconds`).

In [None]:
# Sort the DataFrame by player identifier and in-game time (cl_seconds)
df_tracking.sort_values(by=['pid', 'cl_seconds'], inplace=True)

# Verify sorting by displaying a sample of rows for a specific player
sample_pid = df_tracking['pid'].iloc[0]
display(df_tracking[df_tracking['pid'] == sample_pid].head(10))


**Observations from the Sorting Step:**

- The sample output shows rows for a specific player (`pid` value 0) sorted by the `cl_seconds` column.
- In our sample, all `cl_seconds` values appear as `0.0`. This indicates one of two possibilities:
  - **Many events occur at "00:00":**  
    It is plausible that many events for this player occur right at the end of a period (e.g., "End Period" events), which might be recorded as "00:00". However, if all values are 0, this is unlikely for the entire dataset.
  - **Time Conversion Issue:**  
    There might be an issue with our `time_to_seconds` function. For example, if the clock values include decimals (e.g., "00:00.4"), our current function might be truncating or failing to properly capture the fractional part. This would result in `cl_seconds` being 0 for such values.

- The sorting logic itself is working correctly, as it orders the data by `pid` and `cl_seconds`. This is crucial because accurate sorting is the foundation for computing time differences, spatial differences, and ultimately, velocity.


**Next Steps:**

- Investigate the conversion of the `cl` column to `cl_seconds` to ensure it correctly converts values like `"00:00.4"` to `0.4` seconds rather than `0`.
- Once the time values are correctly converted, the computed `time_diff` should reflect the actual elapsed time between events, and the derived velocity values will be more meaningful.
- Continue with the subsequent feature engineering steps (time differences, spatial differences, distance, and velocity) after ensuring the clock conversion is accurate.

This summary confirms that while our sorting logic works as expected, we need to double-check the time conversion to ensure that our derived features accurately capture the temporal dynamics of the game.


### 5.3 Calculating Time Differences



Now that we have ensured our data is sorted by the player identifier (`pid`) and the converted in-game clock (`cl_seconds`), our next step is to calculate the time difference between consecutive events for each player.

**Why This Step is Important:**
- The time difference, stored in a new column `time_diff`, represents the elapsed time (in seconds) between successive events for a player.
- This feature is critical for calculating other dynamic metrics such as velocity.
- Accurate time differences help us understand the pace of play and the temporal dynamics of player movement.

We use the `.groupby()` method to group events by `pid` and then apply `.diff()` on the `cl_seconds` column to compute the difference between consecutive time stamps. If there is no previous event (i.e., the first event for a player), we fill the missing value with `0`.


In [None]:
# Calculate the time difference (in seconds) between consecutive events for each player
df_tracking['time_diff'] = df_tracking.groupby('pid')['cl_seconds'].diff().fillna(0)

# Display a sample of the new 'time_diff' column alongside 'pid' and 'cl_seconds'
display(df_tracking[['pid', 'cl', 'cl_seconds', 'time_diff']].head(10))

#### Additional Comparison of Time Differences

To better understand the time dynamics, we can compare the `time_diff` values across multiple players. This will help us identify if the 0-second differences are consistent across players or if some players have events with non-zero intervals. 

Below, we display the first 10 events for a few different player IDs to see how the time differences vary.


In [None]:
# Get a sample of unique player IDs from the dataset (for example, the first 5 unique players)
unique_players = df_tracking['pid'].unique()[:5]
print("Sample of unique player IDs:", unique_players)

# For each sampled player, display the first 10 events and their time differences
for player in unique_players:
    print(f"\nEvents for player {player}:")
    display(df_tracking[df_tracking['pid'] == player][['cl', 'cl_seconds', 'time_diff']].head(10))


#### Conclusion: Data Sorting and Time Differences



From our analysis of the sorted data and computed time differences, we observed the following:

- **Player 0:**  
  All events for player 0 show a `cl_seconds` value of 0 and, consequently, a `time_diff` of 0. This suggests that for player 0, the events we sampled (such as "End Period" events) occur at the boundary where the in-game clock reads "00:00".  
- **Other Players (e.g., 1320, 1371, 1497, 1658):**  
  - Players such as 1320, 1371, and others show non-zero values for `cl_seconds` and non-zero `time_diff` values, which indicates that the time conversion is working properly for those events.
  - For instance, player 1320 has events with clock times like "00:49.9", "02:16", "06:47", and "08:44" (converted to seconds as 49.9, 136.0, 407.0, and 524.0, respectively), with corresponding time differences that reflect the elapsed time between events.
  
**Implications:**

- The sorting operation appears to be working correctly, as events for each player are ordered by their converted in-game clock (`cl_seconds`).
- The computed time differences (`time_diff`) vary as expected among players, which will be crucial for calculating derived features such as velocity in subsequent steps.
- The fact that some players (like player 0) consistently have `cl_seconds` equal to 0 is likely due to the nature of those events (e.g., "End Period"). This confirms that we need to consider the context when interpreting the time differences and the derived features.

With these observations, we conclude that our sorting and time difference computations are functioning as intended. We are now ready to move on to the next phase of feature engineering—specifically, computing spatial differences and then using these differences to calculate distance and velocity.

### 5.4 Computing Spatial Differences



In this step, we calculate how much a player's position changes between consecutive events. We do this by computing:

- **dx:** The difference in the X coordinate (`locX`) between consecutive events for each player.
- **dy:** The difference in the Y coordinate (`locY`) between consecutive events for each player.

**Why is this important?**
- These differences quantify the player's movement in each direction.
- They are essential for calculating the Euclidean distance traveled between events.
- Accurate spatial differences are a key input for computing dynamic features like velocity.

We achieve this by grouping the data by the player identifier (`pid`) and applying the `.diff()` method to the `locX` and `locY` columns. Any missing differences (e.g., for the first event of each player) are filled with 0.

Let's implement these steps.


In [None]:
# Compute the difference in the X coordinate (dx) for each player
df_tracking['dx'] = df_tracking.groupby('pid')['locX'].diff().fillna(0)

# Compute the difference in the Y coordinate (dy) for each player
df_tracking['dy'] = df_tracking.groupby('pid')['locY'].diff().fillna(0)

# Display a sample of the computed spatial differences
display(df_tracking[['pid', 'locX', 'locY', 'dx', 'dy']].head(10))


#### Examining Spatial Differences in Later Events



The first event for each player naturally has `dx` and `dy` equal to 0 because there's no previous event for comparison. To confirm that our spatial difference calculations are working correctly, we need to inspect events later in a player’s sequence where we expect movement to occur.

We will:
- Filter for events where the `time_diff` is greater than 0 (indicating that this is not the first event).
- Display a sample for a few players to check that `dx` and `dy` have nonzero values when appropriate.


In [None]:
# Filter for events where the time difference is greater than 0
nonzero_time_events = df_tracking[df_tracking['time_diff'] > 0]

# Display a sample of these events for a few different players
sample_players = nonzero_time_events['pid'].unique()[:3]  # take 3 unique players
for player in sample_players:
    print(f"\nEvents for player {player} with nonzero time differences:")
    display(nonzero_time_events[nonzero_time_events['pid'] == player][['cl', 'cl_seconds', 'time_diff', 'locX', 'locY', 'dx', 'dy']].head(10))


#### Conclusion: Spatial Differences



After filtering for events with nonzero time differences, we examined the spatial differences (`dx` and `dy`) between consecutive events for several players. Our observations are as follows:

- **Player 0:**  
  We see nonzero values in many of the events, indicating that this player’s position does change over time. For example, in some events, the differences (`dx` and `dy`) are substantial (e.g., -199 and 454), reflecting noticeable movement.
  
- **Players 1320 and 1371:**  
  The sample events for these players show that their `dx` and `dy` values are consistently 0. This could indicate that either these events represent moments when these players are stationary (for instance, during certain types of events such as "End Period" or other non-movement-related events) or that the data for those specific events does not capture spatial change.

**Implications:**

- The nonzero spatial differences for player 0 confirm that our method for computing `dx` and `dy` is working as expected for events where movement occurs.
- For players with zero differences, it's important to consider the context of the events. If these events are expected to have little or no movement (for example, if they are administrative or boundary events), then zero values are acceptable.
- In further analysis and modeling, we may want to flag events with zero movement separately or investigate the context behind these records to decide how they should influence the modeling process.

With the spatial differences computed and validated, we now have the necessary components to proceed with the next step: calculating the Euclidean distance and then velocity. This will further enrich our feature set for modeling player movement dynamics.

### 5.5 Calculating Euclidean Distance



Using the spatial differences (`dx` and `dy`) computed in Section 5.4, we now calculate the Euclidean distance traveled between consecutive positions. This distance represents the straight-line distance that a player (or the ball) moved from one event to the next.

The formula used is:

$$
\text{distance} = \sqrt{dx^2 + dy^2}
$$

**Why this is important:**
- **Quantifying Movement:** The distance provides a quantitative measure of how far a player moved between events.
- **Foundation for Velocity Calculation:** This distance, combined with the time difference (`time_diff`), is used to compute the player's velocity.
- **Data Quality Check:** Nonzero distances in events with a nonzero `time_diff` indicate that the spatial tracking data is capturing movement accurately.

Let's implement the calculation of the Euclidean distance.

In [None]:
# Compute the Euclidean distance between consecutive positions using the formula: distance = sqrt(dx^2 + dy^2)
df_tracking['distance'] = np.sqrt(df_tracking['dx']**2 + df_tracking['dy']**2)

# Display a sample of the computed distance alongside the spatial differences for verification
display(df_tracking[['pid', 'dx', 'dy', 'distance']].head(10))


Now, we have computed the Euclidean distance traveled between consecutive events for each player. This distance metric is crucial for understanding player movement dynamics and will serve as a key input for calculating velocity in the next step.

In [None]:
# Filter events with a nonzero distance to show cases with movement
movement_events = df_tracking[df_tracking['distance'] > 0]

# Display a sample of these events to inspect the movement details
display(movement_events[['pid', 'cl', 'cl_seconds', 'dx', 'dy', 'distance']].head(20))


#### Conclusion: Euclidean Distance and Movement Analysis


From our analysis of spatial differences, we computed the Euclidean distance using the formula:

$$
\text{distance} = \sqrt{dx^2 + dy^2}
$$



**Key Observations:**

- For many events, especially the first event for each player, the computed `distance` is 0, as expected, because there is no prior position to compare.
- In contrast, for later events we see significant nonzero distance values. For example:
  - Player 460 shows a distance of approximately 202.63 units when transitioning from one event to the next.
  - Other players (e.g., player 1317, 2049, etc.) exhibit distances in the range of 86.68 to 621.49 units, indicating measurable movement.
- These nonzero distances confirm that our spatial difference calculations (i.e., `dx` and `dy`) are capturing meaningful movement between events.

**Implications:**

- The Euclidean distance feature is a reliable measure of how far players (or the ball) move between successive events.
- These distance values will be a crucial input for calculating velocity in the next section.
- Additionally, by examining these distances, we can later explore patterns such as bursts of movement, changes in pace, or differences across event types.

With the spatial differences and distances computed and validated, we now have a robust feature set capturing the physical movement on the court. We are ready to move on to the next step: calculating velocity (Section 5.6), which will combine these distance values with the computed time differences to quantify the rate of movement.


### 5.6 Calculating Velocity



Velocity is computed as the distance traveled divided by the time difference between consecutive events:

$$
\text{velocity} = \frac{\text{distance}}{\text{time\_diff}}
$$

This formula assumes both distance and time\_diff are correctly computed (with non-zero values) before calculating velocity.


**Important Considerations:**

- **Zero Time Difference:**  
  When `time_diff` is 0 (e.g., for the first event of each player or events recorded at the exact same timestamp), division by zero must be avoided. In these cases, we set the velocity to 0.
  
- **Physical Interpretation:**  
  The calculated velocity represents the rate of movement between events. Nonzero values indicate measurable movement, while 0 indicates no movement or that it is the first event in a sequence.

We use a lambda function with an if-else condition to safely compute the velocity for each row. This ensures that if `time_diff` is zero, the velocity is explicitly set to 0.

Let's implement this calculation.


In [None]:
# Compute velocity as the ratio of distance to time_diff,
# with a safeguard to handle cases where time_diff is zero.
df_tracking['velocity'] = df_tracking.apply(
    lambda row: row['distance'] / row['time_diff'] if row['time_diff'] > 0 else 0,
    axis=1
)

# Display a sample of the computed velocity alongside related features for verification
display(df_tracking[['pid', 'cl', 'cl_seconds', 'time_diff', 'distance', 'velocity']].head(10))

print("Velocity computation completed successfully.")

#### Examining Velocity for Events with Nonzero Time Differences



Now that we have computed the velocity, we want to focus on events where the time difference (`time_diff`) is greater than 0. This will allow us to inspect cases where there is measurable movement and verify that the computed velocity is meaningful. The following cell filters these events and displays a sample for multiple players.


In [None]:
# Filter the DataFrame for events with a nonzero time difference
nonzero_velocity_events = df_tracking[df_tracking['time_diff'] > 0]

# Display a sample of these events to verify the computed velocity alongside other features
display(nonzero_velocity_events[['pid', 'cl', 'cl_seconds', 'time_diff', 'distance', 'velocity']].head(20))

#### Conclusion: Velocity Calculation



The velocity for each event has been computed as the ratio of the Euclidean distance traveled to the time difference between consecutive events:

$$
\text{velocity} = \frac{\text{distance}}{\text{time\_diff}}
$$

**Key Observations:**

- **Nonzero Time Differences:**  
  For events where the time difference is greater than 0, we observe nonzero velocity values. For example, in the sample for player 0, we see velocities such as 4956.98, 3144.30, 6984.84, etc. These values represent the rate of movement between consecutive events.

- **Magnitude of Velocities:**  
  The velocities appear to be very high. This may be due to the spatial coordinates' scale or the very short time intervals (fractions of a second) between events. Such high values can occur when even small changes in position are divided by a very small time difference.

- **Zero Velocities:**  
  For events where `time_diff` is 0 (typically the first event for each player or events recorded at the exact same timestamp), the velocity is correctly set to 0, as per our handling in the lambda function.

**Implications:**

- The computed velocity feature successfully captures the movement rate between consecutive events.  
- The high velocity values should be interpreted in the context of our data's scale and the very short time intervals. We may consider further normalization or clipping of these values in later steps, depending on the requirements of our modeling process.
- These velocity values, together with the other engineered features (time difference, spatial differences, and Euclidean distance), provide a robust foundation for understanding player movement dynamics and will serve as important inputs for our subsequent modeling tasks.

With the velocity calculation validated, we now have a comprehensive set of features to represent both the spatial and temporal dynamics of the game. We are ready to proceed to the next steps in our analysis.


### 5.7 Summary of Feature Engineering


In this chapter, we systematically derived new features from the raw tracking data to capture the spatial and temporal dynamics of NBA player movement. Here's an overview of what we accomplished:

1. **Convert In-Game Clock to Seconds (5.1):**
   - We converted the `cl` column (in `MM:SS` format) to total seconds, storing the result in a new column called `cl_seconds`.  
   - This conversion allows us to perform numerical computations on time values and is essential for computing time differences.

2. **Sorting the Data (5.2):**
   - The DataFrame was sorted by the player identifier (`pid`) and the converted in-game clock (`cl_seconds`), ensuring that events for each player are in chronological order.
   - Proper ordering is critical for accurately computing differences between consecutive events.

3. **Calculating Time Differences (5.3):**
   - We computed the elapsed time (`time_diff`) between consecutive events for each player using the `cl_seconds` column.
   - This feature is a key input for calculating velocity and understanding the temporal dynamics of the game.

4. **Computing Spatial Differences (5.4):**
   - We calculated the differences in the X and Y coordinates (`dx` and `dy`) between consecutive events for each player.
   - These differences quantify the movement along each axis and set the stage for computing the physical distance traveled.

5. **Calculating Euclidean Distance (5.5):**
   - Using the computed spatial differences, we calculated the Euclidean distance with the formula:  
     $$ \text{distance} = \sqrt{dx^2 + dy^2} $$
   - This measure represents the straight-line distance a player moved between consecutive events.

6. **Calculating Velocity (5.6):**
   - Finally, we derived the velocity feature by dividing the Euclidean distance by the time difference (`distance / time_diff`), with appropriate handling for cases where `time_diff` is 0.
   - The velocity provides a measure of the rate of movement, which is essential for capturing dynamic aspects of gameplay.

**Overall Implications:**

- The engineered features (time differences, spatial differences, distance, and velocity) collectively provide a robust foundation for modeling player movement dynamics.
- These features will be crucial inputs for subsequent modeling tasks such as predicting trajectories and detecting key events during the game.
- While some players exhibit 0 values (typically at the start of their sequences or during boundary events), other players show measurable changes, validating the effectiveness of our feature engineering pipeline.

With these features in place, our dataset is now well-prepared for further analysis and modeling. Next, we can focus on integrating these features with additional data or proceeding to advanced exploratory data analysis and model development.


## 6. Exploratory Data Analysis (EDA) <a id="eda"></a>


In this chapter, we perform a thorough exploratory analysis of our engineered features and the original tracking data. Our objectives are to understand the distribution and relationships of our variables, identify potential outliers, and gain insights that will inform our subsequent modeling steps. The chapter is organized as follows:



1. **Introduction to EDA**
   - Overview of the goals of EDA
   - Summary of the features under analysis (e.g., `cl_seconds`, `time_diff`, `dx`, `dy`, `distance`, `velocity`)

2. **Summary Statistics**
   - Descriptive statistics for both original and engineered features
   - Insights from the summary (e.g., mean, median, standard deviation, etc.)

3. **Distribution Analysis**
   - Histograms and Kernel Density Estimates (KDE) for key numerical features
   - Box plots to identify outliers
   - Discussion of the spread and central tendency of features

4. **Spatial Trajectory Visualization**
   - Scatter plots of spatial coordinates (e.g., `locX` vs. `locY`)
   - Visualizing trajectories for selected players, possibly with color-coding by time or event type

5. **Correlation Analysis**
   - Correlation matrix of engineered features
   - Heatmaps to identify strong relationships among features
   - Discussion on how these correlations might affect model performance

6. **Event-Specific Analysis**
   - Comparison of feature distributions across different event types (e.g., "End Period", "Rebound", etc.)
   - Analysis of how event context influences movement dynamics

7. **Summary of EDA Findings**
   - Key insights and observations from the exploratory analysis
   - Implications for feature selection and model development


### 6.1 Summary Statistics



In this section, we will compute and review descriptive statistics for our dataset. Our focus will be on both the original tracking features (e.g., `locX`, `locY`, `cl_seconds`) and the engineered features (e.g., `time_diff`, `dx`, `dy`, `distance`, `velocity`).

**Objectives:**

- **Descriptive Overview:**  
  Provide an overview of key statistics (mean, median, standard deviation, min, max, quartiles) for each numerical feature. This helps in understanding the central tendency and dispersion of the data.

- **Identify Outliers:**  
  Look for unusual values or extreme variations in the distributions that may need further investigation or special handling during modeling.

- **Compare Features:**  
  Compare the distributions of the engineered features to see if they behave as expected (e.g., nonzero time differences, meaningful spatial differences, reasonable distances and velocity values).

The results from this analysis will inform any necessary adjustments in preprocessing and feature engineering before moving on to more advanced EDA and modeling.

Let's now proceed to calculate these summary statistics.


In [None]:
# Compute summary statistics for the original features and engineered features.
# Here we consider columns: cl_seconds, time_diff, locX, locY, dx, dy, distance, velocity

summary_stats = df_tracking[['cl_seconds', 'time_diff', 'locX', 'locY', 'dx', 'dy', 'distance', 'velocity']].describe()
print("Summary Statistics for Key Features:")
display(summary_stats)


#### Summary Statistics – Review and Conclusions



The descriptive statistics for our key features provide several important insights into the dynamics of the tracking data:



- **In-Game Clock (`cl_seconds`):**  
  - **Range:** 0 to 720 seconds (0 to 12 minutes)  
  - **Mean & Median:** Mean is approximately 338.24 seconds, and the median is 335.0 seconds.  
  - **Interpretation:** These values are consistent with the expected duration of a period, indicating that our conversion from `MM:SS` to seconds is functioning as expected.

- **Time Difference (`time_diff`):**  
  - **Central Tendency:** The mean time difference is about 1.20 seconds, with a median of 0.6 seconds.  
  - **Variation:** The standard deviation is 4.18 seconds, and although most events occur in rapid succession (with the 75th percentile at 1.0 second), there are some events with very high time gaps (up to 485 seconds), which might correspond to period transitions or pauses.
  
- **Spatial Coordinates (`locX` and `locY`):**  
  - **`locX`:** Ranges from -250 to 250 with a median around 0, suggesting that the court is centered on the X-axis as expected.
  - **`locY`:** Ranges from -80 to 887 with a median of 23, reflecting a broader distribution that captures various court positions and potential extreme values.
  
- **Spatial Differences (`dx` and `dy`):**  
  - **Observations:** Both `dx` and `dy` have medians of 0, which is expected since the first event for each player (or stationary phases) will have no change in position.  
  - **Variability:** The standard deviations (approximately 135.83 for `dx` and 174.22 for `dy`) indicate that when movement occurs, it can be substantial.
  
- **Euclidean Distance (`distance`):**  
  - **Summary:** With a mean of about 171.87 units and a median of 159.13 units, the distance values indicate a wide range of movement, from small positional changes to large movements (up to nearly 976 units).
  - **Interpretation:** This feature effectively quantifies the magnitude of player movement between events.
  
- **Velocity:**  
  - **Calculation:** Velocity is computed as `distance / time_diff`.  
  - **Distribution:** The mean velocity is approximately 112.49 units per second; however, the median is 0. This suggests a skewed distribution, where many events (e.g., the first events or stationary periods) yield a velocity of 0, while a subset of events exhibits very high velocities.  
  - **Implication:** The high standard deviation (333.30) indicates significant variability, highlighting that some moments in the game involve rapid movements.

**Overall Conclusions:**

- **Feature Effectiveness:**  
  The engineered features (time difference, spatial differences, Euclidean distance, and velocity) capture both the temporal and spatial dynamics of player movement. They exhibit expected behavior (e.g., many zero values where no movement occurs, and nonzero values when movement is present).

- **Data Insights:**  
  The summary statistics indicate that while many events (especially initial events for each player) have minimal movement, there are significant bursts of movement captured by the data. This variability is critical for understanding the dynamics of gameplay and will be valuable for subsequent modeling.

- **Next Steps:**  
  With a solid understanding of our feature distributions, we are now ready to proceed with further exploratory data analysis (EDA) to visualize these patterns and eventually integrate these features into predictive models.

These insights confirm that our feature engineering pipeline is robust and that our derived features are capturing meaningful aspects of the game dynamics. Next, we will delve deeper into visualizing these relationships and exploring the correlations between features in the next section.



### 6.2 Distribution Analysis


In this section, we will explore the distributions of both the original and engineered features to gain deeper insights into their behavior. Our analysis will focus on the following aspects:


1. **Histograms and Kernel Density Estimates (KDE):**  
   - Visualize the distribution of continuous features such as `cl_seconds`, `time_diff`, `distance`, and `velocity`.  
   - Use histograms to observe the frequency of different values and KDE plots to understand the underlying density.

2. **Box Plots:**  
   - Generate box plots for key features to identify outliers and assess the spread and symmetry of the distributions.
   - Box plots will help us pinpoint potential anomalies or extreme values that might require further investigation.

3. **Comparative Analysis:**  
   - Compare the distributions of engineered features (e.g., `time_diff`, `distance`, `velocity`) to assess whether they behave as expected.
   - Identify if there are any skewed distributions or unexpected patterns that could affect subsequent modeling efforts.

4. **Insights and Implications:**  
   - Summarize key findings from the visualizations, including central tendencies, dispersion, and the presence of outliers.
   - Discuss how these insights might inform adjustments in data preprocessing, normalization, or model selection.

The goal of this section is to obtain a comprehensive visual understanding of our data’s distribution, ensuring that our engineered features are well-behaved and suitable for use in modeling.

Once we have a clear picture of these distributions, we will use the insights to inform any necessary transformations and to guide our feature selection for predictive modeling.


#### 6.2.1 Histograms and KDE Plots for Key Features


In this section, we will visualize the distributions of several key continuous features using histograms with Kernel Density Estimate (KDE) overlays. Specifically, we will analyze:

- **cl_seconds:** The in-game clock in seconds.
- **time_diff:** The time difference between consecutive events for each player.
- **distance:** The Euclidean distance traveled between consecutive positions.
- **velocity:** The computed velocity as the ratio of distance to time difference.

These visualizations will help us understand the central tendencies, spread, and potential outliers in our data, which is crucial for further analysis and modeling.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define a list of key features to visualize
features = ['cl_seconds', 'time_diff', 'distance', 'velocity']

# Define a list of colors for each feature (you can customize these as needed)
colors = ['skyblue', 'salmon', 'lightgreen', 'plum']

# Create a figure with subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 10))
axes = axes.flatten()  # flatten the 2D array of axes for easy iteration

# Loop through each feature and corresponding axis
for ax, feature, color in zip(axes, features, colors):
    sns.histplot(df_tracking[feature], bins=50, kde=True, color=color, ax=ax)
    ax.set_title(f"Distribution of {feature}")
    ax.set_xlabel(feature)
    ax.set_ylabel("Frequency")

# Adjust the layout for better spacing between subplots
plt.tight_layout()
plt.show()


##### Observations



1. **Distribution of `cl_seconds` (top‐left):**  
   - The histogram spans roughly from 0 up to around 650+ seconds.
   - There is a pronounced peak near 0 (suggesting many entries close to zero), and then the frequency fluctuates but remains relatively high throughout the 100–600 second range.
   - Toward the upper end (beyond ~600 seconds), the frequencies taper off.
   - Overall, it’s a right‐skewed distribution with a strong concentration near zero but still substantial counts across a wide midrange of `cl_seconds`.


2. **Distribution of `time_diff` (top‐right):**  
   - The x‐axis goes from 0 to about 500, while the y‐axis extends to over 1.5×10^6 in frequency, indicating a *very* high count of near-zero values.
   - There is a sharp drop-off after a small number of seconds, so most `time_diff` values are clustered very close to zero, with relatively few data points above even a few seconds.
   - In other words, it is *highly* right‐skewed, dominated by very small `time_diff` values.


3. **Distribution of `distance` (bottom‐left):**  
   - The range shown is from 0 to 1000.
   - There is an initial large spike near zero—again suggesting a high volume of distances close to zero.
   - Beyond that, the histogram shows multiple modes (several peaks around 50–200 and some around 200–300), then frequencies drop to near zero by ~500–600 onward.
   - Overall, the data is concentrated in lower distances with some secondary clusters in the midrange.


4. **Distribution of `velocity` (bottom‐right):**  
   - The velocity axis goes from 0 to nearly 9000, but the vast majority of observations are near zero, as seen by the extreme spike in frequency at low velocities.
   - After that initial spike, frequency declines rapidly, and only a long tail extends out to higher velocities.
   - Similar to `time_diff`, it displays a strong right skew, with the bulk of velocities clustering very close to zero.


These observations provide a detailed understanding of the behavior of our key features. The insights gained from this analysis will inform any necessary transformations (such as normalization or outlier handling) before integrating these features into our modeling process.


#### 6.2.2 Boxplots


In this section, we use boxplots to visualize the distribution of our key features. Boxplots are useful for:

- **Identifying Outliers:**  
  They clearly show the median, quartiles, and potential outliers in the data.
  
- **Understanding Spread and Central Tendency:**  
  The box shows the interquartile range (IQR) and the whiskers indicate the range of the data, providing insights into the overall distribution.

We will create boxplots for the following features:
- `cl_seconds`
- `time_diff`
- `distance`
- `velocity`

By comparing these boxplots, we can assess the variability in our data and identify any extreme values that may require further investigation or transformation before modeling.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# List of key features to visualize with boxplots
features = ['cl_seconds', 'time_diff', 'distance', 'velocity']

# Define a list of colors for each boxplot (customize as needed)
colors = ['skyblue', 'salmon', 'lightgreen', 'plum']

# Create a figure with subplots (one boxplot per feature)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 10))
axes = axes.flatten()  # Flatten the 2D array of axes for easier iteration

# Loop through each feature, plot its boxplot, and set appropriate titles and labels
for ax, feature, color in zip(axes, features, colors):
    sns.boxplot(data=df_tracking, x=feature, color=color, ax=ax)
    ax.set_title(f"Boxplot of {feature}")
    ax.set_xlabel(feature)

# Adjust the layout for better spacing between subplots
plt.tight_layout()
plt.show()


##### Analysis and Observations



From the boxplots, we observe that all four key variables exhibit strong right-skewness, with many outliers on the high end. Here are the detailed observations:



1. **`cl_seconds`**  
   - The interquartile range (IQR) spans roughly from about 200 seconds up to the 400–450 second range, with the median falling near the middle of this range.
   - The lower whisker extends down close to 0, while the upper whisker reaches up to around 700+ seconds, indicating that while most events occur within a mid-range of time, there are some relatively high values.
   - Overall, the distribution is right-skewed with a strong concentration near the lower end but still has a substantial midrange.

2. **`time_diff`**  
   - The box is nearly flat around zero, meaning that most time differences are very small (often near zero).
   - There is a very high frequency of near-zero values, with a long tail stretching to larger values (up to ~500 seconds), confirming that while most events occur in rapid succession, a few have significant time gaps.
   - This confirms the extreme right-skew, with the bulk of the data clustered at very small time intervals.

3. **`distance`**  
   - The IQR for distance is approximately 0 to 200, with the median in the lower half of this range.
   - The upper whisker extends to roughly 400, and a long tail reaches up to around 1000, indicating that while most movements are relatively small, some events capture larger spatial movements.
   - The distribution exhibits multiple modes at lower distances, reflecting common short movements, and a long tail for less frequent, longer movements.

4. **`velocity`**  
   - Similar to `time_diff`, the box for velocity is pinned near zero, indicating that most events have very low computed velocities.
   - However, there is a large set of outliers with velocity values extending up to nearly 9000. This suggests that when even moderate spatial movement is divided by a very small time difference, the resulting velocity can be extremely high.
   - The strong right skew in velocity is evident, with most values clustered very close to zero and a few very high values in the tail.

**Overall Implications:**

- The strong right-skewness across all features suggests that while many events involve minimal movement or occur in quick succession, a subset of events captures substantial changes.
- These distributions highlight the need for careful handling in subsequent modeling stages—such as normalization or transformation—to account for the skew and potential outliers.
- The insights from these boxplots will inform how we preprocess our data for robust predictive modeling of player trajectories and event detection.

With these observations in hand, we have a solid understanding of the data's distribution. We are now ready to proceed to the next section of our EDA.


### 6.3 Spatial Trajectory Visualization



In this section, we explore the spatial dimensions of the tracking data by visualizing the movement trajectories on the court. Our objectives are to:



- **Visualize Player Movement:**  
  Create scatter plots of the X and Y coordinates (`locX` vs. `locY`) to capture the spatial trajectory of players over time.
  
- **Color-code by Time:**  
  Use a color scale based on the in-game time (`cl_seconds`) to help identify how movement changes throughout a period.

- **Identify Patterns:**  
  Visualize trajectories to see spatial clustering, directional movement, and potential differences between offensive and defensive actions.

By visualizing these trajectories, we can gain insights into player behavior, court positioning, and tactical formations that can later inform our predictive modeling efforts.


In [None]:
import matplotlib.pyplot as plt

# Select a sample player (for example, the first unique player in the dataset)
sample_player = df_tracking['pid'].unique()[0]

# Filter the data for the selected player
player_data = df_tracking[df_tracking['pid'] == sample_player]

# Create a scatter plot of locX vs. locY, with the color representing in-game time (cl_seconds)
plt.figure(figsize=(10, 8))
scatter = plt.scatter(player_data['locX'], player_data['locY'], 
                      c=player_data['cl_seconds'], cmap='viridis', s=20, alpha=0.7)
plt.colorbar(scatter, label='In-Game Time (seconds)')
plt.xlabel('locX')
plt.ylabel('locY')
plt.title(f"Spatial Trajectory for Player {sample_player}")

# Optionally, invert the y-axis if needed to better reflect the court's orientation
plt.gca().invert_yaxis()

plt.show()


#### 6.3.1 Player 0 Movement Analysis and Description


**Overview:**
Based on our comprehensive tracking data analysis, we examined the spatial trajectories and movement patterns of Player 0 over a typical period. The data reveals clear, recurring patterns in both offensive and defensive contexts, offering valuable insights into his role on the court.


**Key Findings:**

- **Frequent Hotspots:**  
  - Player 0's positions tend to cluster along two distinct arcs. One cluster appears in the lower to mid-range of the court (approximately with `locY` between 100 and 300), and another cluster emerges around `locY` values between 400 and 500.  
  - These hotspots suggest that Player 0 consistently occupies key areas of the court, which may be critical for initiating offensive plays or providing defensive support.

- **Wide Horizontal Coverage:**  
  - The `locX` values range from about -200 to +200, indicating that Player 0 is active across the entire width of the court.  
  - This horizontal spread is consistent with a player who is involved in transitioning from one side of the court to the other, contributing to both offensive spacing and defensive balance.

- **Temporal Patterns:**  
  - The color gradient in the trajectory visualizations (representing in-game time) shows that Player 0 revisits these key zones throughout the game rather than following a linear progression.  
  - This recurring pattern indicates a structured movement strategy, likely driven by play design, where the player returns to specific areas at multiple points during the game.

- **Movement Dynamics:**  
  - There are distinct moments of high velocity (suggesting rapid movements during fast breaks or quick transitions) and periods of near-zero velocity (indicating stationary positioning for play setup or defensive alignment).  
  - The combination of these dynamic patterns illustrates that Player 0 is both a mover on the court and someone who can hold position when necessary.



**Tactical Implications:**

- **Offensive Role:**  
  - The consistent presence in the primary arcs suggests that Player 0 plays a key role in spacing the floor, which is essential for creating driving lanes and facilitating ball movement.
  - His positioning may also be leveraged for perimeter shooting, stretching the defense, and opening opportunities for teammates.

- **Defensive Role:**  
  - The recurrent clustering in specific areas may reflect his responsibilities on defense—particularly in guarding crucial zones around the perimeter.
  - Recognizing these patterns can help in planning defensive rotations and adjustments.

- **Coaching Insights:**  
  - The data confirms that Player 0’s movement is both deliberate and structured. Coaches can use this information to design plays that maximize his strengths (e.g., pick-and-roll opportunities, perimeter spacing) and to adjust tactics if opponents start to exploit predictable patterns.
  - Additionally, understanding the moments of high velocity can help in strategizing transitions and fast-break scenarios.

**Conclusion:**

The tracking data provides compelling evidence that Player 0 consistently operates within specific zones on the court, with movement patterns that indicate a balance between dynamic transitions and deliberate positioning. These insights suggest that Player 0 is a perimeter-centric player who plays a critical role in both offensive setups and defensive schemes. Incorporating these findings into play design and tactical adjustments could further enhance team performance.


#### 6.3.2 Spatial Trajectory Scatterplots for Random Players



In this section, we aim to visualize the spatial trajectories (i.e., the `(locX, locY)` positions) for 15 randomly selected players. To ensure that our visualization is representative, we will only consider players who have a sufficient number of events (for example, at least 50 events) in the dataset.

For each selected player, we will create a scatterplot where:
- The X-axis represents the `locX` coordinate.
- The Y-axis represents the `locY` coordinate.
- Each dot is color-coded by the in-game time (`cl_seconds`), which helps us see the temporal progression of the player's movement.

We will arrange the scatterplots in a grid with 3 rows and 5 columns. This layout provides an effective side-by-side comparison of the movement patterns of different players.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set a threshold for the minimum number of events per player (e.g., 50 events)
min_events = 50

# Get a list of player IDs that have at least 'min_events' events
players_with_enough_data = df_tracking.groupby('pid').filter(lambda x: len(x) >= min_events)['pid'].unique()

# Randomly select 15 players from this list
np.random.seed(42)  # For reproducibility
selected_players = np.random.choice(players_with_enough_data, size=15, replace=False)

# Create a grid of subplots (3 rows x 5 columns)
fig, axes = plt.subplots(nrows=3, ncols=5, figsize=(20, 12))
axes = axes.flatten()

# Loop through each selected player and plot their spatial trajectory
for ax, player in zip(axes, selected_players):
    # Filter data for the current player
    player_data = df_tracking[df_tracking['pid'] == player]
    
    # Create a scatter plot: locX vs. locY, color-coded by cl_seconds
    scatter = ax.scatter(player_data['locX'], player_data['locY'], c=player_data['cl_seconds'], 
                         cmap='viridis', s=20, alpha=0.7)
    ax.set_title(f"Player {player}")
    ax.set_xlabel("locX")
    ax.set_ylabel("locY")
    # Optionally, invert the y-axis if needed to better reflect court orientation
    ax.invert_yaxis()

# Add a colorbar to the figure for cl_seconds
# cbar = fig.colorbar(scatter, ax=axes, orientation='vertical', fraction=0.02, pad=0.04)
# cbar.set_label("In-Game Time (seconds)")

plt.tight_layout()
plt.show()


#### 6.3.3 Conclusion: Spatial Trajectory Visualization


The spatial trajectory visualizations have provided valuable insights into player movement patterns over the course of a game. Key observations include:

- **Recurring Movement Patterns:**  
  The scatterplots reveal that players tend to occupy specific areas on the court repeatedly. For example, many trajectories form curved arcs, suggesting that players often move along the perimeter (possibly near the three-point line) and then transition toward the paint.

- **Temporal Dynamics:**  
  The color gradient based on `cl_seconds` shows that players revisit these key areas throughout the game rather than progressing linearly. This temporal layering indicates that a player’s positioning is dynamic and adapts to different phases of play.

- **Spatial Distribution:**  
  The trajectories demonstrate wide horizontal movement across the court, with `locX` values spanning a broad range. Vertical movement (`locY`) also varies significantly, highlighting periods of both concentrated and scattered activity depending on the game context.

- **Contextual Implications:**  
  - Areas with dense clustering may represent strategic zones (such as offensive hotspots or defensive strongholds).  
  - Outlier points in the trajectories could correspond to fast breaks, transitions, or brief defensive adjustments.  
  - The variability in the movement patterns underscores the importance of integrating additional contextual data (e.g., shot attempts, rebounds) to fully interpret these spatial trends.

Overall, these visualizations confirm that the spatial data is rich in information about player behavior and court positioning. The insights gathered here will be crucial for refining our predictive models and developing tactical recommendations. With a robust understanding of spatial trajectories, we can now proceed to further analyses, such as correlation analysis and event-specific insights, in subsequent sections.


### 6.4 Correlation Analysis



In this section, we will examine the correlations between the key features in our dataset. Our objectives are to:

- **Quantify Relationships:**  
  Compute the pairwise correlation coefficients among both the original and engineered features (e.g., `cl_seconds`, `time_diff`, `locX`, `locY`, `dx`, `dy`, `distance`, and `velocity`).

- **Identify Strong Associations:**  
  Use a correlation matrix and heatmap to identify which features are strongly correlated, either positively or negatively. This helps us understand potential redundancies and interdependencies in the data.

- **Implications for Modeling:**  
  - Features that are highly correlated may provide redundant information and could be candidates for dimensionality reduction or careful feature selection.
  - Strong correlations can also reveal underlying patterns or relationships that might be critical for the predictive modeling of player trajectories and game events.

By visualizing the correlation matrix, we can gain insights into the structure of our data and make informed decisions for subsequent modeling steps.

Next, we will proceed to compute and visualize these correlations using a heatmap.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select the key features for correlation analysis
features_for_corr = ['cl_seconds', 'time_diff', 'locX', 'locY', 'dx', 'dy', 'distance', 'velocity']

# Compute the correlation matrix for these features
corr_matrix = df_tracking[features_for_corr].corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", square=True,
            cbar_kws={"shrink": 0.75})
plt.title("Correlation Matrix for Key Features")
plt.show()


#### Conclusion



The correlation matrix provides valuable insights into the relationships among our key features:



1. **Spatial Relationships:**
   - **Strong Positive Associations:**  
     - `locX` and `dx` (0.70) and `locY` and `dy` (0.65) indicate that the changes in positions are consistent with the actual coordinates. This confirms that our method for computing spatial differences is sound.
   - **Weak Associations:**  
     - The near-zero correlation between `locX` and `locY` suggests that horizontal and vertical positions vary largely independently, which is expected given the nature of court positioning.

2. **Temporal and Movement Dynamics:**
   - **Time and Velocity:**  
     - A modest negative correlation between `cl_seconds` and `velocity` (-0.19) implies that, as the game progresses, players tend to move slightly slower. This could be due to fatigue or tactical adjustments as the period advances.
   - **Time Differences:**  
     - The weak correlation between `cl_seconds` and `time_diff` (0.02) indicates that the frequency of events is relatively independent of the period’s progression.

3. **Distance and Velocity:**
   - **Moderate Association:**  
     - The positive correlation between `distance` and `velocity` (0.32) confirms the intuitive expectation that higher velocities correspond to greater distances covered. However, the correlation is not very high, suggesting that other factors (e.g., diagonal movements or variations in event timing) also influence the overall movement.

4. **Overall Implications:**
   - The strong spatial correlations validate our engineered features, while the time-based variables and their relationships with velocity offer insights into game dynamics such as pacing and potential fatigue.
   - Some features exhibit weak or unexpected correlations (e.g., the negligible correlation between `distance` and individual spatial differences `dx`/`dy`), indicating that movement is likely a combination of both horizontal and vertical changes and may be influenced by diagonal trajectories.
   - These insights provide a basis for further analysis, such as exploring event-level contexts and considering additional combined metrics, to fully capture the complexities of player movement.

In summary, the correlation matrix reveals that while our spatial features are strongly interrelated, the temporal and derived movement metrics exhibit more nuanced relationships. These findings will help guide our next steps in feature selection and model development.


## 7. Data Quality & Anomaly Detection <a id="data-quality"></a>


In this chapter, we shift our focus to the overall quality of the dataset and the identification of any anomalies. Our objectives in this section are to:



1. **Assess Data Quality:**
   - Evaluate the completeness, consistency, and accuracy of the data.
   - Check for any systematic issues (e.g., missing values, data entry errors, or unexpected patterns) that might affect the integrity of our engineered features.

2. **Detect Anomalies:**
   - Identify outliers and unusual patterns in both the original and engineered features.
   - Use statistical methods (e.g., Z-scores, Interquartile Range) and visual tools (e.g., boxplots) to flag observations that deviate significantly from the norm.
   - Investigate whether these anomalies are genuine (e.g., moments of exceptional movement during fast breaks) or are artifacts of data collection/processing.

3. **Interpretation and Implications:**
   - Discuss the potential reasons for any detected anomalies, such as measurement errors, specific game events (e.g., timeouts, transitions), or data recording issues.
   - Evaluate the impact of these anomalies on the overall dataset and on subsequent modeling efforts.
   - Consider strategies for handling anomalies (e.g., imputation, removal, or transformation) to ensure that our models are robust and not overly influenced by extreme values.

4. **Next Steps:**
   - Based on our findings, determine if additional preprocessing is required (e.g., normalization, outlier treatment).
   - Plan further analyses or modeling adjustments informed by the data quality and anomaly detection results.

By thoroughly assessing the data quality and detecting anomalies, we ensure that our feature set is reliable and that our subsequent modeling steps are based on high-quality, representative data.

Next, we will implement the analysis methods to quantify data quality and identify potential anomalies.


### 7.1 Outlier Detection using the IQR Method



In this section, we will detect outliers in key numerical features using the Interquartile Range (IQR) method. The IQR method identifies outliers as observations that fall below:

$$ Q1 - 1.5 \times IQR $$

or above:

$$ Q3 + 1.5 \times IQR $$

where \(Q1\) and \(Q3\) are the 25th and 75th percentiles, respectively, and \(IQR = Q3 - Q1\).

We will apply this method to the following key features:
- `cl_seconds`
- `time_diff`
- `distance`
- `velocity`

Our goals are to:
- Quantify the number of outliers for each feature.
- Display a sample of the outlier data for further inspection.

This analysis will help us understand if extreme values exist in our dataset and determine the appropriate strategies (e.g., normalization, transformation, or removal) for handling them during modeling.


In [None]:
# Define a function to detect outliers using the IQR method for a given feature
def detect_outliers_iqr(data, feature):
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)]
    return outliers, lower_bound, upper_bound

# List of key features for outlier detection
numeric_features = ['cl_seconds', 'time_diff', 'distance', 'velocity']

# Dictionary to store outlier information for each feature
outliers_dict = {}

for feature in numeric_features:
    outliers, lower_bound, upper_bound = detect_outliers_iqr(df_tracking, feature)
    outliers_dict[feature] = outliers
    print(f"Feature: {feature}")
    print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
    print(f"Number of outliers: {len(outliers)}")
    print("-" * 40)


#### Analysis and Conclusions: Outlier Detection



From the IQR-based outlier detection, we obtained the following results:

- **cl_seconds:**  
  - Lower bound: -375.00, Upper bound: 1049.00  
  - **Observation:** No outliers were detected for `cl_seconds`, which suggests that the in-game time (converted to seconds) is well-behaved and within the expected range (0 to 720 seconds for a 12-minute period).

- **time_diff:**  
  - Lower bound: -1.50, Upper bound: 2.50  
  - **Observation:** A large number of outliers (33,454 observations) were detected.  
  - **Interpretation:** The majority of `time_diff` values are very small (close to zero), which is expected since many events occur in rapid succession. However, there are some events with much larger gaps, likely due to period transitions or pauses. These extreme values stretch the distribution and are flagged as outliers.

- **distance:**  
  - Lower bound: -225.25, Upper bound: 544.18  
  - **Observation:** 4,730 outlier observations were detected.  
  - **Interpretation:** While many movements are small, some events show unusually high Euclidean distances, indicating substantial movement between events. These could be genuine (e.g., fast breaks) or might be influenced by measurement noise.

- **velocity:**  
  - Lower bound: -189.60, Upper bound: 316.00  
  - **Observation:** 22,976 outlier observations were found.  
  - **Interpretation:** The computed velocity shows a high degree of skewness. A substantial number of events yield very high velocities, which are likely the result of dividing moderate distances by very small time differences (e.g., 0.1 seconds). This can produce extreme values even if the actual movement is moderate.

**Overall Analysis and Implications:**

- The **lack of outliers in `cl_seconds`** confirms that the conversion of the in-game clock is consistent.
- The extremely high count of outliers in **`time_diff` and `velocity`** is largely due to the fact that most events occur with very small time differences, resulting in a clustering near zero and a long tail of high values.
- The presence of outliers in **`distance`** indicates that while many movements are small, some events capture unusually large displacements.
- **High velocity values** should be interpreted with caution, as they are sensitive to very small time differences. It might be useful to further investigate these cases, determine whether they represent genuine bursts of speed (e.g., during fast breaks) or artifacts of data collection, and consider normalization or capping of extreme values if necessary.

**Next Steps:**

1. **Investigate Extreme Values:**  
   - Look into events with very high `velocity` to determine if they correspond to specific game events (like fast breaks) or if they are anomalies due to measurement issues.
   
2. **Consider Data Transformation:**  
   - Depending on the modeling requirements, consider applying transformations (e.g., logarithmic scaling) or capping extreme outlier values to mitigate their impact on downstream analysis.
   
3. **Integrate Context:**  
   - Explore linking these extreme movement events with other game context (e.g., play types, shot attempts, turnovers) to better understand their significance.

With these insights, we have a better understanding of the data quality and the nature of anomalies present in our dataset. The next phase is to proceed with further visualizations and contextual analyses, which will help us refine our feature set before moving into predictive modeling.


In [None]:

# Optionally, display a sample of outliers for one feature (e.g., velocity)
print("\nSample outliers for 'velocity':")
display(outliers_dict['velocity'].head(10))


#### Analyzing Player Participation and Filtering Low-Activity Players



To ensure that our analysis and subsequent modeling are based on robust data, we need to identify and filter out players who have very limited participation in the game. Players with very few events may not provide enough information for reliable analysis, and including them could introduce noise into our models.

**Our Approach:**

1. **Count Events per Player:**  
   - We will compute the total number of events associated with each player (using the `pid` field).
   - This count will serve as a measure of each player's activity level during the game.

2. **Visualize the Distribution of Event Counts:**  
   - By plotting a histogram or boxplot of the number of events per player, we can observe the spread of participation.
   - This visualization will help us determine a reasonable threshold for what constitutes “sufficient” participation.

3. **Filter Out Low-Participation Players:**  
   - Based on the distribution, we will define a threshold (for example, players with fewer than 50 events) and remove these players from our dataset.
   - This step ensures that our analysis focuses on players with adequate data, improving the reliability and robustness of our findings.

By taking these steps, we can ensure that our dataset includes only those players whose movement patterns are well-represented in the data, leading to more accurate and meaningful analysis.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the count of events per player
event_counts = df_tracking['pid'].value_counts()

# Display summary statistics of event counts per player
print("Summary Statistics for Event Counts per Player:")
print(event_counts.describe())

# Plot a histogram of the event counts per player
plt.figure(figsize=(10, 6))
sns.histplot(event_counts, bins=30, kde=True, color='skyblue')
plt.title("Distribution of Events per Player (pid)")
plt.xlabel("Number of Events")
plt.ylabel("Frequency")
plt.show()


In [None]:
event_counts

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the count of events per player
event_counts = df_tracking['pid'].value_counts()

# Plot a boxplot for the distribution of event counts per player
plt.figure(figsize=(10, 6))
sns.boxplot(x=event_counts, color='lightblue')
plt.title("Boxplot of Event Counts per Player")
plt.xlabel("Number of Events")
plt.show()


#### Brief Report: Boxplot Analysis of Event Counts per Player (Including `pid=0`)


##### 1. Distribution & Spread


- **Central Tendency:**
  - The majority of players have event counts clustered below **2,000**. The interquartile range (IQR) is relatively tight.
  - The **median** event count appears to fall just below **1,000**, indicating that half of the players contribute relatively few events.
  
- **Spread:**
  - Most players are grouped within a narrow range of event counts.
  - However, a long right tail is evident, with a few players showing extremely high event counts—up to **25,000** events.



##### 2. Outlier Examination


- **Extreme Outliers:**
  - The player with over **25,000 events** (likely represented by `pid=0`) is a clear outlier.
  - This extreme count could indicate either a star player with significantly higher involvement or, alternatively, a data artifact (such as duplicate tracking or inclusion of non-player entities).
  
- **Star Players or Errors?**
  - While star players can naturally accumulate more events due to high minutes and involvement, such an extreme value warrants verification.
  - Cross-referencing with game logs and roster data is necessary to confirm whether this outlier is valid.



##### 3. Data Quality & Implications


- **Legitimacy of Outliers:**
  - Extreme outliers like `pid=0` may distort summary statistics (e.g., mean) and can bias subsequent analyses if not handled appropriately.
  
- **Impact on Analysis:**
  - Including extreme outliers may skew our understanding of typical player behavior.
  - It may be beneficial to treat or filter such outliers (e.g., through capping or removal) depending on our analysis goals.


##### Recommendations:


- **Verification:**  
  - Confirm that the extremely high event count for `pid=0` aligns with actual playing time and role, or if it is an artifact.
  
- **Filtering for Typical Analysis:**  
  - Consider excluding players with extremely low or extremely high event counts (after verification) to focus on the majority of players whose activity levels are more representative.

**Summary:**  
The boxplot analysis including `pid=0` reveals a heavily right-skewed distribution, with the median below 1,000 events and a single extreme outlier exceeding 25,000 events. This suggests that while most players have modest activity, a few cases need further investigation to ensure data quality.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Remove player with id=0 from the DataFrame
df_non_zero = df_tracking[df_tracking['pid'] != 0]

# Calculate the count of events per player (excluding player 0)
event_counts = df_non_zero['pid'].value_counts()

# Plot a boxplot for the distribution of event counts per player
plt.figure(figsize=(10, 6))
sns.boxplot(x=event_counts, color='lightblue')
plt.title("Boxplot of Event Counts per Player (excluding player 0)")
plt.xlabel("Number of Events")
plt.show()

#### Refined Analysis: Boxplot of Event Counts (Excluding `pid=0`)

##### Key Observations


1. **Central Tendency & Spread:**
   - **Median:**  
     The median event count is approximately **500 events**, indicating that half of the players (excluding `pid=0`) have fewer than around 500 recorded events.
   - **Interquartile Range (IQR):**  
     The IQR spans roughly from **250 to 700 events**, which captures the middle 50% of players, suggesting a relatively modest participation level for most.

2. **Right-Skew & Outliers:**
   - The overall distribution remains right-skewed, with a small number of players showing higher event counts.
   - There are two notable outliers with counts in the range of **1,500 to 1,750 events**, which are plausible for players with significant court time.

3. **Scale Adjustment:**
   - By excluding `pid=0`, the distribution now focuses on typical player participation, with the x-axis extending only to about 1,750 events. This improves visualization and better represents the majority of players.

##### Interpretation


- **Impact of Excluding `pid=0`:**
  - Removing the extreme outlier (`pid=0`) reduces the skew of the distribution, enabling a clearer view of the event counts for the bulk of the players.
  - The refined distribution reveals that most players contribute between **250 and 700 events**, which is more representative of typical participation levels in a game.

- **Remaining Outliers:**
  - The few players with event counts near 1,750 likely represent key contributors (e.g., star players) with high playing time.
  - These outliers, while still present, are within a more plausible range compared to the extreme value from `pid=0`.


##### Recommendations:



1. **Outlier Validation:**  
   - Investigate the two players with high event counts (~1,750) to confirm they are indeed high-usage players.
   
2. **Focus on Active Participants:**  
   - For further analysis and modeling, consider filtering out players with very low event counts (e.g., below 500) to concentrate on those who are actively contributing.

3. **Modeling Considerations:**  
   - Use robust statistical measures (e.g., median) when summarizing data, and consider normalizing event counts to reduce the influence of remaining outliers.

**Summary:**  
Excluding `pid=0` significantly refines the distribution of event counts. Most players now fall within a typical range of 250–700 events, with a few key players showing higher counts. This refined view allows for a more balanced analysis and supports subsequent modeling efforts by focusing on the most relevant data.


### 7.2 Outlier Detection for Player Event Counts (Excluding `pid=0`)



In this section, we focus on analyzing player participation by examining the distribution of event counts per player, while explicitly excluding the extreme case of `pid=0`. Our objectives are to:

- **Quantify Player Activity:**  
  Compute the total number of events recorded for each player (using the `pid` field) to understand overall participation levels.

- **Apply the IQR Method:**  
  Calculate the first (Q1) and third (Q3) quartiles and the Interquartile Range (IQR) for the event counts. Using these, determine the lower boundary as:
  
  $$ \text{Lower Bound} = Q1 - 1.5 \times \text{IQR} $$
  
  and the upper boundary as:
  
  $$ \text{Upper Bound} = Q3 + 1.5 \times \text{IQR} $$
  
- **Determine Filtering Thresholds:**  
  By visualizing and quantifying the distribution, we can decide on a minimum event count threshold. This threshold will help us filter out players with very low participation (e.g., fewer than 500 events) so that our subsequent analysis and modeling are based on robust player data.

Let's now implement the IQR-based outlier detection on the event counts, after excluding `pid=0`.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Exclude pid=0 from the analysis
filtered_data = df_tracking[df_tracking['pid'] != 0]

# Calculate event counts per player
event_counts = filtered_data['pid'].value_counts()

# Display summary statistics for event counts
print("Summary Statistics for Event Counts (excluding pid=0):")
print(event_counts.describe())

# Calculate Q1, Q3, and IQR for event counts
Q1 = event_counts.quantile(0.25)
Q3 = event_counts.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nIQR for event counts: {IQR:.2f}")
print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Identify outlier players based on event counts
outlier_players = event_counts[(event_counts < lower_bound) | (event_counts > upper_bound)]
print(f"\nNumber of outlier players (excluding pid=0): {len(outlier_players)}")
print("Outlier event counts:")
print(outlier_players)

#### Event Counts per Player & Filtering Recommendations


**Summary of Findings:**
- **Event Count Distribution (Excluding `pid=0`):**  
  - **Count:** 583 players  
  - **Mean:** ~486 events  
  - **Median:** 373 events  
  - **25th Percentile:** 101.5 events  
  - **75th Percentile:** 768.5 events  
  - **Max:** 1882 events  
- **IQR Analysis:**  
  - The Interquartile Range (IQR) is 667.00, which results in a calculated upper bound of 1769.00 (using \( Q3 + 1.5 \times IQR \)).  
  - There are 2 players with event counts exceeding this upper bound (1882 and 1804), which likely correspond to star players or possibly data anomalies.

**Implications:**
- The majority of players have modest event counts, with a median of 373, while a small subset of players are extremely active.
- The lower boundary calculated via IQR is negative (-899), which is not useful for filtering; instead, we need to set a practical threshold to exclude players with very low participation.
- Given that the 25th percentile is around 101 events and the median is 373 events, it may be reasonable to filter out players with extremely low counts (e.g., below 250 events) to focus our analysis on players with sufficient on-court activity.

**Recommendations for Next Steps:**
1. **Set a Lower Threshold:**  
   - We recommend filtering out players with fewer than **250 events**. This threshold helps ensure that our analysis and modeling are based on players with enough data to capture meaningful movement patterns.
2. **Remove Low-Participation Players:**  
   - Apply this threshold to further refine the dataset.
3. **Re-Evaluate Distribution:**  
   - After filtering, re-examine the distribution of event counts to ensure the dataset reflects the active participants.
4. **Proceed with Analysis:**  
   - With the filtered dataset, we can then integrate these robust player-level features into further exploratory analysis and modeling.

This refined approach will help us focus on the most representative players, thereby improving the quality and interpretability of our subsequent analyses and models.


#### Filtering Low-Participation Players



Based on our outlier analysis of event counts per player, we determined that:
- We will exclude `pid=0`, which represents an extreme outlier.
- We will also filter out players with fewer than 250 events to focus our analysis on those with sufficient on-court activity.

This filtering step will result in a cleaner dataset that is more representative of players with meaningful participation. Next, we will apply these filters to our dataset.


In [None]:
# Calculate the count of events per player (excluding pid=0, if not already filtered)
event_counts = df_tracking[df_tracking['pid'] != 0]['pid'].value_counts()

# Identify eligible players: those with at least 250 events
eligible_pids = event_counts[event_counts >= 250].index

# Filter the dataset to include only eligible players
df_filtered = df_tracking[df_tracking['pid'].isin(eligible_pids)]

# Display the number of players and total events in the filtered dataset
print("Number of players after filtering:", df_filtered['pid'].nunique())
print("Total number of events in filtered dataset:", len(df_filtered))

### 7.3 Final Exploratory Data Analysis on Cleaned Data


With our dataset now filtered to exclude `pid=0` and players with fewer than 250 events, we have a more robust subset consisting of 336 players and 260,246 events. In this section, we will re-examine key aspects of the data to confirm that our cleaning and preprocessing steps have improved the quality of our dataset. We will focus on:

1. **Distribution of Event Counts per Player:**  
   - Display a histogram and boxplot of event counts per player to understand the spread of participation.

2. **Spatial Trajectory Visualizations:**  
   - Visualize the on-court movement for a few (5) randomly selected players from the cleaned dataset using scatterplots.  
   - This will help confirm that our spatial features are representative and that players' movement patterns are preserved in the filtered data.

3. **Additional Graphs (Optional):**  
   - We may include further visualizations (e.g., comparing distributions of key engineered features) to verify the overall data quality.

These visualizations will serve as the final confirmation of our preprocessing and cleaning efforts, ensuring that our dataset is ready for the modeling phase.

#### Distribution of Event Counts per Player – Histogram



Below, we plot a histogram of the event counts per player (from the filtered dataset) to visualize the overall participation levels. This helps us see how many events the majority of players contribute and confirms that our filtering criteria are appropriate.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Re-calculate event counts per player from the filtered dataset
filtered_event_counts = df_filtered['pid'].value_counts()

plt.figure(figsize=(10, 6))
sns.histplot(filtered_event_counts, bins=30, kde=True, color='steelblue')
plt.title("Distribution of Event Counts per Player (Filtered)")
plt.xlabel("Number of Events")
plt.ylabel("Frequency")
plt.show()


#### Distribution of Event Counts per Player – Boxplot



The following boxplot provides a visual summary of the event counts per player after filtering. This will help us identify the central tendency, spread, and any remaining outliers among active players.


In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=filtered_event_counts, color='lightgreen')
plt.title("Boxplot of Event Counts per Player (Filtered)")
plt.xlabel("Number of Events")
plt.show()


#### Spatial Trajectory Scatterplots for Random Players


To further understand player movement patterns in our cleaned dataset, we will visualize the spatial trajectories for 15 randomly selected players (arranged in 3 rows and 5 columns). Each scatterplot shows the player's (locX, locY) positions color-coded by the in-game time (`cl_seconds`). 

To ensure that the spatial comparisons are meaningful, we will set fixed x and y boundaries for all subplots. For our dataset, we set the x-axis limits to [-250, 250] (representing the full horizontal spread on the court) and the y-axis limits to [900, -80] (inverted to reflect a typical court orientation).

This consistent scaling allows us to directly compare the positional tendencies and hotspots across different players.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set a threshold for minimum events has already been applied in df_filtered.
# Randomly select 15 players from the filtered dataset
np.random.seed(42)  # For reproducibility
selected_players = np.random.choice(df_filtered['pid'].unique(), size=15, replace=False)

# Create subplots for 15 players (3 rows x 5 columns)
fig, axes = plt.subplots(nrows=3, ncols=5, figsize=(25, 15))
axes = axes.flatten()  # Flatten the 2D array of axes for easier iteration

# Define consistent x and y limits for all subplots
x_limits = (-250, 250)
y_limits = (900, -80)  # Inverted y-axis to reflect typical court orientation

# Loop through each selected player and plot their spatial trajectory
for ax, player in zip(axes, selected_players):
    player_data = df_filtered[df_filtered['pid'] == player]
    
    # Create scatter plot: locX vs. locY, color-coded by cl_seconds
    scatter = ax.scatter(player_data['locX'], player_data['locY'], 
                         c=player_data['cl_seconds'], cmap='viridis', s=20, alpha=0.7)
    ax.set_title(f"Player {player}")
    ax.set_xlabel("locX")
    ax.set_ylabel("locY")
    ax.set_xlim(x_limits)
    ax.set_ylim(y_limits)

plt.tight_layout()
plt.show()


### 7.4 Overall Data Summary – Data Quality Confirmation

After extensive preprocessing, filtering, and exploratory analyses, we have arrived at a refined dataset that meets our quality standards. Key points include:

- **Robust Player Participation:**  
  After excluding `pid=0` and players with fewer than 250 events, our dataset now contains 336 players with a total of 260,246 events. This filtering has focused our analysis on active participants, reducing noise from low-activity entries.

- **Engineered Features:**  
  We have successfully computed and validated several key engineered features:
  - **Time Variables:** `cl_seconds` and `time_diff` capture the temporal dynamics.
  - **Spatial Variables:** `locX`, `locY`, along with their differences (`dx`, `dy`), quantify on-court positioning.
  - **Movement Metrics:** The Euclidean distance and velocity provide insights into the magnitude and speed of player movements.
  
- **Distribution & Outlier Analysis:**  
  Our analysis (histograms, boxplots, and correlation matrices) indicates that while some features are strongly right-skewed with outliers (especially `time_diff` and `velocity`), these characteristics align with the fast-paced and variable nature of NBA play. The majority of players exhibit consistent participation and movement patterns, with extreme values flagged for further investigation.

- **Spatial Trajectory Insights:**  
  Visualizations of spatial trajectories reveal distinct movement patterns and hotspots, consistent with typical player roles (e.g., perimeter activity for guards, clustering in the paint for big men).

Overall, the refined dataset is now robust, representative, and ready for advanced modeling and further analysis.


### 7.5 Saving the Refined Data



With our dataset now fully cleaned, filtered, and enriched with engineered features, the next step is to save this refined data for future modeling and analysis. Saving the dataset in a structured format (such as CSV) will ensure reproducibility and ease of use in subsequent phases of the project.

We will export the refined dataset (`df_filtered`) to a CSV file so that it can be easily loaded for further modeling in Chapter 8.


In [None]:
# Save the refined dataset to a CSV file
output_path = "../data/processed/refined_nba_tracking_data.csv"
df_filtered.to_csv(output_path, index=False)

print(f"Refined dataset saved successfully at: {output_path}")

## 8. Next Steps <a id="next-steps"></a>


### Summary of Preprocessing and EDA


- **Data Cleaning & Filtering:**  
  - We filtered out the extreme outlier (`pid=0`) and players with fewer than 250 events, resulting in a refined dataset of 336 active players and 260,246 events.
  - This step ensured that our analysis focuses on players with robust participation, reducing noise from low-activity or potentially erroneous entries.

- **Feature Engineering:**  
  - We successfully derived key features that capture both temporal and spatial dynamics:
    - **Temporal Features:** `cl_seconds` and `time_diff` quantify in-game time and intervals between events.
    - **Spatial Features:** `locX`, `locY`, along with their differences (`dx` and `dy`), provide the basis for computing the Euclidean distance (`distance`) and velocity.
  - These engineered features provide a solid foundation for understanding player movement patterns and game dynamics.

- **Exploratory Data Analysis (EDA):**  
  - We examined the distributions (via histograms, KDE, and boxplots) of both raw and engineered features, revealing important characteristics such as right-skewness and the presence of outliers.
  - Spatial trajectory visualizations confirmed that players exhibit distinct movement patterns and hotspots on the court.
  - Correlation analysis provided insights into the interdependencies among features, further informing our feature selection and preprocessing strategies.


### Next Steps: Roadmap for Predictive Modeling and Further Analysis


1. **Feature Selection & Transformation:**  
   - Reassess and potentially normalize or transform highly skewed features (e.g., `time_diff` and `velocity`) to mitigate the influence of extreme values.
   - Explore additional derived features (e.g., diagonal movement metrics or acceleration) that may further improve model performance.

2. **Predictive Modeling:**  
   - Develop baseline predictive models (e.g., using regression, decision trees, or logistic regression) as benchmarks.
   - Explore advanced modeling approaches, such as:
     - **Sequence Models (LSTMs, Transformers):** For predicting future player trajectories based on historical movement data.
     - **Graph Neural Networks (GNNs):** To capture the multi-agent interactions and team dynamics inherent in the sport.

3. **Model Evaluation & Validation:**  
   - Use appropriate evaluation metrics (e.g., mean squared error, AUC, or log-loss) and cross-validation strategies to assess model performance.
   - Conduct error analysis to identify any systematic biases or areas for improvement.

4. **Integration of Additional Data:**  
   - Incorporate contextual game event data (e.g., play-by-play logs, shot charts, rebounds) to enrich the model’s inputs and better capture the nuances of player behavior.
   - Examine how different in-game situations affect player movement and outcomes.

5. **Actionable Insights & Tactical Recommendations:**  
   - Translate the modeling outcomes into actionable insights for coaches and team analysts.
   - Develop dashboards or visual reports to communicate the findings in an intuitive way.



### Final Thoughts


Our comprehensive preprocessing and EDA have prepared us with a high-quality, feature-rich dataset that accurately represents NBA player movements. The next phase will focus on building predictive models that leverage these insights to forecast player trajectories and extract tactical insights, ultimately contributing to more informed decision-making on and off the court.