# NBA Playoff Player Performance Analysis and Clustering

This notebook analyzes player performance using NBA Playoff data from 2010 to 2020 and applies the `K-Means` clustering algorithm to group players by performance.

## 1. Library Imports

In [None]:
import pandas as pd
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

## 2. Data Loading and Merging

Reads all CSV files in the specified folder (`your_folder_path/playoffs`) and merges them into a single DataFrame.
The year of the data is added to each player row.

In [None]:
# ATTENTION: Don't forget to update this path according to your file structure!
folder_path = 'your_folder_path/playoffs'

all_data = []
for file in os.listdir(folder_path):
    if file.endswith('.csv'):
        # Extract the year from the first 4 digits of the file name
        year = int(file[:4])
        df = pd.read_csv(os.path.join(folder_path, file))
        # Add the Year column
        df["Year"] = year
        all_data.append(df)

# Concatenate all DataFrames
df_all = pd.concat(all_data, ignore_index=True)

## 3. Data Preprocessing and Cleaning

* Cleans up white spaces in column names.
* Removes rows where 'Player' is repeated as a header.
* Cleans up `*` (Hall of Fame indicator) and excess spaces from player names.
* Converts the `MP` (Minutes Played) column to a numeric format.
* Filters players who have played a minimum of 100 minutes (MP >= 100) (for meaningful performance).
* Selects the metrics to be used for analysis (`PER`, `WS`, `BPM`, `TS%`).
* Converts the selected metrics to numeric format and drops rows with missing values.

In [None]:
df_all.columns = df_all.columns.str.strip()
df_all = df_all[df_all['Player'] != 'Player']
df_all['Player'] = df_all['Player'].str.replace('*', '', regex=False).str.strip()

# MP (Minutes Played) is converted to numeric format and filtered
df_all['MP'] = pd.to_numeric(df_all['MP'], errors='coerce')
df_all = df_all[df_all['MP'] >= 100]

# Performance metrics to be used
metrics = ['PER', 'WS', 'BPM', 'TS%']
df_all = df_all[['Player', 'Year'] + metrics]

# Metrics are converted to numeric format and NaN values are dropped
df_all[metrics] = df_all[metrics].apply(pd.to_numeric, errors='coerce')
df_all.dropna(inplace=True)

## 4. Calculating Player Average Performance and Normalization

* Calculates the average of all playoff performance metrics for each player between 2010-2020.
* Normalizes the metrics between 0 and 1 using `MinMaxScaler`.
* Creates an **Overall Performance Score** for players by taking the mean of the normalized metrics.

In [None]:
# Calculate the average of metrics per player
df_player_avg = df_all.groupby('Player')[metrics].mean().reset_index()

# Normalization (Min-Max Scaling)
scaler = MinMaxScaler()
df_player_avg[metrics] = scaler.fit_transform(df_player_avg[metrics])

# Calculate the overall score with the mean of normalized metrics
df_player_avg['Score'] = df_player_avg[metrics].mean(axis=1)

## 5. Top 10 Players

Lists the top 10 players of the 2010-2020 playoff era based on the calculated **Overall Performance Score**.

In [None]:
# Sort by Score and take the top 10
top_10_players = df_player_avg.sort_values('Score', ascending=False).head(10).reset_index(drop=True)

print("Top 10 Players in the 2010â€“2020 Playoffs (Based on Overall Average Score):")
print(top_10_players[['Player', 'Score']])

## 6. K-Means Clustering

K-Means clustering is applied to divide players into 4 different performance groups using the normalized performance metrics (`PER`, `WS`, `BPM`, `TS%`).

* `n_clusters` is set to 4.
* Clusters are labeled as `Elite`, `Good`, `Average`, and `Below Average` based on their mean **Score** values.

In [None]:
# Create and apply the K-Means model (k=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df_player_avg['Cluster'] = kmeans.fit_predict(df_player_avg[metrics])

# Sort and name cluster labels by score
cluster_order = df_player_avg.groupby('Cluster')['Score'].mean().sort_values(ascending=False).index
label_map = {
    cluster_order[0]: 'Elite',
    cluster_order[1]: 'Good',
    cluster_order[2]: 'Average',
    cluster_order[3]: 'Below Average'
}

# Replace cluster numbers with meaningful labels
df_player_avg['Label'] = df_player_avg['Cluster'].map(label_map)

## 7. Visualization

A scatter plot is created by grouping players according to `PER` and `WS` metrics.
The names of the players in the **Elite** cluster, who have the highest level of performance, are shown on the chart.

In [None]:
plt.figure(figsize=(12, 8))

# Plot each performance group in a different color
for label in df_player_avg['Label'].unique():
    subset = df_player_avg[df_player_avg['Label'] == label]
    plt.scatter(subset['PER'], subset['WS'], label=label, alpha=0.7)

# Add the names of Elite players to the chart
elite_players = df_player_avg[df_player_avg['Label'] == 'Elite']
for i, row in elite_players.iterrows():
    plt.text(row['PER'], row['WS'], row['Player'], fontsize=8, alpha=0.9)

plt.xlabel("PER (Player Efficiency Rating)")
plt.ylabel("WS (Win Shares)")
plt.title("Player Performance Groups with K-Means Clustering (Elite Players Labeled)")
plt.legend(title='Performance Group')
plt.grid(True)
plt.tight_layout()
plt.show()