**Analyzing 2022-2023 NBA Player Statistics**
---
*  We will set the project up in this section

**Importing Libraries**



*  To start, we import all needed libraries for this project
*  We will use a number of imports to help us visualize and make calculations for this project

---

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier


**Reading the Data**

*  Using pandas, we read our data in nd set it as a variable named data
*  Then using the .sample() function, the rows in our data are randomly shuffled
*  Removing players with less than 96 mintues of play time will help create less outliers that can skew the results

---

In [None]:
data = pd.read_csv("data.csv")
data = data.sample(frac=1)
data = data[data.MP > 96]

**Descriptive Statistics**

*  To get a better look at our stats as a whole, using the describe() function can give a good outline of a selection of specific statistics

---


In [None]:
data_small = data[["Age", "MP", "FG", "3P", "2P", "FT", "TRB", "AST", "STL", "BLK", "PTS"]]
data_small.describe()

**Cleaning the Data** 

*  In basketball, there are some players who play in multiple positions
*  To clean this up, we can simply convert players to a single position
*  We will also create a compact version of our statistics including only some of the stats we want to take a closer look at. This variable will be named data_small and will be used throughout the project

---

In [None]:
data=data.replace("PF-SF","PF")
data=data.replace("SG-PF","SG")
data=data.replace("SG-SF","SG")
data=data.replace("C-PF","C")
data=data.replace("PF-C","PF")
data=data.replace("SF-SG","SF")
data=data.replace("SG-PG", "SG")

data_small = data[["Pos", "TRB", "AST", "STL", "2P", "3P", "BLK"]]

In [None]:
# plot the compact data set using seaborn pairplot

seaborn.pairplot(data_small, hue='Pos', size = 2)

**Identifying Clusters**
---
*  In order to use the K-mean clustering, to correctly partition observations into clusters we must first observe some potential clusters defined by our selected class: Player Position 
*  We will use the matplotlib scatter plot to visualize potential clusters



**3 Points per Minute vs Total Rebounds per Minute**


---

In [None]:
data["3P/MP"] = data["3P"]/data["MP"]
data["TRB/MP"] = data["TRB"]/data["MP"]

fig, ax = plt.subplots()

x_var = "TRB/MP"
y_var = "3P/MP"

colors = {'SG':'purple', 'PF':'blue', 'C':'red', 'SF':'orange', 'PG':'yellow'}

ax.scatter(data[x_var], data[y_var], c=data['Pos'].apply(lambda x:colors[x]), s =10)

**2 Points per Minute vs Blocks per Minute**

---

In [None]:
data["2P/MP"] = data["2P"]/data["MP"]
data["BLK/MP"] = data["BLK"]/data["MP"]

fig, ax = plt.subplots()

x_var = "BLK/MP"
y_var = "2P/MP"

colors = {'SG':'purple', 'PF':'blue', 'C':'red', 'SF':'orange', 'PG':'yellow'}

ax.scatter(data[x_var], data[y_var], c=data['Pos'].apply(lambda x:colors[x]), s =10)

**Steals per Minute vs Assits per Minute**

---

In [None]:
data["STL/MP"] = data["STL"]/data["MP"]
data["AST/MP"] = data["AST"]/data["MP"]

fig, ax = plt.subplots()

x_var = "AST/MP"
y_var = "STL/MP"

colors = {'SG':'purple', 'PF':'blue', 'C':'red', 'SF':'orange', 'PG':'yellow'}

ax.scatter(data[x_var], data[y_var], c=data['Pos'].apply(lambda x:colors[x]), s =10)

**Total Rebounds per Minute vs Assists per Minute**

---

In [None]:
data["TRB/MP"] = data["TRB"]/data["MP"]
data["AST/MP"] = data["AST"]/data["MP"]

fig, ax = plt.subplots()

x_var = "AST/MP"
y_var = "TRB/MP"

colors = {'SG':'purple', 'PF':'blue', 'C':'red', 'SF':'orange', 'PG':'yellow'}

ax.scatter(data[x_var], data[y_var], c=data['Pos'].apply(lambda x:colors[x]), s =10)

**Assists per Minute vs Blocks per Minute**

---

In [None]:
data["BLK/MP"] = data["BLK"]/data["MP"]
data["AST/MP"] = data["AST"]/data["MP"]

fig, ax = plt.subplots()

x_var = "AST/MP"
y_var = "BLK/MP"

colors = {'SG':'purple', 'PF':'blue', 'C':'red', 'SF':'orange', 'PG':'yellow'}

ax.scatter(data[x_var], data[y_var], c=data['Pos'].apply(lambda x:colors[x]), s =10)

**K-Means Algorithm**
---
*  Using the clusters from Assists per Minute vs Blocks per Minute, we will now use the k-means algorithm to find 3 centroids in our normalized data
---

In [None]:
dfn = data[["AST/MP", "BLK/MP"]]

kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 530, n_init = 10, random_state = 0)

y_kmeans = kmeans.fit_predict(dfn)

print(kmeans.cluster_centers_)

The above coordinates represent the 3 center points fo each cluster our algorithm discovered. To visualize, we will once again use matplotlib scatter plot.

In [None]:
d0=dfn[y_kmeans == 0]
d1=dfn[y_kmeans == 1]
d2=dfn[y_kmeans == 2]
d3=dfn[y_kmeans == 3]
d4=dfn[y_kmeans == 4]

# Clusters
plt.scatter(d0[x_var], d0[y_var], s = 10, c = 'blue', label = 'D0')
plt.scatter(d1[x_var], d1[y_var], s = 10, c = 'green', label = 'D1')
plt.scatter(d2[x_var], d2[y_var], s = 10, c = 'red', label = 'D2')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')

Our K-Means Algorithm found 3 clusters and their centers which are very similar to the ones we found within the data and it's player positions. 