## MP Phase 1

**S11 - Group x**

**Submitted By:** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**Chua Ching, Janine**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**Ileto, Maxine**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**Dytoc, Ayisha**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**Tan, Jared**

# Introduction

## Target Task

# Dataset Description

## Import
The following are imported for this project:

In [None]:
# Python Libraries
import numpy as np
import pandas as pd

In [None]:
# Dataset
df = pd.read_csv('sports.csv')

## Description

## Data Collection Process


## Dataset Features


## Dataset Variables

The dataset contains a total of 22 variables (columns). The following are the descriptions of each variable in the dataset:

- **`#`**: the index number of the player of when they were added into the dataset

To show a few examples, the first 5 rows of the dataset will be shown below:

In [None]:
df.head()

### To verify if there are missing values for the used variables, null values are to be checked if they exist in the variables used in the dataset: 

In [None]:
# Copy dataframe, but only contain the players' height, weight, points, rebounds, and assists
name_height_weight_pts_reb_ast = df[['player_name','player_height', 'player_weight', 'pts', 'reb', 'ast']]

# Check if there is any null values in the variables
name_height_weight_pts_reb_ast.isnull().any()

With the list displayed above, we can determine that the no missing values for the used variables.

### To verify if there are data type for the used variables are correct, each used data type will be checked:

In [None]:
# Check the data type of the used variables
name_height_weight_pts_reb_ast.info()

All data types are *float64* data types except for the **`player_name`** column. This is to be expected as the **`player_name`** column is a string and pandas interprets it as an object while the five other columns are expected to be decimal numbers. 

### To verify if there are default values, default values are to be checked if they exist in the dataset (It is worth noting that zero points, rebounds, or assists is a possibility while height and weight cannot be zero):

In [None]:
# Copy dataframe, but only contain the players' height and weight, points, rebounds, and assists
height_weight = df[['player_height', 'player_weight']]

# Copy dataframe, but only contain the players' points, rebounds, and assists
pts_reb_ast = df[['pts', 'reb', 'ast']]

# Check if there is any default values for the height and weight or the points, rebounds, and assists
((height_weight.values <= 0).any() or (pts_reb_ast.values < 0).any())

With the code returning the boolean *False*, we can determine that the no default values for the used variables.

### To verify if no observations are duplicates, observations will be checked if they are unique, dropping the **`#`** (index) column for a new dataset.

In [None]:
# Drop duplicate rows in dataframe
df.drop_duplicates()

The **`#`** (index) column should be removed as this is only an indication of when a data was added into a dataset. If there are duplicate data, the **`#`** column will only get in the way of checking if there are duplicates.

In [None]:
# Drop duplicate rows in dataframe without the index column
df_no_index = df.drop(columns=df.columns[0], axis=1)

After creating a dataset without the index column, we then check if all observations (rows) are unique.

In [None]:
df_no_index.drop_duplicates()

Seeing as how, after using a function to drop duplicates, the number of obsevations (rows) did not change, we can determine that there are no duplicates in the dataset.

### Despite the data being complete and without default values, the data needs to be pre-processed for the second Exploratory Data Analysis question, "What is the distribution of the NBA players' heights?"

Given that each observation in the dataset contains an NBA player's data for a specific season, multiple observations may include the same player. There is a possibility that the heights of players may not be consistent throughout seasons, which may affect the data. With this, the consistency of player heights will first be verified. 

In [None]:
# verify if all NBA players have consistent heights throughout their seasons
names_df = df.groupby(["player_name", "player_height"])[["player_name", "player_height"]]

# print result
for key, item in names_df:
    print(names_df. get_group(key))

In [None]:
# verify if each groupby has only 1 unique name and height value
unique_names_df = names_df.nunique() 
unique_names_df

In [None]:
# verify if only one unique value for player_name and player_height
unique_names_df.nunique()

As shown above, the values of the players' heights were found to remain consistent throughout different seasons.

The heights of each player will then be stored in the variable **`heights_df`** to be easily used in the Exploratory Data Analysis portion of the notebook.

In [None]:
heights_df = df.drop_duplicates(subset=['player_name'], keep='first')
heights_df = heights_df[["player_name", "player_height"]]

The first five observations in this dataframe are shown below.

In [None]:
heights_df.head()

# Exploratory Data Analysis

### Are there correlations between NBA players' height, weight, points, rebounds, and assists? 

Reading through the dataset, five variables stood out for the research and they were the following:

- **`player_height`**: the height of the player in centimeters
- **`player_weight`**: the weight of the player in centimeters
- **`pts`**: the average number of points the player scored per game for that season
- **`reb`**: the average number of rebounds the player scored per game for that season
- **`ast`**: the average number of assits the player scored per game for that season

With these columns in mind, a correlation matrix indicating their levels of correlation is displayed below:

In [None]:
# Create a correlation matrix with the default Pearson method and round values to 2 decimal places
height_weight_pts_reb_ast = df[['player_height', 'player_weight', 'pts', 'reb', 'ast']]
height_weight_pts_reb_ast.corr().round(2)

Based on this correlation matrix, the following observations were found (Akoglu, 2018):

Positive Correlations:
<ul>
    <li>height and weight</li>
    <li>height and rebounds</li>
    <li>weight and rebounds</li>
    <li>points and rebounds</li>
    <li>points and assists</li>
    <li>rebound and assists</li>
</ul>

Negative Correlations:
<ul>
    <li>height and assists</li>
    <li>weight and assists</li>    
</ul>

No Correlations:
<ul>
    <li>height and points</li>
    <li>weight and points</li>
</ul>

Strong Correlations:
<ul>
    <li>height and weight</li>
</ul>

Moderate Correlations:
<ul>
    <li>height and rebounds</li>
    <li>height and assists</li> 
    <li>weight and rebounds</li>
    <li>weight and assists</li> 
    <li>points and rebounds</li>
    <li>points and assists</li>
</ul>

Weak Correlations:
<ul>
    <li>rebound and assists</li>     
</ul>

To also visualize the relationship of each variable, a scatter plot was generated to compare each of the variables in the dataset.

In [None]:
# Create a scatter plot, comparing each variable to each other
corr_scatter = pairplot(height_weight_pts_reb_ast, plot_kws={'alpha':0.1})
corr_scatter

From the displayed data, player **height** and **weight** are **strongly correlated**. According to Akoglu (2018), a Pearson correlation coefficient greater than 0.80 is a strong relationship. Because the NBA players' height and weight has a correlation coefficient of 0.83, exceeding 0.80, we can say that these two variables are strongly correlated.

### What is the distribution of the NBA players' heights?

In [None]:
# display histogram
heights_df.hist("player_height", bins=50, edgecolor='w', figsize=(10, 6))
plt.show()

Included below are the numerical summaries of the heights of NBA players. The quartile, minimum and maximum values are displayed, as well as the mean, median, mode and standard deviation.

In [None]:
# quartile, minimum and maximum values
heights_df.quantile([0.0,0.25,0.5,0.75,1.0])

In [None]:
# mode
heights_df.mode()["player_height"][0]

In [None]:
# mean, median and std
heights_df.agg({"player_height": ["mean", "median", "std"]}).round(2)

The histogram of the heights of NBA players is relatively normally distributed, with the middle 50% of the data (25% to 75%) being from 193.04 to 205.74 centimeters. The data has a standard deviation of 9.05 cm. 25% of the data fall below 193.04 and 50% of the data fall below 200.66, while 75% of the data fall below 205.74.

The shortest NBA player was found to be 160.02 cm, while the tallest was found to be 231.14 cm. Meanwhile, the most common height of nba players was found to be 205.74 cm.

### What is the distribution of the average number of points scored per game per season?

In [None]:
# display histogram
df.hist("pts", bins=50, edgecolor='w', figsize=(10, 5))
plt.show()

Included below are the numerical summaries of the average number of points scored per game per season. The quartile, minimum and maximum values are displayed, as well as the mean, median, mode and standard deviation.

In [None]:
# quartile, minimum and maximum values
df["pts"].quantile([0.0,0.25,0.5,0.75,1.0])

In [None]:
# mode
df.mode()['pts'][0]

In [None]:
# mean, median and std
df.agg({"pts": ["mean", "median", "std"]}).round(2)

The histogram of the average number of points of each player per season is shown to have a positively-skewed or right-skewed distribution. The middle 50% of the data (25% to 75%) contains values from 3.6 to 11.5 points. The data has a standard deviation of 5.97 points, while  25% of the data fall below 3.6, 50% of the data fall below 6.7 and 75% of the data fall below 11.5. 2 points was found to be the most common average number of points scored.

As visually shown in the histogram, a majority of the data are from the values of 0 to 10. Despite this, there are still outliers to the right of the histogram, with the highest average number of point scored being 36.1 and the lowest average number of points scored being 0.

### What is the distribution of the average number of rebounds per game?

In [None]:
# display histogram
df.hist("reb", bins=50, edgecolor='w', figsize=(10, 5))
plt.show()

As shown in the histogram above, the average number of rebounds per game is **positively skewed**.

In [None]:
# mode
df.mode()["reb"][0]

In [None]:
# mean, median and std
df.agg({"reb": ["mean", "median", "std"]}).round(2)

In [None]:
# quartile, minimum and maximum values
df["reb"].quantile([0.0, 0.25, 0.5, 0.75, 1.0])

After calculating the numerical summaries, it is clear that the average number of rebounds per game has a positively skewed or right-skewed distribution. The most common amount of average rebounds per game is at 2.0. The data has a standard deviation of 2.48 rebounds. It can also be observed that 25% of the data fall below 1.8 rebounds, 50% of the data fall below 3.0 rebounds, and 75% below 4.7 rebounds. The highest amount of average rebounds per game is at 16.3 while the lowest is at 0.0 rebounds per game.

### What is the distribution of the average number of assists per game?

In [None]:
# display histogram
df.hist("ast", bins=50, edgecolor='w', figsize=(10, 5))
plt.show()

As shown in the histogram above, the average number of assists per game is **positively skewed**.

In [None]:
# mode
df.mode()["ast"][0]

In [None]:
# mean, median and std
df.agg({"ast": ["mean", "median", "std"]}).round(2)

In [None]:
# quartile, minimum and maximum values
df["ast"].quantile([0.0, 0.25, 0.5, 0.75, 1.0])

After calculating the numerical summaries, it is clear that the average number of assists per game has a positively skewed or right-skewed distribution. The most common amount of average assists per game is at 0.3. The data has a standard deviation of 1.79 assists. It can also be observed that 25% of the data fall below 0.6 assists, 50% of the data fall below 1.2 assists, and 75% below 2.4 assists. The highest amount of average assists per game is at 11.7 while the lowest is at 0.0 assists per game.

# Research Question

In NBA teams, players are assigned one of the five positions in a basketball team, namely point guard, shooting guard, small forward, power forward, and center. Assignments are primarily determined by their role and height. Given that each player may be assigned to a different role within a game, a player's performance cannot solely be determined by their number of points scored. Rebounds gained and assists given can also be used to determine if a player is fulfilling a specific role given to them by their team. The scatterplots displayed in the first Exploratory Data Analysis question show a possibility of multiple clusters existing when comparing NBA players' season performances across different variables.

Because it was found in the Exploratory Data Analysis that height and weight are greatly correlated with one another, the group decided to only use height for the research question.

With this in mind, the group aims to answer the research question "**How many clusters are found when basketball players are clustered by their height, points scored, rebounds gained, and assists given?**" in the second phase of this project.

## MP Phase 2

**S13 - Group 6**

**Submitted By:** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**Ileto, Maxine**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**Posadas, Annika**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**Tan, Jared**

# Data Modelling

### Data Normalization

Prior to modelling the data, a dataframe containing only values needed for answering the research question will first be made. This dataframe will be named `values_df`. A sample of the first five observations is also shown below.

In [None]:
values_df = pd.DataFrame(df, columns = ['player_name','player_height','pts','reb','ast'])
values_df.head()

A dataframe will be also made containing only numerical values (player height, points, rebounds, and assists). The dataframe values will also be preprocessed using normalization. After preprocessing, all numbers will only be from the values of 0 to 1.

This dataframe will be called `normalized_df` and a copy of the first five observations are shown below.

In [None]:
int_values_df = df[['player_height','pts','reb','ast']]

# normalize values using MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(int_values_df)
normalized_df = scaler.transform(int_values_df)

normalized_df = pd.DataFrame(normalized_df, columns = ['player_height','pts','reb','ast'])

normalized_df.head()

### Determining the Number of Clusters

After preprocessing the data, the optimal k-value was determined using the elbow method. Based on the graph below, the optimal value was found to be either 4 or 5.

For the remainder of this notebook, the k-value that will be used by the group is 5.

In [None]:
from sklearn.cluster import KMeans

limit = 15
 
# wcss - within cluster sum of squared distances
wcss = {}

for k in range(2, limit + 1):
    model = KMeans(n_clusters=k)
    model.fit(normalized_df)
    wcss[k] = model.inertia_

plt.plot(wcss.keys(), wcss.values(), 'gs-')
plt.xlabel('Values of "k"')
plt.ylabel('WCSS')
plt.show()

### Dataset Clustering

In [None]:
kmeans = KMeans(
        n_clusters=5, 
        init="k-means++",
        n_init=10,
        tol=1e-04,
        random_state = 0
        )

kmeans.fit(normalized_df)

clusters = pd.DataFrame(normalized_df)

clusters['label'] = kmeans.labels_

polar = clusters.groupby("label").mean().reset_index()

polar = pd.melt(polar,id_vars=["label"])

fig = px.line_polar(polar, r="value", theta="variable", color="label", color_discrete_sequence=['red','orange','green','blue','indigo'], line_close=True, height=600, width=600)

fig.show()

<!-- Image File -->
<img src="img/polarchart.png" alt="Count of Each Cluster from Total Number of Players"/>

Based on the polar chart above, the five resulting clusters all have distinct characteristics. Players in Cluster 0 are characterized by average player heights but the lowest points, rebounds and assists. Players in Cluster 1 are the shortest, have average points and assists but the lowest rebounds. Players in Cluster 2 are the tallest, have average assists but have high points and the highest rebounds. Meanwhile, players in Cluster 3 are relatively average across height, points, rebounds and assists, while lastly, players in Cluster 4 are average in terms of height and rebounds, but score the highest number of points and assists.

With the resulting polar chart, it also appears that the clustering resulted in two main groups with similar shapes, namely clusters 0, 2, and 3, and clusters 1 and 4. Clusters 0, 2, and 3 all have relatively similar shapes as indicated in the chart, however their sizes differ. While cluster 2 is the largest, cluster 0 is the smallest. On the other hand, clusters 1 and 4 have similar shapes, with cluster 1 being smaller.

The only very close similarities observed in terms of values can be seen in player points, where clusters 2 and 4 have very close values, and in player rebounds, where clusters 0 and 1 have close values as well.

In [None]:
clusters.groupby('label').size().reset_index()

print("Count of Each Cluster from Total Number of Players")
clusters['label'].value_counts().sort_index(ascending=True)

<!-- Image File -->
<img src="img/clustercount.png" alt="Count of Each Cluster from Total Number of Players"/>

In [None]:
print("Percentage of Each Cluster from Total Number of Players")
clusters['label'].value_counts(normalize=True) * 100

<!-- Image File -->
<img src="img/clusterpercent.png" alt="Count of Each Cluster from Total Number of Players"/>

After segregating each observation based on their clusters, it was found that Cluster 0 comprises 32% of all observations with 3,937 entries, Cluster 1 comprises 22.35% with 2,750 observations, Cluster 2 comprises 9.65% with 1,187 observations, Cluster 3 comprises 24.58% with 3,024 observations, and lastly, Cluster 4 comprises 11.43% with 1,407 observations.

### Cluster Description

#### Joining **`player_name`** Back Into the Clusters

In order to create a dataframe with clusters that do not change every instance of a kernel restart, the following dataframes were saved into an external comma-separated values files once with the following code:

- clusters.to_csv('clusters.csv', index=False)
- names_normalized_df.to_csv('names_normalized_df.csv', index=False)

Thereafter, any future access in this notebook is a reference to these files.

In [None]:
clusters = pd.read_csv('clusters.csv')
names_normalized_df = pd.read_csv('names_normalized_df.csv')

In [None]:
labels_df = pd.DataFrame(clusters, columns = ['label'])

label_values_df = values_df.join(labels_df)
label_values_df.head()

To find the summary statistics across the five clusters, we first grouped the observations by their cluster.

In [None]:
clusters_df = label_values_df.groupby(['label'])

#### Player Heights EDA per Cluster

Shown below are the numerical summaries of the heights of the NBA players per cluster, namely its mean, median, and standard deviation.

In [None]:
clusters_df.agg({"player_height": ["mean", "median", "std"]}).round(2)

We also displayed the box plot of each cluster to see how the data is distributed within each cluster.

In [None]:
plt.title("Box Plot of Player Heights")
plt.ylabel("Player Height")
plt.rcParams["figure.figsize"] = [15, 6]

for i, grp in clusters_df:
    plt.boxplot(x='player_height', data=grp, positions=[i])

plt.xticks([0, 1, 2, 3, 4], ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4',])
plt.show()

The box plot of Cluster 0 shows the heights of the players in this cluster with a mean of 204.80, a median of 205.74, and a standard deviation of 6.05. The overall shape of the this cluster's box plot is negatively skewed, with multiple outliers on the right.

The box plot of Cluster 1 shows the heights of the players in this cluster with a mean of 190.05, a median of 190.50, and a standard deviation of 5.47. The overall shape of the this cluster's box plot is nearly symmetrical, with multiple outliers on both sides but mostly on the left.

The box plot of Cluster 2 shows the heights of the players in this cluster with a mean of 208.61, a median of 208.28, and a standard deviation of 4.95. The overall shape of the this cluster's box plot is nearly symmetrical, with multiple outliers on both sides but mostly on the right.

The box plot of Cluster 3 shows the heights of the players in this cluster with a mean of 205.54, a median of 205.74, and a standard deviation of 5.38. The overall shape of the this cluster's box plot is negatively skewed, with multiple outliers on its right.

The box plot of Cluster 4 shows the heights of the players in this cluster with a mean of 192.20, a median of 190.50, and a standard deviation of 6.61. The overall shape of the this cluster's box plot is positively skewed, with a few outliers on its left.

#### Player Points EDA per Cluster

Shown below are the numerical summaries of the points scored the NBA players per cluster, namely its mean, median, and standard deviation.

In [None]:
clusters_df.agg({"pts": ["mean", "median", "std"]}).round(2)

We also displayed the box plot of each cluster to see how the data is distributed within each cluster.

In [None]:
plt.title("Box Plot of Player Points")
plt.ylabel("Player Points")
plt.rcParams["figure.figsize"] = [15, 6]
plt.rcParams["figure.autolayout"] = True

for i, grp in label_values_df.groupby('label'):
    plt.boxplot(x='pts', data=grp, positions=[i])

plt.xticks([0, 1, 2, 3, 4], ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4',])
plt.show()

The box plot of Cluster 0 shows the points of the players in this cluster with a mean of 3.11, a median of 3.00, and a standard deviation of 1.72. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker longer than its lower whisker and many outliers to its right.

The box plot of Cluster 1 shows the points of the players in this cluster with a mean of 6.29, a median of 6.05, and a standard deviation of 3.15. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker longer than its lower whisker and many outliers to its right.

The box plot of Cluster 2 shows the points of the players in this cluster with a mean of 16.63, a median of 16.40, and a standard deviation of 4.68. The overall shape of the this cluster's box plot is nearly symmetrical, with one outlier to the left and many outliers to its right.

The box plot of Cluster 3 shows the points of the players in this cluster with a mean of 9.06, a median of 8.70, and a standard deviation of 3.06. The overall shape of the this cluster's box plot is nearly symmetrical, with one outlier to the left and many outliers to its right.

The box plot of Cluster 4 shows the points of the players in this cluster with a mean of 16.99, a median of 16.50, and a standard deviation of 5.22. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker longer than its lower whisker and many outliers to its right.

#### Player Rebounds EDA per Cluster

Shown below are the numerical summaries of the rebounds of the NBA players per cluster, namely its mean, median, and standard deviation.

In [None]:
clusters_df.agg({"reb": ["mean", "median", "std"]}).round(2)

We also displayed the box plot of each cluster to see how the data is distributed within each cluster.

In [None]:
plt.title("Box Plot of Player Rebounds")
plt.ylabel("Player Rebounds")
plt.rcParams["figure.figsize"] = [15, 6]
plt.rcParams["figure.autolayout"] = True

for i, grp in clusters_df:
    plt.boxplot(x='reb', data=grp, positions=[i])

plt.xticks([0, 1, 2, 3, 4], ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4',])
plt.show()

The box plot of Cluster 0 shows the rebounds of the players in this cluster with a mean of 1.97, a median of 1.9, and a standard deviation of 1.01. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker longer than its lower whisker and a few outliers to its right.

The box plot of Cluster 1 shows the rebounds of the players in this cluster with a mean of 1.83, a median of 1.8, and a standard deviation of 0.86. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker slightly longer than its lower whisker and many outliers to its right.

The box plot of Cluster 2 shows the rebounds of the players in this cluster with a mean of 8.80, a median of 8.6, and a standard deviation of 2.12. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker slightly longer than its lower whisker and many outliers to its right.

The box plot of Cluster 3 shows the rebounds of the players in this cluster with a mean of 4.89, a median of 4.7, and a standard deviation of 1.33. The overall shape of the this cluster's box plot is nearly symmetrical, with many outliers to its right.

The box plot of Cluster 4 shows the rebounds of the players in this cluster with a mean of 4.09, a median of 3.8, and a standard deviation of 1.38. The overall shape of the this cluster's box plot is nearly symmetrical, with one outlier to its left and many outliers to its right.

#### Player Assists EDA per Cluster

Shown below are the numerical summaries of the assists of the NBA players per cluster, namely its mean, median, and standard deviation.

In [None]:
clusters_df.agg({"ast": ["mean", "median", "std"]}).round(2)

We also displayed the box plot of each cluster to see how the data is distributed within each cluster.

In [None]:
plt.title("Box Plot of Player Assists")
plt.ylabel("Player Assists")
plt.rcParams["figure.figsize"] = [15, 6]
plt.rcParams["figure.autolayout"] = True

for i, grp in clusters_df:
    plt.boxplot(x='ast', data=grp, positions=[i])

plt.xticks([0, 1, 2, 3, 4], ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4',])
plt.show()

The box plot of Cluster 0 shows the assists of the players in this cluster with a mean of 0.49, a median of 0.4, and a standard deviation of 0.37. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker longer than its lower whisker and a few outliers to its right.

The box plot of Cluster 1 shows the assists of the players in this cluster with a mean of 2.07, a median of 1.9, and a standard deviation of 1.07. The overall shape of the this cluster's box plot is positively skewed, with with an upper whisker slightly longer than its lower whisker and outliers to its right.

The box plot of Cluster 2 shows the assists of the players in this cluster with a mean of 2.34, a median of 2.1, and a standard deviation of 1.15. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker slightly longer than its lower whisker and many outliers to its right.

The box plot of Cluster 3 shows the assists of the players in this cluster with a mean of 1.36, a median of 1.2, and a standard deviation of 0.72. The overall shape of the this cluster's box plot is positively skewed, with an upper whisker longer than its lower whisker and many outliers to its right.

The box plot of Cluster 4 shows the assists of the players in this cluster with a mean of 5.56, a median of 5.3, and a standard deviation of 1.88. The overall shape of the this cluster's box plot is is positively skewed, with an upper whisker slightly longer than its lower whisker and many outliers to its right.

# Statistical Inference

With the results of the elbow method in determining the number of clusters in the dataset, 5 distinct clusters was found. in this section of the notebook, the group will explore similarities found among clusters within the dataset, whether this be in the cluster shapes or in certain values.

Given the similarities in shape of clusters 0, 2, and 3, and clusters 1 and 4, the group further explored how similar these clusters are to one another in terms of the player height, points, rebounds and assists.

Aside from this, the group also compared clusters 0 and 1 due to their similarities in terms of player rebounds and compared clusters 2 and 4 due to their similarities in points scored.

In doing analyzing these clusters, the group performed 𝑡 -tests to compare two means from unpaired and independent groups. For all hypotheses made in this section of the paper, the significance level used was 5%.

A comprehensive overview of all tested hypotheses along with their p-values and corresponding results can be found below.

## Player Height

### Cluster 0 and 1

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the player heights is 0.

$H_A$ (alternative hypothesis): The true difference of the player heights is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["player_height"],
          label_values_df[label_values_df["label"] == 1]["player_height"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean player heights of Cluster 0 and Cluster 1 are statistically different under a 5% significance level.***

### Cluster 1 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the player heights is 0.

$H_A$ (alternative hypothesis): The true difference of the player heights is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 1]["player_height"],
          label_values_df[label_values_df["label"] == 4]["player_height"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean player heights of Cluster 1 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 2 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the player heights is 0.

$H_A$ (alternative hypothesis): The true difference of the player heights is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 2]["player_height"],
          label_values_df[label_values_df["label"] == 4]["player_height"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean player heights of Cluster 2 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 0 and 2

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the player heights is 0.

$H_A$ (alternative hypothesis): The true difference of the player heights is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["player_height"],
          label_values_df[label_values_df["label"] == 2]["player_height"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean player heights of Cluster 0 and Cluster 2 are statistically different under a 5% significance level.***

### Cluster 2 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the player heights is 0.

$H_A$ (alternative hypothesis): The true difference of the player heights is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 2]["player_height"],
          label_values_df[label_values_df["label"] == 3]["player_height"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean player heights of Cluster 2 and Cluster 3 are statistically different under a 5% significance level.***

### Cluster 0 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the player heights is 0.

$H_A$ (alternative hypothesis): The true difference of the player heights is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["player_height"],
          label_values_df[label_values_df["label"] == 3]["player_height"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean player heights of Cluster 0 and Cluster 3 are statistically different under a 5% significance level.***

## Points

### Cluster 0 and 1

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the points is 0.

$H_A$ (alternative hypothesis): The true difference of the points is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["pts"],
          label_values_df[label_values_df["label"] == 1]["pts"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean points of Cluster 0 and Cluster 1 are statistically different under a 5% significance level.***

### Cluster 1 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the points is 0.

$H_A$ (alternative hypothesis): The true difference of the points is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 1]["pts"],
          label_values_df[label_values_df["label"] == 4]["pts"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean points of Cluster 1 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 2 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the points is 0.

$H_A$ (alternative hypothesis): The true difference of the points is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 2]["pts"],
          label_values_df[label_values_df["label"] == 4]["pts"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean points of Cluster 2 and Cluster 4 are statistically similar under a 5% significance level.***

### Cluster 0 and 2

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the points is 0.

$H_A$ (alternative hypothesis): The true difference of the points is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["pts"],
          label_values_df[label_values_df["label"] == 2]["pts"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean points of Cluster 0 and Cluster 2 are statistically different under a 5% significance level.***

### Cluster 2 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the points is 0.

$H_A$ (alternative hypothesis): The true difference of the points is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 2]["pts"],
          label_values_df[label_values_df["label"] == 3]["pts"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean points of Cluster 2 and Cluster 3 are statistically different under a 5% significance level.***

### Cluster 0 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the points is 0.

$H_A$ (alternative hypothesis): The true difference of the points is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["pts"],
          label_values_df[label_values_df["label"] == 3]["pts"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean points of Cluster 0 and Cluster 3 are statistically different under a 5% significance level.***

## Rebounds

### Cluster 0 and 1

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the rebounds is 0.

$H_A$ (alternative hypothesis): The true difference of the rebounds is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["reb"],
          label_values_df[label_values_df["label"] == 1]["reb"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean rebounds of Cluster 0 and Cluster 1 are statistically different under a 5% significance level.***

### Cluster 1 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the rebounds is 0.

$H_A$ (alternative hypothesis): The true difference of the rebounds is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 1]["reb"],
          label_values_df[label_values_df["label"] == 4]["reb"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean rebounds of Cluster 1 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 2 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the rebounds is 0.

$H_A$ (alternative hypothesis): The true difference of the rebounds is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 1]["reb"],
          label_values_df[label_values_df["label"] == 4]["reb"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean rebounds of Cluster 2 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 0 and 2

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the rebounds is 0.

$H_A$ (alternative hypothesis): The true difference of the rebounds is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["reb"],
          label_values_df[label_values_df["label"] == 2]["reb"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean rebounds of Cluster 0 and Cluster 2 are statistically different under a 5% significance level.***

### Cluster 2 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the rebounds is 0.

$H_A$ (alternative hypothesis): The true difference of the rebounds is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 2]["reb"],
          label_values_df[label_values_df["label"] == 3]["reb"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean rebounds of Cluster 2 and Cluster 3 are statistically different under a 5% significance level.***

### Cluster 0 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the rebounds is 0.

$H_A$ (alternative hypothesis): The true difference of the rebounds is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["reb"],
          label_values_df[label_values_df["label"] == 3]["reb"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean rebounds of Cluster 0 and Cluster 3 are statistically different under a 5% significance level.***

## Assists

### Cluster 0 and 1

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the assists is 0.

$H_A$ (alternative hypothesis): The true difference of the assists is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["ast"],
          label_values_df[label_values_df["label"] == 1]["ast"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean assists of Cluster 0 and Cluster 1 are statistically different under a 5% significance level.***

### Cluster 1 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the assists is 0.

$H_A$ (alternative hypothesis): The true difference of the assists is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 1]["ast"],
          label_values_df[label_values_df["label"] == 4]["ast"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean assists of Cluster 1 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 2 and 4

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the assists is 0.

$H_A$ (alternative hypothesis): The true difference of the assists is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 1]["ast"],
          label_values_df[label_values_df["label"] == 4]["ast"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean assists of Cluster 1 and Cluster 4 are statistically different under a 5% significance level.***

### Cluster 0 and 2

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the assists is 0.

$H_A$ (alternative hypothesis): The true difference of the assists is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["ast"],
          label_values_df[label_values_df["label"] == 2]["ast"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean assists of Cluster 0 and Cluster 2 are statistically different under a 5% significance level.***

### Cluster 2 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the assists is 0.

$H_A$ (alternative hypothesis): The true difference of the assists is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 2]["ast"],
          label_values_df[label_values_df["label"] == 3]["ast"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean assists of Cluster 2 and Cluster 3 are statistically different under a 5% significance level.***

### Cluster 0 and 3

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference of the assists is 0.

$H_A$ (alternative hypothesis): The true difference of the assists is not 0.

In [None]:
tstat, pvalue = ttest_ind(label_values_df[label_values_df["label"] == 0]["ast"],
          label_values_df[label_values_df["label"] == 3]["ast"],
          equal_var = False)

In [None]:
print('Computed P-Value: ', pvalue)

In [None]:
if pvalue < 0.05:
    print('Null hypothesis is rejected, there is a statistical difference between the two means.')
else:
    print('We cannot reject the null hypothesis, there is no statistical significance with the difference.')

***Based on these results, we can conclude that the mean assists of Cluster 0 and Cluster 3 are statistically different under a 5% significance level.***

# Insights and Conclusions

## Number of Clusters

Based off the performed elbow method, the group chose a k-means value of 5. This resulted in five unique clusters being generated, with each cluster having unique descriptions.

## Cluster Descriptions

<!-- Describe each cluster (overview of the characteristics of each cluster -->

### Cluster 0

This cluster is characterized by tall players who do not get a lot of points, rebounds, or assists. While Cluster 0 has above average player height, they are quite lackluster as it is the only feature where they stand out. For these reasons, this cluster can be labeled as the **Tall Benchwarmers**.

### Cluster 1

This cluster is characterized by short players who do a bit of everything. This can be from scoring points, grabbing rebounds, or dishing out assists. While the cluster appears to have moderate to low statistics, this is because players in this cluster generally fulfil a certain role in a team. Players in this cluster can range from three-point shooters, off-the-bench playmakers, or defensive specialists to name a few. The heights in this cluster seldom exceed 200 cm. For these reasons, this cluster can be labeled as the **Short Role Players**.

### Cluster 2

This cluster is characterized by tall players who are great scorers and rebounders. They do get a few assists, but that is not the focus of their game. This cluster has the highest average player height as well as the highest average rebounds. They also have a high amount of points, only behind Cluster 4. These statistics indicate great, above average players in the NBA. For these reasons, this cluster can be labeled as the **Tall Stars**.

### Cluster 3

This cluster is characterized by tall players who get a moderate amount of points and rebounds, but a low amount of assists. This cluster contains the second highest average player height and rebounds, behind Cluster 2. While not as high as the points or rebounds of Cluster 2 (**Tall Stars**), they are still important to a team as they fulfill a specific role. Players in this cluster range from three-point shooters, rebounders, or defensive specialists to name a few. The heights in this cluster seldom fall behind 200 cm. For these reasons, this cluster can be labeled as the **Tall Role Players**.

### Cluster 4

This cluster is characterized by players of moderate height who score a high number of points and give a lot of assists. Although they do get rebounds, not all players have a high number of rebounds in this cluster. However, players who do have a high number of rebounds also have a lot of assists in this cluster. For these reasons, this cluster can be labeled as the **All-Around Stars**.

## Cluster Similarity Analysis

### Cluster with Similar Shapes Based Off Polar Chart

Based off the shapes of the clusters from the polar chart, the following clusters were paired for comparison:
- Cluster 0 and Cluster 2
- Cluster 2 and Cluster 3
- Cluster 0 and Cluster 3
- Cluster 1 and Cluster 4
    
Clusters 0, 2, and 3 were paired with each other as they form an inverted kite shape with a small left side (*feature: assist*).
<br>Clusters 1 and 4 were paired with each other as they form an inverted kite shape with somewhat even sides.

Comparing the means of the features of each paired clusters using a T-test, the results indicate that for all cluster pairs, all the compared means are statistically different under a 5% significance level.

From this, it can be concluded that the paired clusters are statistically different from one another.

### Cluster with Similar Features Based Off Polar Chart

Based off the clusters from the polar chart who have features that are not far from each other, the following clusters were paired for comparison:

- Cluster 0 and Cluster 1
- Cluster 2 and Cluster 4

Clusters 0 and 1 were paired with each other as their average rebounds appear to be close, while<br>Clusters 2 and 4 were paired with each others as their average points appear to be close.

Comparing the means of the features of each paired clusters using a T-test, the results indicate that for all cluster pairs, all the compared means except one feature are statistically different under a 5% significance level.
    
The only feature that was not found to be statistically different was the mean points of Cluster 2 and Cluster 4 where it was found to be statistically similar under a 5% significance level.

While we can conlude that the paired clusters are mostly statistically different from one another under, Cluster 2 and Cluster 4 was found to have similar mean points (but only the mean points).

## In-Depth Cluster Analysis

The results of the data are quite interesting, but under analysis, realistic. Focusing on Cluster 0 (**Tall Benchwarmers**), this was found to be the largest cluster, composing 31.99% of the total data set. An interesting question to ask would be why is there a label for **Tall Benchwarmers** but not **Short Benchwarmers**? A possible conclusion to this is because teams prioritize drafting tall players over short players. If a team is given a choice to put two players of similar skill into their team, they would opt for whoever is taller if possible. The reason for this is because no matter how much you train, you cannot train to be taller. Although **Short Benchwarmers** do exist, they are not abundant enough to be their own cluster and can be found in the **Short Role Player** cluster.

Another interesting result from the data is that there was only one mean that was found to be statistically similar, that being Cluster 2 (**Tall Stars**) and Cluster 4 (**All-Around Stars**). The reason for this can possibly be the fact that these two clusters are usually players who are a team's star player. When a team is a star player, they are usually the first option for scoring on offense. Because **Tall Stars** and **All-Around Stars** are both priority offensive options, this can be a good reason why the mean points are similar.

## Player Examples per Cluster

It is worth noting that the names given may appear in more than one cluster because players can play multiple seasons.

### Cluster 0
**Tall Benchwarmers**
- Bol Bol
- Brian Scalabrine
- Tacko Fall

### Cluster 1
**Short Role Players**
- Alex Caruso
- Muggsy Bogues
- Patrick Beverley

### Cluster 2
**Tall Stars**
- Shaquille O'Neal
- Dennis Rodman
- Yao Ming

### Cluster 3
**Tall Role Players**
- Javale McGee
- Nick Young
- Nicolas Batum

### Cluster 4
**All-Around Stars**
- Stephen Curry
- Lebron James
- Michael Jordan

# Bibliography

Akoglu, H. (2018). User's Guide to Correlation Coefficients. *Turkish Journal of Emergency Medicine, 18*(3), 91–93. https://doi.org/10.1016/j.tjem.2018.08.001

Cirtautas, J. (2022, August 6). *NBA players*. Kaggle. Retrieved November 20, 2022, from https://www.kaggle.com/datasets/justinas/nba-players-data 