# WNBA Analysis

This project uses WNBA draft data sourced from [Kaggle](https://www.kaggle.com/datasets/mattop/wnba-draft-basketball-player-data-1997-2021). This notebook will showcase how to read data, filter for useful statistics, and plot visualizations.

In this notebook we will:
- Find the top leaders for major statistics in the WNBA
- Visualize top scorers for each team
- Visualize pick, year, and scoring output
  
We will start by importing the dataset sourced from `Kaggle` alongside the necessary libraries to run this project.

In [None]:
import plotly.express as px
import pandas as pd
print("Libraries imported")

## Obtaining Data

Using the dataset imported from `Kaggle`, let's begin by first obtaining information about the various columns in the dataset. We'll also be utilizing a *function* to help eliminate wording issues in the columns.

In programming, a function is a named set of instructions that performs a specific task. It's like a mini-program within a larger program. A function essentially takes inputs, performs some actions or calculations, and produces an output.

In [None]:
wnba_df = pd.read_csv(r"https://raw.githubusercontent.com/callysto/basketball-and-data-science/main/content/examples/wnbadraft.csv")

def convert_column_name(column):
    words = column.split('_')  # Split the column name by underscores
    words = [word.capitalize() for word in words]  # Capitalize each word
    return ' '.join(words)  # Join the words with spaces

# Rename the columns in the dataframe
wnba_df = wnba_df.rename(columns=convert_column_name)
display(wnba_df)

| Column            | Description                                               |
|-------------------|---------------------------------------------------------- |
| Overall Pick      | The overall pick number of the player in the WNBA draft.  |
| Year              | The year in which the player was drafted.                 |
| Team              | The team for which the player was drafted.                |
| Player            | The name of the player.                                   |
| Former            | The player's former team or college.                      |
| College           | The college attended by the player.                       |
| Years Played      | The number of years the player has played in the WNBA.    |
| Games             | The total number of games played by the player.           |
| Win Shares        | A metric indicating the player's overall contribution to team wins.|
| Win Shares 40     | A metric indicating the player's contribution per 40 minutes played.|
| Minutes Played    | The total number of minutes played by the player.         |
| Points            | The total number of points scored by the player.          |
| Total Rebounds    | The total number of rebounds grabbed by the player.       |
| Assists           | The total number of assists made by the player.           |

## Finding Top Performers

Now that we know which columns are in the dataset, let's differentiate players who lead major statistics in basketball such as `Points` and `Rebounds`. 

In [None]:
# Create a dictionary to store the column names and their corresponding max player and value
max_players = {}

# Iterate over each column (excluding non-statistical columns)
for column in wnba_df.columns:
    if column not in ['Player', 'Year', 'Team', 'Former', 'College', 'Overall Pick', 'Minutes Played']:
        max_player = wnba_df[column].idxmax()  # Find the index of the max value for the column
        max_value = wnba_df.loc[max_player, column]  # Get the max value
        max_players[column] = {'Player': wnba_df.loc[max_player, 'Player'], 'Value': max_value}

# Print the player with the highest value and their corresponding statistic for each column
for stat, player_info in max_players.items():
    player = player_info['Player']
    value = player_info['Value']
    print(f"The player with the highest {stat} is {player} with a value of {value}.")

Looking at the output, one player stands out and is repeated **(Sue Bird)**. Many would consider Sue Bird to be the greatest to ever play in the WNBA. She leads in All-Star appearances and has won four championships. Another great accomplishment of hers is that she is the only player in the WNBA to win in three different decades. 

## Visualizing Results

Let's also find the top performers in each team (by Points), and see which teams stand out as highest/lowest scoring teams

In [None]:
# Create a dataframe with the top three scorers for each team
# This is done by first sorting the dataframe by "Team" and "Points" and then grouping by "Team"
# head(3) then returns the first 3 rows of the grouped dataframe
team_point_leaders = wnba_df.sort_values(['Team', 'Points'], ascending=False).groupby('Team').head(3)

# Reset index sets a new index for this dataframe, drop=True removes the old index
team_point_leaders = team_point_leaders.reset_index(drop=True)
display(team_point_leaders)

In [None]:
# Create a bar chart of the top three scorers for each team
team_fig = px.bar(team_point_leaders, x='Team', y='Points', hover_data=['Player'], color='Team')
# Dark theme
team_fig.update_layout(template = "plotly_dark").show()

Looking at the visualization, teams like the *Seattle Storm* and *Phoenix Mercury* stand out as one of the highest scoring teams alongside the Miami Sol being the lowest scoring team. One correlation that can be made is that the highest scoring team, the Seattle Storm, also has the highest scoring individual in the WNBA dataset **(Breanna Stewart)**, which doesn't come as a surprise. 

## Finding Correlations

Using other basketball metrics, let's also see if there is a correlation between `Win Shares` and an WNBA player's `Points`/`Assists`.

In [None]:
px.scatter(wnba_df, x='Win Shares', y='Points', color='Player').show()
px.scatter(wnba_df, x='Win Shares', y='Assists', color='Player').show()

It appears generally that the higher a player is able to score/dish out assists, the more they contribute to their team's ability to win. 

One possible reason for a *correlation* between points/assists and win shares is because players who score and distribute the ball effectively tend to have a *positive* impact on their team's success. By scoring *points*, players create scoring opportunities, put pressure on the opposing defense, and contribute to the team's offensive output. *Assists*, on the other hand, indicate a player's ability to create opportunities for their teammates, leading to efficient scoring and better team cohesion.

## Visualizing in 3 dimensions

Lastly, let's visualize in three dimensions! Making a three-dimensional plot essentially means we can look at correlations between three different variables at a time. As a result, we can potentially see if a combination of variables lead to a different result, rather than looking at only two. 

In [None]:
# Create a 3d fig with x being the Year, y being the Overall Pick and z being the Points
fig_3d = px.scatter_3d(wnba_df, x = "Year", y = "Overall Pick", z = "Points",
                    hover_data = ["Player"],
                    color = "Overall Pick", color_continuous_scale = "Sunset")

# Update the marker size and make the symbol "circle-open" for ease of viewing
fig_3d.update_traces(marker = dict(size = 3, symbol = "circle-open"))

# Dark theme
fig_3d.update_layout(template = "plotly_dark").show()

## Conclusion

In this project we used data from [Kaggle](https://www.kaggle.com/datasets/mattop/wnba-draft-basketball-player-data-1997-2021) to look at WNBA data which measures a player's statistical output. 

In the future it would be interesting to explore different correlations in either two or three-dimensions and utilizing different columns such as `Win Shares 40`. 