# Lecture 13 – Visualizing Two Numerical Variables

### Spark 010, Spring 2024

In [None]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)
pd.options.display.width = 0
pd.options.display.max_colwidth = 100

Our first dataset today comes from [Basketball Reference](https://www.basketball-reference.com/leagues/NBA_2020_per_game.html). It contains per-game averages of statistics for players in the 2019-2020 NBA season.

Run the cell below to load it in, select the relevant columns, and do some data cleaning.

**Note:** Most of the interesting data comes from the players who played enough to get reliable information through their stats; we will only look at players who averaged at least 10 minutes per game in the season. This isn't perfect, since there were plenty of good players who averaged less than 10 points per game.

In [None]:
NBA_2019_2020_PlayerData = pd.read_csv('data/NBAPlayerStats_1920.csv')
nba = NBA_2019_2020_PlayerData

In [None]:
nba['Player'] = nba['Player'].apply(lambda x: x.split("\\")[0])

In [None]:
Filter = nba['MP']>10
nba = nba[Filter]
# take only the requisite columns
nba = nba.loc[:,['Player','Pos','Age','MP','Tm','PTS','TRB','AST','3PA','3P%']]

In [None]:
nba.head()

A description of each column:

- `'Player'`: name
- `'Pos'`: general position (either Forward or Guard)
- `'Age'`: age of the player
- `'MP'`: average minutes played per game
- `'Tm'`: abbreviated team
- `'PTS'`: average number of points scored per game
- `'TRB'`: average number of rebounds per game (a player receives a rebound when they grab the ball after someone misses)
- `'AST'`: average number of assists per game (a player receives an assist when they pass the ball to someone who then scores)
- `'3PA'`: average number of three-point shots attempted per game (a three point shot is one from behind a certain line, which is between 22-24 feet from the basket)
- `'3P%'`: average proportion of three-point shots that go in

## Review – Bar Charts and Histograms

### Bar Charts

We can use the code below to generate average statistics for forwards and guards (different basketball positions). Don't worry about understanding `.groupby()` yet — we'll get to that soon.

In [None]:
stats_by_pos = nba.groupby('Pos')[['PTS', 'TRB', 'AST']].mean()
stats_by_pos

Now we can visualize this data by create a bar chart. Since there are some players who are combo-positions, let's focus on the five most common positions found in a starting lineup.

In [None]:
stats_by_pos = stats_by_pos.loc[['C','PF','SF','SG','PG']]
stats_by_pos.plot.barh() # Create a bar chart of mean statistics by position

### Histograms

Recall that histograms allow us to see the distribution (or frequencies) of values for a numerical variable. For example, we can visualize the distribution of points in the NBA.

In [None]:
nba['PTS'].plot.hist(density = False, bins = np.arange(10, 40, 2.5)) # Create a histogram showing the distribution of points

We can also use the `df.groupby` to plot distributions of numerical variables by category (e.g. forwards vs. guards).

In [None]:
# Create a histogram showing the distribution of rebounds grouped by position
NBA = nba[nba['Pos'].isin(['C','PF','SF','SG','PG'])]
plot = NBA.groupby('Pos')['TRB'].plot.hist(density = False, bins = np.arange(17),
         xlabel = 'Rebounds',
         title = 'Distribution of Rebounds by position', alpha = 0.5, legend=True)

## Scatter Plots

Scatter plots allow us to visualize and investigate relationships between two numerical variables. To start out, we're going to create an example table with some fake data for our variables `x` and `y`.

In [None]:
example_data = pd.DataFrame([[1,-1],[4,2],[4,8],[3,0],[6,1]], columns = ['x','y'])
example_data

Instead of looking at the data in a table, we can put it in a scatter plot using `df.plot.scatter()`.

In [None]:
example_data.plot.scatter(x = 'x', y = 'y', s = 50) # Create a scatter plot of y vs. x in `example_data`

### Example 1

Returning to our NBA data, we can explore the relationships between different statistics. For example, what is the relationship between the number of points scored by a player and the number of assists made by a player?

In [None]:
nba.plot.scatter('PTS', 'AST') # Create a scatter plots showing points vs assists

Observation: On average, as the number of points a player averages increases, the number of assists they average also increases.

### Quick Check 1

Fill in the blanks to create a scatter plot showing Three-Point Attempts (`"3PA"`) vs. Rebounds (`"TRB"`) for **small forwards** in the `nba` table.

In [None]:
Filter = nba['Pos'] == ...
Forwards = nba[Filter]
Forwards.plot.scatter( ... , ... ,
                      xlabel = 'Rebounds Per Game (TRB)',
                      ylabel = 'Three-Point Attempts Per Game (3PA)',
                      figsize = (8,5))


Observation: on average, as the number of rebounds a player averages per game increases, the number of three point attempts they average per game decreases.

## More customization

We can customize our plots even further by specifying optional arguments. 

### Point Sizes (`s` and `size`)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots(figsize=(7, 3))
Plot = sns.scatterplot(data = nba, x = 'PTS',y = '3P%', s=5, ax = ax)

In the following plot, a bigger circle corresponds to a player that shoots more three point attempts on average

In [None]:
fig, ax = plt.subplots(figsize=(7, 3))
Plot = sns.scatterplot(data = nba, x = 'PTS',y = '3P%', size = '3PA', sizes = (5,45), ax = ax)

### Point Color by Grouping (`df.groupby`)

In [None]:
sns.set_style("darkgrid")

In [None]:
fig, ax = plt.subplots(figsize=(7, 3))
# look only at centers or shooting guards
NBA = nba[nba['Pos'].isin(['C','SG'])]
Plot = sns.scatterplot(data = NBA, x = 'TRB',y = '3PA', hue = 'Pos', s = 10, ax = ax)

### Labels

In [None]:
# Filter out the players that score less than 25 points per game
df = nba[nba['PTS'] > 25]
plt.figure(figsize=(8,5))
sns.scatterplot(data=df,x='PTS',y='AST')
for i in range(len(df)):
    plt.text(x=df.PTS.iloc[i]-0.6,y=df.AST.iloc[i]+0.15,s=df.Player.iloc[i], 
             fontdict=dict(color='black',size=10))
plt.show()

In [None]:
# Filter out the players that score less than 25 points per game
df = nba[nba['PTS'] > 25]
plt.figure(figsize=(8,5))
sns.scatterplot(data=df,x='PTS',y='AST', size = '3PA',sizes = (5,150))
for i in range(len(df)):
    plt.text(x=df.PTS.iloc[i]-0.6,y=df.AST.iloc[i]+0.15,s=df.Player.iloc[i], 
             fontdict=dict(color='black',size=10))
plt.show()

## Line Plots

Line plots are similar to scatter plots in that they visualize relationships between two numerical variables. However, one of the numerical variables has to have an order (like time or distance).

In [None]:
nba_yearly = pd.read_csv('data/nba-league-averages.csv')

nba_yearly = nba_yearly.loc[:,['Season', 'PTS', 'FGA', '3PA', '3P%', 'Pace']]
nba_yearly.head()

In [None]:
nba_yearly['Season'] = np.arange(2021,1979,-1)
nba_yearly.head()

Our second dataset also comes from Basketball Reference. This dataset contains team-based average statistics for each year.

A little bit about our new dataset:
- `'Season'`: the second calendar year for each season (e.g. `2018` refers to the 2017-18 season)
- `'FGA'`: the average number of field goal attempts (shot attempts) per game
- `'Pace'`: the average number of times a team had possession of the ball per game

### Example 1

In [None]:
nba_yearly.plot('Season', 'Pace') # Generate a line plot of `pace` by season

Observation: The league slowed down in the late 90s and early 2000s, but is speeding back up.

### Example 2

In [None]:
# Generate a line plot of three point attempts by season
nba_yearly.plot('Season', '3PA',
               ylabel = 'Three-Point Attempts (3PA)',
               title = 'Three-Point Attempts Per Season')

Observation: The three-point shot has rapidly increased in popularity over the past decade.

### Example 3

In [None]:
nba_yearly.loc[:,['Season', 'FGA', '3PA']].head()

In [None]:
# Plot both field goal attempts (FGA) and three-point attempts (3PA) by season
nba_yearly.loc[:,['Season', 'FGA', '3PA']].plot('Season') # Notice how we only supplied `plot` with a single argument

Observation: Three point attempts have increase a lot since the 1980s, while the number of field goals (shots) attempted has stayed more or less the same.