![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Statistics

This notebook will introduce some basic statistics with data. Statistics is the process of collecting, describing, and analyzing data. This includes things like [mean, median, and mode](https://www.youtube.com/watch?v=zOYVZtTIjBA&t=49s).

First we will load the same data we have been using about [Pascal Siakam](https://www.basketball-reference.com/players/s/siakapa01.htm).

In [None]:
import pandas as pd
data = pd.read_csv('data/nba-players/Pascal_Siakam.csv')
data

And here's a reminder of the data glossary from [basketball-reference.com](https://www.basketball-reference.com):

|Column|Meaning|
|-|-|
|Age|Player's age on February 1 of the season|
|Lg|League|
|Pos|Position|
|G|Games|
|GS|Games Started|
|MP|Minutes Played Per Game|
|FG|Field Goals Per Game|
|FGA|Field Goal Attempts Per Game|
|FG%|Field Goal Percentage|
|3P|3-Point Field Goals Per Game|
|3PA|3-Point Field Goal Attempts Per Game|
|3P%|3-Point Field Goal Percentage|
|2P|2-Point Field Goals Per Game|
|2PA|2-Point Field Goal Attempts Per Game|
|2P%|2-Point Field Goal Percentage|
|eFG%|Effective Field Goal Percentage*|
|FT|Free Throws Per Game|
|FTA|Free Throw Attempts Per Game|
|FT%|Free Throw Percentage|
|ORB|Offensive Rebounds Per Game|
|DRB|Defensive Rebounds Per Game|
|TRB|Total Rebounds Per Game|
|AST|Assists Per Game|
|STL|Steals Per Game|
|BLK|Blocks Per Game|
|TOV|Turnovers Per Game|
|PF|Personal Fouls Per Game|
|PTS|Points Per Game|

Notice that there is a row for `Career`, let's create a new dataframe that doesn't include that. We'll drop the row with an index of `7`.

In [None]:
seasons = data.drop(data.index[7])
seasons

## Calculating Statistics

Now let's calculate some statistics, and maybe compare them to that `Career` row.

In [None]:
seasons['G'].sum()

Comparing that to the value from row `7` in the original dataframe.

In [None]:
data['G'][7]

Let's find the average (mean) number of games he played per season.

In [None]:
seasons['G'].mean()

We can also calculate `median`, `max`imum, and `min`imum.

In [None]:
print('Median:', seasons['G'].median())
print('Maximum:', seasons['G'].max())
print('Minimum:', seasons['G'].min())

It ia also possible to calculate any of these statistics for the whole dataframe. We will want to use `numeric_only = True` so that it is only calculating with columns containing numbers.

In [None]:
seasons.mean(numeric_only = True)

We may also want to use the `.describe()` method, either on a single column or on the whole dataframe.

In [None]:
seasons['G'].describe()

In [None]:
seasons.describe()

---

### Exercise

Use code to find the maximum and minimum Minutes Played Per Game (`'MP'`) in the `seasons` dataframe

---

## Making New Columns

In the dataset there are a few percent columns, let's create a new column that is the average values of 3-Point, 2-Point, and Free Throw percents.

In [None]:
seasons['Shot Percentage'] = (seasons['FG%'] + seasons['3P%'] + seasons['2P%'] + seasons['FT%']) / 4
seasons

And we can also decide to display only those columns.

In [None]:
seasons[['FG%','3P%','2P%','FT%','Shot Percentage']]

---

### Exercise

Create a new column that is Games (`'G'`) multiplied by Minutes Played Per Game (`'MP'`). Use the `*` symbol for multiplication.

---

## Working With More Data

We have data files from [basketball-reference.com](https://www.basketball-reference.com/players) for a lot of NBA players. We will read each data file and add it to one big dataframe.

In [None]:
df = pd.DataFrame()
import os
for filename in os.listdir('data/nba-players'):
    if filename.endswith(".csv"):
        data = pd.read_csv('data/nba-players/' + filename)
        data['Player'] = filename[:-4].replace('_', ' ') # set the player name
        df = pd.concat([df, data], ignore_index=True)
for column in df.columns[5:-1]:
    df[column] = pd.to_numeric(df[column], errors='coerce') # convert to numeric
df

Now we can look at some statistics for all of those players, such as career average free throw percent.

In [None]:
df['FG%'] = pd.to_numeric(df['FG%'], errors='coerce')
df[df['Season'] == 'Career']['FG%'].mean()

We can also compare statistics for the players on the 2022-2023 season roster.

In [None]:
raptors_2023 = ['Scottie Barnes',
                'Chris Boucher',
                'Pascal Siakam',
                'Fred VanVleet',
                'OG Anunoby',
                'Gary Trent Jr.',
                'Christian Koloko',
                'Precious Achiuwa',
                'Thaddeus Young',
                'Malachi Flynn',
                'Dalano Banton',
                'Jakob Poeltl',
                'Jeff Dowtin',
                'Will Barton',
                'Ron Harper Jr.',
                'Joe Wieskamp',
                'Otto Porter Jr.']

raptors_stats = df[(df['Player'].isin(raptors_2023)) & (df['Season'] == 'Career')]

import plotly.express as px
px.bar(raptors_stats.sort_values('FG%'), x='Player', y='FG%', title='2023 Raptors Career Field Goal Percentage')

### Histograms

A [histogram](https://plotly.com/python/histograms) is a similar visualization that can be used to show statistics such as `count`, `sum`, or `average`.

In [None]:
px.histogram(raptors_stats, x='Player', y='FG', histfunc='avg', title='2023 Raptors Career Field Goal Averages')

---

### Exercise

Create a visualization, such as a bar graph or scatter plot, from the dataframe called `df`.

Suggested visualizations are:
* a bar graph of a career stat other than field goal percentage for the 2022-2023 Raptors roster
* a scatter plot with field goal attempts on the x-axis and points on the y-axis to see if there is a relationship. Use `color='Player', height=800` to make it look nicer.
* a line graph with `Age` on the x-axis and a stat on the y-axis for a particular player, for example `df[df['Player']=='Jakob Poeltl']`

---

Now that you have an understanding of how Jupyter notebooks work, check out these [example projects](examples).

In the next notebook we will look at [markdown](04-markdown.ipynb) for formatting text.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)