**Insert Banner Here**

# Statistics

This notebook will introduce some basic statistics with data. Statistics is the process of collecting, describing, and analyzing data. This includes things like [mean, median, and mode](https://www.youtube.com/watch?v=zOYVZtTIjBA&t=49s).

First we will load the same data we have been using about [Pascal Siakam](https://www.basketball-reference.com/players/s/siakapa01.htm).

In [None]:
import pandas as pd
data = pd.read_csv('data/Pascal_Siakam.csv')
data

And here's a reminder of the data glossary from [basketball-reference.com](https://www.basketball-reference.com):

|Column|Meaning|
|-|-|
|Age|Player's age on February 1 of the season|
|Lg|League|
|Pos|Position|
|G|Games|
|GS|Games Started|
|MP|Minutes Played Per Game|
|FG|Field Goals Per Game|
|FGA|Field Goal Attempts Per Game|
|FG%|Field Goal Percentage|
|3P|3-Point Field Goals Per Game|
|3PA|3-Point Field Goal Attempts Per Game|
|3P%|3-Point Field Goal Percentage|
|2P|2-Point Field Goals Per Game|
|2PA|2-Point Field Goal Attempts Per Game|
|2P%|2-Point Field Goal Percentage|
|eFG%|Effective Field Goal Percentage*|
|FT|Free Throws Per Game|
|FTA|Free Throw Attempts Per Game|
|FT%|Free Throw Percentage|
|ORB|Offensive Rebounds Per Game|
|DRB|Defensive Rebounds Per Game|
|TRB|Total Rebounds Per Game|
|AST|Assists Per Game|
|STL|Steals Per Game|
|BLK|Blocks Per Game|
|TOV|Turnovers Per Game|
|PF|Personal Fouls Per Game|
|PTS|Points Per Game|

Notice that there is a row for `Career`, let's create a new dataframe that doesn't include that. We'll drop the row with an index of `7`.

In [None]:
seasons = data.drop(data.index[7])
seasons

## Calculating Statistics

Now let's calculate some statistics, and maybe compare them to that `Career` row.

In [None]:
seasons['G'].sum()

Comparing that to the value from row `7` in the original dataframe.

In [None]:
data['G'][7]

Let's find the average (mean) number of games he played per season.

In [None]:
seasons['G'].mean()

We can also calculate `median`, `max`imum, and `min`imum.

In [None]:
seasons['G'].median()

In [None]:
seasons['G'].max()

In [None]:
seasons['G'].min()

It ia also possible to calculate any of these statistics for the whole dataframe. We will want to use `numeric_only = True` so that it is only calculating with columns containing numbers.

In [None]:
seasons.mean(numeric_only = True)

We may also wnat to use the `.describe()` method, either on a single column or on the whole dataframe.

In [None]:
seasons['G'].describe()

In [None]:
seasons.describe()

---

### Exercise

Use code to find the maximum and minimum Minutes Played Per Game (`'MP'`) in the `seasons` dataframe

---

## Making New Columns

In the dataset there are a few percent columns, let's create a new column that is the average values of 3-Point, 2-Point, and Free Throw percents.

In [None]:
seasons['Shot Percentage'] = (seasons['FG%'] + seasons['3P%'] + seasons['2P%'] + seasons['FT%']) / 4
seasons

And we can also decide to display only those columns.

In [None]:
seasons[['FG%','3P%','2P%','FT%','Shot Percentage']]

---

### Exercise

Create a new column that is Games (`'G'`) times Minutes Played Per Game (`'MP'`). Use the `*` symbol for times.

---

## Working With More Data



In [None]:
# use BeautifulSoup to get all of the links from https://www.basketball-reference.com/teams/TOR/players.html

import requests
from bs4 import BeautifulSoup

url = 'https://www.basketball-reference.com/teams/TOR/players.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')

player_links = []
for link in links:
    url = link.get('href')
    if 'players/' in url and len(url) > 9 and url[1]=='p':
        player_links.append(url)
player_links

In [None]:
import pandas as pd
from time import sleep

for player_link in player_links[12:]:
    url = 'https://www.basketball-reference.com' + player_link
    response = requests.get(url)

    # get the player name from the h1 tag
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.find('h1').text
    filename = title.strip().replace(' ','_') + '.csv'
    print(filename)

    tables = pd.read_html(url)
    # find the table that contains the per game stats
    for table in tables:
        if 'G' in table.columns:
            df = table
            break
    # find the row that contains Career totals
    for i in range(len(df)):
        if df['Season'][i] == 'Career':
            break
    # drop rows after the career totals
    df = df.drop(df.index[i+1:])

    df.to_csv('data/'+filename, index=False)
    sleep(30)

In the next notebook we will look at [markdown](04-markdown.ipynb).

**Insert Banner Here**