**Insert Banner Here**

# Introduction to Data Science and Basketball

This series of notebooks is designed to introduce you to the foundational concepts and techniques used in data science, with a focus on basketball data.

As our world becomes increasingly data-driven, the ability to analyze, visualize, and draw insights from large and complex datasets is becoming an essential skill. Data science can provide you with the tools to make better decisions and solve complex problems.

We will use [Python](https://www.python.org/) as our programming language in [Jupyter notebooks](https://jupyter.org/) since that is how data science is usually done in the real world. We also use Python because it has great [code libraries](https://www.geeksforgeeks.org/libraries-in-python/) for data science, such as [pandas](https://pandas.pydata.org/) and [Plotly](https://plotly.com/python/).

Each notebook will focus on a specific topic, and will provide examples and exercises to help solidify your understanding.

To run code in a Jupyter notebook, click on a code cell such as the one below then click the `▶Run` button at the top of the window near the stop (`◼`) button.

In [None]:
print('Hello World')

## Working with Data

We will use the [pandas](https://pandas.pydata.org/) library to work with datasets in a format similar to spreadsheets or tables. `▶Run` the code cell below to import the `pandas` library using the short form `pd`. Then we'll use pandas to import data and display data from [basketball-reference.com](https://www.basketball-reference.com/players/s/siakapa01.html) about [Pascal Siakam](https://www.nba.com/stats/player/1627783).

In [None]:
import pandas as pd
#data = pd.read_html('https://www.basketball-reference.com/players/s/siakapa01.html')[1]
data = pd.read_csv('data/siakam.csv')
data

There are quite a few columns in that data table. Let's list just the columns.

In [None]:
data.columns

Here is the data glossary from basketball-reference:

|Column|Meaning|
|-|-|
|Age|Player's age on February 1 of the season|
|Lg|League|
|Pos|Position|
|G|Games|
|GS|Games Started|
|MP|Minutes Played Per Game|
|FG|Field Goals Per Game|
|FGA|Field Goal Attempts Per Game|
|FG%|Field Goal Percentage|
|3P|3-Point Field Goals Per Game|
|3PA|3-Point Field Goal Attempts Per Game|
|3P%|3-Point Field Goal Percentage|
|2P|2-Point Field Goals Per Game|
|2PA|2-Point Field Goal Attempts Per Game|
|2P%|2-Point Field Goal Percentage|
|eFG%|Effective Field Goal Percentage*|
|FT|Free Throws Per Game|
|FTA|Free Throw Attempts Per Game|
|FT%|Free Throw Percentage|
|ORB|Offensive Rebounds Per Game|
|DRB|Defensive Rebounds Per Game|
|TRB|Total Rebounds Per Game|
|AST|Assists Per Game|
|STL|Steals Per Game|
|BLK|Blocks Per Game|
|TOV|Turnovers Per Game|
|PF|Personal Fouls Per Game|
|PTS|Points Per Game|

<span style="font-size:10px">*This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.</span>

### Selecting Data

We can select and display just one column.

In [None]:
data[['FG']]

Or multiple columns.

In [None]:
data[['FG', 'FGA']]

---

### Exercise

In the cell below, use code to display the columns for Field Goal Percentage, 2-Point Field Goal Percentage, 3-Point Field Goal Percentage, and Free Throw Percentage.

---

### Filtering Data

We can filter the data to only display rows where Pascal Siakam's free throw percentage was greater than 75%.

In [None]:
data[data['FT%'] > 0.75]

Or if we only want to display seasons where he started every game, we can filter using `==` which means "is equal to".

In [None]:
data[data['GS'] == data['G']]

It is even possible to include multiple conditions.

In [None]:
data[(data['FT%'] > 0.75) & (data['GS'] == data['G'])]

These are the symbols we use for comparison operations in Python:

|Symbol|Meaning|
|-|-|
|>|greater than|
|<|less than|
|==|is equal to|
|!=|not equal to|
|>=|greater than or equal to|
|<=|less than or equal to|
|&|and|
|\||or|

---

### Exercise

In the cell below, use code to display only rows where Assists Per Game (`'AST'`) was greater than `5` or Steals Per Game (`'STL'`) was greater than 1.

---

### Sorting Data

We can also sort the data by the values in a column.

In [None]:
data.sort_values('PF')

The default is to sort `ascending`, but we can instead sort in descending order.

In [None]:
data.sort_values('PF', ascending=False)

Or we can sort by two columns, for example first by Blocks Per Game and then by Steals Per Game.

In [None]:
data.sort_values(['BLK', 'STL'])

---

### Exercise

In the code cell below, display only the columns `'Season'`, `'FG%'`, `'2P%'`, and `'3P%'` sorted by `'FG%'`.

---

In the next notebook we will start [visualizing data](02-visualizing-data.ipynb).

**Insert Banner Here**