# Getting Started with Pandas and DuckDB

This notebook will walk you through how to work with real NBA data using two powerful tools in Python:

- **DuckDB**, a super-fast in-process SQL database (like SQLite but for analytics)
- **Pandas**, a library for working with tables (called "DataFrames")

We’ll start by connecting to the database and then gradually learn how to filter, group, and analyse data using Python code.

Even if this is your first time seeing Python, don’t worry — each section includes an explanation of what’s happening and why.


In [1]:
# First, import the libraries we need
import duckdb  # lets us run SQL in Python
import pandas as pd  # used to work with tabular data

# Now, connect to our NBA database
con = duckdb.connect('../duckdb/nba.duckdb', read_only=True)

# This database is just a single file on your computer — no server needed


In [None]:
# Let's see what tables exist in our database
# These are like sheets in Excel or tabs in Google Sheets
con.execute("SHOW TABLES").fetchdf()


Each row in the table above represents a dataset we can work with. We'll mostly use:

- `players_cleaned`: info about players (names, positions, height, age)
- `player_stats_cleaned`: performance data (points per game, assists, etc.)
- `top_scorers_by_season`: players with the highest average points (already filtered for us)


In [None]:
# Load the full player stats table into a DataFrame (think: spreadsheet in code)
df_stats = con.execute("SELECT * FROM player_stats_cleaned").fetchdf()

# Let's look at the first 5 rows
df_stats.head()


This gives us a preview of the table as a Pandas DataFrame. Each row is one player. Each column is one stat (like `points_per_game` or `games_played`).

You can think of `df_stats` as a named spreadsheet inside your code.


In [None]:
# .info() shows column names, data types, and whether there are missing values
df_stats.info()

In [None]:
# .describe() gives basic statistics like min, max, mean, and quartiles
df_stats.describe()


These are your quick stats for all numeric columns:
- Count = how many values (not missing)
- Mean = average
- Std = how spread out the data is (standard deviation)
- Min/Max = lowest and highest values

Great for spotting weird values or getting a feel for the data.


In [None]:
# Let's find players who played more than 50 games
df_stats[df_stats["games_played"] > 50]


This is called **boolean indexing**. We're telling pandas:
> Show me only the rows where `games_played` is greater than 50.

This is one of the most common and powerful tools in pandas: filtering your data to focus on just what you need.


In [None]:
# Show the top 10 scorers by points per game
df_stats.sort_values("points_per_game", ascending=False).head(10)


We use `.sort_values()` to order the DataFrame.
- `"points_per_game"` is the column we're sorting by
- `ascending=False` means highest scores first
- `.head(10)` shows only the top 10 rows


In [None]:
# Load player info so we can group by position
df_players = con.execute("SELECT * FROM players_cleaned").fetchdf()

# Merge player stats and player info using player_id (the shared key)
df_joined = pd.merge(df_stats, df_players, on="player_id")

# Now group by position and calculate average points per game
df_joined.groupby("position")["points_per_game"].mean().sort_values(ascending=False)


This is our first grouped analysis:
- `pd.merge(...)` combines the stats and player info into one table
- `.groupby("position")` splits the data into groups (one for each position)
- `["points_per_game"].mean()` calculates the average within each group

This is great for comparing categories.


### Visualisation Using Python

[matplotlib documentation](https://matplotlib.org/stable/index.html)

In this tutorial, we’ll use **matplotlib**, which is the most widely-used plotting library in Python.

matplotlib lets you create charts and plots directly from your data — just like you'd do in Excel or Google Sheets, but with code. You can use it to draw:
- Bar charts
- Line graphs
- Histograms
- Scatter plots
- And more

It’s especially useful when you're exploring data and want to **see patterns** or **communicate insights visually**.

---

In the example below, we use matplotlib (through a built-in pandas method) to create a **histogram** — a type of bar chart that shows how data is distributed across ranges.

This lets you quickly answer questions like:
- How common are 20+ PPG (points per game) players?
- Is the data skewed toward low or high scores?
- Are there any surprising gaps or outliers in the data?

You’ll see matplotlib pop up again and again in real-world data analysis — it’s a foundational skill worth getting comfortable with.


In [None]:
import matplotlib.pyplot as plt

# Plot histogram of points per game
df_stats["points_per_game"].plot(kind="hist", bins=30, title="Points Per Game Distribution")

plt.xlabel("Points Per Game")
plt.ylabel("Number of Players")
plt.show()


## What You’ve Learned

- How to connect to a DuckDB database in Python
- How to read SQL tables into pandas DataFrames
- How to explore, filter, and sort tabular data
- How to group by categories and calculate averages
- How to visualise a distribution with a histogram

This is a great foundation for doing real-world data analysis.

Next steps:
- Try filtering by other columns (assists, rebounds)
- Build your own leaderboard
- Add visualisations like bar plots or scatter plots
