# AI Lesson 04b Part 1: Aggregate Functions

**Course:** Applications of Artificial Intelligence  
**Focus:** Counting, Summing, and Calculating Averages with SQL  

---

## What You'll Learn Today

In ai04a, you learned to filter and sort individual rows. Now you'll learn to **summarize data**!

**Today's focus:**
- `COUNT()` - How many rows?
- `SUM()` - Add up numbers
- `AVG()` - Calculate averages
- `MAX()` - Find the highest value
- `MIN()` - Find the lowest value

**Why this matters for ML:**
- Calculate team win percentages
- Find average player performance
- Identify max/min stats for outlier detection

Let's go!

---

## Setup: Connect to Database

In [None]:
# Import libraries for database work and data manipulation
import pandas as pd
import sqlite3

# Create connection to NBA database
# This connection (conn) will be used in every query
conn = sqlite3.connect('nba_5seasons.db')

print("‚úÖ Connected to NBA database")
print("Ready to learn aggregations!")

---
## Part 1: COUNT - Counting Rows

COUNT tells you **how many rows** match your criteria.

### Query 1: Count All Teams

How many NBA teams are in the database?

**Pattern:**
```sql
SELECT COUNT(*) FROM table_name
```

**Fill in the blank:**

In [None]:
# Query to count total teams
# COUNT(*) counts all rows in the table
# This is like asking "how many teams exist?"
query = """
SELECT COUNT(*)
FROM ______
"""

# Run query and convert result to DataFrame
result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** Table is `teams`

**What you should see:** 30 (there are 30 NBA teams)

### Query 2: Count All Games

How many game records are in team_game_stats?

In [None]:
# Count total game records
# Remember: Each game has 2 records (one for each team)
query = """
SELECT COUNT(*)
FROM ______
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `team_game_stats`

**What you should see:** Large number (thousands of game records)

### Query 3: Count Games in 2021-22 Season

How many game records are from the 2021-22 season?

**Pattern:**
```sql
SELECT COUNT(*) FROM table WHERE condition
```

In [None]:
# Count games from specific season
# COUNT(*) with WHERE only counts rows that match the filter
query = """
SELECT COUNT(*)
FROM team_game_stats
WHERE ______ = ______
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `season = '2021-22'` (don't forget quotes!)

**What you should see:** 2,460 (30 teams √ó 82 games = 2,460 records)

### Query 4: Count Wins in 2021-22

How many wins happened in 2021-22?

In [None]:
# Count wins only
# Combining COUNT with multiple WHERE conditions
query = """
SELECT COUNT(*)
FROM team_game_stats
WHERE season = '2021-22' AND ______ = ______
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `wl = 'W'`

**What you should see:** 1,230 (exactly half of 2,460 - every game has 1 winner)

### Query 5: Count High-Scoring Games

How many games had 120+ points?

In [None]:
# Count games meeting a numeric threshold
# COUNT tells us how many times something happened
query = """
SELECT COUNT(*)
FROM team_game_stats
WHERE ______ >= ______
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `pts >= 120` (no quotes for numbers!)

---
## Part 2: SUM - Adding Things Up

SUM adds up all the values in a column.

### Query 6: Total Points Scored in 2021-22

Add up ALL points scored in the 2021-22 season.

**Pattern:**
```sql
SELECT SUM(column_name) FROM table
```

In [None]:
# Sum all points from a season
# SUM(pts) adds up every value in the pts column
# This gives us total points scored by all teams in all games
query = """
SELECT SUM(______)
FROM team_game_stats
WHERE season = '2021-22'
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `SUM(pts)`

**What you should see:** ~270,000 total points

### Query 7: Total Assists in Wins

How many total assists were made in winning games (2021-22)?

In [None]:
# Sum assists for winning teams only
# WHERE filters to wins, then SUM adds up assists
query = """
SELECT SUM(ast)
FROM team_game_stats
WHERE season = '2021-22' AND ______ = ______
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `wl = 'W'`

### Query 8: Multiple Sums at Once

Get total points AND total rebounds from 2021-22.

**You can use multiple aggregations in one query!**

In [None]:
# Multiple SUM functions in one query
# Separate each with a comma, just like regular columns
query = """
SELECT SUM(pts), SUM(______)
FROM team_game_stats
WHERE season = '2021-22'
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `SUM(reb)` - rebounds column

---
## Part 3: AVG - Calculating Averages

AVG calculates the average (mean) of a column.

### Query 9: Average Points Per Game

What's the average points scored in a game (2021-22)?

**Pattern:**
```sql
SELECT AVG(column_name) FROM table
```

In [None]:
# Calculate average points
# AVG(pts) adds all points and divides by number of games
query = """
SELECT AVG(______)
FROM team_game_stats
WHERE season = '2021-22'
"""

result = pd.read_sql(query, conn)
display(result)

**üí° Hint:** `AVG(pts)`

**What you should see:** Around 110 points (typical NBA score)

### Query 10: Average Assists in Wins vs Losses

Do winning teams average more assists? Let's check wins first.

In [None]:
# Average assists for winning teams
# This shows if teamwork (assists) correlates with winning
query = """
SELECT AVG(ast)
FROM team_game_stats
WHERE season = '2021-22' AND wl = ______
"""

result = pd.read_sql(query, conn)
print("Average assists in WINS:")
display(result)

**üí° Hint:** `'W'` for wins

### Query 11: Now Check Losses

What's the average assists in losses?

In [None]:
# Average assists for losing teams
# Compare this to Query 10 - is there a difference?
query = """
SELECT AVG(ast)
FROM team_game_stats
WHERE season = '2021-22' AND wl = ______
"""

result = pd.read_sql(query, conn)
print("Average assists in LOSSES:")
display(result)

**üí° Hint:** `'L'` for losses

**Question:** Do wins have more assists on average? What does this tell you?

---
## Part 4: MAX and MIN - Finding Extremes

MAX finds the highest value, MIN finds the lowest.

### Query 12: Highest Score Ever

What's the highest single-game point total?

**Pattern:**
```sql
SELECT MAX(column_name) FROM table
```

In [None]:
# Find maximum points scored
# MAX(pts) looks through all rows and returns the highest value
query = """
SELECT MAX(______)
FROM team_game_stats
"""

result = pd.read_sql(query, conn)
print("Highest score in database:")
display(result)

**üí° Hint:** `MAX(pts)`

### Query 13: Lowest Score Ever

What's the lowest single-game point total?

In [None]:
# Find minimum points scored
# MIN(pts) returns the smallest value
# Low scores usually mean poor offensive performance
query = """
SELECT MIN(______)
FROM team_game_stats
"""

result = pd.read_sql(query, conn)
print("Lowest score in database:")
display(result)

**üí° Hint:** `MIN(pts)`

### Query 14: Range of Turnovers

Find both the most and fewest turnovers in a single game (2021-22).

In [None]:
# Find MIN and MAX turnovers in same query
# Shows the range - helps identify outliers
query = """
SELECT MIN(tov), MAX(______)
FROM team_game_stats
WHERE season = '2021-22'
"""

result = pd.read_sql(query, conn)
print("Turnover range (min, max):")
display(result)

**üí° Hint:** `MAX(tov)` - turnovers column

---
## Part 5: Combining Aggregations

You can use multiple aggregate functions together!

### Query 15: Complete Stats Summary

Get COUNT, AVG, MIN, and MAX for points (2021-22).

In [None]:
# Comprehensive statistical summary
# This gives us a complete picture of the data
# Like running df.describe() in pandas!
query = """
SELECT 
    COUNT(*) as total_games,
    AVG(pts) as avg_points,
    MIN(______) as min_points,
    MAX(______) as max_points
FROM team_game_stats
WHERE season = '2021-22'
"""

result = pd.read_sql(query, conn)
print("2021-22 Season Points Summary:")
display(result)

**üí° Hint:** `MIN(pts)` and `MAX(pts)`

**Note:** `as avg_points` gives the column a readable name!

### Query 16: Win vs Loss Stats

Compare average points for wins vs losses.

In [None]:
# First, get average for WINS
query = """
SELECT AVG(pts) as avg_points_in_wins
FROM team_game_stats
WHERE season = '2021-22' AND wl = 'W'
"""

wins = pd.read_sql(query, conn)
print("Average points in WINS:")
display(wins)

# Now get average for LOSSES
query = """
SELECT AVG(pts) as avg_points_in_losses
FROM team_game_stats
WHERE season = '2021-22' AND wl = ______
"""

losses = pd.read_sql(query, conn)
print("\nAverage points in LOSSES:")
display(losses)

**üí° Hint:** `'L'` for losses

**Question:** Do winning teams score more on average? By how much?

### Query 17: Player Stats Summary

From player_season_stats, get average, min, and max points per game.

**Your turn - write the full query:**

In [None]:
# Summarize player scoring stats
# Remember: pts in player_season_stats is points PER GAME (already an average)
# So AVG(pts) here is the average of players' season averages
query = """
______
"""

result = pd.read_sql(query, conn)
print("Player scoring summary (2021-22):")
display(result)

**üí° Hint:**
```sql
SELECT 
    AVG(pts) as avg_ppg,
    MIN(pts) as min_ppg,
    MAX(pts) as max_ppg
FROM player_season_stats
WHERE season = '2021-22'
```

---
## Part 6: Practical ML Applications

### Query 18: Feature Engineering - High Scorers

For ML, we might want to know: what % of players score 15+ PPG?

In [None]:
# Count total players
query1 = """
SELECT COUNT(*) as total_players
FROM player_season_stats
WHERE season = '2021-22'
"""
total = pd.read_sql(query1, conn)

# Count high scorers (15+ PPG)
# This filter helps us create a "high scorer" feature for ML
query2 = """
SELECT COUNT(*) as high_scorers
FROM player_season_stats
WHERE season = '2021-22' AND pts >= ______
"""
high = pd.read_sql(query2, conn)

# Calculate percentage
percentage = (high['high_scorers'][0] / total['total_players'][0]) * 100
print(f"Total players: {total['total_players'][0]}")
print(f"High scorers (15+ PPG): {high['high_scorers'][0]}")
print(f"Percentage: {percentage:.1f}%")

**üí° Hint:** `pts >= 15`

### Query 19: Outlier Detection

Find games that are statistical outliers (way above/below average).

In [None]:
# First, get the average
query_avg = """
SELECT AVG(pts) as avg_pts
FROM team_game_stats
WHERE season = '2021-22'
"""
avg_result = pd.read_sql(query_avg, conn)
avg_pts = avg_result['avg_pts'][0]

print(f"Average points: {avg_pts:.1f}")

# Count games 30+ points above average (outliers)
# In ML, outliers can affect model training
query_outliers = """
SELECT COUNT(*) as high_outliers
FROM team_game_stats
WHERE season = '2021-22' AND pts >= ______
"""
# We'll use avg_pts + 30 in the query
threshold = avg_pts + 30
query_outliers = query_outliers.replace('______', str(int(threshold)))

outliers = pd.read_sql(query_outliers, conn)
print(f"\nGames 30+ points above average: {outliers['high_outliers'][0]}")

---
## Cleanup

In [None]:
# Always close database connection when finished
conn.close()
print("‚úÖ Database connection closed")

---
## üéâ Congratulations!

You learned aggregate functions! You can now:

‚úÖ **COUNT** rows (how many?)  
‚úÖ **SUM** values (add them up)  
‚úÖ **AVG** values (calculate averages)  
‚úÖ **MAX** values (find highest)  
‚úÖ **MIN** values (find lowest)  
‚úÖ Combine multiple aggregations  
‚úÖ Use aggregations for ML feature engineering  

---

## üìù Key Takeaways

**Aggregate Functions:**
- `COUNT(*)` - counts rows
- `SUM(column)` - adds up values
- `AVG(column)` - calculates mean
- `MAX(column)` - finds highest
- `MIN(column)` - finds lowest

**Remember:**
- Aggregations work on MULTIPLE rows, return ONE result
- Use `as column_name` to rename output columns
- Combine with WHERE to aggregate filtered data
- Great for feature engineering in ML!

---

## Next Steps

In **ai04b Part 2**, you'll learn:
- **GROUP BY** - aggregate by categories
- **HAVING** - filter aggregated results
- Create team performance summaries
- Export to Excel for analysis

**Great work!** üìäüèÄ