# AI Lesson 04b Part 1: Aggregate Functions

**Course:** Applications of Artificial Intelligence  
**Focus:** Counting, Summing, and Calculating Averages with SQL  

---

## Learning Objectives

Master SQL aggregate functions to summarize data:
- `COUNT()` - Count rows
- `SUM()` - Add up values
- `AVG()` - Calculate averages
- `MAX()` - Find highest value
- `MIN()` - Find lowest value

**Why this matters for ML:**
- Calculate team statistics for features
- Find outliers using max/min
- Create summary statistics for model training

---

## Setup

In [26]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('nba_5seasons.db')
print("✅ Connected to NBA database")

✅ Connected to NBA database


---
## Part 1: COUNT - Counting Rows

**Pattern:**
```sql
SELECT COUNT(*) FROM table_name
SELECT COUNT(*) FROM table_name WHERE condition
```

### Query 1: Count All Teams

How many NBA teams are in the database?

**Table:** `teams`

In [None]:
query = """

"""

result = pd.read_sql(query, conn)
display(result)

**Expected:** 30 teams

### Query 2: Count Wins in 2021-22

How many winning games happened in the 2021-22 season?

**Table:** `team_game_stats`  
**Hint:** Filter WHERE `season = '2021-22'` AND `wl = 'W'`

In [None]:
query = """

"""

result = pd.read_sql(query, conn)
display(result)

**Expected:** 1,230 (half of 2,460 total games)

### Query 3: Count High-Scoring Games

How many team performances had 120+ points in the '2021-22' season?

**Table:** `team_game_stats`  
**Hint:** WHERE `pts >= 120` AND season _______

In [None]:
query = """

"""

result = pd.read_sql(query, conn)
display(result)

---
## Part 2: SUM - Adding Values

**Pattern:**
```sql
SELECT SUM(column_name) FROM table_name
SELECT SUM(column_name) FROM table_name WHERE condition
```

### Query 4: Total Points in 2021-22

Sum ALL points scored across the 2021-22 season.

**Table:** `team_game_stats`  
**Hint:** SUM the `pts` column, filter to season '2021-22'

In [None]:
query = """

"""

result = pd.read_sql(query, conn)
display(result)

**Expected:** ~270,000 total points

---
## Part 3: AVG - Calculate Averages

**Pattern:**
```sql
SELECT AVG(column_name) AS avg_points  -- this allows you to name the output column as `avg_points` 
FROM table_name

SELECT AVG(column_name) 
FROM table_name 
WHERE condition
```

### Query 5: Average Points Per Game (2021-22)

What's the average points scored per game?

**Table:** `team_game_stats`  
**Hint:** AVG the `pts` column for season '2021-22'

In [None]:
query = """

"""

result = pd.read_sql(query, conn)
display(result)

**Expected:** ~109-110 points average

### Query 6: Compare Win vs Loss Averages

Do winning teams score more points on average?

Calculate TWO averages:
1. Average points in WINS (wl = 'W')
2. Average points in LOSSES (wl = 'L')

**Table:** `team_game_stats`  
**Season:** '2021-22'

In [None]:
# Average in WINS
query_wins = """

"""

wins = pd.read_sql(query_wins, conn)
print("Average points in WINS:")
display(wins)

# Average in LOSSES
query_losses = """

"""

losses = pd.read_sql(query_losses, conn)
print("\nAverage points in LOSSES:")
display(losses)

**Question:** What's the difference? Do winners score ~15-20 more points?

---
## Part 4: MAX and MIN - Find Extremes

**Pattern:**
```sql
SELECT MAX(column_name), MIN(column_name) FROM table_name
```

### Query 7: Highest and Lowest Scores

Find the highest and lowest scores in 2021-22.

**Table:** `team_game_stats`  
**Hint:** Use both MAX(pts) and MIN(pts) in same query

In [None]:
query = """

"""

result = pd.read_sql(query, conn)
display(result)

---
## Part 5: Combining Aggregations

### Query 8: Complete Statistical Summary

Get COUNT, AVG, MIN, and MAX for points in 2021-22.

**Table:** `team_game_stats`  
**Hint:** Use all aggregates together, give each an alias with `AS`

In [None]:
query = """
SELECT 
    ____ (*) as total_team_games,  -- technically, there are half the number of total games than this (one winner and one loser)
    ____(pts) as avg_points,
    ____(pts) as min_points,
    ____(pts) as max_points
FROM ____
WHERE season = '____'
"""

result = pd.read_sql(query, conn)
print("2021-22 Season Points Summary:")
display(result)

---
## Part 6: Practice with Player Data

### Query 9: Player Scoring Summary - The SQL is given to you... answer the question after running the query.

From player_season_stats, calculate:
- Average PPG (points per game)
- Highest PPG scorer
- Lowest PPG

**Table:** `player_season_stats`  
**Season:** '2021-22'  
**Note:** `pts` in this table is already per-game average

In [30]:
query = """
SELECT
    ROUND(AVG(pts * 1.0 / gp), 2) AS avg_ppg,
    ROUND(MAX(pts * 1.0 / gp), 2) AS highest_ppg,
    ROUND(MIN(pts * 1.0 / gp), 2) AS lowest_ppg
FROM player_season_stats
WHERE season = '2021-22'
  AND gp > 0
"""
result = pd.read_sql(query, conn)
print("Player Scoring Summary (2021-22):")
display(result)


Player Scoring Summary (2021-22):


Unnamed: 0,avg_ppg,highest_ppg,lowest_ppg
0,8.24,30.57,0.0


### Questions for Query 9: 
a. What does pts * 1.0 / gp represent? <br>
b. Why does the query filter on gp > 0? <br>
c. What do AVG, MAX, and MIN calculate in this query? <br>
d. What does ROUND(..., 2) do to the results? <br>

### Your answers for Query 9 Questions:
a.      <br>
b.      <br>
c.      <br>
d.      <br>

---
## Part 7: ML Feature Engineering - Giving you most of the answer here....

### Query 10: Identify Elite Scorers

For ML classification, we might create a "high scorer" feature.

Find:
1. Total number of players
2. Number scoring 20+ PPG
3. Calculate percentage

**Table:** `player_season_stats`  
**Season:** '2021-22'

### What this Query is doing...

This query counts unique players in a season and identifies how many averaged 20+ points per game.

It uses a `CASE WHEN` statement (similar to an IF statement) to turn the scoring condition into a 1 (meets the condition) or 0 (does not). These values are summed to count elite scorers, then divided by the total number of players to calculate a percentage.

In [None]:
query = """
SELECT
    COUNT(DISTINCT player_id) AS total_players,
    SUM(CASE WHEN pts >= 20 THEN 1 ELSE 0 END) AS elite_players,
    ROUND(
        100.0 * SUM(CASE WHEN pts >= 20 THEN 1 ELSE 0 END)
        / COUNT(DISTINCT player_id),
        1
    ) AS elite_percentage
FROM player_season_stats
WHERE season = '2021-22'
"""

result = pd.read_sql(query, conn)
result

### Questions for Query 10: 
a. What does COUNT(DISTINCT player_id) count, and why is DISTINCT necessary? <br>
b. What does the CASE WHEN pts >= 20 THEN 1 ELSE 0 logic represent? <br>
c. Why is SUM() used with the CASE WHEN statement? <br>
d. How is the final percentage calculated? <br>

### Your answers for Query 10 Questions:
a.      <br>
b.      <br>
c.      <br>
d.      <br>

---
## Export Data for Excel Analysis

### Export 1: Season Statistics Summary

Create a comprehensive stats summary and export to Excel.

In [None]:
query = """
SELECT 
    'All Games' as category,
    COUNT(*) as game_count,
    AVG(pts) as avg_points,
    AVG(reb) as avg_rebounds,
    AVG(ast) as avg_assists,
    MAX(pts) as highest_score,
    MIN(pts) as lowest_score
FROM team_game_stats
WHERE season = '2021-22'
"""

stats_summary = pd.read_sql(query, conn)

# Export to Excel
stats_summary.to_excel('nba_2021-22_summary.xlsx', 
                       index=False, 
                       sheet_name='Season Stats')

print("✅ Exported to 'nba_2021-22_summary.xlsx'")
display(stats_summary)

---
## Cleanup

In [None]:
conn.close()
print("✅ Database connection closed")

---
## Summary

### You Learned:
✅ `COUNT()` - Count rows  
✅ `SUM()` - Add up values  
✅ `AVG()` - Calculate averages  
✅ `MAX()` / `MIN()` - Find extremes  
✅ Combine multiple aggregations  
✅ Export results to Excel  

### Key Patterns:
```sql
-- Basic aggregation
SELECT COUNT(*), AVG(column) FROM table

-- With filtering
SELECT SUM(column) FROM table WHERE condition

-- Multiple aggregates
SELECT 
    COUNT(*) as total,
    AVG(col1) as average,
    MAX(col2) as maximum
FROM table
WHERE condition
```

### Next: Part 2
Learn GROUP BY to aggregate by categories (teams, seasons, etc.)