# AI Lesson 04a: SQL for Machine Learning (INSTRUCTOR SOLUTIONS)

## Database Applications Development

**Purpose:** Learn SQL to efficiently work with real-world datasets for machine learning

### Why We're Learning SQL in an AI Course

**The Reality:**
- Most ML data doesn't come from CSV files
- Companies store millions/billions of records in databases
- SQL lets you **filter data BEFORE loading** (saves memory!)
- SQL lets you **combine multiple data sources**
- SQL lets you **create new features** through aggregation

**Today's Dataset:** NBA basketball statistics (5 seasons of data)

**What You'll Learn:**
- Connect to SQLite database
- Write SELECT queries to extract data
- Use WHERE to filter records
- Use ORDER BY to sort results
- Use LIMIT to control output size
- Load SQL results into pandas for ML

Let's get started!

---

## Part 1: Setup and Database Connection

In [None]:
# Import the libraries we need
import pandas as pd
import sqlite3

# Connect to the NBA database
conn = sqlite3.connect('nba_5seasons.db')

print("‚úÖ Connected to NBA database successfully!")
print("\nThis database contains:")
print("  - Team information (30 NBA teams)")
print("  - Player information (hundreds of players)")
print("  - Team game statistics (thousands of games)")
print("  - Player season statistics (multiple seasons)")

### What Just Happened?

- `import pandas as pd` - Brings in pandas for DataFrame operations
- `import sqlite3` - Brings in SQLite database functionality
- `sqlite3.connect()` - Opens a connection to our database file
- `conn` - This connection object lets us send SQL queries

**Important:** The database file (`nba_5seasons.db`) must be in the same folder as this notebook!

---

## Part 2: Exploring the Database Structure

In [None]:
# Let's see what tables are in our database
query = """
SELECT name 
FROM sqlite_master 
WHERE type='table'
ORDER BY name
"""

tables = pd.read_sql(query, conn)
print("Tables in our database:")
display(tables)

# Note: sqlite_master is a special system table that stores metadata

**Expected Tables:**
- `players` - Player names and IDs
- `teams` - Team information
- `player_season_stats` - Player statistics by season
- `team_game_stats` - Team performance in each game

In [None]:
# Let's look at the structure of the teams table
query = "PRAGMA table_info(teams)"
team_structure = pd.read_sql(query, conn)
print("Teams table structure:")
display(team_structure)

# PRAGMA is a special SQLite command to get table metadata

---

## Part 3: Your First SQL Query - SELECT and FROM

In [None]:
# Get all columns from the teams table
query = """
SELECT *
FROM teams
"""

all_teams = pd.read_sql(query, conn)
print("All NBA Teams:")
display(all_teams)

print(f"\nTotal teams: {len(all_teams)}")

### SQL Breakdown:

```sql
SELECT *          -- * means "all columns"
FROM teams        -- teams is the table name
```

**Translation:** "Give me all columns from the teams table"

**pandas equivalent:** `pd.read_csv('teams.csv')`

In [None]:
# Select only specific columns
query = """
SELECT full_name, city, state, year_founded
FROM teams
"""

team_basics = pd.read_sql(query, conn)
print("Team Basic Information:")
display(team_basics.head(10))

### Key Point:

- When you specify columns, separate them with commas
- Order matters - columns appear in the order you list them
- **pandas equivalent:** `df[['full_name', 'city', 'state', 'year_founded']]`

---

## Part 4: WHERE Clause - Filtering Data

In [None]:
# Find teams from California
query = """
SELECT full_name, city, state
FROM teams
WHERE state = 'California'
"""

ca_teams = pd.read_sql(query, conn)
print("California NBA Teams:")
display(ca_teams)

print(f"\nNumber of teams in California: {len(ca_teams)}")

### WHERE Clause Rules:

- Filters rows BEFORE returning results
- Text values need **single quotes**: `'California'`
- Use `=` for "equals" (NOT `==` like in Python!)
- **pandas equivalent:** `df[df['state'] == 'California']`

In [None]:
# Find older teams (founded before 1980)
query = """
SELECT full_name, city, year_founded
FROM teams
WHERE year_founded < 1980
"""

older_teams = pd.read_sql(query, conn)
print("Teams Founded Before 1980:")
display(older_teams)

# Numbers don't need quotes!

### Comparison Operators:

| Operator | Meaning | Example |
|----------|---------|----------|
| `=` | Equal to | `WHERE state = 'Texas'` |
| `!=` or `<>` | Not equal | `WHERE state != 'California'` |
| `>` | Greater than | `WHERE year_founded > 1990` |
| `<` | Less than | `WHERE year_founded < 1980` |
| `>=` | Greater or equal | `WHERE year_founded >= 1995` |
| `<=` | Less or equal | `WHERE year_founded <= 1970` |

In [None]:
# Multiple conditions with AND
query = """
SELECT full_name, city, state, year_founded
FROM teams
WHERE state = 'California' AND year_founded < 1970
"""

result = pd.read_sql(query, conn)
print("Old California Teams:")
display(result)

# Both conditions must be true

In [None]:
# Multiple conditions with OR
query = """
SELECT full_name, city, state
FROM teams
WHERE state = 'California' OR state = 'Texas'
"""

result = pd.read_sql(query, conn)
print("Teams from California OR Texas:")
display(result)

# At least one condition must be true

---

## Part 5: ORDER BY - Sorting Results

In [None]:
# Sort teams by year founded (oldest first)
query = """
SELECT full_name, city, year_founded
FROM teams
ORDER BY year_founded
"""

teams_by_age = pd.read_sql(query, conn)
print("Teams Sorted by Age (Oldest First):")
display(teams_by_age.head(10))

# Default sort order is ascending (low to high)

In [None]:
# Sort teams by year founded (newest first)
query = """
SELECT full_name, city, year_founded
FROM teams
ORDER BY year_founded DESC
"""

teams_newest = pd.read_sql(query, conn)
print("Teams Sorted by Age (Newest First):")
display(teams_newest.head(10))

# DESC means descending (high to low)

### ORDER BY Options:

- `ORDER BY column` - Sort ascending (default)
- `ORDER BY column ASC` - Sort ascending (explicit)
- `ORDER BY column DESC` - Sort descending
- **pandas equivalent:** `df.sort_values('column')`

In [None]:
# Sort by multiple columns
query = """
SELECT full_name, city, state, year_founded
FROM teams
ORDER BY state, year_founded
"""

teams_multi_sort = pd.read_sql(query, conn)
print("Teams Sorted by State, then by Year:")
display(teams_multi_sort.head(15))

# First sorts by state (alphabetically), then by year within each state

---

## Part 6: LIMIT - Controlling Output Size

In [None]:
# Get first 5 teams
query = """
SELECT full_name, city, state
FROM teams
LIMIT 5
"""

first_five = pd.read_sql(query, conn)
print("First 5 Teams:")
display(first_five)

# LIMIT goes at the very end of your query

In [None]:
# Top 5 oldest teams
query = """
SELECT full_name, city, year_founded
FROM teams
ORDER BY year_founded
LIMIT 5
"""

oldest_five = pd.read_sql(query, conn)
print("Top 5 Oldest NBA Teams:")
display(oldest_five)

# ORDER BY happens first, THEN LIMIT picks the first N

### Key Points about LIMIT:

- Always goes at the end of your query
- Useful for testing queries on large datasets
- **pandas equivalent:** `df.head(5)`
- Combine with ORDER BY for "Top N" queries

---

## Part 7: Working with Game Data

In [None]:
# Let's peek at the team_game_stats table structure
query = "PRAGMA table_info(team_game_stats)"
game_structure = pd.read_sql(query, conn)
print("Team Game Stats Columns:")
display(game_structure)

In [None]:
# Get a sample of game data
query = """
SELECT season, game_id, team_id, game_date, pts, wl
FROM team_game_stats
LIMIT 10
"""

sample_games = pd.read_sql(query, conn)
print("Sample Game Data:")
display(sample_games)

In [None]:
# Find high-scoring games (120+ points) from 2021-22 season
query = """
SELECT game_date, team_id, pts, wl
FROM team_game_stats
WHERE season = '2021-22' AND pts >= 120
ORDER BY pts DESC
LIMIT 10
"""

high_scoring = pd.read_sql(query, conn)
print("Top 10 Highest-Scoring Team Performances (2021-22):")
display(high_scoring)

print(f"\nHighest score: {high_scoring['pts'].max()} points")

### Why This Matters for ML:

Instead of loading ALL games into memory:
```python
# BAD - loads everything!
df = pd.read_csv('all_nba_games.csv')  # Maybe 100,000 rows
df = df[(df['season'] == '2021-22') & (df['pts'] >= 120)]
```

We use SQL to filter FIRST:
```python
# GOOD - only loads what we need!
query = "SELECT * FROM games WHERE season = '2021-22' AND pts >= 120"
df = pd.read_sql(query, conn)  # Maybe only 200 rows
```

**This is how real data scientists work with large datasets!**

---

## Part 8: Putting It All Together - Complete Queries

In [None]:
# Complex query: Find all wins by teams that scored 100-110 points in 2021-22
query = """
SELECT game_date, team_id, pts, wl
FROM team_game_stats
WHERE season = '2021-22'
  AND pts BETWEEN 100 AND 110
  AND wl = 'W'
ORDER BY pts DESC
LIMIT 20
"""

moderate_wins = pd.read_sql(query, conn)
print("Wins with 100-110 Points (2021-22):")
display(moderate_wins)

print(f"\nTotal games found: {len(moderate_wins)}")

### BETWEEN Operator:

- `WHERE pts BETWEEN 100 AND 110` is the same as `WHERE pts >= 100 AND pts <= 110`
- More readable for range queries
- Includes both endpoints (100 and 110)

In [None]:
# Query for ML feature engineering
# Get all game stats we might need for a win prediction model
query = """
SELECT 
    team_id,
    pts,
    fgm,
    fga,
    fg3m,
    fg3a,
    reb,
    ast,
    stl,
    blk,
    tov,
    wl
FROM team_game_stats
WHERE season = '2021-22'
LIMIT 100
"""

ml_ready_data = pd.read_sql(query, conn)
print("Data Ready for Machine Learning:")
display(ml_ready_data.head())

print(f"\nShape: {ml_ready_data.shape}")
print(f"Features: {list(ml_ready_data.columns)}")
print(f"\nWin rate in this sample: {(ml_ready_data['wl'] == 'W').mean():.2%}")

### This DataFrame is Ready for ML!

Next steps would be:
1. Encode `wl` as binary (W=1, L=0)
2. Split into X (features) and y (target)
3. Train/test split
4. Train decision tree model
5. Predict game outcomes!

**We'll do this in upcoming lessons!**

---

## Part 9: Query Structure Summary

### Complete SQL Query Template:

```sql
SELECT column1, column2, column3    -- What columns to show
FROM table_name                     -- Which table to use
WHERE condition                     -- Filter rows (optional)
ORDER BY column                     -- Sort results (optional)
LIMIT number                        -- Limit results (optional)
```

### Order MATTERS!

1. SELECT (what to show)
2. FROM (which table)
3. WHERE (filter rows)
4. ORDER BY (sort)
5. LIMIT (limit results)

**You cannot put LIMIT before ORDER BY!**

---

## Part 10: SQL vs. Pandas Comparison

### Same Operations, Different Syntax:

| Operation | Pandas | SQL |
|-----------|--------|-----|
| Load all data | `pd.read_csv('file.csv')` | `SELECT * FROM table` |
| Select columns | `df[['col1', 'col2']]` | `SELECT col1, col2 FROM table` |
| Filter rows | `df[df['pts'] > 100]` | `WHERE pts > 100` |
| Sort | `df.sort_values('pts')` | `ORDER BY pts` |
| Sort descending | `df.sort_values('pts', ascending=False)` | `ORDER BY pts DESC` |
| First N rows | `df.head(10)` | `LIMIT 10` |
| Multiple conditions (AND) | `df[(df['pts'] > 100) & (df['wl'] == 'W')]` | `WHERE pts > 100 AND wl = 'W'` |
| Multiple conditions (OR) | `df[(df['state'] == 'CA') | (df['state'] == 'TX')]` | `WHERE state = 'CA' OR state = 'TX'` |

### Key Differences:

- SQL uses `=` for comparison (pandas uses `==`)
- SQL uses `AND` / `OR` (pandas uses `&` / `|`)
- SQL text needs single quotes: `'California'`
- SQL is declarative ("what you want"), pandas is procedural ("how to get it")

---

## Summary & Next Steps

In [None]:
# Always close your database connection when done!
conn.close()
print("‚úÖ Database connection closed")

### What You Learned Today:

‚úÖ **Connect to SQLite databases**
‚úÖ **SELECT** - Choose columns to display
‚úÖ **FROM** - Specify which table
‚úÖ **WHERE** - Filter rows based on conditions
‚úÖ **ORDER BY** - Sort results (ASC/DESC)
‚úÖ **LIMIT** - Control output size
‚úÖ **Load SQL results into pandas** for ML

### SQL Query Pattern:

```python
# 1. Write SQL query as a string
query = """
SELECT columns
FROM table
WHERE conditions
ORDER BY column
LIMIT n
"""

# 2. Execute query and load into DataFrame
df = pd.read_sql(query, conn)

# 3. Now use pandas/sklearn for ML!
```

### Next Lesson:

- **Aggregate functions:** COUNT, SUM, AVG, MIN, MAX
- **GROUP BY:** Create summary statistics
- **Feature engineering** using SQL
- Build ML model using NBA data!

### Practice:

Complete **ai04a_Tasks.ipynb** to practice writing your own queries!

**Great work today!** üèÄüìäü§ñ