# AI Lesson 04a Tasks: SQL for Machine Learning

**Course:** Applications of Artificial Intelligence  
**Lesson:** SQL Introduction with NBA Dataset  

---

## Learning Objectives

By completing this task, you will:
- Write SQL queries to extract data for machine learning
- Use SELECT, WHERE, ORDER BY, and LIMIT effectively
- Filter data to create better training datasets
- Load SQL results into pandas DataFrames
- Export data to Excel for analysis
- Understand how SQL fits into the ML pipeline

---

## The ML Context

**Why are we doing this?**

In real-world machine learning:
- Data lives in **databases**, not CSV files
- You need to **filter** data before loading (millions of rows!)
- SQL lets you **select specific features** for your model
- SQL helps you **create training datasets** efficiently
- You export data to Excel for visualization and analysis

**Today's scenario:**
You're building an ML model to predict NBA team performance. First, you need to extract the right data from the database using SQL, then export it to Excel for analysis.

---

## Setup: Connect to Database

In [None]:
# Import necessary libraries
import pandas as pd
import sqlite3

# Connect to the NBA database
conn = sqlite3.connect('nba_5seasons.db')

print("‚úÖ Connected to NBA database")
print("Ready to write SQL queries!")

---
## Part 1: Understanding the Database Structure

Before we can write good queries, we need to know what tables and columns we have.

### Task 1.1: List All Tables

Write a query to see what tables exist in our database.

**Hint:** Use the `sqlite_master` table and filter for type='table'

In [None]:
# TODO: Write a query to list all tables
query = """
SELECT ________
FROM ________
WHERE ________ = 'table'
ORDER BY ________
"""

tables = pd.read_sql(query, conn)
print("Tables in the database:")
display(tables)

**Q1: How many tables are in the database?**

A: 

### Task 1.2: Explore the teams Table

Look at the first 10 rows of the `teams` table.

**Hint:** Use SELECT * and LIMIT

In [None]:
# TODO: Write a query to see the first 10 teams
query = """
SELECT ________
FROM ________
LIMIT ________
"""

teams_sample = pd.read_sql(query, conn)
print("First 10 teams:")
display(teams_sample)

**Q2: What columns are in the teams table?**

A: 

**Q3: How would you find your favorite team in this table?**

A: 

---
## Part 2: Basic SELECT Queries

Let's practice selecting specific columns - this is how you choose **features** for ML!

### Task 2.1: Select Specific Columns

Get just the team names and cities (the features we care about).

**Hint:** SELECT column1, column2 FROM table

In [None]:
# TODO: Select only full_name and city columns
query = """
SELECT ________, ________
FROM ________
"""

team_names = pd.read_sql(query, conn)
print("Team names and cities:")
display(team_names)

**Q4: Why might you want to select only specific columns instead of using SELECT * ?**

A: 

### Task 2.2: Teams by State

Let's see which states have NBA teams. Get the team name, city, and state, ordered by state.

**Hint:** ORDER BY can take multiple columns

In [None]:
# TODO: Show team names, cities, and states, ordered by state then name
query = """
SELECT ________, ________, ________
FROM teams
ORDER BY ________, ________
"""

teams_by_state = pd.read_sql(query, conn)
print("All teams by state:")
display(teams_by_state)

**Q5: Which state has the most NBA teams?**

A: 

---
## Part 3: WHERE Clause - Filtering for ML

In ML, you often want **specific subsets** of data. WHERE is how you filter!

### Task 3.1: Filter by Season

Get all games from the 2021-22 season. (If you're training a model on recent data, you'd start here!)

**Hint:** WHERE column = 'value' (use single quotes for text!)

In [None]:
# TODO: Filter for only 2021-22 season games
query = """
SELECT *
FROM ________
WHERE ________ = ________
LIMIT 10
"""

recent_games = pd.read_sql(query, conn)
print("Sample of 2021-22 season games:")
display(recent_games)

**Q6: Why use quotes around '2021-22'?**

A: 

### Task 3.2: High-Scoring Games

Find all games where a team scored 120 or more points.

**ML Context:** This could be your positive class for a "high-scoring game predictor" model!

**Hint:** Use >= for "greater than or equal to"

In [None]:
# TODO: Find games with 120+ points, show top 20
query = """
SELECT season, game_date, team_id, matchup, pts, wl
FROM team_game_stats
WHERE ________ >= ________
ORDER BY ________ ________
LIMIT ________
"""

high_scoring = pd.read_sql(query, conn)
print("Top 20 highest-scoring games:")
display(high_scoring)

**Q7: Looking at the results, what was the highest score?**

A: 

**Q8: Did high-scoring games tend to be wins or losses?**

A: 

### üìä Task 3.3: Export High-Scoring Games to Excel

Let's export ALL high-scoring games (120+ points) to Excel for analysis!

**You'll use this file later to practice Excel formulas and charts.**

In [None]:
# TODO: Get ALL high-scoring games (remove LIMIT) and export to Excel
query = """
SELECT season, game_date, team_id, matchup, pts, fgm, fga, fg3m, ftm, reb, ast, wl
FROM team_game_stats
WHERE ________ >= ________
ORDER BY pts DESC
"""

high_scoring_all = pd.read_sql(query, conn)

# TODO: Export to Excel (fill in the blanks)
high_scoring_all.to_excel(________, index=________, sheet_name=________)

print(f"‚úÖ Exported {len(high_scoring_all)} high-scoring games to 'high_scoring_games.xlsx'")
print("\nPreview:")
display(high_scoring_all.head())

**Q9: How many high-scoring games were exported?**

A: 

---
## Part 4: ORDER BY and Multiple Conditions

### Task 4.1: Wins with Good Offense

Find all wins where the team scored 115+ points.

**Hint:** Combine wl = 'W' AND pts >= 115

In [None]:
# TODO: Find wins with 115+ points from 2021-22 season
query = """
SELECT game_date, team_id, matchup, pts, wl
FROM team_game_stats
WHERE ________ = ________
  AND ________ = ________
  AND ________ >= ________
ORDER BY ________ ________
LIMIT 25
"""

dominant_wins = pd.read_sql(query, conn)
print("Dominant wins (115+ points):")
display(dominant_wins)

**Q10: Why do we need quotes around 'W' but not around 115?**

A: 

**Q11: What does the AND operator do in SQL?**

A: 

### Task 4.2: Losses Despite Good Scoring

Find losses where the team still scored 110+ points. (Lost a shootout!)

**Hint:** Similar to above, but wl = 'L'

In [None]:
# TODO: Find losses with 110+ points (high-scoring losses)
query = """
SELECT game_date, team_id, matchup, pts, wl
FROM team_game_stats
WHERE ________ = ________
  AND ________ = ________
  AND ________ >= ________
ORDER BY ________ ________
LIMIT 20
"""

shootout_losses = pd.read_sql(query, conn)
print("High-scoring losses (110+ pts but still lost):")
display(shootout_losses)

**Q12: What does this tell you about offense vs defense in basketball?**

A: 

---
## Part 5: Player Season Stats

Now let's look at player data - this is where ML gets interesting!

### Task 5.1: Explore Player Stats

Look at the structure of the player_season_stats table.

In [None]:
# TODO: Get first 10 rows from player_season_stats for 2021-22
query = """
SELECT *
FROM ________
WHERE ________ = ________
LIMIT ________
"""

player_stats = pd.read_sql(query, conn)
print("Sample player season stats:")
display(player_stats)

**Q13: What columns represent a player's scoring ability?**

A: 

### Task 5.2: High Scorers

Find all players who averaged 20+ points per game in 2021-22.

**Hint:** Filter WHERE pts >= 20

In [None]:
# TODO: Find players averaging 20+ PPG
query = """
SELECT player_id, season, team_id, gp, pts, reb, ast
FROM player_season_stats
WHERE ________ = ________
  AND ________ >= ________
ORDER BY ________ ________
"""

star_scorers = pd.read_sql(query, conn)
print(f"Found {len(star_scorers)} players averaging 20+ PPG")
display(star_scorers.head(20))

**Q14: How many players averaged 20+ PPG?**

A: 

**Q15: Why might you want to filter by games played (gp) too?**

A: 

### üìä Task 5.3: Export Top Scorers to Excel

Export top scorers (20+ PPG, 40+ games) to Excel for analysis.

In [None]:
# TODO: Get top scorers with at least 40 games played
query = """
SELECT player_id, season, team_id, gp, min, pts, reb, ast, stl, blk, fg_pct, fg3_pct, ft_pct
FROM ________
WHERE ________ = ________
  AND ________ >= ________
  AND ________ >= ________
ORDER BY ________ ________
"""

top_scorers = pd.read_sql(query, conn)

# TODO: Export to Excel
top_scorers.to_excel(________, index=________, sheet_name=________)

print(f"‚úÖ Exported {len(top_scorers)} top scorers to 'top_scorers.xlsx'")
display(top_scorers.head(10))

**Q16: How many top scorers were exported?**

A: 

### Task 5.4: All-Around Players

Find players who contributed across multiple categories:
- 15+ points per game
- 5+ rebounds per game
- 5+ assists per game
- Played at least 40 games

**Hint:** You'll need 4 AND conditions!

In [None]:
# TODO: Find all-around players (15+ pts, 5+ reb, 5+ ast, 40+ games)
query = """
SELECT player_id, season, team_id, gp, pts, reb, ast
FROM player_season_stats
WHERE season = '2021-22'
  AND ________ >= ________
  AND ________ >= ________
  AND ________ >= ________
  AND ________ >= ________
ORDER BY ________ ________
"""

all_around = pd.read_sql(query, conn)
print(f"All-around players: {len(all_around)} found")
display(all_around)

**Q17: Why use multiple AND conditions instead of OR?**

A: 

---
## Part 6: Win/Loss Records

Let's create a win/loss dataset that we can analyze in Excel!

### Task 6.1: Get All 2021-22 Game Results

Get all games from 2021-22 with key stats.

In [None]:
# TODO: Get all 2021-22 games with important stats
query = """
SELECT season, game_date, team_id, matchup, wl, pts, fgm, fga, fg3m, ftm, reb, ast, stl, blk, tov
FROM ________
WHERE ________ = ________
ORDER BY ________
"""

all_games = pd.read_sql(query, conn)
print(f"Total games: {len(all_games)}")
display(all_games.head(10))

**Q18: How many total game records are there?**

A: 

### üìä Task 6.2: Export Win/Loss Records to Excel

Export this data - we'll use it for Excel analysis!

In [None]:
# TODO: Export all_games to Excel
all_games.to_excel(________, index=________, sheet_name=________)

print(f"‚úÖ Exported {len(all_games)} game records to 'win_loss_records.xlsx'")
print("\nWin/Loss distribution:")
print(all_games['wl'].value_counts())

---
## Part 7: Creating ML-Ready Dataset

Now let's create the dataset we'll use to train a machine learning model!

### Task 7.1: Win Prediction Dataset

Create a dataset for predicting wins. Include features that might predict winning:
- Points scored
- Field goals made/attempted
- Three-pointers made
- Free throws made
- Rebounds
- Assists
- Steals
- Blocks
- Turnovers
- Win/Loss outcome

In [None]:
# TODO: Select all the columns listed above from team_game_stats for 2021-22
query = """
SELECT 
    pts,
    ________,
    ________,
    ________,
    ________,
    ________,
    ________,
    ________,
    ________,
    ________,
    wl
FROM ________
WHERE ________ = ________
"""

ml_dataset = pd.read_sql(query, conn)
print(f"ML Dataset created: {ml_dataset.shape[0]} games, {ml_dataset.shape[1]} features")
print("\nFirst 10 rows:")
display(ml_dataset.head(10))

print("\nDataset info:")
print(ml_dataset.info())

print("\nTarget variable distribution:")
print(ml_dataset['wl'].value_counts())

**Q19: Is this dataset balanced (roughly equal wins and losses)?**

A: 

**Q20: Why is it important to include turnovers (tov) as a feature?**

A: 

### Task 7.2: Save ML Dataset

Save this dataset both as CSV (for Python ML) and Excel (for analysis).

In [None]:
# TODO: Save as CSV for machine learning
ml_dataset.to_csv(________, index=________)

# TODO: Also save as Excel for analysis
ml_dataset.to_excel(________, index=________, sheet_name=________)

print("‚úÖ ML Dataset saved:")
print("   - nba_win_prediction.csv (for Python ML)")
print("   - nba_win_prediction.xlsx (for Excel analysis)")
print(f"   - {len(ml_dataset)} rows")
print(f"   - {len(ml_dataset.columns)} columns")

---
## Part 8: Summary of Excel Files Created

Let's verify all the Excel files we've created for later analysis.

In [None]:
import os

excel_files = [
    'high_scoring_games.xlsx',
    'top_scorers.xlsx',
    'win_loss_records.xlsx',
    'nba_win_prediction.xlsx'
]

print("üìä Excel Files Created for MOS Practice:\n")
for i, file in enumerate(excel_files, 1):
    if os.path.exists(file):
        size = os.path.getsize(file) / 1024  # Size in KB
        print(f"{i}. ‚úÖ {file} ({size:.1f} KB)")
    else:
        print(f"{i}. ‚ùå {file} (NOT FOUND - did you complete that task?)")

print("\nüí° Next Steps:")
print("- Open these files in Excel")
print("- Practice formulas (SUM, AVERAGE, IF, COUNTIF)")
print("- Create charts and visualizations")
print("- Apply conditional formatting")
print("- Build PivotTables")

---
## Part 9: Reflection Questions

**Q21: How is using SQL different from loading an entire CSV file?**

A: 

**Q22: What's the connection between SQL and machine learning?**

A: 

**Q23: When would you use WHERE vs. filtering in pandas?**

A: 

**Q24: What features would you add to improve a win prediction model?**

A: 

**Q25: Why export to Excel instead of just working in Python?**

A: 

**Q26: How would SQL help if you had 50 million game records instead of 2,000?**

A: 

---
## Cleanup

In [None]:
# Always close your database connection when done
conn.close()
print("‚úÖ Database connection closed")

---
## Summary

**You practiced:**
- ‚úÖ Connecting to databases
- ‚úÖ Exploring table structures
- ‚úÖ SELECT queries (choosing features)
- ‚úÖ WHERE filters (selecting training examples)
- ‚úÖ ORDER BY (finding extremes and patterns)
- ‚úÖ LIMIT (controlling result size)
- ‚úÖ Multiple conditions (AND)
- ‚úÖ Loading SQL results into pandas
- ‚úÖ Exporting data to Excel for analysis

**Excel Files Created:**
1. high_scoring_games.xlsx
2. top_scorers.xlsx
3. win_loss_records.xlsx
4. nba_win_prediction.xlsx

**Key Takeaway:** SQL is how you extract and prepare data for machine learning in the real world!

---

## Next Steps

**In the next lessons, you'll:**
- Open these Excel files
- Practice MOS certification skills (formulas, charts, formatting)
- Learn JOIN queries (combining tables)
- Learn GROUP BY (aggregating data)
- Build a win prediction model with your SQL data!

**Great work!**