# SQL for Data Analysis  
### Introductory Workshop
*D‑Lab, UC Berkeley*

### Welcome & Environment Check 

Estimated time: 8 minutes

## Table of Contents  
1. [Why SQL?](#why-sql)
3. [Relational Databases](#relational)    
4. [Importing CSV → SQLite](#sqlite)  
5. [SQLite Data Types & NULLs](#types)  
6. [SELECT & Derived Columns](#select)  
7. [Filtering Rows with WHERE](#where)  
8. [Aggregates & GROUP BY](#groupby)  
9. [Sorting & Paging Results](#orderby)   
10. [Key Points & Next Steps](#keypoints)

<div class="alert alert-success"> 
<b>Learning Goals</b><br><br>
By the end of this workshop you will be able to:
<ul>
<li>Understand why one would use SQL and why and how it is complementary to Pandas</li>
<li>Write basic queries with <code>SELECT</code>, create aliases, and build derived columns.</li>
<li>Filter rows using <code>WHERE</code>, sort and paginate result sets with <code>ORDER BY</code>, <code>LIMIT</code>, <code>OFFSET</code>.</li>
<li>Summarise data with aggregates &nbsp;(<code>COUNT</code>, <code>SUM</code>, <code>AVG</code>, <code>MIN</code>, <code>MAX</code>)&nbsp; and <code>GROUP BY</code>.</li>
<li>Filter groups with <code>HAVING</code> and understand why it is different from <code>WHERE</code></li>
</ul>
</div>

<div class="alert alert-info">
<b>Prerequisites</b><br><br>
Before starting this workshop, you should:
<ul>
<li>Have basic familiarity with data analysis concepts</li>
<li>Have Python and Jupyter Notebook installed (optional, but helpful for following along)</li>
<li>Download the workshop materials (we'll provide these)</li>
<li>Not strictly necessary, but it would be very useful to have a notion of how to do operations on Pandas (or any other similar library/package/software) to better contextualize the material.</li>
</ul>

No prior SQL experience is required!
</div>

<div class="alert alert-warning">
<b>Setup Note</b><br><br>
If you want to follow along with the exercises: <br>
1. Make sure you have the required Python packages installed:<br>
   - pandas<br>
   - sqlite3<br>
   - sqlalchemy<br>
2. Download the example database file we'll be using
</div>

### Setup

In [14]:
# Importing packages that we will need
import sqlite3
from sqlalchemy import create_engine
import pandas as pd         
import os, pathlib, sys

# The first thing we need to do is connect to our SQLite database, which is the file that stores the information we will be using.
# If there is no pre-existing database with this name, it automatically creates one.
# Every SQL query will be executed through this `conn` object - which can be thought as a channel to the SQLite file.
conn = sqlite3.connect('../Data/customers.sqlite')
engine = create_engine('sqlite:///customers.sqlite')

# Load the customers table into a DataFrame for displ


# Troubleshooting: Please run the following cells to check whether sqlite is imported and the database file is in its appropriate location!

try:
    import sqlite3                        
    print("✅ The sqlite3 library is imported and available.")
except ImportError:
    print("❌ sqlite3 library not found.")

# 2️⃣ Check that the workshop DB file exists
from pathlib import Path
print("✅ The database file is available and in the appropriate folder" if Path("../Data/customers.sqlite").exists()
      else "❌ ../Data/customers.sqlite is missing—download the data bundle and place it there.")


✅ The sqlite3 library is imported and available.
✅ The database file is available and in the appropriate folder


### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
📝 **Poll:** A Zoom poll to help you learn!<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>
🙋 Hands-Up: Quick pulse-check or mini-quiz. Respond by choosing the option that matches how you feel/what you think.

<a id='why-sql'></a>
## 1 · Why SQL? 

Estimated time: 15 minutes

A very common question when people learn about the existence of SQL is: Why bother learning it? Can't I just use pandas and pandas dataframes?

### Learning Objective
Understand when and why to use SQL instead of pandas, and how these tools complement each other in data analysis workflows.

📝 **Poll 1:** What is the usual size of databases you use? How large do you think databases can get?

### Bookcase vs Desk Analogy  
Think of your computer’s RAM as the desk where you spread papers you’re actively working on, and the hard‑drive as the bookcase that stores all your books.

Working at your desk is fast, but at the same time it doesn't hold nearly as much material as a bookshelf. So if you are working with very large datasets, you are at risk of overloading your desk with material. 

* **Pandas** is excellent for analysis of *smaller* data that fits comfortably on the desk.  
* **SQL** is the tool we use to selectively bring only the information we need from the bookcases to the desk.

🙋 **Hands-Up:** Did the analogy make sense? A. Yes B. Still fuzzy

 **Example:** A 20 GB transaction table can be grouped and aggregated with a single SQL statement, whereas pandas would first need 20 GB of memory just to read the file.

### Rough size sweet-spots on a single machine

- **Pandas (all in RAM)** – ideal up to **≈ 5 million rows** (around **500 MB** of CSV). Beyond that, Python overhead (≈ 5–10× the raw file size) quickly exhausts memory.  
- **SQLite (single-file engine)** – stays responsive with **tens or even hundreds of millions of rows** (roughly **10–500 GB**). Because it streams pages from disk, RAM is rarely the bottleneck. File-system ceiling: ≈ 280 TB.  
- **PostgreSQL (server engine)** – comfortably handles **100 GB to several terabyte tables** on industrial hardware; technical cap is **32 TB per table**, but overall database size is **unlimited**


### Complements, not Substitutes 
In practice we often:

1. Use SQL to slice/aggregate huge tables, producing a manageable result set  
2. Pull that into pandas for plotting or modelling

**💡Tip:** When working with large datasets, try to do as much filtering and aggregation as possible in SQL before pulling data into pandas. This can dramatically improve performance!

🔔 **Question:**  Have you ever dealt with a problem in which SQL was necessary? If not, can you think of one real life application that might require it?

<a id='relational'></a>
## 2 · Relational Databases 

Estimated time: 12 minutes

### Learning Objective
Grasp the fundamental concepts of relational databases, including tables, keys, and relationships between data.

## What is a Relational Database?

<div class="alert alert-primary">
<b>🔑 Key Concept: Relational Databases</b><br><br>
A relational database organizes data into tables (sometimes called relations), which consist of rows and columns:

- **Tables**: Collections of related data (e.g., Customers, Orders)
- **Columns**: Specific attributes (e.g., CustomerID, FirstName)
- **Rows**: Individual records in the table
- **Primary Keys**: Unique identifiers for each row
- **Foreign Keys**: References to primary keys in other tables that create relationships

![Database sample of Customers, Orders and OrderDetails tables, highlighting how relational data splits across tables.](../Images/relational-databases.svg)

Let's take a look at these example:
- Combined, these two tables form a database - a collection of data organized in tables.
- In this particular case, there are two tables, each one storing one type of information - either about Customers, or about Orders. But databases usually consist of many tables!
- Each table consists of rows - identifying individual records of information - and columns, which provide a piece of information for a given field corresponding to each row.
- Both of these tables have primary keys - columns that uniquely identify each row, i.e, each row has one, and no two rows share a primary key. Notice however the two tables do not have the same primary key!
- Further, the Order table has a foreign key "Customer ID" - a column linking to the primary key of the Customers table. Each order can be associated to a single customer through this relationship. But that doesn't mean each customer has just one order!

📝 **Poll 2:** Can you think of a table that would not have a Primary Key?


<a id='sqlite'></a>
## 3 · Getting Started with SQLite 

Estimated Time: 10 minutes

### Learning Objective
Understand how to set up SQLite, import data from CSV files, and establish a database connection for querying data.



## What is SQLite?

SQLite is a lightweight relational database. It is easy to install, use, and even though it doesn't have as many advanced capabilities as PostgreSQL or MySQL, it is a great way to start learning SQL. 

SQL queries are very similar (if not identical) across these programs, so there is a lot of transfer of knowledge if you decide to move on to another one in the future.

### When to Use Different Databases

| What You're Doing | SQLite | MySQL/PostgreSQL |
|-------------------|--------|------------------|
| Getting Started | Great for learning SQL | More complex to set up |
| Working Locally |  Just a file you can copy | Needs server installation |
| Small-Medium Projects |  Up to ~1TB of data | Any size |
| Data Science Projects |  Works great with Python | Needs more setup |
| Team Projects | Limited sharing options | Better for collaboration |

💡 **Real-World Examples**:
- **SQLite**: Your Jupyter notebooks, research projects, personal data analysis
- **MySQL/PostgreSQL**: Company databases, shared research data, production systems

The key takeaway: SQLite is perfect for learning SQL and doing your own data analysis. When you need to share data with a team or work on very large datasets, you can easily switch to MySQL or PostgreSQL - the SQL commands will be almost exactly the same!


📝 **Poll 3:** What's your primary reason for learning SQL today? 

### SQLite and Python

In this workshop, we'll work with SQLite directly in a Jupyter notebook using the `sqlite3` Python library, which comes built into Python.

We could create tables with `CREATE TABLE` statements, as well as add new data (INSERT), change existing data (UPDATE), and remove data (DELETE) directly on SQL, but in this workshop we will focus on querying data from a pre-existing database. I have added an example of how can this be done in an auxiliary file that I used to create the database for this workshop, if you would like to explore that later!


💡**Tip:** In this workshop we will be focusing on databases that contain a single table. In the next workshop, we will learn how to combine information from multiple tables. 

Let's upload our sql database as a dataframe in pandas:

customers_df = pd.read_sql_query('SELECT * FROM customers', conn)



🥊 **Challenge**: Can you use pandas to take a look at what this dataset looks like?

In [8]:
### SOLUTION:
customers_df.head()    

Unnamed: 0,name,city,country,items_purchased,price_per_item,last_purchase,account_balance
0,John Smith,New York,US,,56.28,,945.55
1,Maria Garcia,London,GB,15.0,,2024-08-18,905.34
2,Li Wei,Tokyo,JP,11.0,14.18,2024-02-10,638.11
3,Emma Brown,Paris,FR,8.0,64.68,2024-01-28,929.69
4,Ahmed Hassan,Sydney,AU,7.0,25.35,2024-05-14,179.64


<a id='types'></a>
## 4 · SQLite Data Types & `NULL` 

Estimated time: 7 minutes

### Learning Objective
Understand SQLite's data types and how NULL values are handled in SQL databases.

SQLite has a flexible approach to data types. While columns are designed with a recommended data type (like TEXT or INTEGER), SQLite can actually store any type of data in any column.

* **INTEGER** – whole numbers 
* **REAL** – floating‑point  
* **TEXT** – strings    
* **NULL** – missing / undefined  

Unlike PostgreSQL/MySQL, SQLite does NOT have a date format, but it does have dedicated functions that interpret strings as dates.

<a id='select'></a>
## 5 · SELECT & Derived Columns 

Estimated time: 18 minutes

### Learning Objective
Master the fundamental SQL SELECT statement, including column selection, aliasing, and creating derived columns using basic SQL functions.

Real-World Application: Extract the relevant customer information for a marketing campaign

### The SELECT statement is the foundation of SQL queries. We'll learn it step by step:

1. Basic SELECT - Retrieving data from tables
2. Column Aliases - Giving friendly names to columns
3. Derived Columns - Creating new columns from existing ones
4. SQL Functions - Transforming data in useful ways

💡 Tip: We'll build from simple queries to more complex ones, making sure each concept is clear before moving on.

At its most basic version, a query looks like this:

```sql
SELECT column1, column2, …
FROM   table;
```

- One of the most confusing aspects of SQL, especially in the beginning, is understading the order in which SQL reads the instructions of a query - which unfortunately is not the order in which the commands are written.
- Here it is better to think that the first command is "FROM" - which tells us where to draw the information from
- After that, we can select only the columns we will need by listing them after the SELECT statement.
- Notice that:
    - We don't need to put names of columns and tables between quotes
    - We do need to separate different columns by commas - but not one at the end!

🙋 **Hands-Up:** Which clause executes *first* in SQL’s logical order? A. `SELECT` B. `FROM` C. `WHERE`

💡 **Tip**: When selecting multiple columns, use line breaks after commas for better readability!

In [3]:
# Basic SELECT query
query = """
SELECT name, city, country, items_purchased
FROM customers
"""
result = pd.read_sql_query(query, conn)
print("Basic SELECT:\n")
print(result)

Basic SELECT:

                name         city country  items_purchased
0         John Smith     New York      US              NaN
1       Maria Garcia       London      GB             15.0
2             Li Wei        Tokyo      JP             11.0
3         Emma Brown        Paris      FR              8.0
4       Ahmed Hassan       Sydney      AU              7.0
5      Sarah Johnson       Berlin      DE             19.0
6   Carlos Rodriguez       Mumbai      IN             11.0
7      Anna Kowalski    São Paulo      BR             11.0
8       James Wilson      Toronto      CA              4.0
9        Yuki Tanaka     Shanghai      CN              8.0
10       Elena Popov       Madrid      ES              3.0
11     Michel Dubois       Moscow      RU              2.0
12      Sofia Santos        Dubai      AE             12.0
13     Lars Andersen         None    None              6.0
14       Aisha Patel  Mexico City      MX              2.0
15    Diego Martinez    Amsterdam      NL

⚠️ **Warning:** Common Mistake - Missing commas between columns
```sql
-- ❌ WRONG: Missing commas
SELECT name city country
FROM customers

-- ✅ CORRECT: Columns separated by commas
SELECT name, city, country
FROM customers

However, queries are extremely flexible and allow us to combine/summarize information in many ways.

Let's take a look at some examples of more interesting queries

### Step‑by‑Step Examples

1. **Select everything:**


   If you want to keep all columns from a table, you can just use * instead of naming the columns.


   ```sql
   SELECT * 
   FROM customers;
   ```

In [4]:
# Select all columns example
select_all_query = """
SELECT * 
FROM customers
"""
result = pd.read_sql_query(select_all_query, conn)
print("SELECT * - All columns:\n")
print(result)

SELECT * - All columns:

                name         city country  items_purchased  price_per_item  \
0         John Smith     New York      US              NaN           56.28   
1       Maria Garcia       London      GB             15.0             NaN   
2             Li Wei        Tokyo      JP             11.0           14.18   
3         Emma Brown        Paris      FR              8.0           64.68   
4       Ahmed Hassan       Sydney      AU              7.0           25.35   
5      Sarah Johnson       Berlin      DE             19.0           15.85   
6   Carlos Rodriguez       Mumbai      IN             11.0           95.40   
7      Anna Kowalski    São Paulo      BR             11.0           96.91   
8       James Wilson      Toronto      CA              4.0           82.76   
9        Yuki Tanaka     Shanghai      CN              8.0           37.42   
10       Elena Popov       Madrid      ES              3.0           18.79   
11     Michel Dubois       Moscow      

🔔 **Question:** When might using SELECT * be problematic in a real database?

### 5.2 Column Aliases

    Sometimes tables have very complicated/uninformative/ambiguous names. We can rename the header of a column using an alias, which is a new name to columns in our result. This is useful when:
    - Making technical names more readable
    - Clarifying the meaning of computed columns
    - Creating reports for non-technical users

Syntax using AS (recommended for clarity):

   ```sql
   SELECT column AS new_name,
   FROM   table;
   ```

Alternative syntax (implicit):

   ```sql
   SELECT column new_name,
   FROM   table;
   ```

⚠️ **Warning:** If your new name contains spaces or is a special name, you must wrap it in quotes:

   ```sql
   SELECT column AS "Full Name",
   FROM   table;
   ```

In [5]:
# Column aliases example
alias_query = """
SELECT 
    name AS customer_name,
    city AS location,
    items_purchased AS quantity,
    price_per_item AS unit_price
FROM customers
WHERE items_purchased IS NOT NULL
"""
result = pd.read_sql_query(alias_query, conn)
print("SELECT with Aliases: \n")
print(result)

SELECT with Aliases: 

       customer_name     location  quantity  unit_price
0       Maria Garcia       London      15.0         NaN
1             Li Wei        Tokyo      11.0       14.18
2         Emma Brown        Paris       8.0       64.68
3       Ahmed Hassan       Sydney       7.0       25.35
4      Sarah Johnson       Berlin      19.0       15.85
5   Carlos Rodriguez       Mumbai      11.0       95.40
6      Anna Kowalski    São Paulo      11.0       96.91
7       James Wilson      Toronto       4.0       82.76
8        Yuki Tanaka     Shanghai       8.0       37.42
9        Elena Popov       Madrid       3.0       18.79
10     Michel Dubois       Moscow       2.0       71.58
11      Sofia Santos        Dubai      12.0       49.61
12     Lars Andersen         None       6.0       20.98
13       Aisha Patel  Mexico City       2.0       54.57
14    Diego Martinez    Amsterdam       1.0       13.09
15         Lucy Chen        Cairo      12.0       91.84
16       Ivan Petrov    S

### 5.3 Derived Columns

Very commonly we might want to create what we call a "Derived Column" - which is a column that modifies/combines information from columns of the original table. 

The way that we do this is usually by using functions. The syntax is function(column_name). We treat them as if they were regular columns.

Common Uses:
- Perform calculations
- Combine text
- Transform data

Basic syntax:
```sql
SELECT 
    original_column,
    expression AS new_column
FROM table;
```

Common uses:
1. Arithmetic: `price * quantity AS total`
2. Text concatenation: `first_name || ' ' || last_name AS full_name`
3. Simple calculations: `price * 1.2 AS price_with_tax`

![Workflow diagram showing SQL string concatenation that creates a “summary” column from name, price and quantity fields.](../Images/derivedcolumn.svg)
---

In [6]:
# Derived columns with string functions
select_derived_query = """
SELECT 
    LOWER(name) AS lower_name,
    UPPER(city) AS upper_city,
    items_purchased * 2 AS double_items
FROM customers
"""
print("SELECT Example with Derived Columns: \n")
print(pd.read_sql_query(select_derived_query, conn))

SELECT Example with Derived Columns: 

          lower_name   upper_city  double_items
0         john smith     NEW YORK           NaN
1       maria garcia       LONDON          30.0
2             li wei        TOKYO          22.0
3         emma brown        PARIS          16.0
4       ahmed hassan       SYDNEY          14.0
5      sarah johnson       BERLIN          38.0
6   carlos rodriguez       MUMBAI          22.0
7      anna kowalski    SãO PAULO          22.0
8       james wilson      TORONTO           8.0
9        yuki tanaka     SHANGHAI          16.0
10       elena popov       MADRID           6.0
11     michel dubois       MOSCOW           4.0
12      sofia santos        DUBAI          24.0
13     lars andersen         None          12.0
14       aisha patel  MEXICO CITY           4.0
15    diego martinez    AMSTERDAM           2.0
16         lucy chen        CAIRO          24.0
17       ivan petrov    STOCKHOLM          24.0
18     mary williams         None          34.0
1

⚠️ Warning: Common Mistake - Using double quotes for string values

```sql
--❌ WRONG: Double quotes for string values
WHERE country = "USA"

-- ✅ CORRECT: Single quotes for strings
WHERE country = 'USA'
```

In [7]:
# Combining columns and calculations
select_combined_query = """
SELECT 
    name,
    city || ', ' || country AS full_location,
    items_purchased * price_per_item AS total_spent
FROM customers
"""
print("SELECT Example with Combined Columns: \n")
print(pd.read_sql_query(select_combined_query, conn))

SELECT Example with Combined Columns: 

                name    full_location  total_spent
0         John Smith     New York, US          NaN
1       Maria Garcia       London, GB          NaN
2             Li Wei        Tokyo, JP       155.98
3         Emma Brown        Paris, FR       517.44
4       Ahmed Hassan       Sydney, AU       177.45
5      Sarah Johnson       Berlin, DE       301.15
6   Carlos Rodriguez       Mumbai, IN      1049.40
7      Anna Kowalski    São Paulo, BR      1066.01
8       James Wilson      Toronto, CA       331.04
9        Yuki Tanaka     Shanghai, CN       299.36
10       Elena Popov       Madrid, ES        56.37
11     Michel Dubois       Moscow, RU       143.16
12      Sofia Santos        Dubai, AE       595.32
13     Lars Andersen             None       125.88
14       Aisha Patel  Mexico City, MX       109.14
15    Diego Martinez    Amsterdam, NL        13.09
16         Lucy Chen        Cairo, EG      1102.08
17       Ivan Petrov    Stockholm, SE     

🔔 **Question:** Why are we getting NaN's in the last column? How can we avoid this issue?

<div class="alert alert-info">
<b>💡 Best Practices</b><br><br>
1. Always explicitly list the columns you need <br>
2. Use meaningful aliases for clarity<br>
3. Format queries with proper indentation<br>
4. Add comments for complex queries<br>
5. Test queries with a subset of data first
</div>

### Commonly used Functions

We won't be able to cover all of them, but for future reference, here is a quick list of commonly used functions:

#### Text functions

| Function | Description | Example |
|----------|-------------|---------|
| `SUBSTR(text, start, len)` | Substring (1‑based index) | `SUBSTR('Market',1,3)` → `Mar` |
| `INSTR(text, pattern)` | Position (0 if not found) | `INSTR('abcdef','cd')` → `3` |
| `LOWER(text)` / `UPPER(text)` | Case conversion | `LOWER('SQL')` → `sql` |
| `REPLACE(text, old, new)` | Global substitution | `REPLACE('foo','o','0')` |
| `TRIM(text)` | Strip leading/trailing spaces | |
| `str1 \|\| ' merging character ' \|\| str2 \|\|` | Concatenates strings

#### Date / time Functions

| Function | What it does | Example |
|----------|--------------|---------|
| `DATE('now')` | Current date (UTC) | `DATE('now')` → `'2025‑05‑07'` |
| `DATETIME('now','localtime')` | Current local datetime | `DATETIME('now','localtime')` → `'2025‑05‑07 13:25:00'` |
| `STRFTIME(fmt, ts)` | Format timestamp → text | `STRFTIME('%Y‑%m', OrderDate)` → `'1997‑07'` |
| `JULIANDAY(ts)` | Days since noon 4714 BC | `JULIANDAY('2025‑05‑07')` → `2460457.5` |


#### Integer and Float Functions

| Expression | Result |
|------------|--------|
| `column1 + column2` | Add two numbers |
| `ROUND(total * 0.15, 2)` | Round to 2 decimal places |
| `COALESCE(price, 0)` | Replace NULL with 0 |

💡**Tip:** Need a float? Multiply an `INTEGER` by **`1.0`**.

Now that we know the basics of querying, we can start dealing with more complex query commands - such as filtering, grouping, aggregate functions, sorting and paginating.

<a id='where'></a>
## 6 · Filtering Rows with `WHERE` 

Esimated time: 10 minutes

### Learning Objective
Learn how to filter data using WHERE clauses, including comparison operators, logical operators (AND, OR), and pattern matching with LIKE.

🔑 **Key Concept:** Filtering

Filtering is the process of selecting a subset of rows that match a certain condition. Think of it as answering the question "Which rows do I want to keep?"

Filtering happens before selecting. This is at the crux of our desk-bookshelf analogy - we just want to query the data that we will be needing, so that it is manageable when it gets to our desk.


**📚 Real-World Applications**
- Finding transactions above a certain amount for audit <br>
- Filtering customer data by region for targeted marketing<br>
- Identifying high-value products for inventory management<br>
- Finding recent orders for shipping prioritization<br>



The basic filtering method is WHERE, and the syntax is as follows:

```sql
SELECT columns
FROM table
WHERE condition

```

As we will see in a bit, the conditions we can impose are very flexible. But, as a first example, let's filter our previous database to those consumers who purchased more than one item.

In [8]:
# Customers who purchased more than one item
where_query = """
SELECT name, city, items_purchased, price_per_item
FROM customers
WHERE items_purchased > 1
"""
result = pd.read_sql_query(where_query, conn)
print("Customers who purchased more than one item:")
print(result)

Customers who purchased more than one item:
                name         city  items_purchased  price_per_item
0       Maria Garcia       London             15.0             NaN
1             Li Wei        Tokyo             11.0           14.18
2         Emma Brown        Paris              8.0           64.68
3       Ahmed Hassan       Sydney              7.0           25.35
4      Sarah Johnson       Berlin             19.0           15.85
5   Carlos Rodriguez       Mumbai             11.0           95.40
6      Anna Kowalski    São Paulo             11.0           96.91
7       James Wilson      Toronto              4.0           82.76
8        Yuki Tanaka     Shanghai              8.0           37.42
9        Elena Popov       Madrid              3.0           18.79
10     Michel Dubois       Moscow              2.0           71.58
11      Sofia Santos        Dubai             12.0           49.61
12     Lars Andersen         None              6.0           20.98
13       Aisha Pat

This example actually shows something very interesting about the ```WHERE``` statement - we don't need to keep the column we are using to filter the rows.

In other words, let's say that we don't care too much about how much an user actually purchase, but we only want those who spend above a threshold. We can use ```WHERE``` combined with ```SELECT``` to use the information for filtering, but not keep it with us after it has been used, saving a lot of memory.

🥊 **Challenge:** List the `name`, `country`, and `age` of all customers **older than 50** who live in **Brazil**.

In [None]:
# This will raise a sqlite3.OperationalError — fix it!
bad_query = """
SELECT name country age
FROM customers
WHERE age > 50
  country = 'Brazil';
"""
pd.read_sql_query(bad_query, conn)

### 6.1 Filtering Methods

Here are some commonly used filtering methods:

1. **Equality / inequality**  
   ```sql
   … WHERE country = 'Germany';
   ```
⚠️ **Warning:** Be careful with string comparisons - they might be case-sensitive depending on the SQL engine and version you are using.

2. **Set membership**  
   ```sql
   … WHERE country IN ('USA','UK','Germany');
   ```

3. **Comparison**  
   ```sql
   … WHERE freight > 100;
   ```

📝 **Poll 4:** Which comparison operator would you use to find values in a specific range?

4. **NULL checks**  
   ```sql
   … WHERE fax IS NULL;
   ```

⚠️ **Warning:** When you compare anything to NULL, the result isn't TRUE or FALSE, it is a special third type called UNKNOWN. 

This feature can be quite confusing, especially in the beginning. A common mistake is to try to find rows with missing values by using 

'WHERE column = NULL'

This leads to the comparison  `NULL` **≠** `NULL` , which yields *UNKNOWN*, rather than TRUE.

In SQL, we should instead use dedicated functions:

```sql
… WHERE fax IS NULL
… WHERE fax IS NOT NULL
```

<div class="alert alert-warning">
<b>⚠️ Other common WHERE mistakes</b><br><br>

1. String matching
   - ❌ Forgetting quotes: `WHERE name = John`
   - ✅ Using quotes: `WHERE name = 'John'`

2. Date comparisons
   - ❌ `WHERE date >= '2024-01-01'` AND `date <= '2024-12-31'` 
   - ✅ `WHERE date >= '2024-01-01'` AND `date < '2025-01-01'`
</div>


In [9]:
# Demonstrating NULL handling with our customers data
# First, the incorrect way
incorrect_null_query = """
SELECT name, city
FROM customers
WHERE city = NULL
"""
print("Incorrect NULL check (city = NULL):")
print(pd.read_sql_query(incorrect_null_query, conn))

# Now the correct way
correct_null_query = """
SELECT name, city, country
FROM customers
WHERE city IS NULL
"""
print("Correct NULL check (city IS NULL):")
print(pd.read_sql_query(correct_null_query, conn))

Incorrect NULL check (city = NULL):
Empty DataFrame
Columns: [name, city]
Index: []
Correct NULL check (city IS NULL):
            name  city country
0  Lars Andersen  None    None
1  Mary Williams  None    None


### 6.2 Multiple Conditions
We can combine multiple conditions using logical operators:

- AND: Both conditions must be true
- OR: At least one condition must be true
- NOT: Reverses a condition
Use parentheses to make the order of operations clear:

```sql
WHERE (country = 'US' OR country = 'GB')
  AND account_balance > 200
```


💡 **Tip:** You can combine multiple logical expressions using AND/OR/NOT. When doing so, use parentheses to clarify when one logical expressions begins and the other ends.
⚠️ **Warning:** Without parentheses, AND takes precedence over OR.

<a id='groupby'></a>
## 7 · Aggregate Functions & `GROUP BY` 

Estimated Time: 25 minutes

### Learning Objective
Master data aggregation and grouping using SQL's aggregate functions (COUNT, SUM, AVG, etc.) and GROUP BY clause to analyze and summarize data effectively.

### Real World Application  
Given a list of individual transactions, calculate total sales per region.

### 7.1 The Basic Idea

Another way of pre-processing data so that the end result is more manageable is to summarize it according to a given statistic.

One common example is the use of aggreggate functions, combined with GROUP BY, to collapse many rows into one, while keeping the information contained in them, only now summarized in a single row.

First let's understand what the ```GROUP BY``` statement does: it creates subsets of the entire table that are similar in a given way. The most common way of doing so is to pass a column - and then SQL will automatically group the rows according to the values in that column

Second, ```GROUP BY``` statements are used in conjunction with aggreggate functions. By grouping rows according to a given column, we can guarantee that the values of these rows match **for that particular column**. But what about the others? They might be different, in which case there is no obvious way of combining them. Aggregate functions do exactly this - they tell SQL what to do with mismatching information inside the group - for example by counting the number of occurrences, summing or taking averages:

```sql
SELECT country,
       COUNT(*)        AS n_orders,
       ROUND(AVG(freight),2) AS avg_freight
FROM   orders
GROUP  BY country
```


![Three-step graphic: raw sales rows, grouped by Department, then totals—illustrates how GROUP BY collapses data.](../Images/groupby.svg) 

Let's break it down what is going on with the GROUP BY command.

- First, SQL will look into the column indicated on GROUP BY - in this case "department"
- It will then create buckets given the entries in this column. In other words, for each possible value in this column, it will group the rows based on these values.
- For each of these groups, it will run an aggregate function - in this case COUNT(*), which counts how many entries each group has, and SUM(salary), which will sum the column salary across all rows in a given group
- It will finally return the values of each group, and of the aggregate functions. Notice that each group only has one row in the resulting column - we have "collapsed" the table!


🥊 **Challenge:** For each country, list the number of customers (COUNT(*)) and the total items purchased (SUM(items_purchased)).
Show only countries with at least 5 customers.

In [None]:
# Debug this intentional error

bad_query = """
SELECT country,
       COUNT(*),                       
       SUM(items_purchased)
FROM customers
WHERE COUNT(*) >= 5                    
GROUP BY country;
"""
pd.read_sql_query(bad_query, conn)

### 7.2 More Advanced Ideas
Another very important thing - which is a bit tough to get used to in the beginning, is that we can only include in the SELECT statement columns that are either used to group by observations, or ones that are used as inputs of aggregate functions. This is exactly because of what we discussed previously - if there is a mismatch between rows, SQL doesn't know how to handle these values when it collapses all the rows in the group into a single one.

A bit more advanced, but we can also pass more than one column to the GROUP BY statement - which would then create groups in which rows have the same values for all columns passed.

💡 **Tip**: By using GROUP BY, we obtain a collapsed version of the table - we only retain information on the aggregate values. While this is very useful for summarizing information, sometimes we want to keep the detailed data and the summary statistics for more in-depth analysis. This is exactly the problem that Window Functions solve - and something we will be dealing with in the intermediate workshop!

In [10]:
# GROUP BY country with aggregations
group_by_query = """
SELECT 
    country,
    COUNT(*) AS customer_count,
    ROUND(AVG(items_purchased), 2) AS avg_items,
    ROUND(AVG(account_balance), 2) AS avg_balance,
    ROUND(MAX(account_balance), 2) AS max_balance
FROM customers
WHERE country IS NOT NULL
GROUP BY country
"""
result = pd.read_sql_query(group_by_query, conn)
print("Customer statistics by country:\n")
print(result)

Customer statistics by country:

   country  customer_count  avg_items  avg_balance  max_balance
0       AE               1       12.0       352.84       352.84
1       AU               1        7.0       179.64       179.64
2       BR               1       11.0       392.80       392.80
3       CA               1        4.0       449.81       449.81
4       CN               1        8.0       344.21       344.21
5       DE               1       19.0          NaN          NaN
6       EG               1       12.0       167.10       167.10
7       ES               1        3.0       845.86       845.86
8       FR               1        8.0       929.69       929.69
9       GB               1       15.0       905.34       905.34
10      HK               1       15.0       833.92       833.92
11      IN               1       11.0       140.70       140.70
12      IT               1       16.0       104.97       104.97
13      JP               1       11.0       638.11       638.11
14     

⚠️ Warning: Common Mistake - Using aggregate functions without GROUP BY

```sql
-- ❌ WRONG: Mixing aggregate and non-aggregate columns
SELECT country, COUNT(*), AVG(account_balance)
FROM customers

-- ✅ CORRECT: Add GROUP BY for non-aggregate columns
SELECT country, COUNT(*), AVG(account_balance)
FROM customers
GROUP BY country
```

### 7.3: Summarizing

Key points:

- Every non-aggregated column in SELECT must be in GROUP BY
- GROUP BY comes after WHERE but before ORDER BY
- Can group by multiple columns
- Can use expressions in GROUP BY 

For future reference, here is a list of the most commonly used aggregate functions:

| Aggregate Function                                   | What it returns (typical usage)                    |
| ---------------------------------------------------- | -------------------------------------------------- |
| `COUNT(*)`                                           | Total number of rows in the group/query            |
| `SUM(col)`                                           | Arithmetic sum of a numeric column                 |
| `AVG(col)`                                           | Mean (average) of numeric values                   |
| `MAX(col)`                                           | Largest value (numeric *or* lexicographic)         |
| `MIN(col)`                                           | Smallest value (numeric *or* lexicographic)        |
| `STRING_AGG(col, ', ')` / `GROUP_CONCAT` / `LISTAGG` | Concatenates strings in the group with a delimiter |
| `COUNT(DISTINCT col)`                                | Count of unique, non-NULL values                   |

💡 **Tip**: Mathematical functions ignore `NULL` in aggregates (`AVG`, `SUM`), which is usually what you want. `COUNT(column)` counts only non‑null values, whereas `COUNT(*)` counts *all* rows

### 7.4 · Filtering Groups with `HAVING` 

Remember that when we discussed filtering, we used the WHERE command, which was run before the ```GROUP BY```.

Sometimes, we want to filter rows given an aggregate statement. For example, we might want to choose only the customers whose average expenditure is larger than a certain amount.

`HAVING` is evaluated **after** grouping – it filters *groups*, whereas `WHERE` filters *rows*.

Two Observations:
- We cannot use aggregate functions on ```WHERE``` statements
- We must use ```HAVING``` after the ```GROUP BY``` statement

```sql
SELECT country,
       COUNT(*) AS n_orders
FROM   orders
GROUP  BY country
HAVING n_orders > 20          -- aggregate in condition
ORDER  BY n_orders DESC;
```

In [11]:
# Countries with high average purchases using HAVING
having_query = """
SELECT 
    country,
    COUNT(*) AS customer_count,
    ROUND(AVG(items_purchased), 2) AS avg_items_purchased,
    ROUND(AVG(account_balance), 2) AS avg_balance
FROM customers
WHERE country IS NOT NULL 
  AND items_purchased IS NOT NULL
GROUP BY country
HAVING AVG(items_purchased) > 5 
   AND COUNT(*) >= 2
"""
result = pd.read_sql_query(having_query, conn)
print("Countries with average purchases > 5 and at least 2 customers:")
print(result)

Countries with average purchases > 5 and at least 2 customers:
Empty DataFrame
Columns: [country, customer_count, avg_items_purchased, avg_balance]
Index: []


![Split-path illustration contrasting filtering rows with WHERE before grouping versus HAVING after aggregation.](../Images/wherehaving.svg)

🙋 **Hands-Up:** Can you explain—in one sentence—what `HAVING` does that `WHERE` can’t? A. Yes B. Not yet

<a id='orderby'></a>
## 8 · Sorting & Paginating Results 

Estimate Time: 10 minutes

### Real World Application: 

Find the top 10 selling items on a given year.

After we have processed our data, we might want to start preparing it for visualization. In SQL, this is done mostly through sorting - ordering the data according to one or more columns - or paginating - retrieving only a fixed number of observations

`ORDER BY` is evaluated *after* `SELECT`.  

* Default sort is **ASC** (ascending).  
* Use **DESC** for descending order.  


In [12]:
# Simple ORDER BY example with one column
simple_order_query = """
SELECT 
   name,
   city,
   items_purchased
FROM customers
WHERE items_purchased IS NOT NULL
ORDER BY items_purchased DESC
"""

print("Simple ORDER BY Example - Top 10 Customers by Items Purchased:\n")
result = pd.read_sql_query(simple_order_query, conn)
print(result)

Simple ORDER BY Example - Top 10 Customers by Items Purchased:

                name         city  items_purchased
0      Sarah Johnson       Berlin             19.0
1           Jun Park        Seoul             19.0
2      Mary Williams         None             17.0
3       Hans Schmidt         Rome             16.0
4       Maria Garcia       London             15.0
5     Isabella Silva    Hong Kong             15.0
6       Sofia Santos        Dubai             12.0
7          Lucy Chen        Cairo             12.0
8        Ivan Petrov    Stockholm             12.0
9       Anna Ivanova      Bangkok             12.0
10            Li Wei        Tokyo             11.0
11  Carlos Rodriguez       Mumbai             11.0
12     Anna Kowalski    São Paulo             11.0
13         Raj Kumar  Los Angeles             10.0
14        Emma Brown        Paris              8.0
15       Yuki Tanaka     Shanghai              8.0
16      Ahmed Hassan       Sydney              7.0
17     Lars Anders

You can order by *multiple* columns – the second acts as a tie‑breaker.

```sql
SELECT
       name,
       country,
       total_spent,
       ROW_NUMBER() OVER (
           PARTITION BY country
           ORDER BY total_spent DESC,   -- primary
                    name               -- secondary tie-breaker
       ) AS spend_rank
FROM   customer_spending;
```


In [13]:
# Example showing ORDER BY with multiple columns
order_by_example = """
SELECT 
   name,
   country,
   account_balance
FROM customers
WHERE account_balance IS NOT NULL
ORDER BY country ASC, account_balance DESC
LIMIT 10;
"""

print("ORDER BY Example - Sorting by Country then Balance:")
result = pd.read_sql_query(order_by_example, conn)
print(result)

ORDER BY Example - Sorting by Country then Balance:
            name country  account_balance
0  Mary Williams    None           795.02
1  Lars Andersen    None           588.43
2   Sofia Santos      AE           352.84
3   Ahmed Hassan      AU           179.64
4  Anna Kowalski      BR           392.80
5   James Wilson      CA           449.81
6    Yuki Tanaka      CN           344.21
7      Lucy Chen      EG           167.10
8    Elena Popov      ES           845.86
9     Emma Brown      FR           929.69



### Pagination Pattern

LIMIT is used to restrict how many observations we want to retrieve <br>
OFFSET will skill a certain number of rows before displaying the number of results delimited by LIMIT


```sql
SELECT company_name, country, city
FROM   customers
ORDER  BY country
LIMIT  n
OFFSET m;
```

`LIMIT` must appear *before* `OFFSET` in SQLite.

⚠️ **Warning:** It is very hard to predict what the ordering will be after applying filtering or other methods. So remember to always ORDER BY before using LIMIT/OFFSET!

💡 **Tip**: Interestingly, LIMIT doesn't restrict the data being retrived by SQL, only the data being showed. The difference is crucial to understand when thinking about factors such as speed, memory constraints and computing budgeting!

In [14]:
# Sorting and pagination example
pagination_query = """
SELECT 
    name,
    country,
    account_balance
FROM customers
WHERE account_balance IS NOT NULL
ORDER BY account_balance DESC
LIMIT 5
OFFSET 5;
"""
result = pd.read_sql_query(pagination_query, conn)
print("Customers ranked 6-10 by account balance:\n")
print(result)

Customers ranked 6-10 by account balance:

             name country  account_balance
0  Isabella Silva      HK           833.92
1  Diego Martinez      NL           821.98
2   Mary Williams    None           795.02
3    Anna Ivanova      TH           794.14
4        Jun Park      KR           756.11


🙋 **Hands-Up:** Which clause actually *sorts* the result set? A. `SELECT` B. `ORDER BY` C. `LIMIT`

⚠️ Warning: Common Mistake - Wrong clause order

-- ❌ WRONG: ORDER BY must come before LIMIT
```sql
SELECT * FROM customers
LIMIT 10
ORDER BY account_balance DESC 
```

-- ✅ CORRECT: Proper SQL clause order
```sql
SELECT * FROM customers
ORDER BY account_balance DESC
LIMIT 10 ```

## Putting it all together

We learned quite a few different commands for queries. Let's see one example that includes all of them:

```sql
SELECT
    Country,
    COUNT(OrderID)                AS total_orders,
    ROUND(AVG(Freight), 2)        AS avg_freight
FROM    customers 
WHERE   Country IN ('USA','UK','Germany')
GROUP BY Country
HAVING   COUNT(OrderID) >= 10
ORDER BY total_orders DESC
LIMIT 5;
```

The diagram reiterates the **logical query order**, not the command order, helping you remember the order in which the operations are actually made.

![Horizontal flowchart listing SQL clause execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT](../Images/sql-execution-order.svg)
---




### A visualization of the order of query operations

Let's go through an example of how the order of query operations look like in practice. 

**FROM `customers`** – full table (7 rows).

| CustID | Country | Orders |
|-------|---------|--------|
| C1 | USA | 5 |
| C2 | USA | 7 |
| C3 | UK  | 3 |
| C4 | UK  | 7 |
| C5 | FRA | 15 |
| C6 | GER | 2 |
| C7 | CAN | 6 |

---

**WHERE `Country IN ('USA','UK','FRA','GER')`** – drop the Canadian row.

| CustID | Country | Orders |
|-------|---------|--------|
| C1 | USA | 5 |
| C2 | USA | 7 |
| C3 | UK  | 3 |
| C4 | UK  | 7 |
| C5 | FRA | 15 |
| C6 | GER | 2 |

---

**GROUP BY `Country`** – aggregate rows, summing `Orders` - Drops CustID!

| Country | TotalOrders |
|---------|-------------|
| USA | 12 |
| UK  | 10 |
| FRA | 15 |
| GER | 2  |

---

**HAVING `TotalOrders > 5`** – keep only groups with large order volume; Germany drops out.

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |
| UK  | 10 |

---

**SELECT `Country, TotalOrders`** – project just the columns we care about (already those two).

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |
| UK  | 10 |

---

**ORDER BY `TotalOrders DESC`** – rank countries by order volume.

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |
| UK  | 10 |

---

**LIMIT 2** – return only the top two performers.

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |





<a id='keypoints'></a>
<div class="alert alert-success">  
    
## 11 · Key Points & Next Steps 

Estimate Time: 2 minutes

* Use SQL to select and pre-process only the data you really need, then use this smaller dataset for analysis with pandas
* SQLite is a zero‑install, single‑file engine that still speaks standard SQL.  
* Remember the logical query order to avoid confusion (`WHERE` vs `HAVING`).  

### What’s Next?  
In the **Intermediate SQL** workshop we will tackle:

* Creating & altering tables  
* `JOIN`s (```INNER```, ```LEFT```, ```RIGHT```, ```FULL```) and set operations
* `JOIN` as selection  
* Subqueries & Common Table Expressions (`WITH`)  
* Window functions (`ROW_NUMBER`, `LAG`, `LEAD`)  
* ```UNION```
* Pivoting

</div>

## 🎬 Demo — Customer Spending vs. Income Ranking  

💡 **Goal**  
Show how window functions, CTEs, and JOINs can be combined to answer a practical business question:

> *“How much does each customer spend relative to the income they report?”*

### What the query does
1. **`customer_spending` CTE**  
   *Calculates* each customer’s total spending (`items_purchased × price_per_item`).

2. **`income_ranking` CTE**  
   Aggregates income **and** ranks customers by `SUM(amount)` using  
   `RANK() OVER (ORDER BY total_income DESC)`.

3. **Main SELECT**  
   *Joins* the two CTEs on `name` and computes  
   **`spending_pct_of_income`** = spending ÷ income × 100.

Run the cell below and inspect the result.


In [15]:
# DEMO: Combining Window Functions with JOIN
income_df = pd.read_sql_query('SELECT * FROM income', conn)

demo_query = """
WITH customer_spending AS (
    SELECT 
        name,
        items_purchased * price_per_item AS total_spent
    FROM customers
    WHERE items_purchased IS NOT NULL 
      AND price_per_item IS NOT NULL
),
income_ranking AS (
    SELECT 
        name,
        SUM(amount) AS total_income,
        RANK() OVER (ORDER BY SUM(amount) DESC) AS income_rank
    FROM income
    WHERE amount IS NOT NULL
    GROUP BY name
)
SELECT 
    cs.name,
    cs.total_spent,
    ir.total_income,
    ir.income_rank,
    CASE 
        WHEN ir.total_income > 0 
        THEN ROUND(cs.total_spent * 100.0 / ir.total_income, 2)
        ELSE NULL 
    END AS spending_pct_of_income
FROM customer_spending cs
JOIN income_ranking ir ON cs.name = ir.name
ORDER BY ir.income_rank;
"""

print("🎬 DEMO: Customer Spending Analysis with Income Ranking\n")
display(pd.read_sql_query(demo_query, conn))


🎬 DEMO: Customer Spending Analysis with Income Ranking



Unnamed: 0,name,total_spent,total_income,income_rank,spending_pct_of_income
0,Ahmed Hassan,177.45,5800.0,1,3.06
1,Lars Andersen,125.88,4500.0,2,2.8
2,Yuki Tanaka,299.36,3200.0,8,9.36
3,Diego Martinez,13.09,2700.0,10,0.48
4,Elena Popov,56.37,1800.0,12,3.13
5,Mary Williams,1183.71,1200.0,13,98.64
6,Sarah Johnson,301.15,400.0,14,75.29
7,Li Wei,155.98,200.0,15,77.99


🙋 **Hands-Up:** How was that exercise? A. Easy B. Okay C. Tough