# SQL for Data Analysis  
### Introductory Workshop
*D‑Lab, UC Berkeley*

## Table of Contents  
1. [Why SQL?](#why-sql)
3. [Relational Databases](#relational)    
4. [Importing CSV → SQLite](#sqlite)  
5. [SQLite Data Types & NULLs](#types)  
6. [SELECT & Derived Columns](#select)  
7. [Filtering Rows with WHERE](#where)  
8. [Sorting & Paging Results](#orderby)  
9. [Aggregates & GROUP BY](#groupby)  
10. [Filtering Groups with HAVING](#having)  
11. [Key Points & Next Steps](#keypoints)

<div class="alert alert-success"> 
<b>Learning Goals</b><br><br>
By the end of this workshop you will be able to:
<ul>
<li>Understand why one would use SQL and why and how it is complementary to Pandas</li>
<li>Write basic queries with <code>SELECT</code>, create aliases, and build derived columns.</li>
<li>Filter rows using <code>WHERE</code>, sort and paginate result sets with <code>ORDER BY</code>, <code>LIMIT</code>, <code>OFFSET</code>.</li>
<li>Summarise data with aggregates &nbsp;(<code>COUNT</code>, <code>SUM</code>, <code>AVG</code>, <code>MIN</code>, <code>MAX</code>)&nbsp; and <code>GROUP BY</code>.</li>
<li>Filter groups with <code>HAVING</code> and understand why it is different from <code>WHERE</code></li>
</ul>
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
📝 **Poll:** A Zoom poll to help you learn!<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

<a id='why-sql'></a>
## 1 · Why SQL? 

📝 **Poll:** What is the usual size of databases you use? How large do you think databases can get?

### Bookcase vs Desk Analogy  
Think of your computer’s RAM as the desk where you spread papers you’re actively working on, and the hard‑drive as the bookcase that stores all your books.

Working at your desk is fast, but at the same time it doesn't hold nearly as much material as a bookshelf. So if you are working with very large datasets, you are at risk of overloading your desk with material. 

* **pandas** is excellent for analysis of *smaller* data that fits comfortably on the desk.  
* **SQL** is the tool we use to selectively bring only the information we need from the bookcases to the desk.

> **Example:** A 20 GB transaction table can be grouped and aggregated with a single SQL statement, whereas pandas would first need 20 GB of memory just to read the file.

### Complement, not Substitutes 
In practice we often:

1. Use SQL to slice/aggregate huge tables, producing a manageable result set  
2. Pull that into pandas for plotting or modelling  

> **Poll 🚀**  How comfortable are you with SQL?  
> - 0 – Never used it  
> - 1 – Use a couple of SELECT statements  
> - 2 – Can write basic filters  
> - 3 – Confident with GROUP BY

<a id='relational'></a>
## 2 · Relational Databases 

## What is a Relational Database?

A relational database organizes data into **tables** (sometimes called relations), which consist of rows and columns:

- **Tables**: Collections of related data (e.g., Customers, Orders)
- **Columns**: Specific attributes (e.g., CustomerID, FirstName)
- **Rows**: Individual records in the table
- **Primary Keys**: Unique identifiers for each row
- **Foreign Keys**: References to primary keys in other tables that create relationships

#### Example Schema

| Table      | Columns |
|----------- |-----------------------------|
| **Customers** | __CustomerID__ (PK), CompanyName, Country |
| **Orders**    | __OrderID__ (PK), OrderDate, **CustomerID** (FK) |

> **Primary Key vs Foreign Key**  
> • **Primary Key (PK)**: Unique identifier for a row: Each row has one, and no two rows share one.  
> • **Foreign Key (FK)**: References the Primary Key of a different table, used to make connections between tables.

**Sample rows**

```text
Customers
+------------+--------------+---------+
|CustomerID  |CompanyName   |Country  |
+------------+--------------+---------+
|ALFKI       |Alfreds Futter|Germany  |
|ANATR       |Ana Trujillo  |Mexico   |
+------------+--------------+---------+

Orders
+---------+------------+------------+
|OrderID  |OrderDate   |CustomerID  |
+---------+------------+------------+
|10248    |1996‑07‑04  |VINET       |
|10249    |1996‑07‑05  |TOMSP       |
+---------+------------+------------+
```

These tables connect on `CustomerID`: each order “belongs” to exactly one customer - but notice that this is the Primary Key for the first table, and the Foregn Key for the second table.



📝 **Poll 2:** Can you think of a table that would not have a Primary Key?

<a id='sqlite'></a>
## 3. Getting Started with SQLite 



## What is SQLite?

SQLite is a lightweight relational database. It is easy to install, use, and even though it doesn't have as many advanced capabilities as PostgreSQL or MySQL, it is a great way to start learning SQL. SQL queries are very similar (if not identical) across these programs, so there is a lot of transfer of knowledge if you decide to move on to another one in the future.

### SQLite vs. MySQL/PostgreSQL

| Feature | SQLite | MySQL/PostgreSQL |
|---------|--------|------------------|
| Deployment | Single file | Server installation |
| Configuration | None required | Requires setup |
| Concurrency | Limited | High |
| Size | Best for <1TB | Virtually unlimited |
| Use cases | Local applications, prototyping, testing | Enterprise applications, web services |

Examples: SQLite is used to store messages locally on WhatsApp, it is built into the Android OS for local storage of Apps.
          Netflix, Spotify and Instagram use PostgreSQL

📝 **Poll:** What's your primary reason for learning SQL today? 

TOM: This part below would be modified in the future, upload all the database examples I used in the main text.

## Installing SQLite

In this workshop, we'll work with SQLite directly in a Jupyter notebook using the `sqlite3` Python library, which comes built into Python.

We could create tables with `CREATE TABLE` statements, as well as add new data (INSERT), change existing data (UPDATE), and remove data (DELETE) directly on SQL, but in this workshop we will focus on querying data from a pre-existing database. 

For now, let’s *import* a CSVs (`customers.csv`) so we have data to query.

`pandas.DataFrame.to_sql()` conveniently:

1. **Creates** a table matching the DataFrame’s columns  
2. **Inserts** all rows in bulk (not optimal for millions of rows, but fine for demos)
---
> In this workshop we will be focusing on databases that contain a single table. In the next workshop, we will learn how to combine information from multiple tables. 


In [15]:
# First let's import the packages we will be needing:

import sqlite3
from sqlalchemy import create_engine
import pandas as pd         
import os, pathlib, sys

# The first thing we need to do is connect to our SQLite database, which is the file that stores the information we will be using.
# If there is no pre-existing database with this name, it automatically creates one.
# Every SQL query will be executed through this `conn` object - which can be thought as a channel to the SQLite file.
conn = sqlite3.connect("first_workshop.sqlite")

# Load the CSV into a DataFrame and then into SQLite
df = pd.read_csv("customers.csv") # Reads the information from the CSV as a Pandas DataFrame
df.to_sql("customers", conn, if_exists="replace", index=False) # Turns that Pandas DataFrame into a SQL table - overrides if needed.

# By default, df.to_sql also returns the number of rows inserted into the SQL table

93

<a id='types'></a>
## 4 · SQLite Data Types & `NULL` 

SQLite has a flexible approach to data types. While columns are designed with a recommended data type (like TEXT or INTEGER), SQLite can actually store any type of data in any column.

* **INTEGER** – whole numbers 
* **REAL** – floating‑point  
* **TEXT** – strings    
* **NULL** – missing / undefined  

Unlike PostgreSQL/MySQL, SQLite does NOT have a date format, but it does have dedicated functions that interpret strings as dates.

<a id='select'></a>
## 5 · SELECT & Derived Columns 

Real-World Application: Extract the relevant customer information for a marketing campaign

At its most basic version, a query looks like this:

```sql
SELECT column1, column2, …
FROM   table;
```

- One of the most confusing aspects of SQL, especially in the beginning, is understading the order in which SQL reads the instructions of a query - which unfortunately is not the order in which the commands are written.
- Here it is better to think that the first command is "FROM" - which tells us where to draw the information from
- After that, we can select only the columns we will need by listing them after the SELECT statement.
- Notice that:
    - We don't need to put names of columns and tables between quotes
    - We do need to separate different columns by commas - but not one at the end!

💡 **Tip**: When selecting multiple columns, use line breaks after commas for better readability!

In [20]:
# Create a simple customers table
customers_data = {
    'name': ['John Smith', 'Maria Garcia', 'Ahmed Hassan', 'Li Wei', 'Emma Brown', 
             'Carlos Rodriguez', 'Sophie Martin', 'James Wilson', 'Olga Petrov', 'Raj Patel'],
    'city': ['New York', 'Madrid', 'Cairo', 'Beijing', 'London', 
             'Mexico City', 'Paris', 'Sydney', 'Moscow', 'Mumbai'],
    'country': ['US', 'ES', 'EG', 'CN', 'GB', 
                'MX', 'FR', 'AU', 'RU', 'IN'],
    'purchase_amount': [3, 1, 2, 5, 2, 
                       1, 4, 2, 3, 2],
    'purchase_price': [150.50, 49.99, 75.25, 250.00, 98.50, 
                      39.99, 199.75, 89.50, 120.25, 79.99]
}
customers_df = pd.DataFrame(customers_data)

# Create in-memory database
engine = create_engine('sqlite:///:memory:')
customers_df.to_sql('customers', engine, index=False)

print("Customers Table:")
print(customers_df)

# Basic SELECT example
basic_select_query = """
SELECT name, country, purchase_price
FROM customers
"""
print("\nBasic SELECT Example:")
print(pd.read_sql_query(basic_select_query, engine))

Customers Table:
               name         city country  purchase_amount  purchase_price
0        John Smith     New York      US                3          150.50
1      Maria Garcia       Madrid      ES                1           49.99
2      Ahmed Hassan        Cairo      EG                2           75.25
3            Li Wei      Beijing      CN                5          250.00
4        Emma Brown       London      GB                2           98.50
5  Carlos Rodriguez  Mexico City      MX                1           39.99
6     Sophie Martin        Paris      FR                4          199.75
7      James Wilson       Sydney      AU                2           89.50
8       Olga Petrov       Moscow      RU                3          120.25
9         Raj Patel       Mumbai      IN                2           79.99

Basic SELECT Example:
               name country  purchase_price
0        John Smith      US          150.50
1      Maria Garcia      ES           49.99
2      Ahmed H

However, queries are extremely flexible and allow us to combine/summarize information in many ways.

Let's take a look at some examples of more interesting queries

### Step‑by‑Step Examples

1. **Select everything:**


   If you want to keep all columns from a table, you can just use * instead of naming the columns.


   ```sql
   SELECT * 
   FROM customers;
   ```
   <br>

2. **Select specific columns with aliases:**

    Sometimes tables have very complicated/uninformative/ambiguous names. We can rename the header of a column using an alias:


   ```sql
   SELECT company_name AS name,
          country
   FROM   customers;
   ```


In [22]:
# Original Table

print("Customers Table:")
print(customers_df)

# Selecting All columns
select_all_query = """
SELECT *
FROM customers
"""
print("\nSELECT Example with select ALL:")
print(pd.read_sql_query(select_all_query, engine))

# Using an Alias
select_alias_query = """
SELECT name, city AS location 
FROM customers
"""
print("\nSELECT Example with Alias:")
print(pd.read_sql_query(select_alias_query, engine))

Customers Table:
               name         city country  purchase_amount  purchase_price
0        John Smith     New York      US                3          150.50
1      Maria Garcia       Madrid      ES                1           49.99
2      Ahmed Hassan        Cairo      EG                2           75.25
3            Li Wei      Beijing      CN                5          250.00
4        Emma Brown       London      GB                2           98.50
5  Carlos Rodriguez  Mexico City      MX                1           39.99
6     Sophie Martin        Paris      FR                4          199.75
7      James Wilson       Sydney      AU                2           89.50
8       Olga Petrov       Moscow      RU                3          120.25
9         Raj Patel       Mumbai      IN                2           79.99

SELECT Example with select ALL:
               name         city country  purchase_amount  purchase_price
0        John Smith     New York      US                3     

Very commonly we might want to create what we call a "Derived Column" - which is a column that modifies/combines information from columns of the original table. 

The way that we do this is usually by using functions. The syntax is function(column_name). We treat them as if they were regular columns.

6. **Derived columns:**  
   ```sql
   SELECT LOWER(name) AS lower_name,
          UPPER(city) AS upper_city,
          purchase_amount*2 AS double_amount
   FROM   customers;
   ```
   <br>

8. **Combining columns:**  
   ```sql
   SELECT name,
          city || ', ' || country AS address,
          purchase_amount * purchase_price AS expenditure
   FROM   customers;
   ```

In [24]:
# Original Table

print("Customers Table:")
print(customers_df)

# Selecting Derived columns
select_derived_query = """
SELECT LOWER(name) AS lower_name,
          UPPER(city) AS upper_city,
          purchase_amount*2 AS double_amount
   FROM   customers;
"""
print("\nSELECT Example with Derived Columns:")
print(pd.read_sql_query(select_derived_query, engine))

# Combining Columns
select_combined_query = """
SELECT name,
          city || ', ' || country AS address,
          purchase_amount * purchase_price AS expenditure
   FROM   customers;
"""
print("\nSELECT Example with Combined Columns:")
print(pd.read_sql_query(select_combined_query, engine))

Customers Table:
               name         city country  purchase_amount  purchase_price
0        John Smith     New York      US                3          150.50
1      Maria Garcia       Madrid      ES                1           49.99
2      Ahmed Hassan        Cairo      EG                2           75.25
3            Li Wei      Beijing      CN                5          250.00
4        Emma Brown       London      GB                2           98.50
5  Carlos Rodriguez  Mexico City      MX                1           39.99
6     Sophie Martin        Paris      FR                4          199.75
7      James Wilson       Sydney      AU                2           89.50
8       Olga Petrov       Moscow      RU                3          120.25
9         Raj Patel       Mumbai      IN                2           79.99

SELECT Example with Derived Columns:
         lower_name   upper_city  double_amount
0        john smith     NEW YORK              6
1      maria garcia       MADRID   

We won't be able to cover all of them, but for future reference, here is a quick list of commonly used functions:

### Commonly used Functions

#### Text functions

| Function | Description | Example |
|----------|-------------|---------|
| `SUBSTR(text, start, len)` | Substring (1‑based index) | `SUBSTR('Market',1,3)` → `Mar` |
| `INSTR(text, pattern)` | Position (0 if not found) | `INSTR('abcdef','cd')` → `3` |
| `LOWER(text)` / `UPPER(text)` | Case conversion | `LOWER('SQL')` → `sql` |
| `REPLACE(text, old, new)` | Global substitution | `REPLACE('foo','o','0')` |
| `TRIM(text)` | Strip leading/trailing spaces | |
| `str1 \|\| ' merging character ' \|\| str2 \|\|` | Concatenates strings

#### Date / time Functions

| Function | What it does | Example |
|----------|--------------|---------|
| `DATE('now')` | Current date (UTC) | `DATE('now')` → `'2025‑05‑07'` |
| `DATETIME('now','localtime')` | Current local datetime | `DATETIME('now','localtime')` → `'2025‑05‑07 13:25:00'` |
| `STRFTIME(fmt, ts)` | Format timestamp → text | `STRFTIME('%Y‑%m', OrderDate)` → `'1997‑07'` |
| `JULIANDAY(ts)` | Days since noon 4714 BC | `JULIANDAY('2025‑05‑07')` → `2460457.5` |


#### Integer and Float Functions

| Expression | Result |
|------------|--------|
| `column1 + column2` | Add two numbers |
| `ROUND(total * 0.15, 2)` | Round to 2 decimal places |
| `COALESCE(price, 0)` | Replace NULL with 0 |

💡**Tip:** Need a float? Multiply an `INTEGER` by **`1.0`**.

---


Now that we know the basics of querying, we can start dealing with more complex query commands - such as filtering, grouping, aggregate functions, sorting and paginating.

<a id='where'></a>
## 6 · Filtering Rows with `WHERE` 

Real World Application: Find all large transactions on a list of business expenses for auditing

Filtering is the process of selecting a subset of rows that match a certain condition. 

Filtering happens before selecting. This is at the crux of our desk-bookshelf analogy - we just want to query the data that we will be needing, so that it is manageable when it gets to our desk.

The basic filtering method is WHERE, and the syntax is as follows:

```sql
SELECT columns
FROM table
WHERE condition

```

As we will see in a bit, the conditions we can impose are very flexible. But, as a first example, let's filter our previous database to those consumers who purchased more than one item



In [29]:
print("Customers Table:")
print(customers_df)

# Selecting Filtered columns
select_filtered_query = """
SELECT name
FROM customers
WHERE purchase_amount > 1
"""
print("\nSELECT Example with Filtered Rows:")
print(pd.read_sql_query(select_filtered_query, engine))

Customers Table:
               name         city country  purchase_amount  purchase_price
0        John Smith     New York      US                3          150.50
1      Maria Garcia       Madrid      ES                1           49.99
2      Ahmed Hassan        Cairo      EG                2           75.25
3            Li Wei      Beijing      CN                5          250.00
4        Emma Brown       London      GB                2           98.50
5  Carlos Rodriguez  Mexico City      MX                1           39.99
6     Sophie Martin        Paris      FR                4          199.75
7      James Wilson       Sydney      AU                2           89.50
8       Olga Petrov       Moscow      RU                3          120.25
9         Raj Patel       Mumbai      IN                2           79.99

SELECT Example with Filtered Rows:
            name
0     John Smith
1   Ahmed Hassan
2         Li Wei
3     Emma Brown
4  Sophie Martin
5   James Wilson
6    Olga Petr

This example actually shows something very interesting about the WHERE statement - we don't need to keep the column we are using to filter the rows.

In other words, let's say that we don't care too much about how much an user actually purchase, but we only want those who spend above a threshold. We can use WHERE combined with SELECT to use the information for filtering, but not keep it with us after it has been used, saving a lot of memory.

Here are some commonly used filtering methods:

1. **Equality / inequality**  
   ```sql
   … WHERE country = 'Germany';
   ```
⚠️ **Warning:** Be careful with string comparisons - they might be case-sensitive depending on the SQL engine and version you are using.

2. **Set membership**  
   ```sql
   … WHERE country IN ('USA','UK','Germany');
   ```

> *Coming soon:* With **subqueries** we’ll use `IN (SELECT …)` to test *set membership* against *tables*, not just literal lists.

💡 **Tip:** You can combine multiple logical expressions using AND/OR/NOT. When doing so, use parentheses to clarify when one logical expressions begins and the other ends.

3. **Comparison**  
   ```sql
   … WHERE freight > 100;
   ```

📝 Poll: Which comparison operator would you use to find values in a specific range?

4. **NULL checks**  
   ```sql
   … WHERE fax IS NULL;
   ```

⚠️ **Warning:** When you compare anything to NULL, the result isn't TRUE or FALSE, it is a special third type called UNKNOWN. 

This feature can be quite confusing, especially in the beginning. A common mistake is to try to find rows with missing values by using 

'WHERE column = NULL'

This leads to the comparison  `NULL` **≠** `NULL` , which yields *UNKNOWN*, rather than TRUE.

In SQL, we should instead use dedicated functions:

```sql
… WHERE fax IS NULL
… WHERE fax IS NOT NULL
```

In [32]:
# Create a sample dataframe with NULL values
employee_data = {
    'employee_id': [101, 102, 103, 104, 105],
    'name': ['John Smith', 'Sarah Johnson', 'Michael Lee', 'Emma Davis', 'David Wilson'],
    'department': ['Engineering', 'Marketing', 'Finance', None, 'HR'],
    'manager_id': [201, 202, None, 204, 205]
}
employees_df = pd.DataFrame(employee_data)

# Create database
engine = create_engine('sqlite:///:memory:')
employees_df.to_sql('employees', engine, index=False)

print("Employees Table:")
print(employees_df)

incorrect_query = """
SELECT name, department
FROM employees
WHERE department = NULL
"""
print("""\n Results using \"department = NULL\":""")
print(pd.read_sql_query(incorrect_query, engine))


correct_query = """
SELECT name, department
FROM employees
WHERE department IS NULL
"""
print(print("""\n Results using \"department IS NULL\":"""))
print(pd.read_sql_query(correct_query, engine))

Employees Table:
   employee_id           name   department  manager_id
0          101     John Smith  Engineering       201.0
1          102  Sarah Johnson    Marketing       202.0
2          103    Michael Lee      Finance         NaN
3          104     Emma Davis         None       204.0
4          105   David Wilson           HR       205.0

 Results using "department = NULL":
Empty DataFrame
Columns: [name, department]
Index: []

 Results using "department IS NULL":
None
         name department
0  Emma Davis       None


<a id='groupby'></a>
## 7 · Aggregate Functions & `GROUP BY` 

Real World Application: Given a list of individual transactions, calculate total sales per region.

Another way of pre-processing data so that the end result is more manageable is to summarize it according to a given statistic.

One common example is the use of aggreggate functions, combined with GROUP BY, to collapse many rows into one, while keeping the information contained in them, only now summarized in a single row.

First let's understand what the GROUP BY statement does: it creates subsets of the entire table that are similar in a given way. The most common way of doing so is to pass a column - and then SQL will automatically group the rows according to the values in that column

Second, GROUP BY statements are used in conjunction with aggreggate functions. By grouping rows according to a given column, we can guarantee that the values of these rows match **for that particular column**. But what about the others? They might be different, in which case there is no obvious way of combining them. Aggregate functions do exactly this - they tell SQL what to do with mismatching information inside the group - for example by counting the number of occurrences, summing or taking averages:

```sql
SELECT country,
       COUNT(*)        AS n_orders,
       ROUND(AVG(freight),2) AS avg_freight
FROM   orders
GROUP  BY country
```

Another very important thing - which is a bit tough to get used to in the beginning, is that we can only include in the SELECT statement columns that are either used to group by observations, or ones that are used as inputs of aggregate functions. This is exactly because of what we discussed previously - if there is a mismatch between rows, SQL doesn't know how to handle these values when it collapses all the rows in the group into a single one.



💡 **Tip**: Mathematical functions ignore `NULL` in aggregates (`AVG`, `SUM`), which is usually what you want. `COUNT(column)` counts only non‑null values, whereas `COUNT(*)` counts *all* rows.

A bit more advanced, but we can also pass more than one column to the GROUP BY statement - which would then create groups in which rows have the same values for all columns passed.

In [35]:
# Create a customer table with multiple people from the same countries
customers_data = {
    'name': ['John Smith', 'Lisa Johnson', 'Maria Garcia', 'Jose Martinez', 
             'Ahmed Hassan', 'Fatima Ali', 'Li Wei', 'Zhang Min', 
             'Emma Brown', 'William Jones'],
    'city': ['New York', 'Chicago', 'Madrid', 'Barcelona', 
             'Cairo', 'Alexandria', 'Beijing', 'Shanghai', 
             'London', 'Manchester'],
    'country': ['US', 'US', 'ES', 'ES', 
                'EG', 'EG', 'CN', 'CN', 
                'GB', 'GB'],
    'purchase_amount': [3, 2, 1, 3, 
                        2, 1, 5, 3, 
                        2, 4],
    'purchase_price': [150.50, 89.99, 49.99, 125.50, 
                       75.25, 60.00, 250.00, 175.25, 
                       98.50, 195.75]
}
customers_df = pd.DataFrame(customers_data)

# Create database
engine = create_engine('sqlite:///:memory:')
customers_df.to_sql('customers', engine, index=False)

# Display original data
print("Original Customers Table:")
print(customers_df)

# GROUP BY country query
group_by_country_query = """
SELECT 
    country,
    COUNT(*) AS customer_count,
    SUM(purchase_amount) AS total_items,
    SUM(purchase_price) AS total_revenue,
    ROUND(AVG(purchase_price), 2) AS avg_purchase
FROM 
    customers
GROUP BY 
    country
"""

# Execute the query
result = pd.read_sql_query(group_by_country_query, engine)
print("\nCustomers Grouped By Country:")
print(result)

Original Customers Table:
            name        city country  purchase_amount  purchase_price
0     John Smith    New York      US                3          150.50
1   Lisa Johnson     Chicago      US                2           89.99
2   Maria Garcia      Madrid      ES                1           49.99
3  Jose Martinez   Barcelona      ES                3          125.50
4   Ahmed Hassan       Cairo      EG                2           75.25
5     Fatima Ali  Alexandria      EG                1           60.00
6         Li Wei     Beijing      CN                5          250.00
7      Zhang Min    Shanghai      CN                3          175.25
8     Emma Brown      London      GB                2           98.50
9  William Jones  Manchester      GB                4          195.75

Customers Grouped By Country:
  country  customer_count  total_items  total_revenue  avg_purchase
0      CN               2            8         425.25        212.63
1      EG               2            

For future reference, here is a list of the most commonly used aggregate functions:

| Aggregate Function                                   | What it returns (typical usage)                    |
| ---------------------------------------------------- | -------------------------------------------------- |
| `COUNT(*)`                                           | Total number of rows in the group/query            |
| `SUM(col)`                                           | Arithmetic sum of a numeric column                 |
| `AVG(col)`                                           | Mean (average) of numeric values                   |
| `MAX(col)`                                           | Largest value (numeric *or* lexicographic)         |
| `MIN(col)`                                           | Smallest value (numeric *or* lexicographic)        |
| `STRING_AGG(col, ', ')` / `GROUP_CONCAT` / `LISTAGG` | Concatenates strings in the group with a delimiter |
| `COUNT(DISTINCT col)`                                | Count of unique, non-NULL values                   |

<a id='orderby'></a>
## 8 · Sorting & Paginating Results 

Real World Application: Find the top 10 selling items on a given year.

After we have processed our data, we might want to start preparing it for visualization. In SQL, this is done mostly through sorting - ordering the data according to one or more columns - or paginating - retrieving only a fixed number of observations

`ORDER BY` is evaluated *after* `SELECT`.  

* Default sort is **ASC** (ascending).  
* Use **DESC** for descending order.  
* You can order by *multiple* columns – the second acts as a tie‑breaker.

```sql
SELECT company_name, country, city
FROM   customers
ORDER  BY country ASC, company_name DESC;
```

### Pagination Pattern

LIMIT is used to restrict how many observations we want to retrieve
OFFSET will skill a certain number of rows before displaying the number of results delimited by LIMIT


```sql
SELECT company_name, country, city
FROM   customers
ORDER  BY country
LIMIT  n
OFFSET m;
```

`LIMIT` must appear *before* `OFFSET` in SQLite.

Observation: It is very hard to predict what the ordering will be after applying filtering or other methods. So remember to always GROUP BY before using LIMIT/OFFSET!

In [52]:
# Original Query
original_country_query = """
SELECT *
FROM customers
"""

# Execute the query
result = pd.read_sql_query(original_country_query, engine)
print("\n Original Table:")
print(result)


# Ordered Country Query
ordered_country_query = """
SELECT *
FROM customers
ORDER BY name
"""

# Execute the query
result = pd.read_sql_query(ordered_country_query, engine)
print("\n Ordered Table:")
print(result)

# LIMIT/OFFSET country query
limit_offset_country_query = """
SELECT *
FROM customers
ORDER BY name
LIMIT 3
OFFSET 3
"""

# Execute the query
result = pd.read_sql_query(limit_offset_country_query, engine)
print("\nCustomers Grouped By Country:")
print(result)


 Original Table:
            name        city country  purchase_amount  purchase_price
0     John Smith    New York      US                3          150.50
1   Lisa Johnson     Chicago      US                2           89.99
2   Maria Garcia      Madrid      ES                1           49.99
3  Jose Martinez   Barcelona      ES                3          125.50
4   Ahmed Hassan       Cairo      EG                2           75.25
5     Fatima Ali  Alexandria      EG                1           60.00
6         Li Wei     Beijing      CN                5          250.00
7      Zhang Min    Shanghai      CN                3          175.25
8     Emma Brown      London      GB                2           98.50
9  William Jones  Manchester      GB                4          195.75

 Ordered Table:
            name        city country  purchase_amount  purchase_price
0   Ahmed Hassan       Cairo      EG                2           75.25
1     Emma Brown      London      GB                2  

<a id='having'></a>
## 9 · Filtering Groups with `HAVING` 

Real World Application: Find all product categories experiencing high growth over the last year.

Remember that when we discussed filtering, we used the WHERE command, which was run before the GROUP BY.

Sometimes, we want to filter rows given an aggregate statement. For example, we might want to choose only the customers whose average expenditure is larger than a certain amount.

`HAVING` is evaluated **after** grouping – it filters *groups*, whereas `WHERE` filters *rows*.

Two Observations:
- We cannot use aggregate functions on WHERE statements
- We must use HAVING after the GROUP BY statement

```sql
SELECT country,
       COUNT(*) AS n_orders
FROM   orders
GROUP  BY country
HAVING n_orders > 20          -- aggregate in condition
ORDER  BY n_orders DESC;
```

In [117]:
# Create purchases data with multiple purchases per customer
purchases_data = {
    'customer_id': [101, 101, 101, 102, 102, 103, 103, 103, 103, 
                    104, 104, 105, 105, 105, 106, 106, 106, 106],
    'customer_name': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Charlie',
                     'David', 'David', 'Emma', 'Emma', 'Emma', 'Frank', 'Frank', 'Frank', 'Frank'],
    'purchase_amount': [120.50, 85.99, 34.20, 199.99, 45.50, 55.75, 62.30, 89.99, 75.25,
                        350.00, 290.75, 24.99, 19.99, 34.50, 125.99, 85.50, 95.25, 110.75]
}
purchases_df = pd.DataFrame(purchases_data)

# Create database
engine = create_engine('sqlite:///:memory:')
purchases_df.to_sql('purchases', engine, index=False)

# Display some sample data
print("Sample Purchases Data:")
print(purchases_df.head(10))
print("...")

# Query using HAVING to filter customers with average purchase > 100
having_query = """
SELECT 
    customer_id,
    customer_name,
    COUNT(*) AS purchase_count,
    SUM(purchase_amount) AS total_spent,
    ROUND(AVG(purchase_amount), 2) AS average_purchase
FROM 
    purchases
GROUP BY 
    customer_id, customer_name
HAVING 
    AVG(purchase_amount) > 100
ORDER BY 
    average_purchase DESC
"""

# Execute the query
result = pd.read_sql_query(having_query, engine)
print("\nCustomers with Average Purchase > $100:")
print(result)

Sample Purchases Data:
   customer_id customer_name  purchase_amount
0          101         Alice           120.50
1          101         Alice            85.99
2          101         Alice            34.20
3          102           Bob           199.99
4          102           Bob            45.50
5          103       Charlie            55.75
6          103       Charlie            62.30
7          103       Charlie            89.99
8          103       Charlie            75.25
9          104         David           350.00
...

Customers with Average Purchase > $100:
   customer_id customer_name  purchase_count  total_spent  average_purchase
0          104         David               2       640.75            320.38
1          102           Bob               2       245.49            122.75
2          106         Frank               4       417.49            104.37



### Putting it all together

We learned quite a few different commands for queries. Let's see one example that includes all of them:

```sql
SELECT
    Country,
    COUNT(OrderID)                AS total_orders,
    ROUND(AVG(Freight), 2)        AS avg_freight
FROM    customers 
WHERE   Country IN ('USA','UK','Germany')
GROUP BY Country
HAVING   COUNT(OrderID) >= 10
ORDER BY total_orders DESC
LIMIT 5;
```

The diagram reiterates the **logical query order**, not the command order, helping you remember the order in which the operations are actually made.

![SQL Execution Order](sql-execution-order.svg)
---




### A visualization of the order of query commands

**FROM `customers`** – full table (7 rows).

| CustID | Country | Orders |
|-------|---------|--------|
| C1 | USA | 5 |
| C2 | USA | 7 |
| C3 | UK  | 3 |
| C4 | UK  | 7 |
| C5 | FRA | 15 |
| C6 | GER | 2 |
| C7 | CAN | 6 |

---

**WHERE `Country IN ('USA','UK','FRA','GER')`** – drop the Canadian row.

| CustID | Country | Orders |
|-------|---------|--------|
| C1 | USA | 5 |
| C2 | USA | 7 |
| C3 | UK  | 3 |
| C4 | UK  | 7 |
| C5 | FRA | 15 |
| C6 | GER | 2 |

---

**GROUP BY `Country`** – aggregate rows, summing `Orders`.

| Country | TotalOrders |
|---------|-------------|
| USA | 12 |
| UK  | 10 |
| FRA | 15 |
| GER | 2  |

---

**HAVING `TotalOrders > 5`** – keep only groups with healthy order volume; Germany drops out.

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |
| UK  | 10 |

---

**SELECT `Country, TotalOrders`** – project just the columns we care about (already those two).

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |
| UK  | 10 |

---

**ORDER BY `TotalOrders DESC`** – rank countries by order volume.

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |
| UK  | 10 |

---

**LIMIT 2** – return only the top two performers.

| Country | TotalOrders |
|---------|-------------|
| FRA | 15 |
| USA | 12 |





<a id='keypoints'></a>
<div class="alert alert-success">  
    
## 11 · Key Points & Next Steps 

* Use SQL to select and pre-process only the data you really need, then use this smaller dataset for analysis with pandas
* SQLite is a zero‑install, single‑file engine that still speaks standard SQL.  
* Remember the logical query order to avoid confusion (`WHERE` vs `HAVING`).  

### What’s Next?  
In the **Advanced SQL** workshop we will tackle:

* Creating & altering tables  
* `JOIN`s (INNER, LEFT, RIGHT, FULL) and set operations
* `JOIN` as selection  
* Subqueries & Common Table Expressions (`WITH`)  
* Window functions (`ROW_NUMBER`, `LAG`, `LEAD`)  
* UNION
* Pivoting

</div>