# Homework 1: SQL Foundations with DuckDB

**Name:** Bálint Décsi  
**Due:** 19 Oct, 1:30 pm  
**Total Points:** 100 (+ 10 bonus)

---

## Instructions

1. Complete all TODO sections below
2. Write SQL queries to answer each question
3. Add markdown explanations where requested
4. Before submitting: **Kernel → Restart & Run All Cells**
5. Verify all outputs are visible
6. Rename file to `hw1_[your_name].ipynb`

**Read the README.md for full assignment details, rubric, and tips!**

---

## Setup

Run these cells to set up your environment.

In [1]:
# Import libraries
import duckdb
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


In [2]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

✅ Connected to DuckDB!


In [3]:
# Load the Online Retail dataset
# This creates a table called 'retail' that you'll query
con.execute("""
    CREATE TABLE retail AS 
    SELECT * FROM 'data/online_retail_hw1.csv'
""")

print("✅ Dataset loaded!")

✅ Dataset loaded!


### Dataset Exploration

Let's explore the data before starting the assignment.

In [4]:
# Check row count
con.execute("SELECT COUNT(*) as total_rows FROM retail").df()

Unnamed: 0,total_rows
0,525461


In [5]:
# View first few rows
con.execute("SELECT * FROM retail LIMIT 5").df()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [6]:
# Check for NULL values in each column
con.execute("""
    SELECT 
        COUNT(*) - COUNT(Invoice) AS invoice_nulls,
        COUNT(*) - COUNT(StockCode) AS stockcode_nulls,
        COUNT(*) - COUNT(Description) AS description_nulls,
        COUNT(*) - COUNT(Quantity) AS quantity_nulls,
        COUNT(*) - COUNT(InvoiceDate) AS date_nulls,
        COUNT(*) - COUNT(Price) AS price_nulls,
        COUNT(*) - COUNT("Customer ID") AS customerid_nulls,
        COUNT(*) - COUNT(Country) AS country_nulls
    FROM retail
""").df()

Unnamed: 0,invoice_nulls,stockcode_nulls,description_nulls,quantity_nulls,date_nulls,price_nulls,customerid_nulls,country_nulls
0,0,0,2928,0,0,0,107927,0


In [7]:
# Check date range
con.execute("""
    SELECT 
        MIN(InvoiceDate) as first_transaction,
        MAX(InvoiceDate) as last_transaction
    FROM retail
""").df()

Unnamed: 0,first_transaction,last_transaction
0,2009-12-01 07:45:00,2010-12-09 20:01:00


**Good!** Now you know:
- Total row count
- Which columns have NULLs (Customer ID and Description)
- Date range covered

Keep this in mind as you write queries!

---

## Part 1: Basic Queries (30 points)

This section tests: SELECT, WHERE, ORDER BY, NULL handling, LIKE

### Question 1.1: Guest Checkouts (8 points)

**Business question:** How many transactions were guest checkouts (no Customer ID)?

**Requirements:**
- Count transactions where Customer ID is NULL
- Also calculate what percentage of total transactions this represents
- Your result should have two columns: `guest_transactions` and `pct_of_total`

**Hint:** Remember to use `IS NULL`, not `= NULL`!

In [10]:
con.execute("""
    SELECT
        COUNT(*) - COUNT("Customer ID") AS guest_transactions,
        ROUND((guest_transactions / COUNT(*)) * 100, 2) AS pct_of_total
    FROM retail
""").df()

Unnamed: 0,guest_transactions,pct_of_total
0,107927,20.54


**TODO: Explain your result in 1-2 sentences:**

More, than 20 % of total transactions are from guests and at the moment, these rows are represented with having NULL for `Customer ID`. This should be handled later on. 

---

### Question 1.2: High-Value Transactions (7 points)

**Business question:** Show the top 20 highest-value transactions (revenue = Quantity * Price).

**Requirements:**
- Calculate revenue as Quantity * Price
- Show: Invoice, Description, Quantity, Price, Revenue
- Only include rows where Quantity and Price are both positive
- Filter out NULL values appropriately
- Sort by revenue descending
- Limit to top 20

**Hint:** Use calculated column with AS to name it `revenue`

In [14]:
con.execute("""
    SELECT
        Invoice,
        Description,
        Quantity,
        Price,
        Quantity * Price AS Revenue
    FROM retail
    WHERE TRUE
        AND Quantity > 0
        AND Price > 0
        AND Description IS NOT NULL  -- as filtering NULLs is a requirement
    ORDER BY revenue DESC
    LIMIT 20
""").df()

Unnamed: 0,Invoice,Description,Quantity,Price,Revenue
0,512771,Manual,1,25111.09,25111.09
1,530715,ROTATING SILVER ANGELS T-LIGHT HLDR,9360,1.69,15818.4
2,537632,AMAZON FEE,1,13541.33,13541.33
3,502263,Manual,1,10953.5,10953.5
4,502265,Manual,1,10953.5,10953.5
5,522796,Manual,1,10468.8,10468.8
6,524159,Manual,1,10468.8,10468.8
7,525399,Manual,1,10468.8,10468.8
8,496115,Manual,1,8985.6,8985.6
9,511465,PINK PAPER PARASOL,3500,2.55,8925.0


---

### Question 1.3: Product Search (7 points)

**Business question:** Find all products with "CHRISTMAS" in the description.

**Requirements:**
- Use LIKE with wildcard pattern matching
- Show: StockCode, Description
- Get distinct products only (no duplicates)
- Sort alphabetically by Description
- Limit to first 15 results

**Hint:** LIKE is case-sensitive in some databases, but DuckDB is case-insensitive by default

In [18]:
con.execute("""
    SELECT DISTINCT
        StockCode,
        Description
    FROM retail
    WHERE TRUE
        AND UPPER(Description) LIKE '%CHRISTMAS%'
    ORDER BY Description
    LIMIT 15
""").df()

Unnamed: 0,StockCode,Description
0,35962,12 ASS ZINC CHRISTMAS DECORATIONS
1,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS
2,72815,3 WICK CHRISTMAS BRIAR CANDLE
3,22950,36 DOILIES VINTAGE CHRISTMAS
4,22731,3D CHRISTMAS STAMPS STICKERS
5,22731,3D STICKERS CHRISTMAS STAMPS
6,22733,3D STICKERS TRADITIONAL CHRISTMAS
7,22732,3D STICKERS VINTAGE CHRISTMAS
8,22733,3D TRADITIONAL CHRISTMAS STICKERS
9,22732,3D VINTAGE CHRISTMAS STICKERS


---

### Question 1.4: Multi-Country Orders (8 points)

**Business question:** Show transactions from France, Germany, or Spain, with quantity greater than 10.

**Requirements:**
- Use IN operator for country filtering
- Filter for Quantity > 10
- Show: Invoice, Country, Description, Quantity, Price
- Sort by Country, then Quantity descending
- Limit to 25 rows
- Handle NULLs appropriately

**Hint:** Combine IN with AND for multiple conditions

In [22]:
con.execute("""
    SELECT
        Invoice,
        Country,
        Description,
        Quantity,
        Price
    FROM retail
    WHERE TRUE
        AND UPPER(Country) IN ('FRANCE', 'GERMANY', 'SPAIN') 
        AND Quantity > 10
    ORDER BY Country, Quantity DESC
    LIMIT 25
""").df()

Unnamed: 0,Invoice,Country,Description,Quantity,Price
0,518505,France,SET/6 FRUIT SALAD PAPER CUPS,7128,0.08
1,518505,France,SET/6 FRUIT SALAD PAPER PLATES,7008,0.13
2,518505,France,POP ART PEN CASE & PENS,5184,0.08
3,518505,France,MULTICOLOUR SPRING FLOWER MUG,4992,0.1
4,518505,France,BLACK SILVER FLOWER T-LIGHT HOLDER,4752,0.07
5,518505,France,TEATIME PEN CASE & PENS,4608,0.08
6,518505,France,WHITE BIRD GARDEN DESIGN MUG,4320,0.13
7,518505,France,S/4 BLUE ROUND DECOUPAGE BOXES,3936,0.42
8,518505,France,THE KING GIFT BAG,3744,0.05
9,518505,France,RED SPOTTY PUDDING BOWL,3648,0.13


**TODO: Why did you include (or not include) NULL checks in this query?**

We are not aggregating at all, nor sorting or filtering on any columns containing NULLs; so NULLS won't impose unintended behaviour. Guest transactions are not to be excluded neither since we know they are valid transactions and I don't find that big of a problem including products without description.

---

## Part 2: Aggregations (40 points)

This section tests: COUNT, SUM, AVG, GROUP BY, HAVING, WHERE vs HAVING

### Question 2.1: Revenue by Country (10 points)

**Business question:** What's our total revenue and transaction count for each country?

**Requirements:**
- Calculate total revenue (Quantity * Price) per country
- Count transactions per country
- Only include positive quantities and non-NULL prices
- Show: Country, total_revenue, transaction_count
- Sort by total_revenue descending
- Show all countries

**Hint:** Use SUM() and COUNT() with GROUP BY

In [None]:
# TODO: Write your query here



**TODO: Which country generates the most revenue? Does this surprise you?**

[Your explanation here]

---

### Question 2.2: Popular Products (10 points)

**Business question:** Which products have been ordered more than 1,000 times?

**Requirements:**
- Group by StockCode and Description
- Count how many times each product appears
- Calculate total quantity sold for each product
- Filter to products with MORE than 1,000 transactions (use HAVING!)
- Show: StockCode, Description, transaction_count, total_quantity_sold
- Sort by transaction_count descending

**Hint:** This requires HAVING, not WHERE, because you're filtering on an aggregate

In [None]:
# TODO: Write your query here



**TODO: Explain why you used HAVING instead of WHERE for the >1000 filter:**

[Your explanation here]

---

### Question 2.3: High-Value Customers (10 points)

**Business question:** Which customers have spent more than £5,000 total?

**Requirements:**
- Calculate total spending (SUM of Quantity * Price) per customer
- Count their number of transactions
- Only include customers with Customer ID (exclude guest checkouts)
- Only include positive quantities and prices
- Filter to customers with total spending > 5000
- Show: Customer ID, total_spent, transaction_count
- Sort by total_spent descending

**Hint:** Use WHERE for row-level filtering (NULLs, positive values) and HAVING for aggregate filtering (>5000)

In [None]:
# TODO: Write your query here



---

### Question 2.4: Monthly Revenue Trend (10 points)

**Business question:** What's our revenue and transaction count by month?

**Requirements:**
- Extract month from InvoiceDate (use DATE_TRUNC('month', InvoiceDate))
- Calculate total revenue per month
- Count transactions per month
- Calculate average transaction value per month
- Only include positive quantities and prices
- Show: month, total_revenue, transaction_count, avg_transaction_value
- Sort by month chronologically

**Hint:** DATE_TRUNC('month', date_column) gives you the first day of each month

In [None]:
# TODO: Write your query here



**TODO: Do you see any seasonal patterns in the revenue?**

[Your explanation here]

---

## Part 3: Window Functions (30 points)

This section tests: ROW_NUMBER, LAG, moving averages

### Question 3.1: Latest Purchase Per Customer (8 points)

**Business question:** What was each customer's most recent purchase?

**Requirements:**
- Use ROW_NUMBER() to rank transactions per customer by date
- Partition by Customer ID
- Order by InvoiceDate descending (most recent first)
- Filter to only the most recent transaction (row_num = 1)
- Only include customers with Customer ID (no guest checkouts)
- Show: Customer ID, Invoice, InvoiceDate, Description, Quantity, Price
- Sort by InvoiceDate descending
- Show first 20 customers

**Hint:** You'll need a subquery - use ROW_NUMBER() in inner query, filter in outer query

In [None]:
# TODO: Write your query here
# Structure: 
# SELECT ... FROM (
#     SELECT ..., ROW_NUMBER() OVER (...) as row_num
#     FROM retail
# )
# WHERE row_num = 1



**TODO: Why did you use a window function instead of GROUP BY for this question?**

[Your explanation here]

---

### Question 3.2: Week-over-Week Revenue Change (12 points)

**Business question:** How is our weekly revenue changing week-over-week?

**Requirements:**
- First, aggregate to weekly level (use DATE_TRUNC('week', InvoiceDate))
- Calculate total revenue per week
- Use LAG() to get previous week's revenue
- Calculate the change (current week - previous week)
- Calculate percent change
- Only include positive quantities and prices
- Show: week, weekly_revenue, prev_week_revenue, revenue_change, pct_change
- Sort by week chronologically
- Show all weeks

**Hint:** Build this incrementally - first get weekly totals, then add LAG, then calculate changes

In [None]:
# TODO: Write your query here
# Consider using a WITH clause (CTE) to make it cleaner:
# WITH weekly AS (
#     SELECT ... GROUP BY week
# )
# SELECT ..., LAG(...) OVER (ORDER BY week) FROM weekly



**TODO: Which week had the biggest increase in revenue? What might explain this?**

[Your explanation here]

---

### Question 3.3: 7-Day Moving Average (10 points)

**Business question:** What's the 7-day moving average of daily revenue?

**Requirements:**
- First, aggregate to daily level (DATE_TRUNC('day', InvoiceDate) or just InvoiceDate::DATE)
- Calculate total revenue per day
- Use window function with ROWS BETWEEN to calculate 7-day moving average
- The moving average should include current day + 6 days before
- Only include positive quantities and prices
- Show: date, daily_revenue, moving_avg_7day
- Sort by date
- Show first 30 days

**Hint:** ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you 7 days total

In [None]:
# TODO: Write your query here
# WITH daily AS (
#     SELECT date, SUM(...) as daily_revenue
#     FROM retail
#     GROUP BY date
# )
# SELECT 
#     date,
#     daily_revenue,
#     AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg
# FROM daily



**TODO: Why is a moving average useful for analyzing daily revenue?**

[Your explanation here]

---

## Bonus Question (10 points)

This tests: Synthesis of multiple concepts (window functions + GROUP BY)

### Bonus: Top Product Per Country (10 points)

**Business question:** What's the #1 best-selling product (by revenue) in each country?

**Requirements:**
- Calculate total revenue per product per country
- Rank products within each country by revenue
- Show only the #1 product for each country
- Only include positive quantities and prices
- Show: Country, StockCode, Description, total_revenue, rank
- Sort by Country

**Strategy:**
1. First: GROUP BY country and product to get revenue per product per country
2. Then: Use ROW_NUMBER() to rank products within each country
3. Finally: Filter to rank = 1

**Hint:** This combines aggregation (GROUP BY) with window functions (ROW_NUMBER)

In [None]:
# TODO: Write your query here (BONUS)
# This is challenging! Break it into steps:
# 1. Inner query: GROUP BY country and product
# 2. Middle query: Add ROW_NUMBER() partitioned by country
# 3. Outer query: Filter to row_num = 1



**TODO: (Bonus) Explain your approach to this question:**

[Your explanation here]

---

## Submission Checklist

Before submitting, verify:

- [ ] All TODO sections completed
- [ ] All queries produce results (no errors)
- [ ] All query outputs are visible
- [ ] All markdown explanations completed
- [ ] SQL formatted nicely (uppercase keywords, indented)
- [ ] NULL values handled appropriately (IS NULL, not = NULL)
- [ ] **CRITICAL:** Kernel → Restart & Run All Cells (no errors)
- [ ] File renamed to `hw1_[your_name].ipynb`

---

## Reflection (Optional but Recommended)

**What was the most challenging part of this assignment?**

[Your answer here]

**What concept do you feel most confident about now?**

[Your answer here]

**What would you like more practice with?**

[Your answer here]

---

**Great work! 🎉** You've completed queries on 525,000 rows of real data. This is the kind of work data professionals do every day. Be proud!