This project provides a hands-on learning environment for students to practice data warehouse concepts by executing SQL queries on CSV data using DuckDB. The project allows you to:
- Load CSV files as database tables
- Execute SQL queries including advanced operations like CUBE, JOINs, and aggregations
- Capture and log query results automatically
- Practice essential data warehouse querying techniques
Key Learning Objectives:
- Understanding fact and dimension table relationships
- Writing complex SQL queries with multiple JOINs
- Using CUBE operations for multi-dimensional analysis
- Performing aggregations and analytical functions
- Working with real-world data warehouse scenarios
data_warehouse_lab/
β
βββ run_sql_script.py # Main Python script for executing SQL queries
βββ requirements.txt # Python dependencies
βββ README.md # This documentation file
β
βββ data/ # Data storage folder
β βββ sample_schema/ # Sample dataset for learning
β βββ dim_customer.csv # Customer dimension table
β βββ dim_product.csv # Product dimension table
β βββ fact_sales.csv # Sales fact table
β
βββ queries/ # SQL query examples and storage
β βββ sample_schema/ # Queries for the sample schema dataset
β βββ cube_example.sql # Demonstrates CUBE operations
β βββ join_example.sql # Shows JOIN operations
β βββ aggregation_example.sql # Aggregation functions example
β
βββ outputs/ # Query results (auto-generated)
β βββ sample_schema/ # Results organized by schema
β βββ (CSV output files will be created here)
β
βββ logs/ # Query execution logs (auto-generated)
βββ (log files will be created here)
data/: Contains CSV files organized by schema that will be loaded as database tablesqueries/: Store your SQL query files here, organized by schema (must have.sqlextension)outputs/: Query results are automatically saved here as CSV files, maintaining schema organizationlogs/: Execution logs and query results previews are saved hererun_sql_script.py: The main script that orchestrates everything
For Windows:
Goole it.For Mac/Linux:
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtThis will install:
pandas: For data manipulation and CSV handlingduckdb: For in-memory SQL query execution
Execute one of the example queries:
python run_sql_script.py queries/sample_schema/cube_example.sql data/sample_schema/ logs/cube_log.txtAfter successful execution, you'll find:
- Log file:
logs/cube_log.txt- Contains execution details and query results preview - CSV output:
outputs/sample_schema/cube_example_output.csv- Full query results in CSV format
python run_sql_script.py <path_to_sql_file> <path_to_data_folder> <path_to_log_file><path_to_sql_file>: Path to your SQL query file (must end with.sql)<path_to_data_folder>: Folder containing CSV files to load as tables<path_to_log_file>: Where to save execution logs and results
# Run the CUBE example
python run_sql_script.py queries/sample_schema/cube_example.sql data/sample_schema/ logs/cube_analysis.txt
# Run the JOIN example
python run_sql_script.py queries/sample_schema/join_example.sql data/sample_schema/ logs/join_analysis.txt
# Run the aggregation example
python run_sql_script.py queries/sample_schema/aggregation_example.sql data/sample_schema/ logs/agg_analysis.txtThis section explains how students can set up the project for new assignments, different datasets, or their own projects.
Create a new subfolder inside data/ for your specific assignment or dataset:
# Example: For Assignment 2
mkdir data/assignment2_schema
# Or for a specific project
mkdir data/retail_analysis_schema
mkdir data/healthcare_schemaPlace your CSV data files inside your new schema folder:
data/
βββ assignment2_schema/ # Your new assignment folder
βββ customers.csv # Your dimension tables
βββ products.csv
βββ orders.csv # Your fact tables
βββ order_items.csv
Important:
- CSV files must have headers in the first row
- Table names in your SQL queries will match the CSV filename (without .csv extension)
- Example:
customers.csvbecomes tablecustomersin your SQL
Create a matching folder in queries/ with the exact same name:
# Must match your data folder name exactly
mkdir queries/assignment2_schemaCreate your .sql files inside your queries schema folder:
queries/
βββ assignment2_schema/ # Matches your data folder name
βββ customer_analysis.sql # Your custom queries
βββ sales_report.sql
βββ monthly_trends.sql
Example SQL query (queries/assignment2_schema/customer_analysis.sql):
-- Customer Analysis for Assignment 2
SELECT
c.customer_name,
COUNT(o.order_id) as total_orders,
SUM(oi.quantity * oi.unit_price) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY c.customer_id, c.customer_name
ORDER BY total_spent DESC;Use the same command pattern with your new schema name:
# Run your custom query
python run_sql_script.py queries/assignment2_schema/customer_analysis.sql data/assignment2_schema/ logs/assignment2_results.txtYour results will be automatically organized:
- Log file:
logs/assignment2_results.txt(execution details + preview) - CSV output:
outputs/assignment2_schema/customer_analysis_output.csv(full results)
# 1. Create folders
mkdir data/your_schema_name
mkdir queries/your_schema_name
# 2. Add your CSV files to data/your_schema_name/
# 3. Create SQL queries in queries/your_schema_name/
# 4. Run your analysis
python run_sql_script.py queries/your_schema_name/your_query.sql data/your_schema_name/ logs/your_log.txtYou can work on multiple assignments simultaneously:
data_warehouse_lab/
βββ data/
β βββ sample_schema/ # Provided examples
β βββ assignment1_schema/ # Your Assignment 1
β βββ assignment2_schema/ # Your Assignment 2
β βββ final_project_schema/ # Your final project
β
βββ queries/
β βββ sample_schema/ # Example queries
β βββ assignment1_schema/ # Assignment 1 queries
β βββ assignment2_schema/ # Assignment 2 queries
β βββ final_project_schema/ # Final project queries
β
βββ outputs/
βββ sample_schema/ # Example results
βββ assignment1_schema/ # Assignment 1 results
βββ assignment2_schema/ # Assignment 2 results
βββ final_project_schema/ # Final project results
- Consistent Naming: Always use the same name for your data and queries folders
- Descriptive Names: Use clear folder names like
assignment2_schemaorretail_analysis - CSV Headers: Ensure your CSV files have column headers in the first row
- SQL Table Names: Reference tables using the CSV filename without the .csv extension
- Test Early: Start with simple queries to verify your data loads correctly
The project includes a sample retail dataset with three tables:
- customer_id: Unique customer identifier
- customer_name: Customer's full name
- city, region, country: Geographic information
- age_group: Customer age demographic
- segment: Business segment (Consumer, Corporate, Home Office)
- product_id: Unique product identifier
- product_name: Product name
- category, subcategory: Product classification
- brand: Product brand
- unit_price, cost: Pricing information
- sale_id: Unique transaction identifier
- customer_id, product_id: Foreign keys to dimension tables
- sale_date: Transaction date
- quantity: Number of items sold
- total_amount: Total transaction value
- discount_percent: Applied discount
- sales_rep: Sales representative
- region: Sales region
Demonstrates multi-dimensional analysis using the CUBE operator:
- Analyzes sales across region, category, and age group
- Shows subtotals and grand totals for all combinations
- Perfect for understanding OLAP cube concepts
Shows how to combine fact and dimension tables:
- Joins all three tables for comprehensive reporting
- Calculates profit margins and other derived metrics
- Demonstrates typical data warehouse reporting patterns
Focuses on various aggregation functions:
- COUNT, SUM, AVG, MIN, MAX operations
- GROUP BY with HAVING clauses
- Business metrics calculation
-
Create a new SQL file in the appropriate schema folder:
-- queries/sample_schema/my_analysis.sql SELECT c.region, COUNT(*) as total_orders, SUM(f.total_amount) as revenue FROM fact_sales f JOIN dim_customer c ON f.customer_id = c.customer_id GROUP BY c.region ORDER BY revenue DESC;
-
Run your query:
python run_sql_script.py queries/sample_schema/my_analysis.sql data/sample_schema/ logs/my_results.txt
- Add CSV files to the
data/sample_schema/folder (or create a new folder) - Ensure proper CSV format with headers in the first row
- Reference tables in your SQL using the filename without extension
- Example:
sales_data.csvbecomes tablesales_data
- Example:
Error: "SQL file not found"
- Check that the path to your SQL file is correct
- Ensure the file has a
.sqlextension
Error: "No CSV files found"
- Verify the data folder path is correct
- Ensure CSV files are in the specified folder
- Check that files have
.csvextension
Error: "Query execution failed"
- Check your SQL syntax
- Verify table names match CSV filenames (without extension)
- Review the log file for detailed error messages
Error: "Failed to load CSV"
- Ensure CSV files are properly formatted
- Check for encoding issues (should be UTF-8)
- Verify CSV headers are present
- Check the log file - It contains detailed error information
- Review SQL syntax - Ensure proper DuckDB SQL syntax
- Verify table names - Must match CSV filenames exactly
- Test with sample queries - Start with provided examples
- Modify the aggregation example to analyze different dimensions
- Create a simple query to find top-selling products
- Write a query to analyze sales by month
- Create a query using window functions (ROW_NUMBER, RANK)
- Implement a ROLLUP operation for hierarchical totals
- Build a query to calculate year-over-year growth
- Create complex CTEs (Common Table Expressions)
- Implement advanced analytical functions
- Build a comprehensive sales dashboard query
- Keep SQL files focused on specific analysis topics
- Use descriptive names for queries and log files
- Document your queries with comments
- Organize data into logical folders by subject area
- Regular cleanup of log files to manage disk space
Happy Learning! π
This project is designed to give you hands-on experience with data warehouse concepts. Start with the example queries, then experiment with your own analyses to deepen your understanding of SQL and data warehousing principles.