# H.01 | Introduction to SQL and OLAP

H.01 will serve as a simple introduction to SQL and OLAP. It will cover the basic concepts and provide a foundation for understanding how to work with databases and perform data analysis using SQL. We'll use Snowflake as our database and TPC-H as our dataset.

## Snowflake

Snowflake is a cloud-based data warehousing platform that provides a powerful and flexible environment for storing, processing, and analyzing large volumes of data. It is designed to handle complex data workloads and offers features such as scalability, high performance, and ease of use. Snowflake supports SQL as its primary query language, making it accessible to users familiar with SQL. We will be using Snowflake in H.01 to demonstrate the concepts of SQL and OLAP.

Please ensure you have the .env file filled out in the root of this folder, it is required to connect to Snowflake. 

## TPC-H 

TPC-H is a dataset that simulates a real-world business environment, and has relatively simple schema. It is widely used for testing and comparing the performance of different database systems. We're going to use TPC-H to demonstrate the concepts of SQL and OLAP. The dataset consists of several tables, each representing a different aspect of the business. The tables are related to each other through foreign keys, which allow us to join them together and perform complex queries.

<div style="align: center; justify-content: center; display: flex;">
    <img src="https://docs.snowflake.com/en/_images/sample-data-tpch-schema.png" alt="Snowflake Schema" width="400" height="400" style = "border-radius: 10px">
</div>

## Snowflake Connection

To connect to Snowflake, we will use the `snowflake-connector-python` library. This library provides a simple and efficient way to connect to Snowflake and execute SQL queries. We will also use the `pandas` library to load the data into a DataFrame for easy viewing.

In [None]:
from scripts.connection import connect_to_snowflake, as_dataframe
conn = connect_to_snowflake(database="SNOWFLAKE_SAMPLE_DATA", schema="TPCH_SF1")
cursor = conn.cursor()

# Exercise 1 | Basic SQL Practice

This part of the homework will focus on using the Snowflake Connector for Python to connect to a Snowflake database and perform basic operations.

## Exercise 1.1

Retrieve the names and account balances of all customers whose balance is greater than 5000.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

Unnamed: 0,C_NAME,C_ACCTBAL
0,Customer#000060001,9957.56
1,Customer#000060004,7975.22
2,Customer#000060006,9051.40
3,Customer#000060007,6017.17
4,Customer#000060008,5621.44
...,...,...
67984,Customer#000104992,6705.85
67985,Customer#000104993,8245.59
67986,Customer#000104996,8180.80
67987,Customer#000104998,9792.23


## Exercise 1.2

List all orders with an order date in January 1995.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

Unnamed: 0,O_ORDERKEY,O_ORDERDATE
0,4200263,1995-01-20
1,4200421,1995-01-01
2,4200673,1995-01-05
3,4200930,1995-01-06
4,4200998,1995-01-14
...,...,...
19467,1196800,1995-01-26
19468,1197057,1995-01-29
19469,1197856,1995-01-31
19470,1198468,1995-01-10


## Exercise 1.3

Find the total number of parts.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

## Exercise 1.4

List the top 5 suppliers with the highest account balance.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

## Exercise 1.5

Calculate the average order price across all orders.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

## Excercise 1.6

Retrieve the names of parts supplied by suppliers from nation number 3.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

## Exercise 1.7

Find the total extended price for each order. Return the order key and total extended price.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

## Exercise 1.8

Retrieve the names of customers who have orders with a total price greater than 100,000.

In [None]:
QUERY = """ """
cursor.execute(QUERY)
rows = cursor.fetchall()
as_dataframe(rows, cursor)

# Exercise 2 | Demonstration of Serverless OLAP Speed

Let's demonstrate the power of Snowflake & Modern OLAP systems. 

There is a table named `lineitem` in the TPCH_SF1 schema. The table `lineitem` has a column named `l_quantity`, which represents the quantity of items sold in that transaction. The table has **6 million rows**. Let's explore two different approaches to get the average quantity sold.

**Note**: A few seconds may not seem like a big deal. If you change "TPCH_SF1" to "TPCH_SF100" (which has 600 million rows), the difference will be more apparent. This demo will also cost more, so I don't recommend running it unless you're willing to burn a few dollars.

### Slow Approach

The easiest approach to calculating the average quantity is to download the entire table and then calculate the average using pandas. This is a **slow** and **expensive** approach, as it requires downloading all 6 million rows of data. For larger datasets, this approach is not feasible as you will run out of memory in your local setup.

In [7]:
import pandas as pd

SLOW_QUERY = "select l_quantity from lineitem;"
df = pd.read_sql(SLOW_QUERY, conn)
avg_quantity = float(df["L_QUANTITY"].mean())

# Format into a nice table.
df = pd.DataFrame({"AVG_QUANTITY": [avg_quantity]})
df

Unnamed: 0,AVG_QUANTITY
0,25.507967


### Fast Approach

The fast approach is to use the Snowflake SQL engine to calculate the average quantity sold. This approach is much faster and more efficient, as it only requires downloading a small amount of data and the computation is done on the (scalable) Snowflake server. This does incur a cost, but ultimately it will be cheaper to operate than a large local server.

In [8]:
FAST_QUERY = "select AVG(l_quantity) as avg_quantity from lineitem;"
df = pd.read_sql(FAST_QUERY, conn)
df

Unnamed: 0,AVG_QUANTITY
0,25.507967


# Submit

For your convenience, you can just submit your homework by running the cell below. An input box will pop up at the top of the notebook. Respond "y" to submit your homework.

In [2]:
response = input("Are you sure you want to submit H.01? (y/n): ")
if response.lower() != "y":
    print("Submission cancelled.")
else:
    print("Submitting H.01...")
    !python scripts/submit.py --homework ./H01.ipynb
    print("H.01 submitted successfully.")

Submitting H.01...
{'course_id': 'MBAI', 'datetime': '2025-03-17T10:41:50.487903', 'homework': 'H01.ipynb', 'notebook_code': '{\n "cells": [\n  {\n   "cell_type": "markdown",\n   "metadata": {},\n   "source": [\n    "# H.01 | Introduction to SQL and OLAP\\n",\n    "\\n",\n    "H.01 will serve as a simple introduction to SQL and OLAP. It will cover the basic concepts and provide a foundation for understanding how to work with databases and perform data analysis using SQL. We\'ll use Snowflake as our database and TPC-H as our dataset.\\n",\n    "\\n",\n    "## Snowflake\\n",\n    "\\n",\n    "Snowflake is a cloud-based data warehousing platform that provides a powerful and flexible environment for storing, processing, and analyzing large volumes of data. It is designed to handle complex data workloads and offers features such as scalability, high performance, and ease of use. Snowflake supports SQL as its primary query language, making it accessible to users familiar with SQL. We will be