<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_04_AdvancedSelect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Super Select: Advanced Retrieval With Mario Brothers Plumbing

## Mario Brothers Plumbing: Database Schema
In this lesson, we'll be working with a database that represents the plumbing business of the famous "Mario Brothers". Let's get started.

First, we'll load the database and display the basic schema.

In [1]:
# download database
!wget https://github.com/brendanpshea/database_sql/raw/main/data/mario_bros_plumbing.db -q -nc

# Load sql magic and connect
%load_ext sql
%sql sqlite:///mario_bros_plumbing.db

In [2]:
# display schema (SQLite)
%%sql
SELECT * FROM sqlite_master WHERE type='table';

 * sqlite:///mario_bros_plumbing.db
Done.


type,name,tbl_name,rootpage,sql
table,Customers,Customers,2,"CREATE TABLE Customers (  customer_id INTEGER PRIMARY KEY,  first_name VARCHAR(255),  last_name VARCHAR(255),  address JSON,  phone_number VARCHAR(20) )"
table,Employees,Employees,3,"CREATE TABLE Employees (  employee_id INTEGER PRIMARY KEY,  first_name VARCHAR(255),  last_name VARCHAR(255),  job_title VARCHAR(255),  hire_date DATE )"
table,ServiceTypes,ServiceTypes,4,"CREATE TABLE ServiceTypes (  service_type_id INTEGER PRIMARY KEY,  service_type_name VARCHAR(255),  description VARCHAR(255) )"
table,Services,Services,5,"CREATE TABLE Services (  service_id INTEGER PRIMARY KEY,  service_type_id INTEGER,  service_name VARCHAR(255),  description VARCHAR(255),  price DECIMAL(10,2),  FOREIGN KEY (service_type_id) REFERENCES ServiceTypes (service_type_id) )"
table,Orders,Orders,6,"CREATE TABLE Orders (  -- Keeps track of a customer's orders  order_id INTEGER PRIMARY KEY,  customer_id INTEGER,  employee_id INTEGER,  order_date DATE,  total_amount DECIMAL(10,2),  FOREIGN KEY (customer_id) REFERENCES Customers (customer_id),  FOREIGN KEY (employee_id) REFERENCES Employees (employee_id) )"
table,Order_Items,Order_Items,7,"CREATE TABLE Order_Items (  -- Keeps track of a customer's order items  -- This is one line on an invoice  order_item_id INTEGER PRIMARY KEY,  order_id INTEGER,  service_id INTEGER,  quantity INTEGER,  FOREIGN KEY (order_id) REFERENCES Orders (order_id),  FOREIGN KEY (service_id) REFERENCES Services (service_id) )"


## Database Overview
The "Mario Brothers Plumbing" database consists of six interconnected tables designed to manage a plumbing business:

1.  **Customers**: Stores customer information, including a JSON field for address.
2.  **Employees**: Stores employee information, including job title and hire date.
3.  **ServiceTypes**: Stores service type information, including name and description.
4.  **Services**: Stores service information, including name, description, and price (DECIMAL).
5.  **Orders**: Stores order information, including customer, employee, date, and total amount (DECIMAL).
6.  **Order_Items**: Stores order item information, including order, service, and quantity.

### Data Types: JSON and DECIMAL
Two notable data types used in this database are JSON and DECIMAL.

JSON (JavaScript Object Notation) is a lightweight data interchange format that allows for flexible and structured data representation. It can store complex data types like objects and arrays. In this database, JSON is used to store customer addresses, as it provides a convenient way to store and retrieve structured address data without the need for separate address-related tables. Later in this chapter, we'll see how to use SQLite to query this data.

DECIMAL is a data type used to store precise numeric values, with a specified precision and scale. It is suitable for storing monetary values, such as prices and total amounts, where exactness is crucial. In this database, DECIMAL(10,2) is used, allowing for prices and total amounts up to 99,999,999.99.

### Relationships

The tables in this database are related through one-to-many relationships, established using foreign key constraints:

-   **Customers** and **Orders**: A customer can have multiple orders, but an order belongs to only one customer. This is a one-to-many relationship, with the `customer_id` foreign key in the Orders table referencing the `customer_id` primary key in the Customers table.
-   **Employees** and **Orders**: An employee can handle multiple orders, but an order is handled by only one employee. This is a one-to-many relationship, with the `employee_id` foreign key in the Orders table referencing the `employee_id` primary key in the Employees table.
-   **Orders** and **Order_Items**: An order can have multiple order items, but an order item belongs to only one order. This is a one-to-many relationship, with the `order_id` foreign key in the Order_Items table referencing the `order_id` primary key in the Orders table.
-   **Services** and **Order_Items**: A service can be included in multiple order items, but an order item includes only one service. This is a one-to-many relationship, with the `service_id` foreign key in the Order_Items table referencing the `service_id` primary key in the Services table.

### Sub-type Relationship

In this database, Services are a **sub-type** of ServiceTypes. This means that each service belongs to a specific service type, and the service type provides a way to categorize and group related services.

The sub-type relationship is encoded in the relational database using a one-to-many relationship between the ServiceTypes and Services tables. The `service_type_id` foreign key in the Services table references the `service_type_id` primary key in the ServiceTypes table. This relationship ensures that each service is associated with a valid service type and allows for efficient querying and data integrity maintenance.

By using a sub-type relationship, the database can store common attributes of service types in the ServiceTypes table, while specific details of individual services are stored in the Services table. This design promotes data normalization, reduces data redundancy, and allows for easier management and extension of the service catalog.

## AN ERD for Mario Brothers Plumbing
Now, let's take a look a the entity-relationship diagram for this database.

In [3]:
import base64
from IPython.display import Image, display, HTML

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))

mm("""
classDiagram
    Customers "1" -- "*" Orders
    Employees "1" -- "*" Orders
    Orders "1" -- "*" Order_Items
    Services "1" -- "*" Order_Items
    ServiceTypes <|-- Services

    class Customers {
        +customer_id: INTEGER PK
        +first_name: VARCHAR
        +last_name: VARCHAR
        +address: JSON
        +phone_number: VARCHAR
    }

    class Employees {
        +employee_id: INTEGER PK
        +first_name: VARCHAR
        +last_name: VARCHAR
        +job_title: VARCHAR
        +hire_date: DATE
    }

    class ServiceTypes {
        +service_type_id: INTEGER PK
        +service_type_name: VARCHAR
        +description: VARCHAR
    }

    class Services {
        +service_id: INTEGER PK
        +service_type_id: INTEGER FK
        +service_name: VARCHAR
        +description: VARCHAR
        +price: DECIMAL
    }

    class Orders {
        +order_id: INTEGER PK
        +customer_id: INTEGER FK
        +employee_id: INTEGER FK
        +order_date: DATE
        +total_amount: DECIMAL
    }

    class Order_Items {
        +order_item_id: INTEGER PK
        +order_id: INTEGER FK
        +service_id: INTEGER FK
        +quantity: INTEGER
    }
""")

THe above diagram is in Unified Modeling Language. This is similar to the Crow's foot style we saw before, but with a few key differences.
1. In UML, entities are represented as classes, which are depicted as rectangles.
2. The attribute of an entity are listed inside the rectangle, below the entity name. The attribute name is followed by a colon (:) and its data type.
3. Primary key attributes are marked with PK, indicating that they uniquely identify each record in the entity.
4. Foreign key attributes are marked with FK, indicating that they establish relationships with other entities.
5. Relationships between entities are represented by lines connecting the rectangles. The cardinality of a relationship is indicated at each end of the line.
6. In this diagram, a single number (1) represents a one-to-one or one-to-many relationship, while an asterisk (*) represents a many-to-one or many-to-many relationship.
  - For example, the line between Customers and Orders with 1 on the Customers end and * on the Orders end indicates that one customer can have multiple orders (a one-to-many relationship).
7.  In UML, **inheritance** is represented by a line with a hollow arrowhead pointing from the subclass to the superclass.
  - In this diagram, the inheritance relationship is shown between Services and ServiceTypes, with Services inheriting from ServiceTypes.

## A Quick Look at the Data
Now, let's take a quick look at the data in each table.

In [4]:
%%sql
SELECT * FROM employees LIMIT 5;

 * sqlite:///mario_bros_plumbing.db
Done.


employee_id,first_name,last_name,job_title,hire_date
1,Super,Mario,Master Plumber,2000-09-13
2,Super,Luigi,Journeyman Plumber,2003-02-20
3,Princess,Peach,Project Manager,2005-06-10
4,Cat,Peach,Apprentice Plumber,2014-11-05
5,Tanuki,Mario,Plumbing Technician,2011-04-28


In [5]:
%%sql
SELECT * FROM customers LIMIT 5;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,first_name,last_name,address,phone_number
1,Peach,Toadstool,"{""street"": ""Mushroom Castle"", ""city"": ""Toad Town""}",(555) 123-4567
2,Yoshi,Dino,"{""street"": ""24 Egg Island"", ""city"": ""Dinosaur Land"", ""apartment"": ""A""}",(555) 987-6543
3,Daisy,Sarasa,"{""street"": ""10 Sarasaland Way"", ""city"": ""Chai Kingdom""}",(555) 456-7890
4,Toadette,Toadstool,"{""street"": ""15 Mushroom St"", ""city"": ""Toad Town"", ""apartment"": ""2B""}",(555) 789-0123
5,Bowser,Koopa,"{""street"": ""1 Bowser Castle"", ""city"": ""Dark Land""}",(555) 654-3210


In [6]:
%%sql
SELECT * FROM serviceTypes LIMIT 5;

 * sqlite:///mario_bros_plumbing.db
Done.


service_type_id,service_type_name,description
1,Repair,Services related to fixing and repairing plumbing issues
2,Installation,Services related to installing new plumbing fixtures and systems
3,Inspection,Services related to inspecting and assessing plumbing systems


In [7]:
%%sql
SELECT * FROM services LIMIT 5;

 * sqlite:///mario_bros_plumbing.db
Done.


service_id,service_type_id,service_name,description,price
1,1,Pipe Repair,Fix leaky or broken pipes,50
2,1,Drain Cleaning,Clear clogged drains and pipes,75
3,2,Toilet Installation,Install a new toilet,150
4,2,Sink Replacement,Replace an old or damaged sink,200
5,1,Water Heater Repair,Fix issues with water heaters,120


In [8]:
%%sql
SELECT * FROM orders LIMIT 5;

 * sqlite:///mario_bros_plumbing.db
Done.


order_id,customer_id,employee_id,order_date,total_amount
1,3,1,2001-10-14,925
2,2,1,2010-05-29,825
3,1,3,2008-03-14,1025
4,6,3,2008-04-16,1140
5,2,1,2016-03-03,750


In [9]:
%%sql
SELECT * FROM order_items LIMIT 5;

 * sqlite:///mario_bros_plumbing.db
Done.


order_item_id,order_id,service_id,quantity
1,1,1,1
2,1,4,2
3,1,1,3
4,1,2,3
5,1,1,2


## Using GROUP BY in SQL

The `GROUP BY` clause in SQL is used to group rows in a result set based on one or more columns. It is often used in combination with aggregate functions like `COUNT()`, `SUM()`, `AVG()`, `MIN()`, and `MAX()` to perform calculations on grouped data.

The basic syntax of `GROUP BY` is as follows:

```sql
SELECT
 column1,
 column2,
 ...,
 aggregate_function(column) -- Ex: SUM(), COUNT(), AVG()
FROM
 table_name
GROUP BY column1, column2, ...;
```

When using `GROUP BY`, the `SELECT` statement should only include columns that are either listed in the `GROUP BY` clause or used with an aggregate function. The `GROUP BY` clause comes after the `FROM` and `WHERE` clauses but before the `ORDER BY` clause.

Let's explore some examples using the "Mario Brothers Plumbing" database to understand how GROUP BY can be used in practice.

### Example: Counting Orders per Customer (with Table Aliases)

Suppose we want to count the number of orders placed by each customer. We can use `GROUP BY` with the `COUNT()` aggregate function to achieve this.

In [10]:
%%sql
--Number of orders by each customer
SELECT
  c.customer_id AS "customer_id",
  c.first_name,
  c.last_name,
  COUNT(o.order_id) AS order_count
FROM
  Customers c -- We use a Table alias "c" for "Customers"
  -- Table alias "o" for orders
  JOIN Orders o ON c.customer_id = o.customer_id
-- We group by all columns in the select clause, but NOT the COUNT
GROUP BY c.customer_id, c.first_name, c.last_name
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,first_name,last_name,order_count
1,Peach,Toadstool,63
2,Yoshi,Dino,57
3,Daisy,Sarasa,67
4,Toadette,Toadstool,27
5,Bowser,Koopa,27
6,Wario,Wario,32
7,Waluigi,Wario,2
8,Donkey,Kong,1
9,Diddy,Kong,1
11,Cappy,Bonneter,1


A few things happen in this query:

1. First, **table aliases** are used to give a table, or a subquery in a FROM clause, a temporary name. They can make queries easier to write and to read by shortening the names of tables. In this query, 'c' is an alias for the 'Customers' table, and 'o' is an alias for the 'Orders' table. This allows us to refer to these tables using these shorter names throughout the query.

2.  The `GROUP BY` statement in SQL is used to group rows that have the same values in specified columns. In this case, we're grouping by 'c.customer_id', which means that the result set will have one row for each unique customer_id in the 'Customers' table.

3. The query is also using a aggregate function, `COUNT()`, to count the number of orders for each customer. The `COUNT()` function returns the number of rows that matches a specified criteria. In this case, it's counting the number of 'o.order_id' for each group of 'c.customer_id'.

So, the overall result of this query will be a list of customers (with their customer_id, first_name, and last_name), along with the number of orders that each customer has made.

### Example: Calculating Total Order Sales per Employee (with PRINTF)

Let's say we want to calculate the total amount of sales handled by each employee. We can use `GROUP BY` with the `SUM()` aggregate function.

In [11]:
%%sql
SELECT
  e.employee_id,
  e.first_name,
  e.last_name,
  -- We can use printf to format the way currency appears
  PRINTF("$%.2f", SUM(o.total_amount)) AS total_sales_usd
FROM
  Employees e
  JOIN Orders o ON e.employee_id = o.employee_id
-- One row per employee id
GROUP BY e.employee_id;

 * sqlite:///mario_bros_plumbing.db
Done.


employee_id,first_name,last_name,total_sales_usd
1,Super,Mario,$68485.00
2,Super,Luigi,$38815.00
3,Princess,Peach,$33755.00
4,Cat,Peach,$11635.00
5,Tanuki,Mario,$25855.00
6,Fire,Luigi,$17430.00
7,Toad,Toadstool,$7185.00


Here, the `GROUP BY` clause is used to group the result set by one or more columns. In this query, we group the rows by `employee_id`. This means that the query will produce one row per unique `employee_id` value. When `GROUP BY` is used, any column in the `SELECT` list that is not an aggregate function (like `SUM()`) must be included in the `GROUP BY` clause. In this case, `employee_id`, `first_name`, and `last_name` are not aggregate functions, so they must (technically) be listed in the `GROUP BY` clause. However, since `employee_id` uniquely identifies each employee, we only need to include `employee_id` in the `GROUP BY` clause.

The `PRINTF()` function is used to format the total sales amount as a currency string. It takes two arguments: a format string and a value.
  -   The format string `"$%.2f"` specifies that the output should start with a dollar sign (`$`), followed by the value with two decimal places (`%.2f`).
  -   The value passed to `PRINTF()` is the result of `SUM(o.total_amount)`, which calculates the sum of `total_amount` for each employee. Since we are grouping by `employee_id`, the `SUM()` function will calculate the total sales for each employee.

### Finding the Most Popular Service Type

To find the most popular service type based on the number of order items, we can use `GROUP BY` with the `COUNT()` function and an `ORDER BY` clause.

In [12]:
%%sql
SELECT
  st.service_type_name AS most_popular_service,
  COUNT(oi.order_item_id) AS order_item_count
FROM
  ServiceTypes st
  JOIN Services s ON st.service_type_id = s.service_type_id
  JOIN Order_Items oi ON s.service_id = oi.service_id
GROUP BY st.service_type_name -- biggest first
ORDER BY order_item_count DESC  -- top result only
LIMIT 1;

 * sqlite:///mario_bros_plumbing.db
Done.


most_popular_service,order_item_count
Repair,582


In this example, we join the ServiceTypes, Services, and Order_Items tables to connect the service types with their corresponding order items. We group the results by `service_type_name` and count the number of order items for each service type using `COUNT()`. The `ORDER BY` clause is used to sort the results in descending order based on the `order_item_count`, and the `LIMIT` clause is used to retrieve only the top result, which represents the most popular service type.

## Using HAVING in SQL

The `HAVING` clause in SQL is used to filter the results of an aggregate function based on a specified condition. It is similar to the `WHERE` clause, but while `WHERE` filters individual rows before grouping, `HAVING` filters grouped rows after the `GROUP BY` clause has been applied.

The basic syntax of `HAVING` is as follows:

```sql
SELECT
  column1,
  column2,
  ...,
  aggregate_function(column)
FROM table_name
(WHERE condition) -- A WHERE comes before
GROUP BY column1, column2, ...
-- HAVING comes after a group by
HAVING condition;
```

The `HAVING` clause comes after the `GROUP BY` clause and before the `ORDER BY` clause. The condition in the `HAVING` clause typically involves an aggregate function and can use comparison operators like `=`, `>`, `<`, `>=`, `<=`, and `<>`.

### Example: Filtering Employees by Total Sales Amount
Suppose we want to find employees who have achieved total sales greater than $20,000. We can use HAVING with the SUM() aggregate function to filter the grouped results.

In [13]:
%%sql
SELECT
  e.employee_id,
  e.first_name,
  e.last_name,
  PRINTF("$%.2f", SUM(o.total_amount)) AS total_sales_usd
FROM
  Employees e
  JOIN Orders o ON e.employee_id = o.employee_id
GROUP BY e.employee_id
HAVING SUM(o.total_amount) > 20000;

 * sqlite:///mario_bros_plumbing.db
Done.


employee_id,first_name,last_name,total_sales_usd
1,Super,Mario,$68485.00
2,Super,Luigi,$38815.00
3,Princess,Peach,$33755.00
5,Tanuki,Mario,$25855.00


In this example, we join the Employees and Orders tables, group the results by `employee_id`, and calculate the total sales for each employee using `SUM()`. The `HAVING` clause then filters the grouped results to include only employees whose total sales exceed $1000.

### Example: Filtering Service Types by Average Price

Let's say we want to find service types whose average price is greater than $100. We can use `HAVING` with the `AVG()` aggregate function to filter the grouped results.

In [14]:
%%sql
SELECT
  st.service_type_name,
  PRINTF("$%.2f", AVG(s.price)) AS average_price
FROM
  ServiceTypes st
  JOIN Services s ON st.service_type_id = s.service_type_id
GROUP BY st.service_type_name
HAVING AVG(s.price) > 100;

 * sqlite:///mario_bros_plumbing.db
Done.


service_type_name,average_price
Installation,$175.00
Repair,$123.75


Here, we join the ServiceTypes and Services tables, group the results by `service_type_name`, and calculate the average price for each service type using `AVG()`. The `HAVING` clause then filters the grouped results to include only service types whose average price is greater than $100.

## Subqueries in SQL
A **subquery**, also known as a nested query or inner query, is a query within another query. It allows you to use the results of one query as input for another query. Subqueries can be used in various parts of an SQL statement, such as `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses.

The basic syntax of a subquery is as follows:

```sql
SELECT ... -- Start of "outer query"
FROM ...
WHERE column_name operator (
  -- Start of subquery ("inner query")
    SELECT ...
    FROM ...
    WHERE ...
);
```

The subquery is enclosed in parentheses and placed within the outer query. The outer query uses the results of the subquery to perform further operations or filtering.

Let's take a look at a few examples.

### Subquery in the SELECT Clause
Suppose we want to retrieve the customer details along with the difference between their total order amount and the average order amount of all customers. We can use a subquery in the SELECT clause to calculate the average order amount.

In [15]:
%%sql
SELECT
  c.customer_id,
  c.first_name,
  c.last_name,
  SUM(o.total_amount) AS total_order_amount,
  (SUM(o.total_amount) -
    -- We subtract the results of this subquery
    (SELECT AVG(total_amount) FROM Orders)
  ) AS difference_from_average
FROM
  Customers c
  JOIN  Orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,first_name,last_name,total_order_amount,difference_from_average
1,Peach,Toadstool,37110,36450.38961038961
2,Yoshi,Dino,35635,34975.38961038961
3,Daisy,Sarasa,46210,45550.38961038961
4,Toadette,Toadstool,20450,19790.38961038961
5,Bowser,Koopa,18275,17615.38961038961
6,Wario,Wario,21465,20805.38961038961
7,Waluigi,Wario,2295,1635.3896103896104
8,Donkey,Kong,75,-584.6103896103896
9,Diddy,Kong,1065,405.3896103896104
11,Cappy,Bonneter,570,-89.61038961038957


In this example, the subquery `(SELECT AVG(total_amount) FROM Orders)` calculates the average order amount of all customers. The result of the subquery is then used in the outer query to calculate the difference between each customer's total order amount and the average order amount. This calculation would be difficult to achieve without a subquery.

We can use PRINTF to clean up the presentation of this data as follows:

In [16]:
%%sql
SELECT
  c.customer_id,
  c.first_name,
  c.last_name,
  PRINTF("$%.2f", SUM(o.total_amount)) AS total_order_amount,
  PRINTF("$%.2f",
    (
    SUM(o.total_amount) -
    -- We subtract the results of this subquery
    (SELECT AVG(total_amount) FROM Orders)
    )
  )AS difference_from_average
FROM
  Customers c
  JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,first_name,last_name,total_order_amount,difference_from_average
1,Peach,Toadstool,$37110.00,$36450.39
2,Yoshi,Dino,$35635.00,$34975.39
3,Daisy,Sarasa,$46210.00,$45550.39
4,Toadette,Toadstool,$20450.00,$19790.39
5,Bowser,Koopa,$18275.00,$17615.39
6,Wario,Wario,$21465.00,$20805.39
7,Waluigi,Wario,$2295.00,$1635.39
8,Donkey,Kong,$75.00,$-584.61
9,Diddy,Kong,$1065.00,$405.39
11,Cappy,Bonneter,$570.00,$-89.61


###  Subquery in the `WHERE` or `HAVING` Clauses

Let's say we want to find the employees who have greater than average sales per order.

In [17]:
%%sql
SELECT
  e.employee_id,
  e.first_name,
  e.last_name,
  PRINTF("$%.2f", SUM(o.total_amount) / COUNT(o.total_amount)) AS avg_sales_order
FROM
  Employees e
  JOIN Orders o ON e.employee_id = o.employee_id
GROUP BY e.employee_id
HAVING (SUM(o.total_amount) / COUNT(o.total_amount)) >
  (SELECT AVG(total_amount) FROM Orders);

 * sqlite:///mario_bros_plumbing.db
Done.


employee_id,first_name,last_name,avg_sales_order
1,Super,Mario,$678.00
3,Princess,Peach,$675.00
5,Tanuki,Mario,$698.00
7,Toad,Toadstool,$718.00


The query retrieves employee details and their average sales per order, but only for employees whose average sales per order is greater than the overall average order amount. It joins the `Employees` and `Orders` tables, groups the results by `employee_id`, calculates the average sales per order using `SUM` and `COUNT`, and filters the results using a `HAVING` clause that compares each employee's average sales with the overall average calculated by a subquery.

## Case Study: Subqueries and Big O in the Mushroom Kingdom
Welcome to the Mushroom Kingdom, where Birdo and Yoshi have taken up SQL programming in their spare time! As they dive into the world of databases, they quickly realize the importance of understanding query efficiency and performance. This is where Big O notation comes into play.

### Introduction to Big O Notation
Big O notation is a mathematical notation used to describe the performance or complexity of an algorithm. In the context of databases and SQL queries, Big O notation helps us analyze how the running time of a query changes as the size of the input data grows.

The "O" in Big O stands for "order of," and it describes the upper bound of the growth rate of a function. In simpler terms, it tells us how fast the running time of a query increases as the amount of data it processes increases.

Some common Big O notations and their meanings:

-   O(1): Constant time - The query's running time remains constant, regardless of the input size.
-   O(log n): Logarithmic time - The query's running time grows logarithmically with the input size.
-   O(n): Linear time - The query's running time grows linearly with the input size.
-   O(n log n): Linearithmic time - The query's running time grows in a combination of linear and logarithmic factors.
-   O(n^2): Quadratic time - The query's running time grows quadratically with the input size.

Now, let's look at some simple database queries and their corresponding Big O notations.

### Example 1: Constant Time - O(1)

Birdo has a query that retrieves a single record from the `Toads` table based on a specific `id`:

```sql
SELECT * FROM Toads WHERE id = 1;
```

This query has a Big O notation of O(1) because it always retrieves a single record based on its **primary key**, regardless of the size of the `Toads` table. The running time remains constant. (Note: If this was something besides the primary key, this might be different)!

### Example 2: Linear Time - O(n)

Yoshi has a query that retrieves all the records from the `Toads` table:

```sql
SELECT * FROM Toads;
```

This query has a Big O notation of O(n), where n is the number of records in the `Toads` table. As the number of records grows, the running time of the query increases linearly.

### Example 3: Linearithmic Time - O(n log n)

Birdo has a query that sorts the `Toads` table based on the `name` column:

```sql
SELECT * FROM Toads ORDER BY name;
```

This query has a Big O notation of O(n log n) because sorting algorithms typically have a time complexity of O(n log n). The running time grows in a combination of linear and logarithmic factors.

### Example 4: Subquery with Quadratic Time - O(n^2)
Birdo has a query that retrieves the names of all Toads who have collected more than the average number of Power Stars:

```sql
SELECT name
FROM Toads
WHERE num_power_stars > (
  SELECT AVG(num_power_stars)
  FROM Toads
);
```

In this query, the subquery calculates the average number of Power Stars collected by all Toads. The outer query then compares each Toad's `num_power_stars` against this average. The subquery has a Big O notation of O(n^2) because
 1. The "inner query needs to scan the entire `Toads` table to calculate the average.  This is O(n),
 2. The outer query also has a Big O notation of O(n) because it needs to compare each Toad's `num_power_stars` against the average.

Therefore, the overall Big O notation of this query is O(n * n) = O(n^2). This could be pretty slow with large datasets. After Birdo and Yoshi discuss it a while, they discover they could speed up this query with **common table expressions** which we will learn about later.

### Example 5: Subquery with Cubic Time - O(n^3)

Yoshi has a query that retrieves the names of all Toads who have collected more Power Stars than the average number of Power Stars collected by their friends:

```sql
SELECT name
FROM Toads t1
WHERE num_power_stars > (
  SELECT AVG(num_power_stars)
  FROM Toads t2
  WHERE t2.id IN (
    SELECT friend_id
    FROM Friends
    WHERE toad_id = t1.id
  )
);
```

In this query, we have a table called `Friends` that stores the friendship relationships between Toads. The subquery first finds the friends of each Toad by querying the `Friends` table. Then, for each Toad, it calculates the average number of Power Stars collected by their friends using another subquery. Finally, the outer query compares each Toad's `num_power_stars` against the average calculated for their friends.

The innermost subquery (`SELECT friend_id FROM Friends WHERE toad_id = t1.id`) has a Big O notation of O(n), where n is the number of friendship records in the `Friends` table. This subquery is executed for each Toad in the outer query.

The middle subquery (`SELECT AVG(num_power_stars) FROM Toads t2 WHERE t2.id IN (...)`), which calculates the average number of Power Stars for each Toad's friends, has a Big O notation of O(m), where m is the average number of friends per Toad. This subquery is executed for each Toad in the outer query.

The outer query (`SELECT name FROM Toads t1 WHERE num_power_stars > (...)`), which compares each Toad's `num_power_stars` against the average calculated for their friends, has a Big O notation of O(n), where n is the number of Toads in the `Toads` table.

Combining these three factors, the overall Big O notation of this query is O(n * m * n), which simplifies to O(n^3), assuming that the average number of friends per Toad (m) is proportional to the total number of Toads (n).

This query demonstrates a scenario where the use of nested subqueries can lead to a cubic time complexity, which can be very inefficient for large datasets. In such cases, it's essential to consider alternative approaches, such as optimizing the database schema, using joins instead of subqueries, or breaking down the query into smaller, more efficient parts.

Birdo and Yoshi are surprised by the complexity of this query and the potential performance impact it can have. They realize that understanding Big O notation is crucial for writing efficient SQL queries, especially when dealing with large datasets and complex relationships between tables.

(Brendan's Note: Fuguring out how to query huge, complex social network data--of the sort in the last example--efficiently is a major areas of research. In the real world, database systems like SQLite try to "optimize" queries in various ways to make them more efficient, and provide various utilities for measuring their real-world run times. Later in the book, we'll find out more about this.).


## JSON and SQL
In chapter 1, we discussed different logical "data models", or ways of organizing information. The two most prominent data models in use today are:

1. The **relational model** that organizes data into related "tables". SQL is designed for databases organized on this model. SQLite follows this model, as do most other leading databases (MySQL, Oracle, SQL Server, Postgres).
2. Databases in **key-value** structures, such as **JSON**. These are sometimes called **document databases**. MongoDB is the most widely used type of this database.

As it turns out, most modern relational databases have added the capacity to deal natively with JSON data. (And so, they are technically "hybrid" databases). We'll briefly take a look at how this works, using the `Address` column of the `Customer` table in our Mario Bros plumbing database.

### What is JSON?

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is commonly used to transmit data between a server and a web application, as an alternative to XML.

JSON represents data in key-value pairs and supports various data types such as strings, numbers, booleans, arrays, and nested objects. Here's an example of a JSON object representing an address:

```javascript
// An example of a JSON object
{
  "street": "Mushroom Castle",
  "city": "Toad Town",
  "zip_code": "12345",
  "country": "Mushroom Kingdom"
}
```

In this example, the JSON object contains key-value pairs where the keys are "street", "city", "zip_code", and "country", and their corresponding values are "Mushroom Castle", "Toad Town", "12345", and "Mushroom Kingdom".

### Storing JSON in SQLite

SQLite provides support for storing JSON data directly in database columns. You can store JSON objects as text in a column of type TEXT. In our database, we have a table called `Customers` with a column named `address` that stores JSON data representing customer addresses. Let's see what this looks like:

In [18]:
%%sql
SELECT address FROM Customers LIMIT 15;

 * sqlite:///mario_bros_plumbing.db
Done.


address
"{""street"": ""Mushroom Castle"", ""city"": ""Toad Town""}"
"{""street"": ""24 Egg Island"", ""city"": ""Dinosaur Land"", ""apartment"": ""A""}"
"{""street"": ""10 Sarasaland Way"", ""city"": ""Chai Kingdom""}"
"{""street"": ""15 Mushroom St"", ""city"": ""Toad Town"", ""apartment"": ""2B""}"
"{""street"": ""1 Bowser Castle"", ""city"": ""Dark Land""}"
"{""street"": ""100 Gold Coin Blvd"", ""city"": ""Diamond City""}"
"{""street"": ""101 Silver Coin Ave"", ""city"": ""Diamond City"", ""unit"": ""5C""}"
"{""street"": ""50 Banana Jungle"", ""city"": ""DK Island"", ""house"": ""Treehouse""}"
"{""street"": ""51 Banana Jungle"", ""city"": ""DK Island"", ""floor"": ""Ground""}"
"{""street"": ""Comet Observatory"", ""city"": ""Space""}"


### Querying JSON Data in SQLite
To extract a specific value from a JSON column, you can use the `json_extract()` function in the SELECT statement. The `json_extract()` function takes the JSON column and a path expression as arguments. For example, to retrieve the street address for each customer:

In [19]:
%%sql
SELECT
  customer_id,
  -- address is the (JSON) column, and street is a "key"
  json_extract(address, '$.street') AS street
FROM Customers
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,street
1,Mushroom Castle
2,24 Egg Island
3,10 Sarasaland Way
4,15 Mushroom St
5,1 Bowser Castle
6,100 Gold Coin Blvd
7,101 Silver Coin Ave
8,50 Banana Jungle
9,51 Banana Jungle
10,Comet Observatory


This query will return the `customer_id` and the value of the "street" key from the JSON object stored in the `address` column. The `$` symbol represents the root of the JSON object, and `.street` specifies the path to the "street" key.

You can extract multiple values from a JSON column by specifying multiple path expressions in the `json_extract()` function. For example, to retrieve the street and city for each customer:

In [20]:
%%sql
SELECT
  customer_id,
  last_name,
  json_extract(address, '$.street') AS street,
  json_extract(address, '$.city') AS city
FROM Customers
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,last_name,street,city
1,Toadstool,Mushroom Castle,Toad Town
2,Dino,24 Egg Island,Dinosaur Land
3,Sarasa,10 Sarasaland Way,Chai Kingdom
4,Toadstool,15 Mushroom St,Toad Town
5,Koopa,1 Bowser Castle,Dark Land
6,Wario,100 Gold Coin Blvd,Diamond City
7,Wario,101 Silver Coin Ave,Diamond City
8,Kong,50 Banana Jungle,DK Island
9,Kong,51 Banana Jungle,DK Island
10,Cosmic,Comet Observatory,Space


You can use JSON values in the WHERE clause to filter rows based on specific conditions. The `json_extract()` function can be used to extract the desired JSON value for comparison. For example, to find customers who live in the city "Toad Town":

In [21]:
%%sql
SELECT
  customer_id,
  first_name,
  last_name,
  json_extract(address, '$.city') AS city
FROM Customers
WHERE json_extract(address, '$.city') = 'Toad Town'
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,first_name,last_name,city
1,Peach,Toadstool,Toad Town
4,Toadette,Toadstool,Toad Town
23,Toad,Toadstool,Toad Town
28,Toadsworth,Toadstool,Toad Town


To check if a specific key exists in a JSON column, you can use the `json_type()` function in the WHERE clause. The `json_type()` function returns the data type of the value at the specified path. If the key exists, it will return the data type; otherwise, it will return NULL. For example, to find customers who have a "apartment" key in their address:

In [22]:
%%sql
SELECT
  customer_id,
  first_name,
  last_name,
  json_extract(address, '$.apartment') AS apartment
FROM Customers
WHERE json_type(address, '$.apartment') IS NOT NULL
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


customer_id,first_name,last_name,apartment
2,Yoshi,Dino,A
4,Toadette,Toadstool,2B
27,Lakitu,Cloud,Skyview


We can also combine JSON with what we've been learning about GROUP BY in this chapter. For example, here is a query that gives a count of how many residents live in each city.

In [23]:
%%sql
-- Count of residents in each city
SELECT
  json_extract(address, '$.city') AS city,
  COUNT(*) AS resident_count
FROM Customers
GROUP BY city
LIMIT 10;

 * sqlite:///mario_bros_plumbing.db
Done.


city,resident_count
Acorn Plains,1
Cap Kingdom,1
Chai Kingdom,1
DK Island,4
Dark Land,4
Diamond City,2
Dinosaur Land,1
Eagleland,1
Hyrule Kingdom,2
Kanto Region,1


Or, a bit more ambitiously, we could find out some summary statistics about the orders in each city:

In [25]:
%%sql
SELECT
  json_extract(c.address, '$.city') AS city,
  COUNT(*) AS order_count,
  MAX(total_amount) AS max_order_amount,
  MIN(total_amount) AS min_order_amount,
  AVG(total_amount) AS avg_order_amount
FROM
  Customers c
  JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY city

 * sqlite:///mario_bros_plumbing.db
Done.


city,order_count,max_order_amount,min_order_amount,avg_order_amount
Acorn Plains,3,1575,225,800.0
Cap Kingdom,1,570,570,570.0
Chai Kingdom,67,1485,75,689.7014925373135
DK Island,4,1065,75,553.75
Dark Land,31,1700,75,677.258064516129
Diamond City,34,1620,50,698.8235294117648
Dinosaur Land,57,2210,50,625.1754385964912
Hyrule Kingdom,4,1025,150,640.0
Kanto Region,3,850,300,483.3333333333333
Lycia,1,775,775,775.0


### Putting it Altogether
Finally, let's put everything (well, not everything, but a bunch of thigs) we've learned into a sigle query.

In [35]:
%%sql
SELECT
  json_extract(c.address, '$.city') AS city,
  COUNT(*) AS order_count,
  PRINTF('$%.2f', MAX(total_amount)) AS max_order_amount,
  PRINTF('$%.2f', MIN(total_amount)) AS min_order_amount,
  PRINTF('$%.2f', AVG(total_amount)) AS avg_order_amount
FROM
  Customers c
  JOIN Orders o ON c.customer_id = o.customer_id
WHERE c.first_name NOT IN ("Wario", "Bowser")
GROUP BY city
HAVING order_count > 10
ORDER BY order_count DESC;


 * sqlite:///mario_bros_plumbing.db
Done.


city,order_count,max_order_amount,min_order_amount,avg_order_amount
Toad Town,94,$1600.00,$75.00,$637.71
Chai Kingdom,67,$1485.00,$75.00,$689.70
Dinosaur Land,57,$2210.00,$50.00,$625.18


his query demonstrates various SQL concepts and techniques that we have learned in the course so far. Let's go through each part:

1.  `SELECT` clause:
    -   `json_extract(address, '$.city') AS city`: Extracts the value of the "city" key from the JSON object stored in the `address` column of the `Customers` table and aliases it as `city`.
    -   `COUNT(*) AS order_count`: Counts the number of rows in each group and aliases the result as `order_count`.
    -   `PRINTF('$%.2f', MAX(total_amount)) AS max_order_amount`: Calculates the maximum value of `total_amount` for each group, formats it as a currency string using `PRINTF`, and aliases the result as `max_order_amount`.
    -   `PRINTF('$%.2f', MIN(total_amount)) AS min_order_amount`: Calculates the minimum value of `total_amount` for each group, formats it as a currency string using `PRINTF`, and aliases the result as `min_order_amount`.
    -   `PRINTF('$%.2f', AVG(total_amount)) AS avg_order_amount`: Calculates the average value of `total_amount` for each group, formats it as a currency string using `PRINTF`, and aliases the result as `avg_order_amount`.
2.  `FROM` clause:
    -   `Customers c`: Specifies the `Customers` table as the main table for the query and assigns it the alias `c`.
3.  `JOIN` clause:
    -   `JOIN Orders o ON c.customer_id = o.customer_id`: Performs an inner join between the `Customers` table (aliased as `c`) and the `Orders` table (aliased as `o`) based on the matching `customer_id` column. This join retrieves rows from both tables where the `customer_id` values match.
4.  `WHERE` clause:
    -   `Customers.first_name NOT IN ("Wario", "Bowser")`: Filters the rows from the `Customers` table where the `first_name` is not "Wario" or "Bowser". This condition excludes customers with those specific first names from the result set.
5.  `GROUP BY` clause:
    -   `GROUP BY city`: Groups the rows based on the `city` column extracted from the JSON address object. This clause ensures that the aggregate functions (`COUNT`, `MAX`, `MIN`, `AVG`) are applied to each group of rows with the same city value.
6.  `HAVING` clause:
    -   `HAVING order_count > 10`: Filters the grouped rows based on the condition that the `order_count` (calculated by `COUNT(*)`) is greater than 10. This clause removes cities with 10 or fewer orders from the result set.
7.  `ORDER BY` clause:
    -   `ORDER BY order_count DESC`: Sorts the result set in descending order based on the `order_count` column. Cities with the highest number of orders will appear first in the result set.

This query showcases the use of JSON extraction, joining tables, aggregate functions, grouping, filtering with `WHERE` and `HAVING` clauses, and ordering the results.