# **SQL 102 - Intermediate Queries and Combining Data**
Author: Martin Arroyo

### **About this notebook**

All of your queries will be written using preloaded databases that are available only in this notebook. Our "RDBMS" and SQL dialect is called `duckdb`, a new and popular Python library that provides the framework to make our queries possible. You can find [the documentation for `duckdb` here](https://duckdb.org/docs/sql/introduction) - you will want to keep the documentation handy.

`teachdb`, which provides the data that you will be working with, is a Python library written by The Freestack Initiative, a group of COOP alumni who want to empower the community to learn and improve their technical skills by providing materials and resources at low (or no) cost.

## **How to use this notebook**

First, we'll do a quick tutorial on how to use the notebook with these tools, then we'll dive into more SQL!

### **Step 1: Press the play button below to set up the database and notebook**

You will see a checkmark appear when the database is finished setting up.

In [62]:
%%capture
# @title Press Play { display-mode: "form" }

# This code is used to set up the notebook by installing the libraries we need, configuring extensions to
# make displays for our queries look nice, and connecting to our relational database so that you can write
# queries in code cells using the %%sql magic tag. 

# Install `teachdb` if it's not in the system already
!pip install --quiet --upgrade git+https://github.com/freestackinitiative/teachingdb.git
import pandas as pd
from teachdb.teachdb import connect_teachdb
# Set configurations for notebook
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
# Load data
con = connect_teachdb(databases=["core", "restaurant"])
%sql con

# Check out the Freestack Initiative @https://github.com/freestackinitiative

### **Step 2: Run a query**

To run SQL queries against the database, create a new code cell. Then write `%%sql` at the top. This tells the notebook that this cell is being used to query the database. You can write your queries underneath the `%%sql` line by pressing the play button of the cell or selecting the cell and using `CTRL + Shift + Enter` on Windows (`CMD + Shift + Return` on Mac.)

Go ahead and try it by executing the query in the cell below:

In [None]:
%%sql

SELECT *
FROM Customers
LIMIT 5

## **Single Table Query Review and `CASE` Statements**

In this section, we'll warm up with some review of the basics we learned in SQL 101. Then you'll learn a new query technique - `CASE` statements - which are a special kind of conditional statement that lets us create custom column values based on conditions we specify. 

### **Review - Single Table Queries and Aggregation**

Let's get warmed up by writing a query using what we learned in SQL 101!

Write a query that shows the top 5 customers in the `Reservations` table that have the most reservations. Additionally, show the average party size for each of those customers. 

`Expected Output:`

| CustomerID | TotalReservations | AvgPartySize |
|------------|-------------------|--------------|
| 6          | 34                | 3.882353     |
| 80         | 30                | 4.033333     |
| 31         | 27                | 3.259259     |
| 41         | 27                | 3.962963     |
| 44         | 27                | 3.777778     |

<br/>
<details>
<summary>Click here to reveal answer</summary>

```sql
SELECT CustomerID, COUNT(ReservationID) AS total_reservations, AVG(PartySize) AS avg_party_size
FROM Reservations
GROUP BY CustomerID
ORDER BY total_reservations DESC
LIMIT 5
```

</details>

In [30]:
%%sql

SELECT CustomerID, COUNT(ReservationID) AS total_reservations, AVG(PartySize) AS avg_party_size
FROM Reservations
GROUP BY CustomerID
ORDER BY total_reservations DESC
LIMIT 5

Unnamed: 0,CustomerID,total_reservations,avg_party_size
0,6,34,3.882353
1,80,30,4.033333
2,31,27,3.259259
3,41,27,3.962963
4,44,27,3.777778


Great work with that first query! Now that we're warmed up, it's time to take your querying skills to the next level!

### **`CASE` Statements**

A popular restaurant reviewer is using a pricing scale that they invented to rate how affordable dishes at restaurants are. Restaurants that have too many dishes considered "Pricey" typically have lower ratings, while those with more dishes in the "Average" range do best. We don't want our restaurant to have a low rating, so we have to find out how our menu does on this scale. 

Here is the reviewer's pricing scale:

>$4 or less - `Inexpensive`
>
>Between $4 and $8 - `Average`
>
>Above $8 - `Pricey`

How can we convert the price of a dish to one of these three values based on the price? This is a perfect use for `CASE` statements!


#### **How to use `CASE` statements**

`CASE` statements are very similar to using `IF` function in Excel. They allow you to specify "If/Then/Else" logic in your queries. 

Here is the general form of a `CASE` statement:

```sql
CASE WHEN {`some condition to check`} THEN {`value if the condition is true`} ELSE {`value if all other conditions are false`} END
```

If you need to check more than one separate condition, you simply add another `WHEN/THEN` clause. The `ELSE` is always the final condition checked since it covers the case where all the other cases are false. 

To see this in action, check out the query we use to show the pricing scale for our menu.

##### **`CASE` Statement Example**

Here is the query to check the pricing scale:
```sql
SELECT Name
    , Price
    , Type
    , CASE 
        WHEN Price <= 4.0 THEN 'Inexpensive' -- Check the first condition
        WHEN Price BETWEEN 4.0 AND 8.0 THEN 'Average' -- Check the second condition
        ELSE 'Pricey' -- Anything over 8.0 is Pricey
      END AS PriceRating -- We end our CASE statement and give the resulting column a name using an alias
FROM Dishes
```

Here are the first five results:

| Name                         | Price | Type      | PriceRating |
|------------------------------|-------|-----------|-------------|
| Parmesan Deviled Eggs        | 8.00  | Appetizer | Average     |
| Artichokes with Garlic Aioli | 9.00  | Appetizer | Pricey      |
| French Onion Soup            | 7.00  | Main      | Average     |
| Mini Cheeseburgers           | 8.00  | Main      | Average     |
| Panko Stuffed Mushrooms      | 7.00  | Appetizer | Average     |

##### Breakdown - `CASE` Statement query

Our query looks pretty simple aside from the `CASE` statement that we added. We'll focus on breaking that down line-by-line:

```sql
    CASE WHEN Price <= 4.0 THEN 'Inexpensive' 
```

Every `CASE` statement begins with the word `CASE`. After that, we check conditions using the `WHEN`/`THEN` keywords. The `WHEN` looks at the condition that you specify and the `THEN` defines what happens when that condition is true. In this case, we're checking if the price is $4 or less; if it is, we're telling SQL that the value we want is "Inexpensive".

```sql
    WHEN Price BETWEEN 4.0 AND 8.0 THEN 'Average'
```

Since we need to check more than one condition to look at the price scale, we need to add another `WHEN`/`THEN` set of keywords to check the next condition. This one says, <em>"Use the word `Average` here if the `Price` is between $4 and $8."</em>

```sql
    ELSE 'Pricey'
```

The `ELSE` portion of the `CASE` statement determines what to do if all of the conditions before it are false. This is saying, <em>"If the price of a dish is not `Inexpensive` or `Average`, then it is `Pricey`"</em>

```sql
    END AS PriceRating
```

We use the `END` keyword to close all `CASE` statements. This says, <em>"We are done with our statement."</em> Since we are creating a column that doesn't otherwise exist in the database, the RDBMS will give it a default name. It is best practice to name our `CASE` statements using aliases. Here, we name the resulting column from our `CASE` statement `PriceRating`.

##### **Action Item - Using the `CASE` Statement**

The restaurant reviewer recently published an update to their price scale! The new scale is below:

> **Restaurant Reviewer NEW Pricing Scale:**
>
>\$3 or less - `Super Cheap`
>
>Between \$4 and \$5 - `Inexpensive`
>
>Between \$5 and \$8 - `Average`
>
>Above \$8 - `Pricey`

Write a query using the `Dishes` table that shows the updated price scale. Your results should include the name of the dish, the price, and it's type - along with the new `PriceRating`. Order your results by `Price` so that we see the lowest prices first and limit your results to just the first five rows. 

`Expected Output:`

| Name                   | Price | Type     | PriceRating |
|------------------------|-------|----------|-------------|
| Pomegranate Iced Tea   | 4.00  | Beverage | Inexpensive |
| Apple Pie              | 5.00  | Dessert  | Inexpensive |
| Chocolate Chip Brownie | 6.00  | Dessert  | Average     |
| Tropical Blue Smoothie | 6.00  | Beverage | Average     |
| Cafe Latte             | 6.00  | Beverage | Average     |

<br/>
<details>
    <summary>Click here to reveal the answer</summary>

```sql
SELECT Name
    , Price
    , Type
    , CASE 
        WHEN Price <= 3.0 THEN 'Super Cheap'
        WHEN Price BETWEEN 4.0 AND 5.0 THEN 'Inexpensive' 
        WHEN Price BETWEEN 5.0 AND 8.0 THEN 'Average'
        ELSE 'Pricey' -- Anything over 8.0 is Pricey
    END AS PriceRating
FROM Dishes
ORDER BY Price
LIMIT 5
```

</details>

In [29]:
%%sql

SELECT Name
    , Price
    , Type
    , CASE 
        WHEN Price <= 3.0 THEN 'Super Cheap'
        WHEN Price BETWEEN 4.0 AND 5.0 THEN 'Inexpensive' 
        WHEN Price BETWEEN 5.0 AND 8.0 THEN 'Average'
        ELSE 'Pricey' -- Anything over 8.0 is Pricey
      END AS PriceRating
FROM Dishes
ORDER BY Price
LIMIT 5

Unnamed: 0,Name,Price,Type,PriceRating
0,Pomegranate Iced Tea,4.0,Beverage,Inexpensive
1,Apple Pie,5.0,Dessert,Inexpensive
2,Chocolate Chip Brownie,6.0,Dessert,Average
3,Tropical Blue Smoothie,6.0,Beverage,Average
4,Cafe Latte,6.0,Beverage,Average


In [65]:
%%sql

SELECT *
FROM NewDishes
WHERE Type IN ('Appetizer', 'Dessert')
UNION
SELECT *
FROM Dishes
WHERE Type IN ('Main', 'Beverage')

Unnamed: 0,DishID,Name,Description,Price,Type
0,1,Avocado Pesto Bruschetta,"Sliced and toasted sourdough topped with a spread of creamy avocado pesto, cherry tomatoes, and basil.",8.0,Appetizer
1,2,Zucchini Fritters,"Golden brown zucchini fritters, seasoned with garlic and herbs. Served with tzatziki sauce.",9.0,Appetizer
2,5,Stuffed Bell Peppers,"Bell peppers stuffed with a mixture of quinoa, black beans, corn, and melted cheese. Topped with avocado slices.",7.0,Appetizer
3,13,Pumpkin Hummus Platter,A seasonal twist on hummus featuring pumpkin. Served with an assortment of vegetables and pita.,9.99,Appetizer
4,14,Berry Medley Parfait,"Layers of fresh mixed berries, Greek yogurt, and granola. A light and healthy dessert option.",9.0,Dessert
5,15,Banana Nut Bread Pudding,Warm banana nut bread pudding topped with a scoop of vanilla ice cream and caramel sauce.,9.0,Dessert
6,16,Mini Fruit Tarts,"A selection of mini tarts filled with lemon curd, chocolate ganache, and raspberry jam.",6.0,Dessert
7,17,Strawberry Rhubarb Pie,"A perfect balance of sweet and tart, our strawberry rhubarb pie is a seasonal favorite.",5.0,Dessert
8,3,French Onion Soup,"Caramelized onions slow cooked in a savory broth, topped with sourdough and a provolone cheese blend. Served with sourdough bread.",7.0,Main
9,4,Mini Cheeseburgers,"These mini cheeseburgers are served on a fresh baked pretzel bun with lettuce, tomato, avocado, and your choice of cheese.",8.0,Main


In [68]:
%%sql

SELECT C.State, COUNT(R.ReservationID) AS total_reservations -- We are counting the number of reservations by each state
FROM Reservations AS R -- The 'Reservations' table is on the "left-side" of the join. We give it an alias of 'R'
INNER JOIN Customers AS C -- The 'Customers' table is on the "right-side" of the join. We give it an alias of 'C'
ON R.CustomerID=C.CustomerID -- This is our constraint saying "Only give me the rows where the customer id in 'Orders' matches the id in 'Customers'" 
GROUP BY C.State -- We're summarizing our results by what state our customers are from
ORDER BY total_reservations DESC -- Since we want to know which states have the most reservations, we order the counts from greatest to least
LIMIT 5 -- We only want to see the 5 states with the most reservations, so we limit our results to the first 5 rows

Unnamed: 0,State,total_reservations
0,CA,331
1,TX,171
2,VA,161
3,FL,142
4,NY,138


In [72]:
%%sql



Unnamed: 0,ReservationID,CustomerID,Date,PartySize
0,1,74,2018-06-01 15:30:00,6
1,2,67,2018-06-02 13:30:00,2
2,3,16,2018-06-04 08:00:00,4
3,4,87,2018-06-04 19:30:00,5
4,5,29,2018-06-06 13:00:00,1


In [71]:
%%sql

WITH C AS (
SELECT FirstName, LastName, State
FROM Customers
LIMIT 5
), R AS (
    SELECT *
    FROM Reservations
    LIMIT 5
)

Unnamed: 0,FirstName,LastName,State
0,Maggi,Domney,CA
1,Javier,Dawks,CT
2,Aleen,Fasey,FL
3,Taylor,Jenkins,FL
4,Imogen,Kabsch,SC


# Practice Section - Joins and Multiple Tables

As an example, we'll join the `company` and `foods` table together. There is already a primary key/foreign key relationship between the two tables established by the `company_id` column. 

Both tables have the `company_id` column. The primary key in this relationship is the `company_id` column from the `company` table, since it ensures that each row of `company` is unique. And, as you probably guessed, the foreign key is the `company_id` column from the `foods` table, which establishes the relationship with the `customer` table.


### Inner Joins

An `INNER JOIN` is the most common type of join you will use. Let's apply what we just learned to the `company` and `foods` tables. Join the two tables and show all of the columns between them.

As a reminder, here is a representation on how the data will be matched, record by record, between the two tables:

![Inner Join](assets/inner-join-company-foods.png)


### **Now it's your turn!**

In [None]:
%%sql

SELECT C.State, COUNT(R.ReservationID) AS total_reservations, AVG(R.PartySize) AS avg_partysize
FROM Reservations AS R
INNER JOIN Customers AS C
ON R.CustomerID=C.CustomerID
GROUP BY C.State
ORDER BY total_reservations DESC
LIMIT 5

### Left Joins

A `LEFT JOIN` is the other most common join after `INNER JOIN`. The difference between the two is how the records are matched in the query result. While an `INNER JOIN` includes only the records that match on both sides of the join, a `LEFT JOIN` will keep all the records from the left-side of the join and only those that match from the right-side. When this happens, instead of dropping those records like the `INNER JOIN`, any values in unmatched records are set to `null`. Visually, the matched records will look like this:

![Left Join](assets/left-join-company-foods.png)

If you're new to `LEFT JOIN`, you're probably wondering what was meant before by "left-side of the join." Visually, we can see that there is a table on the left side that has all of the records included with only the records at the intersection included from the right. But how does that translate to an actual query?

```SQL

SELECT *
FROM company -- This is the table on the "left-side" of the join
LEFT JOIN foods -- This table is on the "right-side" of the join
ON company.company_id=foods.company_id
```

Simply put, the table on the "left" of a join type is the one used in the `FROM` clause and the table on the "right" is the one specified after the join type (`LEFT JOIN` in this case.) Columns and constraints otherwise work the same as the `INNER JOIN`. The syntax is virtually identical between the two joins, but it's important to know how they work and the caveats for all join types.

Run the `LEFT JOIN` query that we wrote above in the cell below and compare the results with the query we ran earlier using an `INNER JOIN`. Can you see the difference?

### **Now it's your turn!**

In [None]:
%%sql

### `INNER JOIN` vs `LEFT JOIN` - Caveats & Using them in Practice

#### Caveats

Looking at the result of the two queries, you should notice that the `LEFT JOIN` query gave you an extra record that was missing from the `INNER JOIN` query. The `sip-n-Bite` company is missing from our first query. Why?

Well, we know that we joined the two tables together based on matching `company_id`. We also know that `INNER JOIN` only keeps records from both tables that match. Since we know that the record exists in the `company` table, that must mean that there is no record for the `sip-n-Bite` company in the `foods` table. You can confirm this by looking at the `foods` table and querying for `company_id=19`, which will return no result since it doesn't exist.

#### When do I use one over the other?

Whether to use an `INNER JOIN` or a `LEFT JOIN` is something you must consider for your particular use case. Do you only want to consider the records that match between your tables? Then choose an `INNER JOIN`. Want to make sure that records are kept from the left side of the join? Then - you guessed it - use a `LEFT JOIN`.

#### Practical Usage

By and large, the majority of your joins in practice will either be an `INNER JOIN` or a `LEFT JOIN`. It is worth it to learn them well and become really comfortable with using them, as well as knowing when to use them. The other joins mentioned are not used as much in practice, but it's good to know about them - especially for technical interviews!


### Right Joins

As mentioned earlier, `RIGHT JOIN` is rarely used in practice. This is because you can do the same thing using just a `LEFT JOIN`, so there aren't many (if any) use cases where you would want to exclusively use it. However, it is a join type to be aware of and is commonly asked about in interviews, so let's cover it.

The opposite of the `LEFT JOIN`, `RIGHT JOIN` includes all the records from the "right-side" of the join and only records that match from the "left-side". Also, similar to `LEFT JOIN`, values in records from the other side of the join that don't match are set to `null` and included in our query results. Visually, the resulting matches look like this:

![Right Join](../assets/right-join-company-foods.png)



Here is the query breakdown:

```SQL

SELECT *
FROM company -- This is the table on the "left-side" of the join
RIGHT JOIN foods -- This table is on the "right-side" of the join
ON foods.company_id=company.company_id
```

Syntatically, it is almost identical to the other joins. Let's run a `RIGHT JOIN` query and see the results.


### **Now it's your turn!**

Write a `RIGHT JOIN` query with the `foods` table on the "right-side" of the join and the `company` table on the "left-side" of the join.

In [None]:
%%sql

From the results of the `RIGHT JOIN`, we can see that it indeed kept all of the records from the `foods` table (since it's on the right-side of the join) and gave null values in the records from the `company` table that didn't match.

### Full Outer Joins

`FULL OUTER JOIN` is another join type that isn't used as often as left or inner joins in practice, but it is much more common than the `RIGHT JOIN`. We use `FULL OUTER JOIN` when we want to include all the records from both sides of the join, showing the records that match between the two and otherwise giving null values where there isn't a match between the tables. A `FULL OUTER JOIN` is like a combination of both the left and right join types.

Here is how the matching looks visually:

![Outer Join](assets/full-outer-join-company-foods.png)

The query syntax is pretty much identical to the others, aside from specifying the join type itself:

```SQL
SELECT *
FROM company
FULL OUTER JOIN foods
ON company.company_id=foods.food_id
```

### **Now it's your turn!**

Let's write a query using the `FULL OUTER JOIN` with `company` on the left-side of the join and `foods` on the right-side:

In [None]:
%%sql


As you can see, the `FULL OUTER JOIN` gave us a result that is essentially a combination of the results from the right and left joins. 

#### Summary of Joins
To sum things up, we use joins to combine data from different sources to add context to our analysis. When we join two tables, the records are matched based on a constraint that we specify, with the most common being that the value on one side of the join is equal to the value on the other. Joins also have types, which affect what rows are returned from a query. The four primary types are `INNER`, `LEFT`, `RIGHT`, and `FULL OUTER` joins. Of these four, the two most common are `INNER` and `LEFT`. 

# Aggregates - Summarizing Data with SQL

### Clauses to know:

- `GROUP BY` - Allows you to aggregate data in by a single value or group of values.
- `HAVING` - Allows you to filter your query using the value of an aggregate function. Think of this as a `WHERE` clause for aggregate functions.

### Common aggregate functions:

- `COUNT(column)`: Counts how many rows are in a particular column (or table if you use '*' - e.g. `COUNT(*)`).
- `MIN(column)`: Gives you the smallest value found for the given column.
- `MAX(column)`: Gives you the largest value found for the given column.
- `AVG(column)`: Gives you the average for all values in the given column.
- `SUM(column)`: Gives you the sum of all the values in the given column.

[Click here to see code examples of aggregate functions in SQL](https://martinmarroyo.github.io/sqlcheatsheetandresources-coop/#aggregates)


### **Now it's your turn!**

Using the `employees` table, find the longest time that an employee has been at the studio:

In [None]:
%%sql

## More resources for further practice

- [SQL Bolt](https://sqlbolt.com/): The lessons here are a great introduction to SQL and you know the platform already!
- [Mode](https://mode.com/sql-tutorial/): A comprehensive SQL tutorial from beginner all the way to advanced SQL. There's even a data analytics with SQL tutorial. This is a great resource to learn about SQL in depth and practice what you learn in their online database.
- [StrataScratch](https://platform.stratascratch.com/coding): Practice coding questions geared toward data analysts and data scientists. You can solve coding problems used by real companies for technical interviews using PostgresSQL, Python, R, or MySQL. It's free to sign up!
- [Codecademy - Free Learn SQL Course](https://www.codecademy.com/learn/learn-sql): Codecademy is another great resource to learn SQL as well as most other languages. There are a lot of free resources here that can help you learn SQL, Python, R, and many other languages.
- [Socratica SQL (YouTube)](https://www.youtube.com/watch?v=nWyyDHhTxYU&list=PLih4ch-U2DiBbMoFK4ML9faT3k3MM2UQY): This is a great playlist that will get you started learning SQL with one of the most popular relational databases - Postgres.
- [DB Fiddle](https://dbfiddle.uk/): This site is like a SQL scratch pad. You can use it to practice doing stuff like creating tables and inserting data into them, and all sorts of other stuff that you might not be able to do so freely in a live database. It's a sandbox, basically. Here are a couple of links to fiddles with some data in them to play with: [fiddle 1](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=366b683701596d3f7459b0411c15acd1) and [fiddle 2](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=dfffc1939f629d9286c55d732fb656c5).


And don't forget to keep your [SQL Cheatsheet](https://martinmarroyo.github.io/sqlcheatsheetandresources-coop/) handy!