# SQL 102 - The (unofficial) Workbook
Author: Martin Arroyo

## How to use this notebook

**Step 1: Run the cell below to connect to set up the database and notebook**

Don't worry about the code below. It is just setting up the notebook for this lesson:

In [1]:
# Install `teachdb` if it's not in the system already
from importlib.util import find_spec
if not find_spec('teachdb'):
    print("Installing `teachdb` and its dependencies...")
    !pip install --upgrade --quiet git+https://github.com/freestackinitiative/teachingdb.git
    print("Successfully installed `teachdb`")

import duckdb
from teachdb.teachdb import connect_teachdb

# Set configurations for notebook
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Load data
con = duckdb.connect(":memory:")
connect_teachdb(con)

%sql con

Connected to `teachdb` from the Freestack Initiative


**Step 2: Run a query**

To run SQL queries against the database, make sure that the cell you are writing in has `%%sql` written at the top. You can write your queries underneath that and run the cell to execute them. 

Go ahead and try it with the cell below:

In [2]:
%%sql

SELECT *
FROM foods

Unnamed: 0,item_id,item_name,item_unit,company_id
0,1,Chex Mix,Pcs,16.0
1,6,Cheez-It,Pcs,15.0
2,2,BN Biscuit,Pcs,15.0
3,3,Mighty Munch,Pcs,17.0
4,4,Pot Rice,Pcs,15.0
5,5,Jaffa Cakes,Pcs,18.0
6,7,Salt n Shake,Pcs,


# Theory Section - Joins and multiple tables

If you feel confident in your understanding of joins, or you want to try some practical exercises first before diving into theory, skip ahead to the practice section below this. Otherwise, read on to learn about what joins are, why we use them, and how to include them in your queries.

### What are joins and why do we use them?

When we join data, we are merging two or more related tables into a single view, typically to add context to some analysis. This is accomplished by using common attributes (aka *columns*) to establish relationships between the tables along with constraints that control how records (aka *rows*) are matched between them. There are four primary join types that each have their own affect on query results.

### A hypothetical join scenario

For example, let's say you have two tables, one with `company` data and another with `product` data. It would be useful to see a combined view of companies and their products for an analysis or report. 

To do this, we would join the `customer` and `product` tables together using a common field (or fields) that establish a relationship between them. In a relational database, this is generally done using a primary key/foreign key relationship. Then we would establish a constraint to determine how the rows should be matched between the two tables in the resulting query. The most common constraint is to specify that only matching values between the tables be included in the join (Ex: `A.column_a = B.column_a`). 

### Working through an example of the most common join - the `INNER JOIN`

Joins can be a difficult topic, so let's practice doing some together and then we'll explain what is happening as we go.

The first type of join we'll look at is the `INNER JOIN`. When we use an `INNER JOIN`, we are saying that we only want the values that match between the two tables based on our constraints. 

Let's assume we have two tables, `A` and `B`, that we want to join together and only include the records that are common to both tables. 

Here is what the query would look like:

**Query**
```SQL
SELECT
    A.column1,
    B.column1
FROM A -- This is the table on the "left-side" of the join
-- Specifying the join type
INNER JOIN B -- This is the table on the "right-side" of the join
-- Specifying the join column (`column1`) and constraint (`=`)
ON A.column1=B.column1 
```

This is a visual representation of which results will be returned between the two tables after the join:

**What gets matched between the two tables**

![Inner Join](https://www.codeproject.com/KB/database/Visual_SQL_Joins/INNER_JOIN.png)

As you can see, we're keeping only the results between the two tables that match based on our join columns and constraints. These results will only include those rows that matched and will drop data from both sides of the join that don't match. 

For the syntax of the join, we always specify the join type, column(s), and constraint(s) together. Joins come after the `FROM` clause and before the `WHERE` and `GROUP BY` clauses in our queries. 

To specify the type of join, we use the following form: 

`{Join Type} {name of table to join}`

**Example:** `INNER JOIN company`

To specify the column and constraint, we use the following form:

`{First Table.Column to join on} {constraint} {Second Table.Column to join on}`

**Example:** `foods.company_id=company.company_id`

Putting it all together, writing the join would look like:

```SQL
INNER JOIN company
ON foods.company_id=order_details.company_id
```

>*Note: We typically use dot (`.`) notation when we join two or more tables in >SQL. This is so that we don’t confuse the SQL engine when two tables have >columns with the same name. Here’s the syntax: `table_name.column_name`*

Overall, there are **four join types** that you should be aware of (and which we will cover):

- `INNER JOIN`: Only keep the records that match the constraint between the two tables
- `LEFT JOIN`: Keep all the records from the left-side of the join, and only show values for the records on the right-side that matched. Any values from the right-side that weren't matched will be assigned a `null` value.
- `RIGHT JOIN`: The opposite of the `LEFT JOIN` - keep all the records from the right-side of the join and only show values for the records on the left-side that matched. Any values from the left-sie that weren't matched will be assigned a `null` value.
- `FULL OUTER JOIN`: Keep all records between both tables, but only show the values that match my constraint. All other records that don't match will be included, but those values will be set to null.

We will dive a bit deeper into the details of each of these join types as we work through some practical examples in the next section of this notebook. Also, these are only four join types out of many. There are some more advanced joins, like [cross-joins, natural joins, and self-joins](https://www.linkedin.com/pulse/what-difference-between-natural-joincross-join-self-madhu-mitha-k) that you should eventually become familiar with as you enhance your skills and understanding.

## Practical considerations for selecting attributes/columns for joins

Joining datasets usually depends on primary and foreign key relationships, but these aren't always clear (or available), especially with unfamiliar data. In such cases, understanding the data's origin, purpose, and contents can help. This can be done by finding documentation like data dictionaries or entity relationship diagrams (ERDs) or consulting with subject matter experts.

If there's no documentation or experts, exploratory data analysis becomes crucial. This involves understanding column data types, summary statistics, and identifying missing values. Aim to identify unique identifier columns, or potential primary keys for joins, which should be unique and non-null. [The information_schema](https://en.wikipedia.org/wiki/Information_schema), found in many relational databases, can also provide helpful metadata.

The last step in joining datasets is connecting tables using primary keys identified earlier. If key relationships are known, this is simple. Otherwise, analyze and understand primary keys, look for common columns, and try joining. Duplicates after joining suggest incorrect column usage. This illustrates why it is crucial for you to understand your data.

## -----------------------------------------------------------------------------

# Practice Section - Joins and Multiple Tables

As an example, we'll join the `company` and `foods` table together. There is already a primary key/foreign key relationship between the two tables established by the `company_id` column. 

Both tables have the `company_id` column. The primary key in this relationship is the `company_id` column from the `company` table, since it ensures that each row of `company` is unique. And, as you probably guessed, the foreign key is the `company_id` column from the `foods` table, which establishes the relationship with the `customer` table.


### Inner Joins

An `INNER JOIN` is the most common type of join you will use. Let's apply what we just learned to the `company` and `foods` tables. Join the two tables and show all of the columns between them.

As a reminder, here is a representation on how the data will be matched, record by record, between the two tables:

![Inner Join](assets/inner-join-company-foods.png)


### **Now it's your turn!**

In [3]:
%%sql

SELECT *
FROM company
INNER JOIN foods
ON company.company_id=foods.company_id

Unnamed: 0,company_id,company_name,company_city,item_id,item_name,item_unit,company_id.1
0,16,Akas Foods,Delhi,1,Chex Mix,Pcs,16.0
1,15,Jack Hill Ltd,London,6,Cheez-It,Pcs,15.0
2,15,Jack Hill Ltd,London,2,BN Biscuit,Pcs,15.0
3,17,Foodies.,London,3,Mighty Munch,Pcs,17.0
4,15,Jack Hill Ltd,London,4,Pot Rice,Pcs,15.0
5,18,Order All,Boston,5,Jaffa Cakes,Pcs,18.0


### Left Joins

A `LEFT JOIN` is the other most common join after `INNER JOIN`. The difference between the two is how the records are matched in the query result. While an `INNER JOIN` includes only the records that match on both sides of the join, a `LEFT JOIN` will keep all the records from the left-side of the join and only those that match from the right-side. When this happens, instead of dropping those records like the `INNER JOIN`, any values in unmatched records are set to `null`. Visually, the matched records will look like this:

![Left Join](assets/left-join-company-foods.png)

If you're new to `LEFT JOIN`, you're probably wondering what was meant before by "left-side of the join." Visually, we can see that there is a table on the left side that has all of the records included with only the records at the intersection included from the right. But how does that translate to an actual query?

```SQL

SELECT *
FROM company -- This is the table on the "left-side" of the join
LEFT JOIN foods -- This table is on the "right-side" of the join
ON company.company_id=foods.company_id
```

Simply put, the table on the "left" of a join type is the one used in the `FROM` clause and the table on the "right" is the one specified after the join type (`LEFT JOIN` in this case.) Columns and constraints otherwise work the same as the `INNER JOIN`. The syntax is virtually identical between the two joins, but it's important to know how they work and the caveats for all join types.

Run the `LEFT JOIN` query that we wrote above in the cell below and compare the results with the query we ran earlier using an `INNER JOIN`. Can you see the difference?

### **Now it's your turn!**

In [None]:
%%sql

SELECT *
FROM company
LEFT JOIN foods
ON company.company_id=foods.company_id

### `INNER JOIN` vs `LEFT JOIN` - Caveats & Using them in Practice

#### Caveats

Looking at the result of the two queries, you should notice that the `LEFT JOIN` query gave you an extra record that was missing from the `INNER JOIN` query. The `sip-n-Bite` company is missing from our first query. Why?

Well, we know that we joined the two tables together based on matching `company_id`. We also know that `INNER JOIN` only keeps records from both tables that match. Since we know that the record exists in the `company` table, that must mean that there is no record for the `sip-n-Bite` company in the `foods` table. You can confirm this by looking at the `foods` table and querying for `company_id=19`, which will return no result since it doesn't exist.

#### When do I use one over the other?

Whether to use an `INNER JOIN` or a `LEFT JOIN` is something you must consider for your particular use case. Do you only want to consider the records that match between your tables? Then choose an `INNER JOIN`. Want to make sure that records are kept from the left side of the join? Then - you guessed it - use a `LEFT JOIN`.

#### Practical Usage

By and large, the majority of your joins in practice will either be an `INNER JOIN` or a `LEFT JOIN`. It is worth it to learn them well and become really comfortable with using them, as well as knowing when to use them. The other joins mentioned are not used as much in practice, but it's good to know about them - especially for technical interviews!


### Right Joins

As mentioned earlier, `RIGHT JOIN` is rarely used in practice. This is because you can do the same thing using just a `LEFT JOIN`, so there aren't many (if any) use cases where you would want to exclusively use it. However, it is a join type to be aware of and is commonly asked about in interviews, so let's cover it.

The opposite of the `LEFT JOIN`, `RIGHT JOIN` includes all the records from the "right-side" of the join and only records that match from the "left-side". Also, similar to `LEFT JOIN`, values in records from the other side of the join that don't match are set to `null` and included in our query results. Visually, the resulting matches look like this:

![Right Join](assets/right-join-company-foods.png)

Here is the query breakdown:

```SQL

SELECT *
FROM company -- This is the table on the "left-side" of the join
RIGHT JOIN foods -- This table is on the "right-side" of the join
ON foods.company_id=company.company_id
```

Syntatically, it is almost identical to the other joins. Let's run a `RIGHT JOIN` query and see the results.


### **Now it's your turn!**

Write a `RIGHT JOIN` query with the `foods` table on the "right-side" of the join and the `company` table on the "left-side" of the join.

In [None]:
%%sql

-- Write your query below here --

SELECT *
FROM company
RIGHT JOIN foods
ON company.company_id=foods.company_id

From the results of the `RIGHT JOIN`, we can see that it indeed kept all of the records from the `foods` table (since it's on the right-side of the join) and gave null values in the records from the `company` table that didn't match.

### Full Outer Joins

`FULL OUTER JOIN` is another join type that isn't used as often as left or inner joins in practice, but it is much more common than the `RIGHT JOIN`. We use `FULL OUTER JOIN` when we want to include all the records from both sides of the join, showing the records that match between the two and otherwise giving null values where there isn't a match between the tables. A `FULL OUTER JOIN` is like a combination of both the left and right join types.

Here is how the matching looks visually:

![Outer Join](assets/full-outer-join-company-foods.png)

The query syntax is pretty much identical to the others, aside from specifying the join type itself:

```SQL
SELECT *
FROM company
FULL OUTER JOIN foods
ON company.company_id=foods.food_id
```

### **Now it's your turn!**

Let's write a query using the `FULL OUTER JOIN` with `company` on the left-side of the join and `foods` on the right-side:

In [None]:
%%sql

-- Write your query below this line --
SELECT *
FROM company
FULL OUTER JOIN foods
ON company.company_id=foods.company_id

As you can see, the `FULL OUTER JOIN` gave us a result that is essentially a combination of the results from the right and left joins. 

#### Summary of Joins
To sum things up, we use joins to combine data from different sources to add context to our analysis. When we join two tables, the records are matched based on a constraint that we specify, with the most common being that the value on one side of the join is equal to the value on the other. Joins also have types, which affect what rows are returned from a query. The four primary types are `INNER`, `LEFT`, `RIGHT`, and `FULL OUTER` joins. Of these four, the two most common are `INNER` and `LEFT`. 

# Aggregates - Summarizing Data with SQL

### Clauses to know:

- `GROUP BY` - Allows you to aggregate data in by a single value or group of values.
- `HAVING` - Allows you to filter your query using the value of an aggregate function. Think of this as a `WHERE` clause for aggregate functions.

### Common aggregate functions:

- `COUNT(column)`: Counts how many rows are in a particular column (or table if you use '*' - e.g. `COUNT(*)`).
- `MIN(column)`: Gives you the smallest value found for the given column.
- `MAX(column)`: Gives you the largest value found for the given column.
- `AVG(column)`: Gives you the average for all values in the given column.
- `SUM(column)`: Gives you the sum of all the values in the given column.

[Click here to see code examples of aggregate functions in SQL](https://martinmarroyo.github.io/sqlcheatsheetandresources-coop/#aggregates)


### **Now it's your turn!**

Using the `employees` table, find the longest time that an employee has been at the studio:

In [None]:
%%sql

-- Write your query below this line --

SELECT MAX(years_employed)
FROM employees

## More resources for further practice

- [SQL Bolt](https://sqlbolt.com/): The lessons here are a great introduction to SQL and you know the platform already!
- [Mode](https://mode.com/sql-tutorial/): A comprehensive SQL tutorial from beginner all the way to advanced SQL. There's even a data analytics with SQL tutorial. This is a great resource to learn about SQL in depth and practice what you learn in their online database.
- [StrataScratch](https://platform.stratascratch.com/coding): Practice coding questions geared toward data analysts and data scientists. You can solve coding problems used by real companies for technical interviews using PostgresSQL, Python, R, or MySQL. It's free to sign up!
- [Codecademy - Free Learn SQL Course](https://www.codecademy.com/learn/learn-sql): Codecademy is another great resource to learn SQL as well as most other languages. There are a lot of free resources here that can help you learn SQL, Python, R, and many other languages.
- [Socratica SQL (YouTube)](https://www.youtube.com/watch?v=nWyyDHhTxYU&list=PLih4ch-U2DiBbMoFK4ML9faT3k3MM2UQY): This is a great playlist that will get you started learning SQL with one of the most popular relational databases - Postgres.
- [DB Fiddle](https://dbfiddle.uk/): This site is like a SQL scratch pad. You can use it to practice doing stuff like creating tables and inserting data into them, and all sorts of other stuff that you might not be able to do so freely in a live database. It's a sandbox, basically. Here are a couple of links to fiddles with some data in them to play with: [fiddle 1](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=366b683701596d3f7459b0411c15acd1) and [fiddle 2](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=dfffc1939f629d9286c55d732fb656c5).


And don't forget to keep your [SQL Cheatsheet](https://martinmarroyo.github.io/sqlcheatsheetandresources-coop/) handy!