# Full Outer Join


In some cases, you might want to include unmatched rows from both tables being joined. You can do this with a `full outer join`.

```SQL

SELECT column_name(s)
FROM Table_A
FULL OUTER JOIN Table_B ON Table_A.column_name = Table_B.column_name;
```

<img src="https://video.udacity-data.com/topher/2017/November/5a147487_full-outer-join/full-outer-join.png" style="width:200px">


## Finding Unmatched Rows after `Full Outer Join`

`LEFT JOIN` and `RIGHT JOIN` each return unmatched rows from one of the tables — `FULL JOIN` **returns unmatched rows from both tables**. FULL JOIN is commonly used in conjunction with aggregations to understand the amount of overlap between two tables.

**If you wanted to return unmatched rows only**, which is useful for some cases of data assessment, you can isolate them by adding the following line to the end of the query:
```SQL
WHERE Table_A.column_name IS NULL OR Table_B.column_name IS NULL
```

<img src="https://video.udacity-data.com/topher/2017/November/5a147485_full-outer-join-if-null/full-outer-join-if-null.png" style="width:200px">


### A common application

A common application of `Full Outer Join` is when joining two tables on a timestamp. Let’s say you’ve got one table containing the number of item 1 sold each day, and another containing the number of item 2 sold. If a certain date, like January 1, 2018, exists in the left table but not the right, while another date, like January 2, 2018, exists in the right table but not the left:

- a left join would drop the row with January 2, 2018 from the result set
- a right join would drop January 1, 2018 from the result set

**The only way to make sure both January 1, 2018 and January 2, 2018 make it into the results is to do a full outer join**. A full outer join returns unmatched records in each table with null values for the columns that came from the opposite table.



## Quiz

Say you're an analyst at Parch & Posey and you want to see:

- each account who has a sales rep and each sales rep that has an account (all of the columns in these returned rows will be full)
- but also each account that does not have a sales rep and each sales rep that does not have an account (some of the columns in these returned rows will be empty)

This type of question is rare, but `FULL OUTER JOIN` is perfect for it.

```SQL
SELECT *
FROM accounts a
FULL JOIN sales_reps s ON a.sales_rep_id = s.id
WHERE a.sales_rep_id is NULL or s.id is NULL
```

There is no unmatched rows found, that means each account has at least one sales rep and each sales rep has at least one account.

# Filter Rows in JOINS with Comparison Operation

Filtering in the join clause will eliminate rows before they are joined, while filtering in the `WHERE` clause will leave those rows in and produce some nulls.

```SQL

SELECT a.name account,
	   a.primary_poc contact,
       s.name rep
FROM accounts a
LEFT JOIN sales_reps s
ON a.sales_rep_id = s.id 
/* Join performs on only those account with contact names come before the sales reps' names alphabetically */
AND a.primary_poc < s.name 
```

#### Here is the rows where `LEFT JOIN` performed:
```SQL
WHERE s.name is NOT NULL
LIMIT 10
```

|account|contact|rep|
|---|---|---|
|Johnson Controls|Cammy Sosnowski|Samuel Racine|
|Ingram Micro|Chanelle Keach|Samuel Racine|
|Freddie Mac|Elayne Grunewald|Samuel Racine|
|Express Scripts Holding|Jewell Likes|Samuel Racine|
|Delta Air Lines|Enola Thoms|Eugena Esser|
|PepsiCo|Cathleen Delamater|Eugena Esser|
|Tesoro|Mammie Koff|Michel Averette|
|Nationwide|Henriette Dawes|Michel Averette|
|Tyson Foods|Ardelle Khoury|Michel Averette|
|United Technologies|Janett Wisecarver|Michel Averette


#### Here is the rows where `LEFT JOIN` not performed:

```SQL
WHERE s.name is NULL
LIMIT 10
```

|account|contact|rep|
|---|---|---|
|Walmart|Tamara Tuma||
|Exxon Mobil|Sung Shields||
|Berkshire Hathaway|Serafina Banda||
|UnitedHealth Group|Savanna Gayman||
|General Electric|Parker Hoggan||
|AmerisourceBergen|Tuan Trainer||
|Chevron|Paige Bartos||
|Fannie Mae|Terrilyn Kesler||
|Walgreens Boots Alliance|Esta Engelhardt||
|HP|Khadijah Riemann|

# Self Join

SELF JOIN is optimal when you want to show both parent and chile relationship within a family tree.


## Understand the logic with an example


### Use `LEFT JOIN`
```SQL
SELECT o1.id AS o1_id, o1.account_id AS o1_account_id, o1.occurred_at AS o1_occurred_at,
       o2.id AS o2_id, o2.account_id AS o2_account_id, o2.occurred_at AS o2_occurred_at
FROM orders o1
LEFT JOIN orders o2
ON o1.account_id = o2.account_id
AND o1.occurred_at < o2.occurred_at
AND o2.occurred_at <= o1.occurred_at + INTERVAL '28 days'
ORDER BY o1.account_id, o1.occurred_at
```

The above SQL will return the order placed by the same customer if it occured after an earlier order within 28 days. The earlier order details will be found in `o1` columns, and the latter ones in `o2` columns. Since it's a `LEFT JOIN` so the results include those orders without orders placed after within 28 days.

|o1_id|o1_account_id|o1_occurred_at|o2_id|o2_account_id|o2_occurred_at|
|---|---|---|---|---|---|
|1|1001|2015-10-06T17:31:14.000Z||||
|4307|1001|2015-11-05T03:25:21.000Z|2|1001|2015-11-05T03:34:33.000Z|
|2|1001|2015-11-05T03:34:33.000Z||||
|4308|1001|2015-12-04T04:01:09.000Z|3|1001|2015-12-04T04:21:55.000Z|
|3|1001|2015-12-04T04:21:55.000Z||||
|4309|1001|2016-01-02T00:59:09.000Z|4|1001|2016-01-02T01:18:24.000Z|
|4|1001|2016-01-02T01:18:24.000Z||||
|4310|1001|2016-02-01T19:07:32.000Z|5|1001|2016-02-01T19:27:27.000Z|
|5|1001|2016-02-01T19:27:27.000Z||||
|6|1001|2016-03-02T15:29:32.000Z|4311|1001|2016-03-02T15:40:29.000Z


### Use INNER JOIN

INNER JOIN will return only those orders with other orders placed after them within 28 days.

```SQL
SELECT o1.id AS o1_id, o1.account_id AS o1_account_id, o1.occurred_at AS o1_occurred_at,
       o2.id AS o2_id, o2.account_id AS o2_account_id, o2.occurred_at AS o2_occurred_at
FROM orders o1
JOIN orders o2
ON o1.account_id = o2.account_id
AND o1.occurred_at < o2.occurred_at
AND o2.occurred_at <= o1.occurred_at + INTERVAL '28 days'
ORDER BY o1.account_id, o1.occurred_at
```

|o1_id|o1_account_id|o1_occurred_at|o2_id|o2_account_id|o2_occurred_at|
|---|---|---|---|---|---|
|4307|1001|2015-11-05T03:25:21.000Z|2|1001|2015-11-05T03:34:33.000Z|
|4308|1001|2015-12-04T04:01:09.000Z|3|1001|2015-12-04T04:21:55.000Z|
|4309|1001|2016-01-02T00:59:09.000Z|4|1001|2016-01-02T01:18:24.000Z|
|4310|1001|2016-02-01T19:07:32.000Z|5|1001|2016-02-01T19:27:27.000Z|
|6|1001|2016-03-02T15:29:32.000Z|4311|1001|2016-03-02T15:40:29.000Z|
|4312|1001|2016-04-01T11:15:27.000Z|7|1001|2016-04-01T11:20:18.000Z|
|4313|1001|2016-05-01T15:40:04.000Z|8|1001|2016-05-01T15:55:51.000Z|
|4314|1001|2016-05-31T21:09:48.000Z|9|1001|2016-05-31T21:22:48.000Z|
|4315|1001|2016-07-30T03:21:57.000Z|11|1001|2016-07-30T03:26:30.000Z|
|4316|1001|2016-08-28T06:50:58.000Z|12|1001|2016-08-28T07:13:39.000Z



## Quiz

> As you may have noticed in the previous example, using inequalities in conjunction with **self JOINs** is common.


Modify the query from the previous example to perform the same interval analysis except for the web_events table. Also:

- change the interval to 1 day to find those web events that occurred after, but not more than 1 day after, another web event
- add a column for the channel variable in both instances of the table in your query

```SQL
SELECT w1.id AS w1_id, 
	   w1.account_id AS w1_account_id,
       w1.occurred_at AS w1_occurred_at,
       w1.channel AS w1_channel,
       w2.id AS w2_id, 
	   w2.account_id AS w2_account_id,
       w2.occurred_at AS w2_occurred_at,
       w2.channel AS w2_channel
FROM web_events w1
LEFT JOIN web_events w2
ON w1.account_id = w2.account_id 
AND w1.occurred_at < w2.occurred_at
AND w2.occurred_at <= w1.occurred_at + INTERVAL '1 day'
ORDER BY w1.occurred_at, w2.occurred_at
LIMIT 20

```

|w1_id|w1_account_id|w1_occurred_at|w1_channel|w2_id|w2_account_id|w2_occurred_at|w2_channel|
|---|---|---|---|---|---|---|---|
|2471|2861|2013-12-04T04:18:29.000Z|direct|6994|2861|2013-12-04T18:22:04.000Z|facebook|
|4193|4311|2013-12-04T04:44:58.000Z|direct|8825|4311|2013-12-04T08:27:55.000Z|adwords|
|8825|4311|2013-12-04T08:27:55.000Z|adwords|||||
|6994|2861|2013-12-04T18:22:04.000Z|facebook|||||
|294|1281|2013-12-05T20:17:50.000Z|direct|4728|1281|2013-12-05T21:22:29.000Z|adwords|
|4728|1281|2013-12-05T21:22:29.000Z|adwords|||||
|1998|2481|2013-12-06T02:03:07.000Z|direct|6499|2481|2013-12-06T11:48:58.000Z|direct|
|1998|2481|2013-12-06T02:03:07.000Z|direct|6500|2481|2013-12-06T13:11:15.000Z|direct|
|7587|3251|2013-12-06T07:52:09.000Z|organic|2996|3251|2013-12-06T12:52:53.000Z|direct|
|6499|2481|2013-12-06T11:48:58.000Z|direct|6500|2481|2013-12-06T13:11:15.000Z|direct|
|3186|3431|2013-12-06T12:31:12.000Z|direct|7796|3431|2013-12-06T23:07:01.000Z|twitter|
|2996|3251|2013-12-06T12:52:53.000Z|direct|||||
|6500|2481|2013-12-06T13:11:15.000Z|direct|||||
|7870|3491|2013-12-06T16:57:12.000Z|facebook|3317|3491|2013-12-06T23:32:08.000Z|direct|
|7796|3431|2013-12-06T23:07:01.000Z|twitter||||



# Appending Data with `UNION`


## `UNION` 


The `UNION` operator is used to combine the result sets of 2 or more `SELECT` statements. 

Typically, the use case for leveraging the `UNION` command in SQL is when a user wants to **pull together distinct values of specified columns that are spread across multiple tables**.

- Example 1: a chef wants to pull together the ingredients and respective aisle across three separate meals that are maintained in different tables.


- Example 2: when you want to determine all late reasons among students. Currently each late reason is maintained within tables corresponding to the grade the student is in.

### Details on `UNION`
- There must be the same number of columns (fields) in both SELECT statements.
- The corresponding columns must have the same data type.
- Column names, however, don't need to be the same to append two tables.


> `UNION` removes duplicate rows; `UNION ALL` does not remove duplicate rows.


### Quiz

1. The SQL will return 351 results (all rows in the accounts table)

```SQL

SELECT * FROM accounts 

UNION 

SELECT * FROM accounts
```

2. The SQL will return `351 * 2 = 702` results 

```SQL

SELECT * FROM accounts 

UNION ALL

SELECT * FROM accounts
```


3. Add a WHERE clause to each of the tables that you unioned in the query above, filtering the first table where name equals Walmart and filtering the second table where name equals Disney. 

```SQL
SELECT * FROM accounts 
WHERE name = 'Walmart'

/*As there will be no duplicate in the results, either UNION or UNION ALL shall work*/
UNION

SELECT * FROM accounts
WHERE name = 'Disney'
```
    The result is equivalent to the following SQL command:

```SQL
SELECT * FROM accounts 
WHERE name='Walmart' OR name='Disney'
```

4. Performing Operations on a Combined Dataset

   Perform the `union all` in your second query in a common table expression and name it `double_accounts`. Then do a `COUNT` the number of times a name appears in the `double_accounts` table. If you do this correctly, your query results should have a count of 2 for each name.
   
```SQL
/* This returns 351 rows that shows the count for each account as 2*/
WITH double_accounts AS (
SELECT * FROM accounts 
UNION ALL
SELECT * FROM accounts)

SELECT name, COUNT(*)
FROM double_accounts
GROUP BY 1
```

# Performance Tuning

Even though SQL can work with massic amount of data, sometimes it may take hours to return the result of a query. Performance tuning might help ease the issue.


## Factors to consider

Reduce the number of calculations that need to be performed. Some of the high-level things that will affect the number of calculations a given query will make include:
    - Table size
    - Joins
    - Aggregations

Query runtime is also dependent on some things that you can’t really control related to the database itself:
- Other users running queries concurrently
- Database software and optimization


## Methods

1. Filter data to a smaller subset.
    - use `WHERE` clause
    - use subquery
   
   **`LIMIT` is run after returning the results, so it is NOT gonna help**


2. Make **joins** as simple as possible, reduce the table sizes before joining them.
    - For example, perform aggregation first and use the aggregated results table for join:
    ```SQL
    SELECT a.name, 
           sub.num_events
    FROM (
        SELECT account_id, 
               COUNT(*) num_events
        FROM web_events
        GROUP By 1
    ) sub
    
    JOIN accounts a 
    ON a.id = sub.account_id
    ```

3. Use `EXPLAIN` to see how a query is run, and then modify the steps that are most expensive.
    ```SQL
    EXPLAIN
    SELECT a.name, 
           sub.num_events
    FROM (
        SELECT account_id, 
               COUNT(*) num_events
        FROM web_events
        GROUP By 1
    ) sub
    
    JOIN accounts a 
    ON a.id = sub.account_id
    ```
    
## A Useful Example of Joining Subqueries to Improve Performance

Say we'd like to get the following metrics day by day:
- number of active sales reps 
- number of orders placed
- number of web visits 

To do that, we can join `orders` and `web_events` tables and then do aggregation. However, performing JOINs before aggregation will cause poor performance. Instead, we shall do subqueries to get pre-aggregated results first:

```SQL

SELECT /* The JOINed results will incude the rows from either orders or web_events with unmatched dates, in such case, the date column of one of the table will be NULL, COALESCE the NULL date value with the date from the other table */
       COALESCE(orders.date, web_events.date) AS date,
       orders.active_sales_rep,
       orders.orders,
       web_events.web_visits
FROM(
    SELECT DATE_TRUNC('day', o.occurred_at) AS date,
           COUNT(a.sales_rep_id) AS active_sales_rep,
           COUNT(o.id) AS orders
      FROM accounts a
      JOIN orders o
        ON o.account_id = a.id
    GROUP BY 1) orders
    
/* Use FULL JOIN to include data from both tables even there is no matching dates */
FULL JOIN

    (SELECT DATE_TRUNC('day', w.occurred_at) AS date,
           COUNT(w.id) AS web_visits
      FROM web_events w
      GROUP BY 1) web_events
    
ON orders.date = web_events.date
ORDER BY 1, 2, 3

```



# Resources


[Understanding `UNION` syntax and examples](https://www.techonthenet.com/sql/union.php).


Practices

- [HackerRank](https://www.hackerrank.com/domains/sql) 
- [ModeAnalytics](https://community.modeanalytics.com/sql/tutorial/sql-business-analytics-training/). We strongly recommend these case study projects.You should have a strong handle on SQL before attempting these problems.  
- The skill test by [AnalyticsVidhya](https://www.analyticsvidhya.com/blog/2017/01/46-questions-on-sql-to-test-a-data-science-professional-skilltest-solution/) is a fun test to take too.