# Lesson 9 - SQL Aggregations - Part 2 of 2

## `DISTINCT`

`DISTINCT` is always used in `SELECT` statements, and it provides the unique rows for all columns written in the `SELECT` statement. Therefore, you only use `DISTINCT` once in any particular `SELECT` statement.

You could write:

`SELECT DISTINCT column1, column2, column3
FROM table1;`
which would return the unique (or DISTINCT) rows across all three columns.

You would not write:

SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3
FROM table1;
You can think of DISTINCT the same way you might think of the statement "unique".

DISTINCT - Expert Tip
It’s worth noting that using DISTINCT, particularly in aggregations, can slow your queries down quite a bit.

*Examples:*

<img src="../SQL/ERD DAND.jpg" width="600" height="400">

Use DISTINCT to test if there are any accounts associated with more than one region.

    The below two queries have the same number of resulting rows (351), so we know that every account is associated with only one region. If each account was associated with more than one region, the first query should have returned more rows than the second query.

    `SELECT a.id as "account id", r.id as "region id", 
    a.name as "account name", r.name as "region name"
    FROM accounts a
    JOIN sales_reps s
    ON s.id = a.sales_rep_id
    JOIN region r
    ON r.id = s.region_id;`
    
    and

    `SELECT DISTINCT id, name
    FROM accounts;`


Have any sales reps worked on more than one account?

Yes, all sales reps have worked on more than one account. At a minimum there is 3.

`SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.id, s.name
ORDER BY num_accounts`

This ensures all sales reps are accounted for (50 of them):

`SELECT DISTINCT id, name
FROM sales_reps;`




## `HAVING`

This is like using a `WHERE` clause except you can use aggregate functions on it. You cannot do this with a `WHERE` clause.

`HAVING` is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a `WHERE` on an element of your query that was created by an aggregate, you need to use `HAVING` instead.




Key takeaways - `HAVING` vs `WHERE`:

- `WHERE` subsets the returned data based on a logical condition.

- `WHERE` appears **after** the `FROM`, `JOIN`, and `ON` clauses, but **before** `GROUP BY`.

- `HAVING` appears **after** the `GROUP BY` clause, but **before** the ORDER BY clause.

- `HAVING` is like `WHERE`, but it works on logical statements involving aggregations.


*Examples:*

How many of the sales reps have more than 5 accounts that they manage?

`SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.id, s.name
HAVING COUNT(*) > 5
ORDER BY num_accounts`


How many accounts have more than 20 orders?

`SELECT a.id, a.name, COUNT(*) num_orders
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING COUNT(*) > 20
ORDER BY num_orders`


Which account has the most orders?

`SELECT a.id, a.name, COUNT(*) num_orders
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING COUNT(*) > 20
ORDER BY num_orders DESC
LIMIT 1`


How many accounts spent more than 30,000 usd total across all orders?

204 results.

`SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING SUM(o.total_amt_usd) > 30000
ORDER BY total_spent;`


How many accounts spent less than 1,000 usd total across all orders?

3 results.

`SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING SUM(o.total_amt_usd) < 1000
ORDER BY total_spent;`


Which account has spent the most with us?

EOG Resources

`SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY total_spent DESC
LIMIT 1;`


Which account has spent the least with us?

Nike

`SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY total_spent
LIMIT 1;`


Which accounts used facebook as a channel to contact customers more than 6 times?

`SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.id, a.name, w.channel
HAVING COUNT(*) > 6 AND w.channel = 'facebook'
ORDER BY use_of_channel;`


Which account used facebook most as a channel? 

Gilead Sciences

`SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUP BY a.id, a.name, w.channel
ORDER BY use_of_channel DESC
LIMIT 1;`


Which channel was most frequently used by most accounts?

All the top 10 are direct.

`SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.id, a.name, w.channel
ORDER BY use_of_channel DESC
LIMIT 10;`

## Working with Dates

A comprehensive guide on DATE/TIME functions can be found here: https://www.postgresql.org/docs/9.1/static/functions-datetime.html

## `DATE_TRUNC`

Allows you to truncate your date to a particular part of your date-time column. Common functions are `day`, `month`, `year`.


<img src="../SQL/DATE_TRUNC examples.png" width="600" height="400">


Read this article for more information: https://blog.modeanalytics.com/date-trunc-sql-timestamp-function-count-on/


## `DATE_PART`

Allows you to extract a specific portion of a date. Here are examples:

<img src="../SQL/DATE_PART examples.png" width="600" height="400">


Another example is `DOW` which is "Day of the Week". 0 = Sunday and 6 = Saturday.


You can reference the columns in your select statement in `GROUP BY` and `ORDER BY` clauses with numbers that follow the order they appear in the `SELECT` statement. For example:

    SELECT standard_qty, COUNT(*)

    FROM orders

    GROUP BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)

    ORDER BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)


__*Examples of DATE / TIME functions:*__

Find the sales in terms of total dollars for all orders in each year, ordered from greatest to least. Do you notice any trends in the yearly sales totals?

`SELECT DATE_PART('year', occurred_at) ord_year,  SUM(total_amt_usd) total_spent
 FROM orders
 GROUP BY 1
 ORDER BY 2 DESC;`
 
 Note: When we look at the yearly totals, you might notice that 2013 and 2017 have much smaller totals than all other years. If we look further at the monthly data, we see that for 2013 and 2017 there is only one month of sales for each of these years (12 for 2013 and 1 for 2017). Therefore, neither of these are evenly represented. Sales have been increasing year over year, with 2016 being the largest sales to date. At this rate, we might expect 2017 to have the largest sales.


Which month did Parch & Posey have the greatest sales in terms of total dollars? Are all months evenly represented by the dataset?

    The greatest sales amounts occur in December (12). 

`SELECT DATE_PART('month', occurred_at) AS Month, SUM(total_amt_usd) 
FROM orders
WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'
GROUP BY 1
ORDER BY 2 DESC`


Which year did Parch & Posey have the greatest sales in terms of total number of orders? Are all years evenly represented by the dataset?

    Again, 2016 by far has the most amount of orders, but again 2013 and 2017 are not evenly represented to the other years in the dataset.


`SELECT DATE_PART('year', occurred_at) ord_year, COUNT(*) total_sales
FROM orders
GROUP BY 1
ORDER BY 2 DESC;`



Which month did Parch & Posey have the greatest sales in terms of total number of orders? Are all months evenly represented by the dataset?

    December still has the most sales, but interestingly, November has the second most sales (but not the most dollar sales. To make a fair comparison from one month to another 2017 and 2013 data were removed.

`SELECT DATE_PART('month', occurred_at) ord_month, COUNT(*) total_sales
FROM orders
WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'
GROUP BY 1
ORDER BY 2 DESC;`



In which month of which year did Walmart spend the most on gloss paper in terms of dollars?

    May 2016.
    
`SELECT DATE_TRUNC('month', o.occurred_at) ord_date, SUM(o.gloss_amt_usd) tot_spent
FROM orders o 
JOIN accounts a
ON a.id = o.account_id
WHERE a.name = 'Walmart'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;`


## `CASE` and Derived Columns

This is just like doing a if-then-else in programming. Here are some tips:

- The `CASE` statement always goes in the `SELECT` clause.

- `CASE` must include the following components: `WHEN`, `THEN`, and `END`. `ELSE` is an optional component to catch cases that didn’t meet any of the other previous CASE conditions.

- You can make any conditional statement using any conditional operator (like `WHERE`) between `WHEN` and `THEN`. This includes stringing together multiple conditional statements using `AND` and `OR`.

- You can include multiple WHEN statements, as well as an `ELSE` statement again, to deal with any unaddressed conditions.
- Using just a `WHERE` clause will produce the same results. However, this only works if there's one condition. If there are multiple, then use a `CASE`.

_**Illustrated example with ZERO DIVISION ERROR:**_

Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields. NOTE - you will be thrown an error with the correct solution to this question. This is for a division by zero. You will learn how to get a solution without an error to this query when you learn about CASE statements in a later section.

`SELECT account_id, CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
                        ELSE standard_amt_usd/standard_qty END AS unit_price
FROM orders
LIMIT 10;`


_**Other Examples:**_


<img src="../SQL/ERD DAND.jpg" width="600" height="400">

We would like to understand 3 different levels of customers based on the amount associated with their purchases. The top branch includes anyone with a Lifetime Value (total sales of all orders) greater than 200,000 usd. The second branch is between 200,000 and 100,000 usd. The lowest branch is anyone under 100,000 usd. Provide a table that includes the level associated with each account. You should provide the account name, the total sales of all orders for the customer, and the level. Order with the top spending customers listed first.

`SELECT a.name, SUM(total_amt_usd) total_spent, 
     CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'
     WHEN  SUM(total_amt_usd) > 100000 THEN 'middle'
     ELSE 'low' END AS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id 
GROUP BY a.name
ORDER BY 2 DESC;`



We would now like to perform a similar calculation to the first, but we want to obtain the total amount spent by customers only in 2016 and 2017. Keep the same levels as in the previous question. Order with the top spending customers listed first.

`SELECT a.name, SUM(total_amt_usd) total_spent, 
     CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'
     WHEN  SUM(total_amt_usd) > 100000 THEN 'middle'
     ELSE 'low' END AS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id
WHERE occurred_at > '2015-12-31' 
GROUP BY 1
ORDER BY 2 DESC;`


We would like to identify top performing sales reps, which are sales reps associated with more than 200 orders. Create a table with the sales rep name, the total number of orders, and a column with top or not depending on if they have more than 200 orders. Place the top sales people first in your final table.

`SELECT s.name, COUNT(*) num_ords,
     CASE WHEN COUNT(*) > 200 THEN 'top'
     ELSE 'not' END AS sales_rep_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id 
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name
ORDER BY 2 DESC;`


The previous didn't account for the middle, nor the dollar amount associated with the sales. Management decides they want to see these characteristics represented as well. We would like to identify top performing sales reps, which are sales reps associated with more than 200 orders or more than 750000 in total sales. The middle group has any rep with more than 150 orders or 500000 in sales. Create a table with the sales rep name, the total number of orders, total sales across all orders, and a column with top, middle, or low depending on this criteria. Place the top sales people based on dollar amount of sales first in your final table. You might see a few upset sales people by this criteria!

`SELECT s.name, COUNT(*), SUM(o.total_amt_usd) total_spent, 
     CASE WHEN COUNT(*) > 200 OR SUM(o.total_amt_usd) > 750000 THEN 'top'
     WHEN COUNT(*) > 150 OR SUM(o.total_amt_usd) > 500000 THEN 'middle'
     ELSE 'low' END AS sales_rep_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id 
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name
ORDER BY 3 DESC;`