# Dealing with Data Spring 2020 – Class 6

---

# GROUP BY

```sql
SELECT A1, Aggregation Function [count(*), sum(*), avg(*), min, etc.]
FROM T1, T2, ... Tm
WHERE condition
GROUP BY A1
```

`count(*)` counts the number of rows in the group <br> 
`count(attr)` counts the number of rows in the group with non-null values for the attribute <br> 
`count(DISTINCT attr)` counts the number of distinct, non-null values for the attributes in the group <br> 
`max(attr)` row with maximum attribute value in the group <br>
`min(attr)` row with minimum value in the group <br>
`sum(attr)` sum values of selected rows in the group <br>
`avg(attr)` estimates the average attribute in the group

---

# Exercise 1: 

> Find the most popular Category for NYC filming permits

In [None]:
# your code here

---

# Exercise 2: 

>  Find the most popular Borough by Category for NYC filming permits.

In [None]:
# your code here

---

# Exercise 3: 

> Find the year and zipcode with the most tax returns in NYC

---

# HAVING

```sql 
SELECT A1, Aggregation Function
FROM T1, T2, ... Tm
WHERE condition
GROUP BY A1
HAVING Aggregation Function Condition
```
<br>

`WHERE` applies to rows _before_ computing the aggregate <br>
`HAVING` applies to the aggregate value only

`AKA, the WHERE clause applies the condition to individual rows before they are summarized into groups by the GROUP BY clause. HAVING, meanwhile, applies the condition to the group after the rows are grouped into groups.`

---

# Exercise 6: 

> Find the subcategories with more than 5,000 NYC filming permits in the database.

In [None]:
# your code here

---

# Exercise 7: 

> Find the year where "Commercials" had fewer than 700 filming permits.



In [None]:
# your code here

---

# Conditional Construction: CASE

```SQL
CASE
    WHEN condition THEN result
    WHEN condition2 THEN result2
    ELSE result
END
```

---

# Exercise 8:

> Map each zipcode and year's return counts to a "return count" bucket

In [None]:
# your code here

---

# Exercise 9: 

> Make the results more human friendly by giving a short name for the agi_map_id field results.

In [None]:
# your code here

---

# Exercise 10: 

> Flatten the IRS database by putting the return_count in a seprate column for each agi_map_id.

In [None]:
# your code here

---

# SQL JOINS

First and foremost, any join requires a common data field found in both tables that enables the combination. However, those data fields don't need to share a common name, but they must be the same data type!

NB: LEFT and RIGHT are determined by order after the FROM command! 

<img src="https://i.stack.imgur.com/VQ5XP.png">

# INNER JOIN

> This is the default SQL JOIN, thus you don't even need to specify "INNER". <br> <br>

1) A new table is created with the columns of both tables you're trying to combine, <br><br>
2) It then looks to match values between the columns you specify in your 'ON' statement, <br><br>
3) SQL will start with the first value of the specified column in the first table, then look through every value in the specified column of the second table for a match, <br><br>
4) If there is a match, SQL will copy the data from both the row of the first table and the row of the second table and put it in the newly created table; It won't add any rows that didn't have a match

<br>

`AKA an INNER JOIN is going to return all rows when the comparisoin fields (in the "ON" statement") are matching in BOTH tables.`

In [None]:
from IPython.display import HTML
HTML('<img src="https://dataschool.com/assets/images/how-to-teach-people-sql/innerJoin/innerJoin_3.gif">')

---

# Exercise 11: 

> Combine the IRS tax returns with the human-readable form of the income buckets.

In [None]:
# your code here

---

# Exercise 12: 

> Find if there's a potential relationship between total filming permits and annual average tax returns for the top 10 zip codes for filming permits. 

In [None]:
# your code here

---

# LEFT JOIN

> LEFT refers to the first table, or the table you'll be joining to. <br> <br>
> For any rows in the LEFT table that don't have a match in the RIGHT table, it still adds that row to the next table but puts nulls for the missing columns

<br> 

`AKA a LEFT JOIN is going to return all rows from the LEFT table with matching rows from the RIGHT table`

In [None]:
from IPython.display import HTML
HTML('<img src="https://dataschool.com/assets/images/how-to-teach-people-sql/leftJoin/leftJoin_1.gif">')

---

# Exercise 13: 

> Find the bottom 20 zipcodes for filming permits.

In [None]:
# your code here

---

# Exercise 14: 

> Find the percentage of children living in the top 5 filming permit zipcodes over all years in the database.

In [None]:
# your code here

---

# RIGHT JOIN

> RIGHT refers to the second table, or the table you'll be joining in. <br> <br>
> A RIGHT JOIN can be re-written as a LEFT JOIN, and is thus much more rare in practice.

---

# FULL OUTER JOIN

> FULL OUTER refers to the first table, or the table you'll be joining to (in this case, the Facebook table since it comes before LinkedIn in the query). <br> <br>
> After completing a LEFT join of the data, it basically perorms a RIGHT join – only checking ot see if each value is present in the joined table. If it is not, SQL will add this row to the new table and put nulls for the columns from the other table.

`AKA, a FULL OUTER JOIN returns all rows in both tables that match the query's WHERE clause, and when the ON condition can't be satisfied, a NULL is put in for those unpopulated fields`

In [None]:
from IPython.display import HTML
HTML('<img src="https://dataschool.com/assets/images/how-to-teach-people-sql/fullOuter/fullOuter_1.gif">')

---

# UNION

`UNION` is used to combine the result-set of two or more SELECT statements

In [None]:
from IPython.display import HTML
HTML('<img src="https://dataschool.com/assets/images/how-to-teach-people-sql/union/union_2.gif">')

---

# CROSS JOIN

`CROSS JOIN` joins every row from Table 1 with every row from Table 1. 

`AKA, a CROSS JOIN returns all possible combinations of all rows.`

In [None]:
from IPython.display import HTML
HTML('<img src="https://dataschool.com/assets/images/how-to-teach-people-sql/crossJoin/crossJoin_1.gif">')

---

# Subqueries / Nested Queries

Subqueries are temporary tables created with nested SELECT statements where a table should be that allow for deeper analysis with SQL.

---

> Find the top 10 zipcodes with the largest percentage of persons of Latino/Hispanic descent, and calculate the percentage of adults and percentage of six-figure income earners in those zipcodes. 

---

# Exercise 15: 

> What year had the most return counts from the Borough of Brooklyn?

In [None]:
# your code here

---

# Exercise 16: 

> In zipcodes where the majority of film permits have the sub category "Commerical", what is the predominant age group?  

In [None]:
# your code here

---