# Dealing with Data Spring 2022 – Class 10

---

In [None]:
import sqlite3 # this is how we will import the sqlite3 functionality needed to proceed

In [None]:
con = sqlite3.connect('class10_data.db') # this is how we are going to create our database, 
                                         # calling it 'class9_data.db'

                                         # note that if the db doesn't exist, this will create it; Otherwise, it will connect

# "con" stands for "connection" – this is telling SQLite what database to use


In [None]:
import pandas as pd

irs_agi_map = pd.read_csv("./irs_agi_map.csv")
irs_nyc_tax_returns = pd.read_csv("./irs_nyc_tax_returns.csv")
nyc_census_data = pd.read_csv("./nyc_census_data.csv")
nyc_film_permits = pd.read_csv("./nyc_film_permits.csv")

In [None]:
irs_agi_map.head()

In [None]:
irs_nyc_tax_returns.head()

In [None]:
nyc_census_data.head()

In [None]:
nyc_film_permits.head()

In [None]:
irs_agi_map.to_sql(name='irs_agi_map',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM irs_agi_map LIMIT 3", con=con)
check

In [None]:
irs_nyc_tax_returns.to_sql(name='irs_nyc_tax_returns',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns LIMIT 3", con=con)
check

In [None]:
nyc_census_data.to_sql(name='nyc_census_data',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM nyc_census_data LIMIT 3", con=con)
check

In [None]:
nyc_film_permits.to_sql(name='nyc_film_permits',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM nyc_film_permits LIMIT 3", con=con)
check

---

# ⭕ **QUESTIONS?**

---

# GROUP BY

```sql
SELECT A1, Aggregation Function [count(*), sum(*), avg(*), min, etc.]
FROM T1, T2, ... Tm
WHERE condition
GROUP BY A1
```

`count(*)` counts the number of rows in the group <br> 
`count(attr)` counts the number of rows in the group with non-null values for the attribute <br> 
`count(DISTINCT attr)` counts the number of distinct, non-null values for the attributes in the group <br> 
`max(attr)` row with maximum attribute value in the group <br>
`min(attr)` row with minimum value in the group <br>
`sum(attr)` sum values of selected rows in the group <br>
`avg(attr)` estimates the average attribute in the group

# Exercise 1: 

> Find the most popular Category for NYC filming permits.

In [None]:
# your code here

# Solution

In [None]:
check = pd.read_sql("SELECT Category, COUNT(DISTINCT EventID) as film_permits FROM nyc_film_permits GROUP BY Category ORDER BY film_permits DESC", con=con)
check

# Exercise 2: 

>  Find the most popular Borough by Category for NYC filming permits.

In [None]:
# your code here

# Solution

In [None]:
check = pd.read_sql("SELECT Category, Borough, COUNT(DISTINCT EventID) as film_permits FROM nyc_film_permits GROUP BY Category, Borough ORDER BY Category ASC, film_permits DESC", con=con)
check

# Exercise 3: 

> Find the year and zipcode with the most tax returns in NYC

In [None]:
# your code here

# Solution

In [None]:
check = pd.read_sql("SELECT zipcode, year, SUM(return_count) as total_returns FROM irs_nyc_tax_returns GROUP BY zipcode, year ORDER BY total_returns DESC", con=con)
check

---

# ⭕ **QUESTIONS?**

---

# HAVING

```sql 
SELECT A1, Aggregation Function
FROM T1, T2, ... Tm
WHERE condition
GROUP BY A1
HAVING Aggregation Function Condition
```
<br>

`WHERE` applies to rows _before_ computing the aggregate <br>
`HAVING` applies to the aggregate value only

`AKA, the WHERE clause applies the condition to individual rows before they are summarized into groups by the GROUP BY clause. HAVING, meanwhile, applies the condition to the group after the rows are grouped into groups.`

---

# ⭕ **QUESTIONS?**

---

# Exercise 6: 

> Find the subcategories with more than 5,000 NYC filming permits in the database.

In [None]:
# your code here

# Solution

In [None]:
check = pd.read_sql("""
    SELECT Category, SubCategoryName, COUNT(DISTINCT EventID) as permit_count 
    FROM nyc_film_permits
    GROUP By Category, SubCategoryName
    Having permit_count > 5000
    ORDER BY permit_count DESC
""", con=con)
check

# Exercise 7: 

> Find the year where "Commercials" had fewer than 700 filming permits.



In [None]:
# your code here

# Solution

```SQL 
SELECT STRFTIME('%Y',StartDateTime) as permit_year, COUNT(DISTINCT EventID) as permit_count
FROM nyc_film_permits
WHERE LOWER(Category) = "commercial" 
GROUP BY permit_year
HAVING permit_count < 700
ORDER BY permit_count DESC;
```

In [None]:
check = pd.read_sql("""
    SELECT STRFTIME('%Y',StartDateTime) as permit_year, COUNT(DISTINCT EventID) as permit_count
    FROM nyc_film_permits
    WHERE LOWER(Category)="commercial"
    GROUP BY permit_year
    HAVING permit_count < 700
    ORDER BY permit_count DESC;
""", con=con)
check

---

# Conditional Construction: CASE

```SQL
CASE
    WHEN condition THEN result
    WHEN condition2 THEN result2
    ELSE result
END
```

# Exercise 8:

> Map each zipcode and year's return counts to a "return count" bucket

In [None]:
# your code here

# Solution

In [None]:
check = pd.read_sql("""
    SELECT year, zipcode,
      CASE
        WHEN SUM(return_count) < 5000 THEN "under 5k returns"
        WHEN SUM(return_count) < 1000 THEN "5k to 10k returns" 
        WHEN SUM(return_count) < 25000 THEN "10k to 25k returns" 
        WHEN SUM(return_count) < 50000 THEN "25k to 50k returns" 
        WHEN SUM(return_count) < 100000 THEN "50k to 100k returns" 
        ELSE "over 100k returns" 
      END AS return_count_bucket
    FROM irs_nyc_tax_returns
    GROUP BY year, zipcode
""", con=con)
check

# Exercise 9: 

> Make the results more human friendly by giving a short name for the agi_map_id field results.

In [None]:
# your code here

# Solution

In [None]:
check = pd.read_sql("""
    SELECT year, zipcode, 
      CASE agi_map_id
        WHEN 1 THEN "under 25k"
        WHEN 2 THEN "25k to 50k"
        WHEN 3 THEN "50k to 75k" 
        WHEN 4 THEN "75k to 100k" 
        WHEN 5 THEN "100k to 200k" 
        ELSE "over 200k" 
      END AS income_level, return_count
    FROM irs_nyc_tax_returns
    WHERE zipcode = 10128 AND year = 2013
""", con=con)
check

---

# ⭕ **QUESTIONS?**

---