# Dealing with Data Spring 2022 – Class 9

---

### Why Databases?

> Size <br>
> Scale <br>
> Security <br>
> Easy to Make Insertions, Deletions, and Updates

### Building a Data Pipeline – In-Class Example

**Users**

Our users are the [Commissioner for NYC's Media & Entertainment Agency](https://www1.nyc.gov/site/mome/index.page)

**Decision Problem**

When making daily decisions around reviewing film permit requests in NYC, the office now needs to consider how to best create a balanced media representation of NYC's diversity. 

**Data Assistance Requirements**

Describe neighborhood characteristics of past film permits to best inform future decisions.

**Application Type**

`Descriptive`: the agency needs to start with an understanding of the neighborhoods shown in previously issued permits

### Our Data

Our `nyc_film_permits.csv` originated from the City of New York [(link here)](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) and was transformed for ease of use in our class. 

Our `irs_nyc_tax_returns.csv` originated from the IRS [(link here)](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi) and was transformed for ease of use in our class. 

Our `nyc_census_data.csv` originated from Census.gov [(link here)](https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t) and was transformed for ease of use in our class. 

Our `irs_agi_map.csv` was produced manually by Guthrie Collin after reading IRS documentation. 

### sqlite

In [None]:
import sqlite3 # this is how we will import the sqlite3 functionality needed to proceed

[SQLite](https://www.sqlite.org/index.html) is a library that allows us to create, populate, and call upon a SQL Database. It's also serverless, meaning we don't need to access a separate server where we're storing our data – instead, we can directly access our databse. We can even store that database as a file in our Colab environment and call upon it. 

In [None]:
con = sqlite3.connect('class9_data.db') # this is how we are going to create our database, 
                                         # calling it 'class9_data.db'

                                         # note that if the db doesn't exist, this will create it; Otherwise, it will connect

# "con" stands for "connection" – this is telling SQLite what database to use


It's important to note that the databse we just created was created _inside this Colab environment_ which will be reset when we log off. 

Thus, when you're done with your databse, I recommend downloading it directly to your machine. 

You can see the databse by clicking on the file icon on the far-left side of the window. 

In [None]:
import pandas as pd

irs_agi_map = pd.read_csv("./irs_agi_map.csv")
irs_nyc_tax_returns = pd.read_csv("./irs_nyc_tax_returns.csv")
nyc_census_data = pd.read_csv("./nyc_census_data.csv")
nyc_film_permits = pd.read_csv("./nyc_film_permits.csv")

In [None]:
irs_agi_map.head()

In [None]:
irs_nyc_tax_returns.head()

In [None]:
nyc_census_data.head()

In [None]:
nyc_film_permits.head()

---

In [None]:
irs_agi_map.to_sql(name='irs_agi_map',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM irs_agi_map LIMIT 3", con=con)
check

In [None]:
irs_nyc_tax_returns.to_sql(name='irs_nyc_tax_returns',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns LIMIT 3", con=con)
check

In [None]:
nyc_census_data.to_sql(name='nyc_census_data',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM nyc_census_data LIMIT 3", con=con)
check

In [None]:
nyc_film_permits.to_sql(name='nyc_film_permits',con=con)

In [None]:
check = pd.read_sql("SELECT * FROM nyc_film_permits LIMIT 3", con=con)
check

---

# ⭕ **QUESTIONS?**

---

# SELECT and LIMIT

In [None]:
check = pd.read_sql("SELECT * FROM irs_agi_map LIMIT 3", con=con)
check

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns LIMIT 15", con=con)
check

In [None]:
check = pd.read_sql("SELECT state, zipcode FROM irs_nyc_tax_returns", con=con)
check

In [None]:
check = pd.read_sql("SELECT year, state, zipcode, return_count FROM irs_nyc_tax_returns LIMIT 5", con=con)
check

---

# ⭕ **QUESTIONS?**

---

# AS

Sometimes we want to rename a column to provide a more descriptive name in the results

In [None]:
check = pd.read_sql("SELECT year, state, zipcode, return_count as tax_returns FROM irs_nyc_tax_returns", con=con)
check

In [None]:
check = pd.read_sql("SELECT year, state, return_count as 'tax returns' FROM irs_nyc_tax_returns LIMIT 5", con=con)
check

# ORDER BY 

Used to sort the result row based on attribute values

In [None]:
check = pd.read_sql("SELECT year, state, return_count as 'tax returns' FROM irs_nyc_tax_returns ORDER BY 'tax returns' ASC", con=con)
check

In [None]:
check = pd.read_sql("SELECT year, return_count as 'tax returns' FROM irs_nyc_tax_returns ORDER BY year DESC", con=con)
check

---

# ⭕ **QUESTIONS?**

---

# WHERE

Defines which rows will appear in the results.

`attr = 'text'/number` means 'attribute is equal to' (either a text value or numerical value) <br>
`attr != value` or `attr <> value` means 'attribute is *not equal to* value' <br>
`attr > value` means 'attribute is greater than value' <br>
`attr < value` means 'attribute is less than value' <br>
`attr >= value` means 'attribute is greater than or equal to value' <br>
`attr <= value` means 'attribute is less than or equal to value' <br>
`attr IN (x1,x2,x3,...)` means 'attribute value is either x1, x2, or x2, or ...' <br> 
`attr NOT IN (x1,x2,x3,...)` means 'attribute value is not x1, nor x2, nor x3,...' <br>
`condition1 AND condition2` means 'both conditions should hold' <br>
`condition1 OR condition1` means 'at least one of the conditions should hold' <br>


In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns WHERE (zipcode == 10001)", con=con)
check

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns WHERE (zipcode == 10001 AND return_count > 4000) OR (agi_map_id = 2 AND year = 14)", con=con)
check

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns WHERE (zipcode == 10128 AND year == 2012)", con=con)
check

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns WHERE zipcode NOT IN (10001,10002,10128)", con=con)
check

In [None]:
check = pd.read_sql("SELECT * FROM irs_nyc_tax_returns WHERE (zipcode == 10128 AND year == 2012) ORDER BY agi_map_id DESC;", con=con)
check

# DISTINCT

Used to eliminate duplicates in results

In [None]:
check = pd.read_sql("SELECT DISTINCT year FROM irs_nyc_tax_returns", con=con)
check

In [None]:
check = pd.read_sql("SELECT DISTINCT year, zipcode as 'zip code' FROM irs_nyc_tax_returns", con=con)
check

# Other Operators

`IS NULL` returns rows that have null values for a specified attribute <br>
`IS NOT NULL` returns rows that do not have null values for a specified attribute <br>
`BETWEEN` returns something like, 'between *this* value and *that* value'


---

# ⭕ **QUESTIONS?**

---