# Dealing with Data Spring 2020 – Class 5

---

# Why Databases?

> Size <br>
> Scale <br>
> Security <br>
> Easy to Make Insertions, Deletions, and Updates

# Entity Relationship Diagrams (ERDs): 

...graphically present the relationships between entities. For example:

![ERD Example](https://online.visual-paradigm.com/repository/images/73785425-47d6-4273-8b89-b2b9622af30f.png)

An `entity` is a collection of objects with the same properties (e.g., 'Student'or 'Examination'). Entities are descriptions of `instances` (e.g., 'John Doe', '1833449', 'Business')

A `primary key` is an attribute whose value is unique in each instance (e.g., 'SID')

A `composite primary key` is a primary key that consists of two or more attributes, whose values together (but not separately) are unique for each instance of an entity (e.g., SectionNum and CourseNo)

A `relationship` describes an assocation between entities (e.g., a Student takes a Course). 

`Cardinalities` describe the number of instances that participate in a relationship. For example: 

- A student may take 0, 1, or more ('many') courses
- A course can be taken by 0, 1, or more ('many') students

<br> 

![Cardinality Notation](https://user-images.githubusercontent.com/2719310/29148000-7c8d3dd8-7d1f-11e7-9e91-2caf5074f6af.png)

---

# From Narrative to ERD

1. Identify `entities` and `attributes`
> look for nouns that describe people, places, things (those are your entities) 
> then, look for details about those entities (those are your attributes) 
2. Define the `primary keys`
> these should be stable, and each entity should only have one primary key
3. Identify `relationships` and determine `cardinalities`
4. Refine and iterate



---

> An `attribute` maps to a `table` <br> <br>
> Each `attribute` maps to a `column` in that table <br> <br>
> The `primary key` of the entity maps to the `primary key` of the table

---

# In-Class Example – CitiBike Data

In [5]:
import sqlite3

[SQLite](https://www.sqlite.org/index.html) is a library that allows us to create, populate, and call upon a SQL Database. It's also serverless, meaning we don't need to access a separate server where we're storing our data – instead, we can directly access our databse. We can even store that database as a file on our local machine and call upon it. 

In [6]:
con = sqlite3.connect('citibikeData.db') # this is how we are going to create our database, 
                                         # calling it 'citibikeData.db'

# "con" stands for "connection" – this is telling SQLite what database to use

Now, let's check out the API we'll be working with: https://streamdata.io/developers/api-gallery/new-york-citibike-api/

---

First, we'll request the json from the CitiBike API URL and just print it out to get a quick glimpse

In [7]:
import json 
import urllib.request # https://docs.python.org/3/library/urllib.request.html

with urllib.request.urlopen("https://feeds.citibikenyc.com/stations/stations.json") as url:
    data = json.loads(url.read().decode())
    print(data)

{'executionTime': '2020-01-27 01:25:23 PM', 'stationBeanList': [{'id': 304, 'stationName': 'Broadway & Battery Pl', 'availableDocks': 15, 'totalDocks': 33, 'latitude': 40.70463334, 'longitude': -74.01361706, 'statusValue': 'In Service', 'statusKey': 1, 'availableBikes': 18, 'stAddress1': 'Broadway & Battery Pl', 'stAddress2': '', 'city': '', 'postalCode': '', 'location': '', 'altitude': '', 'testStation': False, 'lastCommunicationTime': '2020-01-27 01:24:00 PM', 'landMark': ''}, {'id': 359, 'stationName': 'E 47 St & Park Ave', 'availableDocks': 35, 'totalDocks': 64, 'latitude': 40.75510267, 'longitude': -73.97498696, 'statusValue': 'In Service', 'statusKey': 1, 'availableBikes': 29, 'stAddress1': 'E 47 St & Park Ave', 'stAddress2': '', 'city': '', 'postalCode': '', 'location': '', 'altitude': '', 'testStation': False, 'lastCommunicationTime': '2020-01-27 01:24:21 PM', 'landMark': ''}, {'id': 367, 'stationName': 'E 53 St & Lexington Ave', 'availableDocks': 30, 'totalDocks': 34, 'latitud

As you can see, the json is a dictionary of lists and other dictionaries containing information about CitiBike stations across New York City. 

For our purposes we're interested in the information contained within the 'stationBeanList' list, seen in the first line of the json above:

In [8]:
stations = data['stationBeanList'] # iterate through the json to find the station data of interest

In [9]:
import pandas as pd # we'll use pandas just to visualize our data, NOT to query it

df_stations = pd.DataFrame(stations) # create a new dataframe called 'df_stations' 
df_stations.head() # check the first five station entries

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark
0,304,Broadway & Battery Pl,15,33,40.704633,-74.013617,In Service,1,18,Broadway & Battery Pl,,,,,,False,2020-01-27 01:24:00 PM,
1,359,E 47 St & Park Ave,35,64,40.755103,-73.974987,In Service,1,29,E 47 St & Park Ave,,,,,,False,2020-01-27 01:24:21 PM,
2,367,E 53 St & Lexington Ave,30,34,40.758281,-73.970694,In Service,1,4,E 53 St & Lexington Ave,,,,,,False,2020-01-27 01:23:49 PM,
3,402,Broadway & E 22 St,24,39,40.740343,-73.989551,In Service,1,15,Broadway & E 22 St,,,,,,False,2020-01-27 01:23:39 PM,
4,3443,W 52 St & 6 Ave,38,41,40.76133,-73.97982,In Service,1,2,W 52 St & 6 Ave,,,,,,False,2020-01-27 01:21:46 PM,


So, we have our data from the CitiBike feed, and it looks pretty good! Now we need to create a table within our database (the one we named citibikeData.db). We do that using the 'CREATE TABLE IF NOT EXISTS' statement seen below. 

In that statement, the 'IF NOT EXISTS' makes clear that we are going to create the table called 'StationsData' only once. That way, if we run that cell again, it's not going to overwrite the work we've previously done. 

Note that at this point we aren't adding any data to our table. All we're doing is telling SQLite that we want to create a new table, and providing it with a) the column names and b) the data type those columns should be expecting.

In [10]:
sql = "CREATE TABLE IF NOT EXISTS StationsData (station_id int, stationName varchar(250), availableDocks int, totalDocks int, latitude float, longitude float, statusValue varchar(250), statusKey int, availableBikes int, stAddress1 varchar(250), stAddress2 varchar(250), city varchar(250), postalCode varchar(250), location varchar(250), altitude varchar(250), testStation bool, lastCommunicationTime date, landMark varchar(250));" 

con.execute(sql)
con.commit()

Now that we have our database and our table, we want to insert our data. 

Below, we create a "query template" where we "INSERT OR IGNORE INTO" our table (StationsData) the values associated with each of our columns. 

We define those values by parsing through the CitiBike json we got earlier, and for each "row" of that json, we create a new row in our SQLite table. 

In [11]:
query_template = """INSERT OR IGNORE INTO StationsData(station_id, stationName, availableDocks, totalDocks, latitude, \
longitude, statusValue, statusKey, availableBikes, stAddress1, stAddress2, city, postalCode, location, altitude, \
testStation, lastCommunicationTime, landMark) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);"""

for entry in stations: # for every station entry in the json 
    station_id = int(entry['id']) # find and set station_id
    stationName = str(entry['stationName'])
    availableDocks = int(entry['availableDocks'])
    totalDocks = int(entry['totalDocks'])
    latitude = str(entry['latitude'])
    longitude = str(entry['longitude'])
    statusValue = str(entry['statusValue'])
    statusKey = int(entry['statusKey'])
    availableBikes = int(entry['availableBikes'])
    stAddress1 = str(entry['stAddress1'])
    stAddress2 = str(entry['stAddress2'])
    city = str(entry['city'])
    postalCode = str(entry['postalCode'])
    location = str(entry['location'])
    altitude = str(entry['altitude'])
    testStation = bool(entry['testStation'])
    lastCommunicationTime = entry['lastCommunicationTime']
    landMark = str(entry['landMark'])
                           
    print("Inserting Station:", station_id, stationName, availableDocks, totalDocks, latitude, longitude, statusValue, statusKey, availableBikes, stAddress1, stAddress2, city, postalCode, location, altitude, testStation, lastCommunicationTime, landMark) 
    
    query_parameters = (station_id, stationName, availableDocks, totalDocks, latitude, longitude, statusValue, statusKey, availableBikes, stAddress1, stAddress2, city, postalCode, location, altitude, testStation, lastCommunicationTime, landMark) 
    
    con.execute(query_template, query_parameters)
    
con.commit()

Inserting Station: 304 Broadway & Battery Pl 15 33 40.70463334 -74.01361706 In Service 1 18 Broadway & Battery Pl      False 2020-01-27 01:24:00 PM 
Inserting Station: 359 E 47 St & Park Ave 35 64 40.75510267 -73.97498696 In Service 1 29 E 47 St & Park Ave      False 2020-01-27 01:24:21 PM 
Inserting Station: 367 E 53 St & Lexington Ave 30 34 40.75828065 -73.97069431 In Service 1 4 E 53 St & Lexington Ave      False 2020-01-27 01:23:49 PM 
Inserting Station: 402 Broadway & E 22 St 24 39 40.7403432 -73.98955109 In Service 1 15 Broadway & E 22 St      False 2020-01-27 01:23:39 PM 
Inserting Station: 3443 W 52 St & 6 Ave 38 41 40.76132983124814 -73.97982001304626 In Service 1 2 W 52 St & 6 Ave      False 2020-01-27 01:21:46 PM 
Inserting Station: 72 W 52 St & 11 Ave 22 55 40.76727216 -73.99392888 In Service 1 33 W 52 St & 11 Ave      False 2020-01-27 01:22:34 PM 
Inserting Station: 79 Franklin St & W Broadway 5 33 40.71911552 -74.00666661 In Service 1 27 Franklin St & W Broadway      Fals

Inserting Station: 389 Broadway & Berry St 23 27 40.71044554 -73.96525063 In Service 1 4 Broadway & Berry St      False 2020-01-27 01:24:56 PM 
Inserting Station: 390 Duffield St & Willoughby St 11 31 40.69221589 -73.9842844 In Service 1 20 Duffield St & Willoughby St      False 2020-01-27 01:23:15 PM 
Inserting Station: 391 Clark St & Henry St 31 31 40.69760127 -73.99344559 In Service 1 0 Clark St & Henry St      False 2020-01-27 01:22:59 PM 
Inserting Station: 392 Jay St & Tech Pl 11 35 40.695065 -73.987167 In Service 1 23 Jay St & Tech Pl      False 2020-01-27 01:24:08 PM 
Inserting Station: 393 E 5 St & Avenue C 37 37 40.72299208 -73.97995466 In Service 1 0 E 5 St & Avenue C      False 2020-01-27 01:24:45 PM 
Inserting Station: 394 E 9 St & Avenue C 32 34 40.72521311 -73.97768752 In Service 1 2 E 9 St & Avenue C      False 2020-01-27 01:24:03 PM 
Inserting Station: 396 Lefferts Pl & Franklin Ave 25 25 40.680342423 -73.9557689392 In Service 1 0 Lefferts Pl & Franklin Ave      False 

Inserting Station: 3050 Putnam Ave & Throop Ave 21 21 40.6851532 -73.94111 In Service 1 0 Putnam Ave & Throop Ave      False 2020-01-27 01:24:44 PM 
Inserting Station: 3052 Lewis Ave & Madison St 22 23 40.686312 -73.935775 In Service 1 0 Lewis Ave & Madison St      False 2020-01-27 01:21:53 PM 
Inserting Station: 3053 Marcy Ave & Lafayette Ave 23 23 40.6900815 -73.947915 In Service 1 0 Marcy Ave & Lafayette Ave      False 2020-01-27 01:22:21 PM 
Inserting Station: 3054 Greene Ave & Throop Ave 18 19 40.6894932 -73.942061 In Service 1 1 Greene Ave & Throop Ave      False 2020-01-27 01:24:15 PM 
Inserting Station: 3055 Greene Ave & Nostrand Ave 20 23 40.6883337 -73.950916 In Service 1 3 Greene Ave & Nostrand Ave      False 2020-01-27 01:23:13 PM 
Inserting Station: 3056 Kosciuszko St & Nostrand Ave 22 23 40.69072549 -73.95133465 In Service 1 0 Kosciuszko St & Nostrand Ave      False 2020-01-27 01:21:49 PM 
Inserting Station: 3057 Kosciuszko St & Tompkins Ave 18 18 40.69128258 -73.9452416 

Inserting Station: 3256 Pier 40 - Hudson River Park 11 23 40.7277140777778 -74.01129573583603 In Service 1 12 Pier 40 - Hudson River Park      False 2020-01-27 01:23:11 PM 
Inserting Station: 3259 9 Ave & W 28 St 13 27 40.74937024193277 -73.99923384189606 In Service 1 14 9 Ave & W 28 St      False 2020-01-27 01:23:50 PM 
Inserting Station: 3260 Mercer St & Bleecker St 1 45 40.72706363348306 -73.99662137031554 In Service 1 44 Mercer St & Bleecker St      False 2020-01-27 01:24:08 PM 
Inserting Station: 3263 Cooper Square & Astor Pl 35 59 40.72951496224949 -73.99075269699097 In Service 1 23 Cooper Square & Astor Pl      False 2020-01-27 01:23:54 PM 
Inserting Station: 3267 Morris Canal 13 14 40.7124188237569 -74.03852552175522 In Service 1 1 Morris Canal      False 2020-01-27 01:23:45 PM 
Inserting Station: 3268 Lafayette Park 10 14 40.71346382669195 -74.06285852193832 In Service 1 3 Lafayette Park      False 2020-01-27 01:22:17 PM 
Inserting Station: 3269 Brunswick & 6th 14 14 40.726011

Inserting Station: 3412 Pacific St & Nevins St 6 18 40.6853761 -73.98302136 In Service 1 12 Pacific St & Nevins St      False 2020-01-27 01:23:02 PM 
Inserting Station: 3414 Bergen St & Flatbush Ave 4 33 40.680944723477296 -73.97567331790923 In Service 1 29 Bergen St & Flatbush Ave      False 2020-01-27 01:21:32 PM 
Inserting Station: 3415 Prospect Pl & 6 Ave 19 25 40.6793307 -73.97519523 In Service 1 6 Prospect Pl & 6 Ave      False 2020-01-27 01:22:25 PM 
Inserting Station: 3416 7 Ave & Park Pl 10 25 40.6776147 -73.97324283 In Service 1 15 7 Ave & Park Pl      False 2020-01-27 01:22:28 PM 
Inserting Station: 3417 Baltic St & 5 Ave 9 27 40.6795766 -73.97854971 In Service 1 18 Baltic St & 5 Ave      False 2020-01-27 01:25:11 PM 
Inserting Station: 3418 Plaza St West & Flatbush Ave 22 33 40.6750207 -73.97111473 In Service 1 11 Plaza St West & Flatbush Ave      False 2020-01-27 01:22:45 PM 
Inserting Station: 3419 Douglass St & 4 Ave 9 27 40.6792788 -73.98154004 In Service 1 18 Douglass 

Inserting Station: 3609 Vernon Blvd & 31 Ave 18 23 40.7692475 -73.9354504 In Service 1 4 Vernon Blvd & 31 Ave      False 2020-01-27 01:23:48 PM 
Inserting Station: 3610 Vernon Blvd & 30 Rd 10 27 40.770845 -73.934171 In Service 1 17 Vernon Blvd & 30 Rd      False 2020-01-27 01:23:48 PM 
Inserting Station: 3611 Vernon Blvd & 47 Rd 2 31 40.7449067 -73.9534573 In Service 1 28 Vernon Blvd & 47 Rd      False 2020-01-27 01:24:31 PM 
Inserting Station: 3612 30 Ave & 21 St 12 15 40.7703743 -73.9286078 In Service 1 3 30 Ave & 21 St      False 2020-01-27 01:24:45 PM 
Inserting Station: 3613 Center Blvd & 48 Ave 12 25 40.745038 -73.957539 In Service 1 13 Center Blvd & 48 Ave      False 2020-01-27 01:22:16 PM 
Inserting Station: 3614 Crescent St & 30 Ave 23 23 40.768692 -73.9249574 In Service 1 0 Crescent St & 30 Ave      False 2020-01-27 01:24:55 PM 
Inserting Station: 3615 44 Dr & 21 St 0 21 40.748 -73.9460927 In Service 1 20 44 Dr & 21 St      False 2020-01-27 01:21:45 PM 
Inserting Station: 361

Inserting Station: 3827 Halsey St & Broadway 20 20 40.68565 -73.91564 In Service 1 0 Halsey St & Broadway      False 2020-01-27 01:24:42 PM 
Inserting Station: 3828 Eldert St & Bushwick Ave 22 22 40.68652 -73.91321 In Service 1 0 Eldert St & Bushwick Ave      False 2020-01-27 01:21:53 PM 
Inserting Station: 3829 Central Ave & Decatur St 24 24 40.6882 -73.90798 In Service 1 0 Central Ave & Decatur St      False 2020-01-27 01:22:18 PM 
Inserting Station: 3830 Halsey St & Evergreen Ave 22 22 40.68858 -73.91227 In Service 1 0 Halsey St & Evergreen Ave      False 2020-01-27 01:25:08 PM 
Inserting Station: 3831 Broadway & Hancock St 35 35 40.68663 -73.9168 In Service 1 0 Broadway & Hancock St      False 2020-01-27 01:22:48 PM 
Inserting Station: 3832 Central Ave & Weirfield St 20 21 40.69055 -73.91181 In Service 1 1 Central Ave & Weirfield St      False 2020-01-27 01:23:30 PM 
Inserting Station: 3833 Madison St & Evergreen Ave 18 18 40.69122 -73.91693 In Service 1 0 Madison St & Evergreen Av

Now, we can use [pd.read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) to check that we are properly connected to our database, and the StationsData table within that database:

In [12]:
check = pd.read_sql("SELECT * FROM StationsData LIMIT 5", con=con)
check

Unnamed: 0,station_id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark
0,304,Broadway & Battery Pl,15,33,40.704633,-74.013617,In Service,1,18,Broadway & Battery Pl,,,,,,0,2020-01-27 01:24:00 PM,
1,359,E 47 St & Park Ave,35,64,40.755103,-73.974987,In Service,1,29,E 47 St & Park Ave,,,,,,0,2020-01-27 01:24:21 PM,
2,367,E 53 St & Lexington Ave,30,34,40.758281,-73.970694,In Service,1,4,E 53 St & Lexington Ave,,,,,,0,2020-01-27 01:23:49 PM,
3,402,Broadway & E 22 St,24,39,40.740343,-73.989551,In Service,1,15,Broadway & E 22 St,,,,,,0,2020-01-27 01:23:39 PM,
4,3443,W 52 St & 6 Ave,38,41,40.76133,-73.97982,In Service,1,2,W 52 St & 6 Ave,,,,,,0,2020-01-27 01:21:46 PM,


---

# SELECT

In [15]:
check = pd.read_sql("SELECT station_id, stationName FROM StationsData LIMIT 5", con=con)
check

Unnamed: 0,station_id,stationName
0,304,Broadway & Battery Pl
1,359,E 47 St & Park Ave
2,367,E 53 St & Lexington Ave
3,402,Broadway & E 22 St
4,3443,W 52 St & 6 Ave


# AS

Sometimes we want to rename a column to provide a more descriptive name in the results

In [17]:
check = pd.read_sql("SELECT station_id, stationName, stAddress1 as main_address FROM StationsData LIMIT 5", con=con)
check

Unnamed: 0,station_id,stationName,main_address
0,304,Broadway & Battery Pl,Broadway & Battery Pl
1,359,E 47 St & Park Ave,E 47 St & Park Ave
2,367,E 53 St & Lexington Ave,E 53 St & Lexington Ave
3,402,Broadway & E 22 St,Broadway & E 22 St
4,3443,W 52 St & 6 Ave,W 52 St & 6 Ave


# DISTINCT

Used to eliminate duplicates in results

In [19]:
check = pd.read_sql("SELECT COUNT(DISTINCT station_id) as num_stations FROM StationsData", con=con)
check

Unnamed: 0,num_stations
0,935


# ORDER BY 

Used to sort the result row based on attribute values

# LIMIT

Limits the number of rows in the result

# WHERE

Defines which rows will appear in the results

# Conditions for WHERE Clauses:

`attr = 'text'/number` means 'attribute is equal to' (either a text value or numerical value) <br>
`attr != value` or `attr <> value` means 'attribute is *not equal to* value' <br>
`attr > value` means 'attribute is greater than value' <br>
`attr < value` means 'attribute is less than value' <br>
`attr >= value` means 'attribute is greater than or equal to value' <br>
`attr <= value` means 'attribute is less than or equal to value' <br>
`attr IN (x1,x2,x3,...)` means 'attribute value is either x1, x2, or x2, or ...' <br> 
`attr NOT IN (x1,x2,x3,...)` means 'attribute value is not x1, nor x2, nor x3,...' <br>
`condition1 AND condition2` means 'both conditions should hold' <br>
`condition1 OR condition1` means 'at least one of the conditions should hold' <br>


# Other Operators

`AS` is used to change the name of a column in the result <br> 
`DISTINCT` means 'no duplicate rows' <br>
`ORDER BY` lets you sort by column(s) in ascending or descending order <br>
`*` means 'select all columns' <br>
`IS NULL` returns rows that have null values for a specified attribute <br>
`IS NOT NULL` returns rows that do not have null values for a specified attribute <br>
`BETWEEN` returns something like, 'between *this* value and *that* value'


---

# Adding a Second Database

In [22]:
# https://data.cityofnewyork.us/resource/i4gi-tjb9.json

# from https://data.cityofnewyork.us/Transportation/Real-Time-Traffic-Speed-Data/qkm5-nuaq