## Designing the Database

Each citibike file records information about every single trip that was taken during a single month of the year. There are files for each month starting from June 2013. Each citibike file has the same format. The order and the description of the colomns are as follows:
- Trip Duration (seconds): The length of the trip in seconds
- Start Date & Time: The start time of the trip MM-DD-YYYY HH:MM:SS
- End Date & Time: The end time of the trip MM-DD-YYYY HH:MM:SS
- Start Station ID: The ID for the station where the trip started
- Start Station Name: The name of the station where the trip started
- Start Station Latitude: The latitude of the station where the trip started
- Start Station Longitude: The longitude of the station where the trip started
- End Station ID: The ID for the station where the trip ended
- End Station Name: The name of the station where the trip ended
- End Station Latitude: The latitude of the station where the trip ended
- End Station Longitude: The longitude of the station where the trip ended
- Bike ID: The ID for the bike that was used in the trip
- User Type: What type of user took the trip (Subscriber or Customer)
- Gender: The gender of the user (Male - 1, Female - 2, None - 0)
- Year of Birth: The year that the user was born

<img src="DatabaseDiagram.png" width="600" height="800" align="center"/>

## Connecting to the Database

In [32]:
pip install psycopg2-binary;

Collecting psycopg2-binary
  Using cached psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.8.6
Note: you may need to restart the kernel to use updated packages.


In [33]:
import psycopg2

In [34]:
# Put the password in 
PGHOST = 'tripdatabase.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [35]:
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Populating the Staging Table

In [14]:
pip install s3fs;c

Collecting s3fs
  Using cached s3fs-0.5.1-py3-none-any.whl (21 kB)
Collecting aiobotocore>=1.0.1
  Using cached aiobotocore-1.1.2-py3-none-any.whl (45 kB)
Collecting fsspec>=0.8.0
  Using cached fsspec-0.8.4-py3-none-any.whl (91 kB)
Collecting aiohttp>=3.3.1
  Using cached aiohttp-3.7.3-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
Collecting botocore<1.17.45,>=1.17.44
  Using cached botocore-1.17.44-py2.py3-none-any.whl (6.5 MB)
Collecting aioitertools>=0.5.1
  Using cached aioitertools-0.7.1-py3-none-any.whl (20 kB)
Collecting yarl<2.0,>=1.0
  Using cached yarl-1.6.3-cp37-cp37m-manylinux2014_x86_64.whl (294 kB)
Collecting multidict<7.0,>=4.5
  Using cached multidict-5.0.2-cp37-cp37m-manylinux2014_x86_64.whl (142 kB)
Collecting typing-extensions>=3.6.5
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting async-timeout<4.0,>=3.0
  Using cached async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
[31mERROR: boto3 1.16.21 has requirement botocore<1.20.0,>=1.19.21, but yo

In [15]:
import pandas as pd
import s3fs
import os
from io import StringIO

In [5]:
ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = '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'
bucket = "s3://williams-citibike/TripData/"

fs = s3fs.S3FileSystem(anon=False, key = ACCESS_KEY_ID, secret= ACCESS_SECRET_KEY)
trip_filenames = fs.ls("s3://williams-citibike/TripData/")[1:]

In [10]:
staging_table_query = """
           CREATE TABLE IF NOT EXISTS staging (
               tripduration INTEGER, 
               starttime TIMESTAMP,
               endtime TIMESTAMP,
               startID NUMERIC,
               startname VARCHAR(64),
               start_lat REAL,
               start_long REAL,
               endID NUMERIC,
               endname VARCHAR(64),
               end_lat REAL,
               end_long REAL,
               bikeID INTEGER,
               usertype VARCHAR(16),
               birthyear REAL,
               gender SMALLINT                
          );
          """
cursor.execute("rollback;")
cursor.execute(staging_table_query)
conn.commit()

In [11]:
def populate_stage(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    datastream = StringIO()
    
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, na_values ="") 
        data.fillna(-1, inplace=True) # Empty spaces need to be integers for birthyear column in database
        
        #Some stations have commas in their name causing the copy_from to register extra data fields
        data.iloc[:, 4] = data.iloc[:, 4].str.replace(',','_')
        data.iloc[:, 8] = data.iloc[:, 8].str.replace(',','_')
        
        # data.iloc[:, 3] = data.iloc[:, 3].astype('int32')
        # data.iloc[:, 7] = data.iloc[:, 7].astype('int32')
        
        data.to_csv(datastream, index=False, header = False)
        datastream.seek(0)

        cursor.copy_from(datastream,'staging',sep=',')
        conn.commit()
    
    datastream.close()
    print(f"Finished Uploading to Staging Table: {datafile}")
    return None

In [12]:
"""
cursor.execute("rollback;")
for file in trip_filenames:
    populate_staging(file)
"""

Finished Uploading to Raw: williams-citibike/TripData/2013-07 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-08 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-09 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-10 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-11 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-12 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/201306-citibike-tripdata.csv


## Populating the Trip Table

In [19]:
trip_table_query = """
            CREATE TABLE IF NOT EXISTS trip (
                startime TIMESTAMP,
                endtime TIMESTAMP,
                tripduration INTEGER,
                startID NUMERIC,
                endID NUMERIC,
                usertype VARCHAR(16),
                birthyear REAL,
                gender SMALLINT
            );
            """
cursor.execute("rollback;")
cursor.execute(trip_table_query)
conn.commit()

In [20]:
insert_query2 = """
        INSERT INTO trip
        SELECT starttime, endtime, tripduration, startid, endid, usertype, birthyear, gender
          FROM staging
         ORDER BY starttime, endtime;
        """

cursor.execute("rollback;")
cursor.execute(insert_query2)
conn.commit()

## Populating the Station Table (Without the Neighborhood Code)

In [14]:
station_table_query = """
               CREATE TABLE IF NOT EXISTS station (
                   stationID NUMERIC PRIMARY KEY,
                   name VARCHAR(64) NOT NULL,
                   latitude REAL,
                   longitude REAL
                );
                
                """
cursor.execute("rollback;")
cursor.execute(station_table_query)
conn.commit()

In [15]:
insert_query = """
        INSERT INTO station
        SELECT DISTINCT ON(endid) endid, endname, end_lat, end_long 
          FROM staging 
         ORDER BY endid;
        """

cursor.execute("rollback;")
cursor.execute(insert_query)
conn.commit()

## Prepping the Neighborhood Table I - Without the Spatial Data

In [9]:
from bs4 import BeautifulSoup
import requests

In [10]:
# Attempt connection to the URL
HoodURL = "https://furmancenter.org/neighborhoods"
try:
    r2 = requests.get(HoodURL)
    r2.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(errh)

In [11]:
soup = BeautifulSoup(r2.content, "html.parser")

# The website has a dropdown with all the neighborhood codes and names
hood_code_names = []

#Instead of creating a dictionary like before, we create a list of tuples so that we can make a df
for code in soup.find_all('option')[1:]:
    hood_code_names.append((code.text[:4], code.text[6:].replace("/","-").replace(" ","_")))

In [131]:
hood_df = pd.DataFrame(hood_code_names, columns=["code", "hoodname"])

In [133]:
borough = {
        "BK": "Brooklyn", 
        "BX": "Bronx",
        "MN": "Manhattan",
        "QN": "Queens",
        "SI": "Staten"
        }

hood_df["borough"] = hood_df["code"].str[0:2].map(borough)

In [219]:
hood_df.head()

Unnamed: 0,code,hoodname,borough
0,BK01,Greenpoint-Williamsburg,Brooklyn
1,BK02,Fort_Greene-Brooklyn_Heights,Brooklyn
2,BK03,Bedford_Stuyvesant,Brooklyn
3,BK04,Bushwick,Brooklyn
4,BK05,East_New_York-Starrett_City,Brooklyn


## Prepping the Neighborhood Table II - Adding the Spatial Data

In [1]:
pip install geopandas

Collecting geopandas
  Using cached geopandas-0.8.1-py2.py3-none-any.whl (962 kB)
Collecting shapely
  Using cached Shapely-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
Collecting pyproj>=2.2.0
  Using cached pyproj-3.0.0.post1-cp37-cp37m-manylinux2010_x86_64.whl (6.4 MB)
Collecting fiona
  Using cached Fiona-1.8.18-cp37-cp37m-manylinux1_x86_64.whl (14.8 MB)
Collecting munch
  Using cached munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting click-plugins>=1.0
  Using cached click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Collecting cligj>=0.5
  Using cached cligj-0.7.1-py3-none-any.whl (7.1 kB)
Installing collected packages: shapely, pyproj, munch, click-plugins, cligj, fiona, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.1 fiona-1.8.18 geopandas-0.8.1 munch-2.5.0 pyproj-3.0.0.post1 shapely-1.7.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install descartes

Collecting descartes
  Using cached descartes-1.1.0-py3-none-any.whl (5.8 kB)
Installing collected packages: descartes
Successfully installed descartes-1.1.0
Note: you may need to restart the kernel to use updated packages.


In [71]:
import geopandas as gpd

In [224]:
districts = gpd.read_file('Community_Districts.geojson')

In [225]:
districts.head()

Unnamed: 0,boro_cd,shape_area,shape_leng,geometry
0,311,103177785.365,51549.5578567,"MULTIPOLYGON (((-73.97299 40.60881, -73.97259 ..."
1,313,88195686.2748,65821.875577,"MULTIPOLYGON (((-73.98372 40.59582, -73.98305 ..."
2,312,99525500.0655,52245.8304843,"MULTIPOLYGON (((-73.97140 40.64826, -73.97121 ..."
3,206,42664311.3238,35875.7111725,"MULTIPOLYGON (((-73.87185 40.84376, -73.87192 ..."
4,226,50566410.6415,32820.3983295,"MULTIPOLYGON (((-73.86790 40.90294, -73.86796 ..."


In [188]:
"""
ax = districts.plot(figsize=(30,15))

for long,lat, label in zip(districts.centroid.x, districts.centroid.y, districts.boro_cd):
    ax.annotate(label, xy=(long,lat))
""";

The codes from the Furham center are exactly the same as the codes seen on the map. However, the first number represents the borough so the codes have to be reversed engineered using a maping.

In [86]:
borough_num_to_abr = {
        "3": "BK", 
        "2": "BX",
        "1": "MN",
        "4": "QN",
        "5": "SI"
        }

districts["boro_cd"] = districts["boro_cd"].str[0].map(borough_num_to_abr) + districts['boro_cd'].str[1:]

In [189]:
"""
ax = districts.plot(figsize=(30,15))

for long,lat, label in zip(districts.centroid.x, districts.centroid.y, districts.boro_cd):
    ax.annotate(label, xy=(long,lat))
""";

In [87]:
districts = districts[['boro_cd','geometry']]

In [134]:
hood_spatial = hood_df.merge(districts, left_on='code', right_on='boro_cd', how='left').loc[:,['code', 'hoodname', 'borough', 'geometry']]

In [137]:
hood_spatial.sort_values(by='code', inplace=True)

In [139]:
hood_spatial = gpd.GeoDataFrame(hood_spatial)

In [226]:
hood_spatial.head()

Unnamed: 0,code,hoodname,borough,geometry
0,BK01,Greenpoint-Williamsburg,Brooklyn,"MULTIPOLYGON (((-73.92406 40.71411, -73.92404 ..."
1,BK02,Fort_Greene-Brooklyn_Heights,Brooklyn,"MULTIPOLYGON (((-73.96929 40.70709, -73.96839 ..."
2,BK03,Bedford_Stuyvesant,Brooklyn,"MULTIPOLYGON (((-73.91805 40.68721, -73.91800 ..."
3,BK04,Bushwick,Brooklyn,"MULTIPOLYGON (((-73.89647 40.68234, -73.89653 ..."
4,BK05,East_New_York-Starrett_City,Brooklyn,"MULTIPOLYGON (((-73.86841 40.69473, -73.86868 ..."


## Populating the Neighborhood Table

In [215]:
neighborhood_table_query = """
            CREATE TABLE IF NOT EXISTS neighborhood (
                code CHAR(4) PRIMARY KEY,
                hoodname VARCHAR NOT NULL,
                borough VARCHAR(16) NOT NULL,
                geometry GEOGRAPHY(MULTIPOLYGON,4326)
            );
            """
cursor.execute("rollback;")
cursor.execute(neighborhood_table_query)
conn.commit()

In [213]:
def populate_hood(code, hoodname, borough, geometry):
    test_query = f"""
            INSERT INTO neighborhood2
            VALUES ('{code}','{hoodname}','{borough}','{geometry}');
            """

    cursor.execute('rollback;')
    cursor.execute(test_query)
    conn.commit()

In [217]:
hood_spatial.head()

Unnamed: 0,code,hoodname,borough,geometry
0,BK01,Greenpoint-Williamsburg,Brooklyn,"MULTIPOLYGON (((-73.92406 40.71411, -73.92404 ..."
1,BK02,Fort_Greene-Brooklyn_Heights,Brooklyn,"MULTIPOLYGON (((-73.96929 40.70709, -73.96839 ..."
2,BK03,Bedford_Stuyvesant,Brooklyn,"MULTIPOLYGON (((-73.91805 40.68721, -73.91800 ..."
3,BK04,Bushwick,Brooklyn,"MULTIPOLYGON (((-73.89647 40.68234, -73.89653 ..."
4,BK05,East_New_York-Starrett_City,Brooklyn,"MULTIPOLYGON (((-73.86841 40.69473, -73.86868 ..."


In [218]:
hood_spatial.apply(lambda row: populate_hood(row['code'], row['hoodname'], row['borough'], row['geometry']), axis=1);

0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
22    None
23    None
24    None
25    None
26    None
27    None
28    None
29    None
30    None
31    None
32    None
33    None
34    None
35    None
36    None
37    None
38    None
39    None
40    None
41    None
42    None
43    None
44    None
45    None
46    None
47    None
48    None
49    None
50    None
51    None
52    None
53    None
54    None
55    None
56    None
57    None
58    None
dtype: object