## Designing the Database

Each citibike file has the same format, the colomns in the files are:
- Trip Duration (seconds)
- Start Date & Time
- End Date & Time
- Start Station ID
- Start Station Name
- Start Station Latitude
- Start Station Longitude
- End Date & Time
- End Station ID
- End Station Name
- End Station Latitude
- End Station Longitude
- Bike ID
- User Type
- Gender
- Year of Birth

<img src="DatabaseDiagram.png" width="600" height="800" align="center"/>

## Connecting to the Database

In [1]:
pip install psycopg2-binary;

Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = ''

In [4]:
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Populating the Staging Table

In [4]:
pip install s3fs;

Collecting s3fs
  Using cached s3fs-0.5.1-py3-none-any.whl (21 kB)
Collecting aiobotocore>=1.0.1
  Using cached aiobotocore-1.1.2-py3-none-any.whl (45 kB)
Collecting fsspec>=0.8.0
  Using cached fsspec-0.8.4-py3-none-any.whl (91 kB)
Collecting aioitertools>=0.5.1
  Using cached aioitertools-0.7.1-py3-none-any.whl (20 kB)
Collecting botocore<1.17.45,>=1.17.44
  Using cached botocore-1.17.44-py2.py3-none-any.whl (6.5 MB)
Collecting aiohttp>=3.3.1
  Using cached aiohttp-3.7.3-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
Collecting typing_extensions>=3.7
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting yarl<2.0,>=1.0
  Using cached yarl-1.6.3-cp37-cp37m-manylinux2014_x86_64.whl (294 kB)
Collecting multidict<7.0,>=4.5
  Using cached multidict-5.0.2-cp37-cp37m-manylinux2014_x86_64.whl (142 kB)
Collecting async-timeout<4.0,>=3.0
  Using cached async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
[31mERROR: boto3 1.16.13 has requirement botocore<1.20.0,>=1.19.13, but you'

In [5]:
import pandas as pd
import s3fs
import os
from io import StringIO

In [6]:
ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = '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'
bucket = "s3://williams-citibike/TripData/"
fs = s3fs.S3FileSystem(anon=False, key = ACCESS_KEY_ID, secret= ACCESS_SECRET_KEY)
trip_filenames = fs.ls("s3://williams-citibike/TripData/")[1:]

In [10]:
stagingtable = """
           CREATE TABLE IF NOT EXISTS staging (
               tripduration INTEGER, 
               starttime TIMESTAMP,
               endtime TIMESTAMP,
               startID NUMERIC,
               startname VARCHAR(64),
               start_lat REAL,
               start_long REAL,
               endID NUMERIC,
               endname VARCHAR(64),
               end_lat REAL,
               end_long REAL,
               bikeID INTEGER,
               usertype VARCHAR(16),
               birthyear REAL,
               gender SMALLINT                
          );
          """
cursor.execute("rollback;")
cursor.execute(stagingtable)
conn.commit()

In [11]:
def populate_stage(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    datastream = StringIO()
    
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, na_values ="") 
        data.fillna(-1, inplace=True) # Empty spaces need to be integers for birthyear column in database
        
        #Some stations have commas in their name causing the copy_from to register extra data fields
        data.iloc[:, 4] = data.iloc[:, 4].str.replace(',','_')
        data.iloc[:, 8] = data.iloc[:, 8].str.replace(',','_')
        
        # data.iloc[:, 3] = data.iloc[:, 3].astype('int32')
        # data.iloc[:, 7] = data.iloc[:, 7].astype('int32')
        
        data.to_csv(datastream, index=False, header = False)
        datastream.seek(0)

        cursor.copy_from(datastream,'staging',sep=',')
        conn.commit()
    
    datastream.close()
    print(f"Finished Uploading to Staging Table: {datafile}")
    return None

In [12]:
"""
cursor.execute("rollback;")
for file in trip_filenames:
    populate_staging(file)
"""

Finished Uploading to Raw: williams-citibike/TripData/2013-07 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-08 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-09 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-10 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-11 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/2013-12 - Citi Bike trip data.csv
Finished Uploading to Raw: williams-citibike/TripData/201306-citibike-tripdata.csv


## Populating the Station Table (Without the Neighborhood Code)

In [14]:
stationtable = """
               CREATE TABLE IF NOT EXISTS station (
                   stationID NUMERIC PRIMARY KEY,
                   name VARCHAR(64) NOT NULL,
                   latitude REAL,
                   longitude REAL
                );
                
                """
cursor.execute("rollback;")
cursor.execute(stationtable)
conn.commit()

In [15]:
insert_query = """
               INSERT INTO station
               SELECT DISTINCT ON(endid) endid, endname, end_lat, end_long 
                FROM staging 
               ORDER BY endid;
               """

cursor.execute("rollback;")
cursor.execute(insert_query)
conn.commit()

## Populating the Trip Table

In [19]:
triptable = """
            CREATE TABLE IF NOT EXISTS trip (
                startime TIMESTAMP,
                endtime TIMESTAMP,
                tripduration INTEGER,
                startID NUMERIC,
                endID NUMERIC,
                usertype VARCHAR(16),
                birthyear REAL,
                gender SMALLINT
            );
            """
cursor.execute("rollback;")
cursor.execute(triptable)
conn.commit()

In [20]:
insert_query2 = """
                INSERT INTO trip
                SELECT starttime, endtime, tripduration, startid, endid, usertype, birthyear, gender
                  FROM staging
                 ORDER BY starttime, endtime;
                """

cursor.execute("rollback;")
cursor.execute(insert_query2)
conn.commit()

## Populating the Neighborhood Table

In [20]:
from bs4 import BeautifulSoup
import requests

In [21]:
# Attempt connection to the URL
HoodURL = "https://furmancenter.org/neighborhoods"
try:
    r2 = requests.get(HoodURL)
    r2.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(errh)

In [31]:
soup = BeautifulSoup(r2.content, "html.parser")

# The website has a dropdown with all the neighborhood codes and names
hood_code_names = []

#Instead of creating a dictionary like before, we create a list of tuples so that we can make a df
for code in soup.find_all('option')[1:]:
    hood_code_names.append((code.text[:4], code.text[6:].replace("/","-").replace(" ","_")))

In [34]:
hood_df = pd.DataFrame(hood_code_names, columns=["Code", "Name"])

In [40]:
borough = {
        "BK": "Brooklyn", 
        "BX": "Bronx",
        "MN": "Manhattan",
        "QN": "Queens",
        "SI": "Staten"
        }

hood_df["Borough"] = hood_df["Code"].str[0:2].map(borough)

In [48]:
hoodtable = """
            CREATE TABLE IF NOT EXISTS neighborhood (
                code CHAR(4) PRIMARY KEY,
                hoodname VARCHAR NOT NULL,
                borough VARCHAR(16) NOT NULL
            );
            """
cursor.execute("rollback;")
cursor.execute(hoodtable)
conn.commit()

In [50]:
hoodstream = StringIO()

hood_df.to_csv(hoodstream, index=False, header = False)
hoodstream.seek(0)

cursor.copy_from(hoodstream,'neighborhood',sep=',')
conn.commit()
    
hoodstream.close()

## TESTING: Creating a Neighborhood Profile Table

In [222]:
hood_filenames = fs.ls("s3://williams-citibike/HoodData/")[1:]

In [223]:
# Importing the file from the bucket
cols_lst = [0,3,8]
names_lst = ["code", "indicator", "2018"]

file = "s3://" + hood_filenames[0]
data = pd.read_excel(file, sheet_name=1, usecols = cols_lst, names = names_lst)

# Prep the '2018' column so that it can used as the value argument in the pivot_table 
data['2018'] = data['2018'].str.replace('$',"")
data['2018'] = data['2018'].str.replace(',',"")

# Values that are percents get turned into decimals
for index, value in data['2018'].items():
    if isinstance(value,str):
        if value[-1] == '%':
            data['2018'][index] = float(value.strip('%')) / 100

In [224]:
data['2018'] = pd.to_numeric(data['2018'])

In [225]:
data = data.pivot_table(index=['code'],values='2018', columns='indicator')

In [226]:
data = data.rename_axis(None, axis=1).reset_index()

In [227]:
data['code'] = data['code'][0].replace(" ","")

In [228]:
data.head()

Unnamed: 0,code,Born in New York State,Car-free commute (% of commuters),Disabled population,FHA/VA-backed home purchase loans (% of home purchase loans),Foreign-born population,"Home purchase loan rate (per 1,000 properties)",Home purchase loans in LMI tracts (% of home purchase loans),Home purchase loans to LMI borrowers (% of home purchase loans),Homeownership rate,...,Severely rent-burdened households,"Severely rent-burdened households, low income","Severely rent-burdened households, moderate income",Single-person households,"Students performing at grade level in English language arts, 4th grade","Students performing at grade level in math, 4th grade","Total housing code violations (per 1,000 privately owned rental units)",Unemployment rate,Units authorized by new residential building permits,Units issued new certificates of occupancy
0,BK01,0.518,0.85,0.059,0.002,0.2,21.7,0.242,0.024,0.158,...,0.267,0.485,0.084,0.273,0.518,0.493,173.6,0.0245,1097.0,2472.0
