# Building the Database

Each citibike file records information about every single trip that was taken during a single month of the year. There are files for each month starting from June 2013. Each citibike file has the same format. The order and the description of the colomns are as follows:
- Trip Duration (seconds): The length of the trip in seconds
- Start Date & Time: The start time of the trip MM-DD-YYYY HH:MM:SS
- End Date & Time: The end time of the trip MM-DD-YYYY HH:MM:SS
- Start Station ID: The ID for the station where the trip started
- Start Station Name: The name of the station where the trip started
- Start Station Latitude: The latitude of the station where the trip started
- Start Station Longitude: The longitude of the station where the trip started
- End Station ID: The ID for the station where the trip ended
- End Station Name: The name of the station where the trip ended
- End Station Latitude: The latitude of the station where the trip ended
- End Station Longitude: The longitude of the station where the trip ended
- Bike ID: The ID for the bike that was used in the trip
- User Type: What type of user took the trip (Subscriber or Customer)
- Gender: The gender of the user (Male - 1, Female - 2, None - 0)
- Year of Birth: The year that the user was born

<img src="./Data/Images/DatabaseDiagramW.png" width="600" height="800" align="center"/>

*Note: If you cannot see the label names try editing the markdown code (double click diagram) and change the src from DatabaseDiagramW.png to DatabaseDiagramB.png

### Connecting to the Database

In [1]:
pip install psycopg2-binary;

Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase2.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [4]:
# Database Context Manager
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Database Construction I - Creating the Staging Tables

#### A - Installs, Imports, Functions, etc.

In [5]:
pip install s3fs;

Note: you may need to restart the kernel to use updated packages.


In [6]:
import pandas as pd
import numpy as np
import s3fs
import os
from io import StringIO
import Queries
from urllib.parse import urlencode
import requests

In [7]:
# The S3 Bucket that will be used to store the data should be created beforehand
ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = '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'

fs = s3fs.S3FileSystem(anon=False, key = ACCESS_KEY_ID, secret= ACCESS_SECRET_KEY)

In [8]:
api_key = 'AIzaSyCrG_VK47xMKjER4zpHyd3FJNLFn2weNFY'

In [9]:
def upload_data(conn, data, table: str, sep = ','):
    """Uploads dataframe to the table in the database
    
    Parameters
    ----------
    conn: psycopg2.extensions.connection
        The connection to the database
    data: pandas.DataFrame
        The dataframe to be uploaded
    table: str
        The name of the table where the data will be stored
    sep: str
        The seperator to use when saving the dataframe to a csv
    
    Returns
    -------
    None:
        If executed properly the data should be in specified table of the database
    """
    
    cursor = conn.cursor()
    datastream = StringIO()
    
    data.to_csv(datastream, sep=sep, index=False, header=False)
    datastream.seek(0)
    
    cursor.execute('rollback;')
    cursor.copy_from(datastream, table, sep=sep)
    conn.commit()
    
    return None    

In [10]:
staging_schema_query = """CREATE SCHEMA IF NOT EXISTS staging;"""
cursor.execute("rollback;")
cursor.execute(staging_schema_query)

#### Creating the BayWheels Staging Table

In [11]:
bay_filenames = fs.ls("s3://williams-citibike/TripData/BayWheels")

In [12]:
# TAbles module. One function for all the tables. 
bay_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.bay_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID VARCHAR,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID VARCHAR,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL             
              );
              """
cursor.execute("rollback;")
cursor.execute(bay_staging_query)
conn.commit()

In [13]:
def populate_bay_staging(datafile: str) -> None:
    """Grabs the baywheels data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['start_time','end_time',
               'start_station_id', 'start_station_name', 
               'start_station_latitude', 'start_station_longitude', 
               'end_station_id', 'end_station_name',
               'end_station_latitude', 'end_station_longitude']


    altcols = ['started_at','ended_at',
               'start_station_id', 'start_station_name',
               'start_lat', 'start_lng',
               'end_station_id', 'end_station_name',
               'end_lat', 'end_lng']
        
    na_fills = {'start_lat': -1,'start_lng': -1,
               'end_lat': -1, 'end_lng': -1}
    
    with fs.open("s3://"+datafile, 'r') as file:
        try:
            data = pd.read_csv(file, usecols = columns, na_values="")[columns]
        except:    
            file.seek(0)
            data = pd.read_csv(file, usecols = altcols, na_values="")[altcols]
            data.fillna(value=na_fills, inplace=True)
        
        #Some stations have commas in their name causing the copy_from to register extra data fields
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        upload_data(conn, data, 'staging.bay_trip')

    print(f"Finished Uploading to Bay Staging Table: {datafile}")
    return None

In [14]:
for file in bay_filenames:
    populate_bay_staging(file)

Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/2017-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201801-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201802-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201803-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201804-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201805-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201806-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201807-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201808-fordgobike-tripdata.csv
Finished Uploading to

#### Creating the BlueBike Staging Table

In [15]:
blue_filenames = fs.ls("s3://williams-citibike/TripData/BlueBike")

In [16]:
# TAbles module. One function for all the tables. 
blue_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.blue_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
cursor.execute("rollback;")
cursor.execute(blue_staging_query)
conn.commit()

In [17]:
def populate_blue_staging(datafile: str) -> None:
    """Grabs the blue bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
      
    columns = ['starttime','stoptime',
               'start station id', 'start station name',
               'start station latitude', 'start station longitude',
               'end station id', 'end station name',
               'end station latitude', 'end station longitude']
    
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, usecols=columns, na_values = "")[columns]
        
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        upload_data(conn, data,'staging.blue_trip')
    
    print(f"Finished Uploading to Blue Staging Table: {datafile}")
    return None

In [18]:
# Data starts from 2015, any data before data doesn't have location data
for file in blue_filenames[5:]:
    populate_blue_staging(file)

Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201501-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201502-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201503-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201504-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201505-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201506-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201507-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201508-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201509-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citi

#### Creating the Capital Staging Table

In [19]:
capital_filenames = fs.ls("s3://williams-citibike/TripData/CapitalBike")

In [20]:
# TAbles module. One function for all the tables. 
capital_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.capital_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
cursor.execute("rollback;")
cursor.execute(capital_staging_query)
conn.commit()

In [21]:
def populate_capital_staging(datafile: str) -> None:
    """Grabs the capital bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['Start date', 'End date',
               'Start station number', 'Start station',
               'End station number', 'End station']
    
    altcolumns = ['started_at','ended_at',
                  'start_station_id', 'start_station_name',
                  'start_lat', 'start_lng',
                  'end_station_id', 'end_station_name',
                  'end_lat', 'end_lng']
    
    with fs.open("s3://"+datafile, 'r') as file:
        try:   
            data = pd.read_csv(file, usecols=columns, na_values = "")[columns]
            data.insert(4,'start_lat', -1)
            data.insert(5,'start_lng',-1)

            data.insert(8,'end_lat', -1)
            data.insert(9,'end_lng',-1)
        except:
            file.seek(0)
            data = pd.read_csv(file, usecols=altcolumns, na_values = "")[altcolumns]
            data.fillna({'start_station_id': -1, 'end_station_id':-1, 
                         'start_lat': -1, 'start_lng': -1,
                         'end_lat': -1, 'end_lng': -1}, inplace=True)
        
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')

        upload_data(conn,data,'staging.capital_trip')
    
    print(f"Finished Uploading to Blue Staging Table: {datafile}")
    return None

In [22]:
for file in capital_filenames:
    populate_capital_staging(file)


Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2010-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2011-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q1-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q2-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q3-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q4-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2013Q1-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2013Q2-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/

#### Creating the CitiBike Staging Table

In [23]:
citi_filenames = fs.ls("s3://williams-citibike/TripData/CitiBike")

Get rid of bikeID:gender

In [24]:
# TAbles module. One function for all the tables. 
citi_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.citi_trip (
                   tripduration NUMERIC, 
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
cursor.execute("rollback;")
cursor.execute(citi_staging_query)
conn.commit()

In [25]:
def populate_citi_staging(datafile: str) -> None:
    """Grabs the citi bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
       
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, na_values ="", usecols=list(range(0,11)))   # Can't use the C engine to speed this up
        data.fillna(-1, inplace=True)   # Empty spaces need to be integers for birthyear REAL type in database
        
        #Some stations have commas in their name causing the copy_from to register extra data fields
        data.iloc[:, 4] = data.iloc[:, 4].str.replace(',','_')
        data.iloc[:, 8] = data.iloc[:, 8].str.replace(',','_')
        
        data.iloc[:, 3] = data.iloc[:, 3].astype('int32')
        data.iloc[:, 7] = data.iloc[:, 7].astype('int32')
        
        upload_data(conn,data,'staging.citi_trip')
        
    print(f"Finished Uploading to Citi Staging Table: {datafile}")
    return None

In [26]:
for file in citi_filenames:
    populate_citi_staging(file)

Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-07 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-08 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-09 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-10 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-11 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-12 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/201306-citibike-tripdata.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2014-01 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2014-02 - Citi Bike trip data.c

In [27]:
drop_tripduration_query = """
        ALTER TABLE staging.citi_trip
        DROP COLUMN tripduration
        """

cursor.execute("rollback;")
cursor.execute(drop_tripduration_query)
conn.commit()

#### Creating the Divvy Staging Table

In [11]:
divvy_filenames = fs.ls("s3://williams-citibike/TripData/DivvyBike")

In [12]:
# TAbles module. One function for all the tables. 
divvy_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.divvy_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID VARCHAR,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID VARCHAR,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL             
              );
              """
cursor.execute("rollback;")
cursor.execute(divvy_staging_query)
conn.commit()

In [13]:
def populate_divvy_staging(datafile: str) -> None:
    """Grabs the divvy bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['started_at', 'ended_at',
               'start_station_id', 'start_station_name',
               'start_lat', 'start_lng',
               'end_station_id', 'end_station_name',
               'end_lat', 'end_lng']
    
    altcolumns = ['starttime', 'stoptime',
                  'from_station_id', 'from_station_name',
                  'to_station_id','to_station_name']
    
    alt3 = ['start_time', 'end_time',
            'from_station_id', 'from_station_name',
            'to_station_id','to_station_name']
    
    names = ['starttime', 'endtime','startid','startname','endid','endname']
    
    with fs.open("s3://"+datafile, 'r') as file:
        try:
            data = pd.read_csv(file, usecols=columns, na_values="", parse_dates=[0,1])[columns]
            data.fillna({'start_station_id': -1, 'end_station_id':-1, 
                         'start_lat': -1, 'start_lng': -1,
                         'end_lat': -1, 'end_lng': -1}, inplace=True)            
        except ValueError:
            file.seek(0)
            try:
                data = pd.read_csv(file, usecols=altcolumns, na_values = "", parse_dates=[0,1])[altcolumns]
                data.columns = names
            except ValueError:
                file.seek(0)
                try:
                    data = pd.read_csv(file, usecols=alt3, na_values = "", parse_dates=[0,1])[alt3]
                    data.columns = names
                except:
                    file.seek(0)
                    data = pd.read_csv(file, usecols=[1,2,5,6,7,8], na_values="", parse_dates=[0,1])
                    data.columns = names
        
            data.insert(4,'start_lat', -1)
            data.insert(5,'start_lng',-1)

            data.insert(8,'end_lat', -1)
            data.insert(9,'end_lng',-1)
            
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        
        upload_data(conn,data,'staging.divvy_trip')
        
        
    print(f"Finished Uploading to Divvy Staging Table: {datafile}")
    return None

In [14]:
for file in divvy_filenames:
    populate_divvy_staging(file)

Finished Uploading to Divvy Staging Table: williams-citibike/TripData/DivvyBike/202004-divvy-tripdata.csv
Finished Uploading to Divvy Staging Table: williams-citibike/TripData/DivvyBike/202005-divvy-tripdata.csv
Finished Uploading to Divvy Staging Table: williams-citibike/TripData/DivvyBike/202006-divvy-tripdata.csv


KeyboardInterrupt: 

## Database Construction II - Derving the Station Tables

There isn't an explicit list of stations for each service however, embedded in each trip is the starting and ending station information. With that information it is possible to use SQL to derive the unique stations that are in each service. For each service the same set of steps will be followed:

- Determine the unique list of stations from the staging table and save it as a dataframe. Stations that aren't consumer stations are dropped in the retrieval process. 
- Convert dataframe into a geodataframe by using combining the  latitude, longitude coordinates into a point geometry 
- Get the zipcode data for each station using the google API. Some stations require a manual zipcode entry
- Create the table in the database and upload the data to it
- Add a column to the table that identifies which service the stations are associated with

In [24]:
pip install geopandas

Note: you may need to restart the kernel to use updated packages.


In [25]:
import geopandas as gpd
import shapely

In [26]:
def get_stations(conn, service: str, drop_indices: list=[]) -> pd.DataFrame():
    """Derives the unqiue stations from the trip data in the staging schema
    
    Parameters
    ----------
    conn: psycopg2.extensions.connection
        The connection to the database
    service : str
        One of the five bikeshare services of interest
    drop_indices: list
        A list of indices to drop before returning the stations. Used for stations that aren't actual stations
    
    Returns
    -------
    pd.DataFrame:
        Returns a dataframe containing the stations information
    """
    
    station_query = f"""
            SELECT DISTINCT ON(endid) endid, endname, end_lat, end_long 
              FROM staging.{service}_trip
             WHERE end_lat > 0
             UNION
            SELECT DISTINCT ON(endid) endid, endname, end_lat, end_long
              FROM staging.{service}_trip
            ORDER BY endid, end_lat
            """
    
    station = pd.read_sql(station_query, conn)
    station.dropna(inplace=True)
    station.drop_duplicates(subset=['endid'], keep='last', inplace=True)
    
    if len(drop_indices) > 0:
        station = station.set_index('endid').drop(drop_indices).reset_index()
    
    return station

In [27]:
def add_bike_service_name(conn, table: str, name: str, schema: str):
    """Adds a bikeshare column in the table where every value is the name passed
    
    Parameters
    ----------
    conn: psycopg2.extensions.connection
        The connection to the database
    table : str
        The name of the table to be altered
    name: str
        The value that will fill the new column
    
    Returns
    -------
    None:
        If executed properly the table will have a new column called bikeshare that is populated with the value of name
    """
    
    cursor = conn.cursor()
    cursor.execute('rollback;')
   
    add_name_query = f"""
            alter table {schema}.{table}
            add column bikeshare varchar(18);

            update {schema}.{table}
            set bikeshare = '{name}';
            """
          
    
    cursor.execute(add_name_query)
    conn.commit()
    
    return None

#### Geocoding Functions


In [29]:
def get_address_components(address: str, state_initials = "") -> list:
    """Uses the name of the station to get the zipcode via the Google API
    
    Parameters
    ----------
    address: str
        The name of the station
    state_initials : str
        The 2 CHAR initials of the state to use with componenet filtering
        
    Returns
    -------
    List of Dicts:
        A list of the different components of the address
    """
    
    endpoint = f'https://maps.googleapis.com/maps/api/geocode/json'
    params = {'address': address, 'key': api_key}
    url_params = urlencode(params)
    url = f"{endpoint}?{url_params}" + f"&components=administrative_area:{state_initials}|country:US"
    
    r = requests.get(url)
    if r.status_code not in range(200,299):
        return -1

    try:
        return r.json()['results'][0]['address_components']
    except IndexError:
        return -1


def get_latlong_components(lat: float, long: float) -> list:
    """Uses the coordinates of the station to get the zipcode via the Google API
    
    Parameters
    ----------
    lat: float
        The latitude coordinate of the station
    long: float
        The longitude corrdinate of the station
        
    Returns
    -------
    List of Dicts:
        A list of the different components of the address
    """
        
    url = f"https://maps.googleapis.com/maps/api/geocode/json?latlng={lat}, {long}&key={api_key}"
    
    r = requests.get(url)
    if r.status_code not in range(200,299):
        return -1
   
    try:
        return r.json()['results'][0]['address_components']
    except IndexError:
        return -1


def extract_zipcode(components: list) -> int:
    """Iterates through the address components to find the postal code
    
    Parameters
    ----------
    components: list
        The latitude coordinate of the station
    long: float
        The longitude corrdinate of the station
        
    Returns
    -------
    int:
        The zip code value of the address component
    """
    
    if components == -1: 
        return -1
    
    for comp in components:
        if comp.get('types')[0] == 'postal_code':
            return comp.get('long_name')

In [30]:
def input_zipcode(df, state_initials = ""):
    """Uses the df data to determine a station's zip code
    
    Parameters
    ----------
    df: pandas.DatFrame
        The dataframe that the function will be applied on
    state_initials: str
        The 2 CHAR state initial to use for component filtering
        
    Returns
    -------
    None:
        If executed properly, the df will have a zipcode column
    """
    
    if df.end_lat < 10:   # No coordinate data means we use the address
        return extract_zipcode(get_address_components(address = df.endname, state_initials = state_initials))
    else:   # Use the coordinates
        return extract_zipcode(get_latlong_components(df.end_lat, df.end_long))

In [31]:
def manual_zip_entry(geodf, entries: list):
    """Manual inputs zip code data into the dataframe based on the entries passed
    
    Parameters
    ----------
    geodf: geopandas.geodataframe.GeoDataFrame
        The dataframe that the function will be applied on
    entries: list of tuples
        A list of tuples where each tuple is in the form (stationid, zipcode)
        
    Returns
    -------
    geopandas.geodataframe.GeoDataFrame:
        A geodataframe with all the missing zipcodes inputted
    """
    
    geodf = geodf.set_index('endid')
    
    for entry in entries:
        geodf.loc[entry[0], 'zipcode'] = entry[1]
    
    return geodf.reset_index()

In [28]:
stations_schema_query = """CREATE SCHEMA IF NOT EXISTS stations;"""
cursor.execute("rollback;")
cursor.execute(stations_schema_query)

#### Creating the BayWheels Station Table

In [51]:
bay_remove = ['', '449', '449.0', '420.0', '408', '408.0', '484.0', 
              '16th Depot Bike Station', '16th St Depot', 'San Jose Depot', 
              'SF Depot', 'SF Depot-2 (Minnesota St Outbound)']

In [52]:
bay_station = get_stations(conn, 'bay', bay_remove)

In [53]:
def drop_decimal(x):
    """Drops the .0 from a string, if it has it"""
    
    if x.endswith('.0'):
        return(x[:-2])
    else: return x

In [None]:
bay_station['endid'] = bay_station.endid.apply(drop_decimal)

In [50]:
# Stations that have the same ID was a station relocation
bay_station['endid'] = bay_station.endid.apply(drop_decimal)
bay_station.drop_duplicates(subset=['endid'], keep='last', inplace=True)

In [51]:
bay_spatial = gpd.GeoDataFrame(bay_station, geometry=gpd.points_from_xy(bay_station.end_long, bay_station.end_lat), crs="EPSG:4326")

In [56]:
bay_spatial['zipcode'] = bay_spatial.apply(input_zipcode, axis=1).fillna(-1)

In [57]:
# Tables module
bay_station_query = """
               CREATE TABLE IF NOT EXISTS stations.bay_station (
                   stationID VARCHAR,
                   name VARCHAR(64) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
cursor.execute("rollback;")
cursor.execute(bay_station_query)
conn.commit()

Manual Zip Code Entry

In [58]:
manual_zipcodes = [('98', 94103)]

In [59]:
bay_spatial = manual_zip_entry(bay_spatial, manual_zipcodes)

Database Upload

In [60]:
upload_data(conn, bay_spatial, 'stations.bay_station', sep='\t')

In [61]:
add_bike_service_name(conn, 'bay_station','bay', schema = 'stations')

#### Creating the BlueWheels Station Table

In [12]:
blue_remove = [153, 158, 164, 223, 229, 230, 308, 382]

In [68]:
blue_station = get_stations(conn, 'blue', blue_remove)

In [39]:
blue_spatial = gpd.GeoDataFrame(blue_station, geometry=gpd.points_from_xy(blue_station.end_long, blue_station.end_lat), crs="EPSG:4326")

In [40]:
blue_spatial['zipcode'] = blue_spatial.apply(input_zipcode, axis=1)

In [44]:
# Tables module
blue_station_query = """
               CREATE TABLE IF NOT EXISTS stations.blue_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
cursor.execute("rollback;")
cursor.execute(blue_station_query)
conn.commit()

In [45]:
upload_data(conn, blue_spatial, 'stations.blue_station', sep='\t')

In [143]:
add_bike_service_name(conn, 'blue_station','blue', schema = 'stations')

#### Creating the Capital Station Table

In [1]:
# -1 Isn't included because -1 isn't a "stationary" station so when we go to calculate distance, the values won't be correct
capital_remove = [-1]

In [19]:
capital_station = get_stations(conn, 'capital', capital_remove)

In [20]:
capital_spatial = gpd.GeoDataFrame(capital_station, geometry=gpd.points_from_xy(capital_station.end_long, capital_station.end_lat), crs="EPSG:4326")

In [21]:
# Tables module
capital_station_query = """
               CREATE TABLE IF NOT EXISTS stations.capital_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
cursor.execute("rollback;")
cursor.execute(capital_station_query)
conn.commit()

In [22]:
def capital_input_zipcode(df):
    """Uses the df data to determine a station's zip code for capital bike only
    
    Parameters
    ----------
    df: pandas.DatFrame
        The dataframe that the function will be applied on
        
    Returns
    -------
    None:
        If executed properly, the df will have a zipcode column
    """
    
    # component filter through the three states until we find a match
    state_initials = ['DC', 'VA', 'MD']
    
    if df.end_lat < 10:   # No coordinates
        for state in state_initials:
            zip_code = extract_zipcode(get_address_components(address = df.endname, state_initials = state))
            if isinstance(zip_code, str):   # If get_address fails it returns -1
                return zip_code
    else:
        return extract_zipcode(get_latlong_components(df.end_lat, df.end_long))

In [23]:
capital_spatial['zipcode'] = capital_spatial.apply(capital_input_zipcode, axis=1).fillna(-1)

Database Upload

In [None]:
upload_data(conn, capital_spatial, 'stations.capital_station', sep='\t')

In [None]:
add_bike_service_name(conn, 'capital_station','capital', schema = 'stations')

#### Creating the CitiBike Station Table

In [31]:
citi_remove = [-1, 3036, 3650, 3247, 3248, 3446, 3480, 3488, 3633]

In [69]:
citi_station = get_stations(conn, 'citi', citi_remove)

In [70]:
citi_spatial = gpd.GeoDataFrame(citi_station, geometry=gpd.points_from_xy(citi_station.end_long, citi_station.end_lat), crs="EPSG:4326")

In [80]:
# Tables module
citi_station_query = """
               CREATE TABLE IF NOT EXISTS stations.citi_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
cursor.execute("rollback;")
cursor.execute(citi_station_query)
conn.commit()

In [75]:
citi_spatial['zipcode'] = citi_spatial.apply(input_zipcode, axis=1).fillna(-1)

Manual Zip Code Entry

In [80]:
manual_zipcodes = [(152, 10007)]

In [81]:
citi_spatial = manual_zip_entry(citi_spatial, manual_zipcodes)

Database Upload

In [81]:
upload_data(conn, citi_spatial, 'stations.citi_station', sep='\t')

In [82]:
add_bike_service_name(conn, 'citi_station','citi', schema = 'stations')

#### Creating the Divvy Station Table

In [36]:
divvy_remove =  ['-1', '1', '360', '361', '363', '512']

In [37]:
divvy_station = get_stations(conn, 'divvy', divvy_remove)

In [20]:
# Stations that have the same ID was a station relocation
divvy_station['endid'] = divvy_station.endid.apply(drop_decimal)
divvy_station.drop_duplicates(subset=['endid'], keep='last' inplace=True)

In [43]:
divvy_spatial = gpd.GeoDataFrame(divvy_station, geometry=gpd.points_from_xy(divvy_station.end_long, divvy_station.end_lat), crs="EPSG:4326")

In [44]:
# Tables module
divvy_station_query = """
               CREATE TABLE IF NOT EXISTS stations.divvy_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
cursor.execute("rollback;")
cursor.execute(divvy_station_query)
conn.commit()

In [45]:
divvy_spatial['zipcode'] = divvy_spatial.apply(input_zipcode, state_initials='IL',axis=1).fillna(-1)

Missing Zip Codes - Manual Entry

In [47]:
manual_zipcodes = [
    ('606', 60302), ('609', 60305), ('610', 60302), ('613', 60302), 
    ('614',60302), ('617', 60304), ('669', 60611) 
]

In [48]:
divvy_spatial = manual_zip_entry(divvy_spatial, manual_zipcodes)

Database Upload

In [49]:
upload_data(conn, divvy_spatial, 'stations.divvy_station', sep='\t')

In [50]:
add_bike_service_name(conn, 'divvy_station','divvy', schema = 'stations')

## Database Construction III - Creating the Trip Tables

In [75]:
cursor.execute("CREATE SCHEMA IF NOT EXISTS trips")
conn.commit()

### Functions for this Section

In [30]:
def trip_from_staging(conn, service):
    cursor = conn.cursor()
    cursor.execute('rollback;')
    
    trip_from_staging_query = f"""
            CREATE TABLE trips.{service}_trip as (
                SELECT 
                *, 
                CASE WHEN 
                     duration > 0 
                   THEN ROUND(distance/(duration / 60), 2) 
                END AS speed
                FROM (
                    SELECT 
                      starttime, 
                      endtime, 
                      ROUND((EXTRACT(epoch FROM (endtime - starttime))/60)::NUMERIC, 2) AS duration, 
                      startid, 
                      startname, 
                      endid, 
                      endname,
                      CASE WHEN 
                            s1.latitude > 0 AND s2.latitude > 0 
                           THEN ROUND(CAST(ST_Distance(s1.geometry, s2.geometry)*0.000621371 AS NUMERIC),2)
                      END AS distance
                    FROM staging.{service}_trip AS {service}
                    LEFT JOIN stations.{service}_station AS s1
                      ON {service}.startid = s1.stationid::NUMERIC
                    LEFT JOIN stations.{service}_station AS s2
                      ON {service}.endid = s2.stationid::NUMERIC
                ) AS {service}_table
            );
            """
    
    cursor.execute(trip_from_staging_query)
    conn.commit()
    
    return None

In [31]:
def delete_non_trips(conn, service: str, drop_indices: list):
    cursor = conn.cursor()
    cursor.execute('rollback;')
    
    drop_indices = [str(element) for element in drop_indices]
    drop_indices = '(' + ",".join(drop_indices) + ')'
    
    delete_non_trips_query = f"""
            DELETE FROM trips.{service}_trip
            WHERE startid IN {drop_indices}
                OR endid IN {drop_indices}
            """
    
    cursor.execute(delete_non_trips_query)
    conn.commit()
    return None

#### BayWheels Trip Table
The BayWheels table doesn't fit the generic format for both of the functions and has to be hardcoded

In [62]:
bay_station_table_query = """
        CREATE TABLE trips.bay_trip AS (
            SELECT 
              *, 
              CASE WHEN 
                     duration > 0 
                   THEN ROUND(distance/(duration / 60), 2) 
              END AS speed
            FROM (
                SELECT 
                  bay.starttime, 
                  bay.endtime, 
                  ROUND((EXTRACT(epoch FROM (endtime - starttime))/60)::NUMERIC, 2) AS duration, 
                  replace(bay.startid, '.0','') as startid 
                  startname, 
                  replace(bay.endid, '.0','') as endid 
                  endname,
                  CASE WHEN 
                        s1.latitude > 0 AND s2.latitude > 0
                       THEN ROUND(CAST(ST_Distance(s1.geometry, s2.geometry)*0.000621371 AS NUMERIC),2)
                  END AS distance
                FROM staging.bay_trip AS bay
                LEFT JOIN stations.bay_station AS s1
                  ON replace(bay.startid,'.0','') = s1.stationid
                LEFT JOIN stations.bay_station AS s2
                  ON replace(bay.endid, '.0','') = s2.stationid
            ) AS bay_table 
        );
        """

Queries.execute_query(conn, bay_station_table_query)

In [63]:
delete_bay_non_trips_query = """
            DELETE FROM trips.bay_trip
            WHERE startid IN ('449', '449.0', '420.0', '408', '408.0', '484.0', 
                              '16th Depot Bike Station', '16th St Depot', 'San Jose Depot', 
                              'SF Depot', 'SF Depot-2 (Minnesota St Outbound)'
                          )
            
              OR endid IN ('449', '449.0', '420.0', '408', '408.0', '484.0', 
                           '16th Depot Bike Station', '16th St Depot', 'San Jose Depot', 
                           'SF Depot', 'SF Depot-2 (Minnesota St Outbound)'
                       );
            """
Queries.execute_query(conn, delete_bay_non_trips_query)

In [17]:
add_bike_service_name(conn, 'bay_trip','bay', schema='trips')

#### BlueBike Trips Derviation

In [None]:
trip_from_staging(conn, 'blue')

In [None]:
delete_non_trips(conn, 'blue', blue_remove)

In [18]:
add_bike_service_name(conn, 'blue_trip','blue', schema='trips')

#### CapitalBike Trips Derviation

In [32]:
trip_from_staging(conn, 'capital')

In [19]:
add_bike_service_name(conn, 'capital_trip','capital', schema='trips')

#### CitiBike Trips Derviation

In [None]:
trip_from_staging(conn, 'citi')

In [None]:
delete_non_trips(conn, 'citi', citi_remove)

In [20]:
add_bike_service_name(conn, 'citi_trip','citi', schema='trips')

#### DivvyBike Trips Derviation

In [None]:
trip_from_staging(conn, 'divvy')

There are some things that shouldn't be stations and thus not included in the trips. For example DIVVVY PARTS TESTING shouldn't be a station and the trips that that 'station' are involved in aren't consumer trips. 

Then, there are some things that shouldn't be stations, but still included in the trips. For example, a blank start name (startid=-1) corresponds to an error recording the station information, but it is still a valid trip. A blank name shouldn't be in the stations table, but we shouldn't remove that trip from the trip table. 

In [43]:
delete_non_trips(conn, 'divvy', divvy_remove[1:])

In [21]:
add_bike_service_name(conn, 'divvy_trip','divvy', schema='trips')

After all of the updates and deletes we should do a full vacuum before moving forward

In [None]:
Queries.VACUUM_FULL(conn)