# **Building the Database**

In the last notebook we wrangled ten different sets of data which summed to over 350 files. To work with all this data, the best course of action will be to build a database that we can query from in the future. The database that is built in this notebook will have the majority of the data needed to complete the project. Four schemas are used to organize the database:
- The Staging Schema: Has all the raw trip data for each of the different bike share services
- The Trips Schema: Has all the filtered and edited trip data for each of the different bike share services
- The Stations Schema: Has all the stations for each of the different bike share services
- The Neighborhoods Schema: Has all the information for different zipcodes within the United States

### **Connecting to the Database**

Using the AWS Management Console a RDS Database running PostgreSQL 12.5 on a db.t3.micro instance was created. In this section, we connect to it using the credentials that were generated at creation time. 

In [1]:
pip install psycopg2-binary;

Collecting psycopg2-binary
  Using cached psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.8.6
Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase2.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [4]:
# Database Context Manager
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## **Database Construction I - Creating the Staging Tables**

Each trip file contains more or less the same information about every single trip that was taken during some time period of the year; the majority being a single month. For the tables in our staging schema we are going to standardize our tables so that every table has the same columns below:
- Start Date & Time: The start time of the trip MM-DD-YYYY HH:MM:SS
- End Date & Time: The end time of the trip MM-DD-YYYY HH:MM:SS
- Start Station ID: The ID for the station where the trip started
- Start Station Name: The name of the station where the trip started
- Start Station Latitude: The latitude of the station where the trip started
- Start Station Longitude: The longitude of the station where the trip started
- End Station ID: The ID for the station where the trip ended
- End Station Name: The name of the station where the trip ended
- End Station Latitude: The latitude of the station where the trip ended
- End Station Longitude: The longitude of the station where the trip ended

In [5]:
pip install s3fs;

Collecting botocore<1.19.53,>=1.19.52
  Using cached botocore-1.19.52-py2.py3-none-any.whl (7.2 MB)
[31mERROR: boto3 1.17.34 has requirement botocore<1.21.0,>=1.20.34, but you'll have botocore 1.19.52 which is incompatible.[0m
[31mERROR: awscli 1.19.34 has requirement botocore==1.20.34, but you'll have botocore 1.19.52 which is incompatible.[0m
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.20.34
    Uninstalling botocore-1.20.34:
      Successfully uninstalled botocore-1.20.34
Successfully installed botocore-1.19.52
Note: you may need to restart the kernel to use updated packages.


In [6]:
from io import StringIO
import numpy as np
import pandas as pd
import os
import requests
import s3fs
import sys
from urllib.parse import urlencode

#### **Introducing the Queries Package**

The Queries Package is a custom package containing custom queries that execute against our database. The most used function in this package is the execute_query function which takes in a SQL statement and makes the execution. There are queries that return results and the execute_query function has the option of returning the results as a pandas dataframe or a 2 length tuple of the (column_names, data).

In [7]:
sys.path.append(os.path.join(os.getcwd(),'Data','Scripts'))
import Queries

#### **Access Keys for AWS | GCP**

The bucket that was created is private, therefore we need credentials to access it. If the bucket is public then no credentials are need.  The Google Cloud Access Key isn't necessary. The geocoding portion of the project could get very expensive without the free credits that they offer. In the Data folder of the repository I provide the post-geocoding data that to avoid those charges. 

In [8]:
# The S3 bucket is private and to connect to it we need a valid AWS Access Key
ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = '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'
fs = s3fs.S3FileSystem(anon=False, key = ACCESS_KEY_ID, secret= ACCESS_SECRET_KEY)

# For the geocoding we need a valid GCP Api Key 
google_api_key = 'AIzaSyCrG_VK47xMKjER4zpHyd3FJNLFn2weNFY'

#### **Subsection Structure** -  Each of the 5 following subsections in this section have the same structure
<ol>
    <li> Create the staging.*_trips table in the database.
    <li> Create a custom function to pull the data from S3 and upload it to the table created
    <li> Iterate through all the files and call the custom function
</ol>

In [10]:
staging_schema_query = """CREATE SCHEMA IF NOT EXISTS staging;"""
Queries.execute_query(conn, staging_schema_query)

<img src="./Data/Images/ERD-Staging.png" width="200" height="266" align="center"/>
<p style = "text-align:center"> The Entity Relationship Diagram for the 5 Trip Tables in the Staging Schema </p>
<p style = "text-align:center;font-style:italic"> The start/endID could either both be VARCHAR OR NUMERIC depending on the service </p>

#### **Creating the BayWheels Staging Table**

In [11]:
bay_filenames = fs.ls("s3://williams-citibike/TripData/BayWheels")

In [12]:
# TAbles module. One function for all the tables. 
bay_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.bay_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID VARCHAR,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID VARCHAR,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL             
              );
              """

Queries.execute_query(conn, bay_staging_query)

In [13]:
def populate_bay_staging(datafile: str) -> None:
    """Grabs the baywheels data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['start_time','end_time',
               'start_station_id', 'start_station_name', 
               'start_station_latitude', 'start_station_longitude', 
               'end_station_id', 'end_station_name',
               'end_station_latitude', 'end_station_longitude']


    altcols = ['started_at','ended_at',
               'start_station_id', 'start_station_name',
               'start_lat', 'start_lng',
               'end_station_id', 'end_station_name',
               'end_lat', 'end_lng']
       
    na_fills = {'start_lat': -1,'start_lng': -1,
               'end_lat': -1, 'end_lng': -1}
    
    # The purpose of the with block is to process and upload the files to the database
    with fs.open("s3://"+datafile, 'r') as file:
        # The try-except block handles all the possibilites when trying to read the data
        try:
            data = pd.read_csv(file, usecols = columns, na_values="")[columns]
        except:    
            file.seek(0)
            data = pd.read_csv(file, usecols = altcols, na_values="")[altcols]
            data.fillna(value=na_fills, inplace=True)
        
        # Some stations have commas in their name causing the copy_from to register extra data fields
        # In hindsight we could have used the tab separator
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        Queries.upload_data(conn, data, 'staging.bay_trip')

    print(f"Finished Uploading to Bay Staging Table: {datafile}")
    return None

In [14]:
for file in bay_filenames:
    populate_bay_staging(file)

Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/2017-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201801-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201802-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201803-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201804-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201805-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201806-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201807-fordgobike-tripdata.csv
Finished Uploading to Bay Staging Table: williams-citibike/TripData/BayWheels/201808-fordgobike-tripdata.csv
Finished Uploading to

#### **Creating the BlueBike Staging Table**

In [15]:
blue_filenames = fs.ls("s3://williams-citibike/TripData/BlueBike")

In [16]:
# TAbles module. One function for all the tables. 
blue_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.blue_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
Queries.execute_query(conn, blue_staging_query)

In [17]:
def populate_blue_staging(datafile: str) -> None:
    """Grabs the blue bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
      
    columns = ['starttime','stoptime',
               'start station id', 'start station name',
               'start station latitude', 'start station longitude',
               'end station id', 'end station name',
               'end station latitude', 'end station longitude']

    # The purpose of the with block is to process and upload the files to the database
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, usecols=columns, na_values = "")[columns]
        
        # Some stations have commas in their name causing the copy_from to register extra data fields
        # In hindsight we could have used the tab separator
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        Queries.upload_data(conn, data,'staging.blue_trip')
    
    print(f"Finished Uploading to Blue Staging Table: {datafile}")
    return None

In [18]:
# Data starts from 2015, any data before data doesn't have location data
for file in blue_filenames[5:]:
    populate_blue_staging(file)

Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201501-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201502-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201503-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201504-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201505-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201506-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201507-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201508-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/BlueBike/201509-hubway-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citi

#### **Creating the Capital Staging Table**

In [19]:
capital_filenames = fs.ls("s3://williams-citibike/TripData/CapitalBike")

In [20]:
# TAbles module. One function for all the tables. 
capital_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.capital_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
Queries.execute_query(conn, capital_staging_query)

In [21]:
def populate_capital_staging(datafile: str) -> None:
    """Grabs the capital bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['Start date', 'End date',
               'Start station number', 'Start station',
               'End station number', 'End station']
    
    altcolumns = ['started_at','ended_at',
                  'start_station_id', 'start_station_name',
                  'start_lat', 'start_lng',
                  'end_station_id', 'end_station_name',
                  'end_lat', 'end_lng']
    
    # The purpose of the with block is to process and upload the files to the database
    with fs.open("s3://"+datafile, 'r') as file:
        # The try-except block handles all the possibilites when trying to read the data
        try:   
            data = pd.read_csv(file, usecols=columns, na_values = "")[columns]
            data.insert(4,'start_lat', -1)
            data.insert(5,'start_lng',-1)

            data.insert(8,'end_lat', -1)
            data.insert(9,'end_lng',-1)
        except:
            file.seek(0)
            data = pd.read_csv(file, usecols=altcolumns, na_values = "")[altcolumns]
            data.fillna({'start_station_id': -1, 'end_station_id':-1, 
                         'start_lat': -1, 'start_lng': -1,
                         'end_lat': -1, 'end_lng': -1}, inplace=True)
        
        # Some stations have commas in their name causing the copy_from to register extra data fields
        # In hindsight we could have used the tab separator
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')

        Queries.upload_data(conn,data,'staging.capital_trip')
    
    print(f"Finished Uploading to Capital Staging Table: {datafile}")
    return None

In [22]:
for file in capital_filenames:
    populate_capital_staging(file)


Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2010-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2011-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q1-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q2-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q3-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2012Q4-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2013Q1-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/TripData/CapitalBike/2013Q2-capitalbikeshare-tripdata.csv
Finished Uploading to Blue Staging Table: williams-citibike/

#### **Creating the CitiBike Staging Table**

In [16]:
citi_filenames = fs.ls("s3://williams-citibike/TripData/CitiBike")

In [17]:
# TAbles module. One function for all the tables. 
citi_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.citi_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
Queries.execute_query(conn, citi_staging_query)

In [18]:
def populate_citi_staging(datafile: str) -> None:
    """Grabs the citi bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
       
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, na_values ="", usecols=list(range(1,11)))   # Can't use the C engine to speed this up
        data.fillna(-1, inplace=True)
        
        # Some stations have commas in their name causing the copy_from to register extra data fields
        # In hindsight we could have used the tab separator
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        Queries.upload_data(conn,data,'staging.citi_trip')
        
    print(f"Finished Uploading to Citi Staging Table: {datafile}")
    return None

In [19]:
for file in citi_filenames:
    populate_citi_staging(file)

Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-07 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-08 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-09 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-10 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-11 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2013-12 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/201306-citibike-tripdata.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2014-01 - Citi Bike trip data.csv
Finished Uploading to Citi Staging Table: williams-citibike/TripData/CitiBike/2014-02 - Citi Bike trip data.c

#### **Creating the Divvy Staging Table**

In [11]:
divvy_filenames = fs.ls("s3://williams-citibike/TripData/DivvyBike")

In [12]:
# TAbles module. One function for all the tables. 
divvy_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.divvy_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID VARCHAR,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID VARCHAR,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL             
              );
              """
Queries.execute_query(conn, divvy_staging_query)

In [13]:
def populate_divvy_staging(datafile: str) -> None:
    """Grabs the divvy bike data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['started_at', 'ended_at',
               'start_station_id', 'start_station_name',
               'start_lat', 'start_lng',
               'end_station_id', 'end_station_name',
               'end_lat', 'end_lng']
    
    altcolumns = ['starttime', 'stoptime',
                  'from_station_id', 'from_station_name',
                  'to_station_id','to_station_name']
    
    alt3 = ['start_time', 'end_time',
            'from_station_id', 'from_station_name',
            'to_station_id','to_station_name']
    
    names = ['starttime', 'endtime','startid','startname','endid','endname']
    
    
    # The purpose of the with block is to process and upload the files to the database
    with fs.open("s3://"+datafile, 'r') as file:
        # The try-except block handles all the possibilites when trying to read the data
        try:
            data = pd.read_csv(file, usecols=columns, na_values="", parse_dates=[0,1])[columns]
            data.fillna({'start_station_id': -1, 'end_station_id':-1, 
                         'start_lat': -1, 'start_lng': -1,
                         'end_lat': -1, 'end_lng': -1}, inplace=True)            
        except ValueError:
            file.seek(0)
            try:
                data = pd.read_csv(file, usecols=altcolumns, na_values = "", parse_dates=[0,1])[altcolumns]
                data.columns = names
            except ValueError:
                file.seek(0)
                try:
                    data = pd.read_csv(file, usecols=alt3, na_values = "", parse_dates=[0,1])[alt3]
                    data.columns = names
                except:
                    file.seek(0)
                    data = pd.read_csv(file, usecols=[1,2,5,6,7,8], na_values="", parse_dates=[0,1])
                    data.columns = names
        
            data.insert(4,'start_lat', -1)
            data.insert(5,'start_lng',-1)

            data.insert(8,'end_lat', -1)
            data.insert(9,'end_lng',-1)
        
        # Some stations have commas in their name causing the copy_from to register extra data fields
        # In hindsight we could have used the tab separator
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        Queries.upload_data(conn,data,'staging.divvy_trip')
        
    print(f"Finished Uploading to Divvy Staging Table: {datafile}")
    return None

In [None]:
for file in divvy_filenames:
    populate_divvy_staging(file)

## **Database Construction II - Derving the Station Tables**

There isn't an explicit list of stations for each service, however, embedded in each trip is the starting and ending station information. With that information it is possible to use SQL to derive the unique stations that are in each service.

In [36]:
pip install geopandas

Collecting geopandas
  Using cached geopandas-0.9.0-py2.py3-none-any.whl (994 kB)
Collecting pyproj>=2.2.0
  Using cached pyproj-3.0.1-cp37-cp37m-manylinux2010_x86_64.whl (6.5 MB)
Collecting fiona>=1.8
  Using cached Fiona-1.8.18-cp37-cp37m-manylinux1_x86_64.whl (14.8 MB)
Collecting shapely>=1.6
  Using cached Shapely-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
Collecting munch
  Using cached munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting click-plugins>=1.0
  Using cached click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Collecting cligj>=0.5
  Using cached cligj-0.7.1-py3-none-any.whl (7.1 kB)
Installing collected packages: pyproj, munch, click-plugins, cligj, fiona, shapely, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.1 fiona-1.8.18 geopandas-0.9.0 munch-2.5.0 pyproj-3.0.1 shapely-1.7.1
Note: you may need to restart the kernel to use updated packages.


In [37]:
import geopandas as gpd
import shapely

#### **Geocoding Functions**


In [14]:
def get_address_components(address: str, state_initials = "") -> list:
    """Uses the name of the station to get the zipcode via the Google API with state region filtering
    
    Parameters
    ----------
    address: str
        The name of the station
    state_initials : str
        The 2 CHAR initials of the state to use with componenet filtering
        
    Returns
    -------
    List of Dicts:
        A list of the different components of the address
    """
    
    endpoint = f'https://maps.googleapis.com/maps/api/geocode/json'
    params = {'address': address, 'key': google_api_key}
    url_params = urlencode(params)
    url = f"{endpoint}?{url_params}" + f"&components=administrative_area:{state_initials}|country:US"
    
    r = requests.get(url)
    if r.status_code not in range(200,299):
        return -1

    try:
        return r.json()['results'][0]['address_components']
    except IndexError:
        return -1

In [14]:
def get_latlong_components(lat: float, long: float) -> list:
    """Uses the coordinates of the station to get the zipcode via the Google API
    
    Parameters
    ----------
    lat: float
        The latitude coordinate of the station
    long: float
        The longitude corrdinate of the station
        
    Returns
    -------
    List of Dicts:
        A list of the different components of the address
    """
        
    url = f"https://maps.googleapis.com/maps/api/geocode/json?latlng={lat}, {long}&key={google_api_key}"
    
    r = requests.get(url)
    if r.status_code not in range(200,299):
        return -1
   
    try:
        return r.json()['results'][0]['address_components']
    except IndexError:
        return -1

In [14]:
def extract_zipcode(components: list) -> int:
    """Iterates through the address components to find the postal code
    
    Parameters
    ----------
    components: list
        The latitude coordinate of the station
    long: float
        The longitude corrdinate of the station
        
    Returns
    -------
    int:
        The zip code value of the address component
    """
    
    if components == -1: 
        return -1
    
    for comp in components:
        if comp.get('types')[0] == 'postal_code':
            return comp.get('long_name')

In [15]:
def input_zipcode(df, state_initials = ""):
    """Uses the df data to determine a station's zip code
    
    Parameters
    ----------
    df: pandas.DatFrame
        The dataframe that the function will be applied on
    state_initials: str
        The 2 CHAR state initial to use for component filtering
        
    Returns
    -------
    None:
        If executed properly, the df will have a zipcode column
    """
    
    if df.end_lat < 10:   # No coordinate data means we use the address
        return extract_zipcode(get_address_components(address = df.endname, state_initials = state_initials))
    else:   # Use the coordinates
        return extract_zipcode(get_latlong_components(df.end_lat, df.end_long))

In [16]:
def manual_zip_entry(geodf, entries: list):
    """Manual inputs zip code data into the dataframe based on the entries passed
    
    Parameters
    ----------
    geodf: geopandas.geodataframe.GeoDataFrame
        The dataframe that the function will be applied on
    entries: list of tuples
        A list of tuples where each tuple is in the form (stationid, zipcode)
        
    Returns
    -------
    geopandas.geodataframe.GeoDataFrame:
        A geodataframe with all the missing zipcodes inputted
    """
    
    geodf = geodf.set_index('endid')
    
    for entry in entries:
        geodf.loc[entry[0], 'zipcode'] = entry[1]
    
    return geodf.reset_index()

#### **Subsection Structure** - Each of the 5 following subsections in this section have the same structure
<ol>
    <li> Create a table in the stations schema for the specific service.
    <li> Derive the unique list of stations from the staging table and save it as a dataframe. Stations that aren't consumer stations are dropped in the retrieval process.
<li> Convert dataframe into a geodataframe by combining the latitude and longitude coordinates into a point geometry.
<li> Get the zipcode data for each station using the Google Cloud Platform API. Some stations require a manual zipcode entry.
<li> Upload the data to the table. 
<li> Add a column to the table that identifies which service the stations are associated with.
</ol>

In [28]:
stations_schema_query = """CREATE SCHEMA IF NOT EXISTS stations;"""
Queries.execute_query(conn, stations_schema_query)

<img src="Data/Images/ERD-Staging-Stations.png" width="500" height="666" align="center"/>
<p style = "text-align:center"> The Entity Relationship Diagram for the 5 Station Tables in the Stations Schema </p>
<p style = "text-align:center;font-style:italic"> The stationID could either be VARCHAR OR NUMERIC depending on the service </p>

#### **Creating the BayWheels Station Table**

In [61]:
bay_station_query = """
               CREATE TABLE IF NOT EXISTS stations.bay_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
Queries.execute_query(conn, bay_station_query)

**Station Derivation**

In [9]:
bay_remove = ['', '449', '449.0', '420.0', '408', '408.0', '484.0', 
              '16th Depot Bike Station', '16th St Depot', 'San Jose Depot', 
              'SF Depot', 'SF Depot-2 (Minnesota St Outbound)']

In [31]:
bay_station = Queries.get_stations(conn, 'bay', bay_remove)

In [32]:
def drop_decimal(x):
    """Drops the .0 from a string, if it has it"""
    
    if x.endswith('.0'):
        return(x[:-2])
    else: return x

In [34]:
# Stations that have the same ID was a station relocation
bay_station['endid'] = bay_station.endid.apply(drop_decimal)
bay_station.drop_duplicates(subset=['endid'], keep='last', inplace=True)

**Geocoding**

In [38]:
bay_spatial = gpd.GeoDataFrame(bay_station, geometry=gpd.points_from_xy(bay_station.end_long, bay_station.end_lat), crs="EPSG:4326")

In [60]:
bay_spatial['zipcode'] = bay_spatial.apply(input_zipcode, axis=1).fillna(-1)

Manual Zip Code Entry

In [62]:
manual_zipcodes = [('98', 94103)]

In [63]:
bay_spatial = manual_zip_entry(bay_spatial, manual_zipcodes)

**Database Upload**

In [64]:
Queries.upload_data(conn, bay_spatial, 'stations.bay_station', sep='\t')

In [65]:
Queries.add_bike_service_name(conn, 'bay_station','bay', schema = 'stations')

#### **Creating the BlueWheels Station Table**

In [21]:
# Tables module
blue_station_query = """
               CREATE TABLE IF NOT EXISTS stations.blue_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
Queries.execute_query(conn, blue_station_query)

**Station Derivation**

In [10]:
blue_remove = [153, 158, 164, 223, 229, 230, 308, 382]

In [18]:
blue_station = Queries.get_stations(conn, 'blue', blue_remove)

**Geocoding**

In [19]:
blue_spatial = gpd.GeoDataFrame(blue_station, geometry=gpd.points_from_xy(blue_station.end_long, blue_station.end_lat), crs="EPSG:4326")

In [20]:
blue_spatial['zipcode'] = blue_spatial.apply(input_zipcode, axis=1)

**Database Upload**

In [22]:
Queries.upload_data(conn, blue_spatial, 'stations.blue_station', sep='\t')

In [23]:
Queries.add_bike_service_name(conn, 'blue_station','blue', schema = 'stations')

#### **Creating the Capital Station Table**

In [27]:
# Tables module
capital_station_query = """
               CREATE TABLE IF NOT EXISTS stations.capital_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
Queries.execute_query(conn, capital_station_query)

**Station Derivation**

In [24]:
# -1 Isn't included because -1 isn't a "stationary" station so when we go to calculate distance, the values won't be correct
capital_remove = [-1]

In [25]:
capital_station = Queries.get_stations(conn, 'capital', capital_remove)

**Geocoding**

In [26]:
capital_spatial = gpd.GeoDataFrame(capital_station, geometry=gpd.points_from_xy(capital_station.end_long, capital_station.end_lat), crs="EPSG:4326")

In [28]:
def capital_input_zipcode(df):
    """Uses the df data to determine a station's zip code for capital bike only
    
    Parameters
    ----------
    df: pandas.DatFrame
        The dataframe that the function will be applied on
        
    Returns
    -------
    None:
        If executed properly, the df will have a zipcode column
    """
    
    # component filter through the three states until we find a match
    state_initials = ['DC', 'VA', 'MD']
    
    if df.end_lat < 10:   # No coordinates
        for state in state_initials:
            zip_code = extract_zipcode(get_address_components(address = df.endname, state_initials = state))
            if isinstance(zip_code, str):   # If get_address fails it returns -1
                return zip_code
    else:
        return extract_zipcode(get_latlong_components(df.end_lat, df.end_long))

In [29]:
capital_spatial['zipcode'] = capital_spatial.apply(capital_input_zipcode, axis=1).fillna(-1)

**Database Upload**

In [30]:
Queries.upload_data(conn, capital_spatial, 'stations.capital_station', sep='\t')

In [31]:
Queries.add_bike_service_name(conn, 'capital_station','capital', schema = 'stations')

#### **Creating the CitiBike Station Table**

In [35]:
# Tables module
citi_station_query = """
               CREATE TABLE IF NOT EXISTS stations.citi_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
Queries.execute_query(conn, citi_station_query)

**Station Derivation**

In [52]:
citi_remove = [-1, 3036, 3650, 3247, 3248, 3446, 3480, 3488, 3633]

In [33]:
citi_station = Queries.get_stations(conn, 'citi', citi_remove)

**Geocoding**

In [34]:
citi_spatial = gpd.GeoDataFrame(citi_station, geometry=gpd.points_from_xy(citi_station.end_long, citi_station.end_lat), crs="EPSG:4326")

In [36]:
citi_spatial['zipcode'] = citi_spatial.apply(input_zipcode, axis=1).fillna(-1)

Manual Zip Code Entry

In [37]:
manual_zipcodes = [(152, 10007)]

In [38]:
citi_spatial = manual_zip_entry(citi_spatial, manual_zipcodes)

**Database Upload**

In [39]:
Queries.upload_data(conn, citi_spatial, 'stations.citi_station', sep='\t')

In [40]:
Queries.add_bike_service_name(conn, 'citi_station','citi', schema = 'stations')

#### **Creating the Divvy Station Table**

In [44]:
# Tables module
divvy_station_query = """
               CREATE TABLE IF NOT EXISTS stations.divvy_station (
                   stationID VARCHAR,
                   name VARCHAR(128) NOT NULL,
                   latitude REAL,
                   longitude REAL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL,
                   zipcode INTEGER
                );
                
                """
Queries.execute_query(conn, divvy_station_query)

**Station Derivation**

In [36]:
divvy_remove =  ['-1', '1', '360', '361', '363', '512']

In [37]:
divvy_station = Queries.get_stations(conn, 'divvy', divvy_remove)

In [20]:
# Stations that have the same ID was a station relocation
divvy_station['endid'] = divvy_station.startid.apply(drop_decimal)
divvy_station.drop_duplicates(subset=['endid'], keep='last' inplace=True)

**Geocoding**

In [43]:
divvy_spatial = gpd.GeoDataFrame(divvy_station, geometry=gpd.points_from_xy(divvy_station.end_long, divvy_station.end_lat), crs="EPSG:4326")

In [45]:
divvy_spatial['zipcode'] = divvy_spatial.apply(input_zipcode, state_initials='IL',axis=1).fillna(-1)

Manual Zip Code Entry

In [47]:
manual_zipcodes = [
    ('606', 60302), ('609', 60305), ('610', 60302), ('613', 60302), 
    ('614',60302), ('617', 60304), ('669', 60611) 
]

In [48]:
divvy_spatial = manual_zip_entry(divvy_spatial, manual_zipcodes)

**Database Upload**

In [49]:
Queries.upload_data(conn, divvy_spatial, 'stations.divvy_station', sep='\t')

In [50]:
Queries.add_bike_service_name(conn, 'divvy_station','divvy', schema = 'stations')

## **Database Construction III - Creating the Trip Tables**

We will use the staging table along with the station table to process all the data into the trip table that will be used for the remainder of the project. Remember, when deriving the station tables we removed stations that aren't consumer stations, those trips are still in the staging tables and before we do anything we need to drop them. 

The drop lists aren't exactly the same, there are some things that shouldn't be stations AND not included in the trips. For example DIVVVY PARTS TESTING shouldn't be a station and the trips that that 'station' are involved in aren't consumer trips. Then, there are some things that shouldn't be stations, but STILL be included in the trips. For example, a blank start name (startid = -1) corresponds to either an error recording the station information or a "floating bike" , but it is still a valid trip. A blank name shouldn't be in the stations table, but we shouldn't remove that trip from the trip table. 

Once the fake trips are dropped we are going to manufacture four additonal columns:

<ol>
    <li> Duration: How long each trip took in minutes (subtract the starttime from the endtime)
    <li> Distance (Miles): The distance of the trip (distance between the station geometries)
    <li> Speed (MPH): How fast the rider was going (distance divided by duration)
    <li> Bikeshare: The name of the service the trip was in (a constant value for every entry in a table)
</ol>      

<p style="text-align:center;font-style:italic"> All of the functions to processes the raw data are in the Queries package </p>

In [41]:
trips_schema_query = """CREATE SCHEMA IF NOT EXISTS trips;"""
Queries.execute_query(conn, trips_schema_query)

<img src="./Data/Images/ERD-Stations-Trips.png" width="600" height="800" align="center" />
<p style = "text-align:center"> The Entity Relationship Diagram for the 5 Trip Tables in the Trips Schema </p>
<p style = "text-align:center;font-style:italic"> The start/endID could either be VARCHAR OR NUMERIC depending on the service </p>

#### **BayWheels Trip Derivation**
The BayWheels table doesn't fit the generic format for the Queries.delete_non_trips function and has to be hardcoded

In [43]:
Queries.trip_from_staging(conn, 'bay', id_type = 'VARCHAR')

In [44]:
delete_bay_non_trips_query = """
            DELETE FROM trips.bay_trip
            WHERE startid IN ('449', '420', '408', '484', 
                              '16th Depot Bike Station', '16th St Depot', 'San Jose Depot', 
                              'SF Depot', 'SF Depot-2 (Minnesota St Outbound)'
                          )
            
               OR endid IN ('449', '420', '408', '484', 
                           '16th Depot Bike Station', '16th St Depot', 'San Jose Depot', 
                           'SF Depot', 'SF Depot-2 (Minnesota St Outbound)'
                       );
            """
Queries.execute_query(conn, delete_bay_non_trips_query)

In [45]:
Queries.add_bike_service_name(conn, 'bay_trip','bay', schema='trips')

#### **BlueBike Trips Derviation**

In [46]:
Queries.trip_from_staging(conn, 'blue')

In [48]:
Queries.add_bike_service_name(conn, 'blue_trip','blue', schema='trips')

#### **CapitalBike Trips Derviation**

In [53]:
Queries.trip_from_staging(conn, 'capital')

In [54]:
Queries.add_bike_service_name(conn, 'capital_trip','capital', schema='trips')

#### **CitiBike Trips Derviation**

In [None]:
Queries.trip_from_staging(conn, 'citi')

In [None]:
Queries.delete_non_trips(conn, 'citi', citi_remove)

In [None]:
Queries.add_bike_service_name(conn, 'citi_trip','citi', schema='trips')

#### **DivvyBike Trips Derviation**
The DivvyBike, like BayWheels doesn't fit the generic format for the Queries.delete_non_trips function and has to be hardcoded

In [None]:
Queries.trip_from_staging(conn, 'divvy', id_type='VARCHAR')

In [None]:
delete_divvy_non_trips_query = """
            DELETE FROM trips.divvy_trip
            WHERE startid IN ('1', '360', '361', '363', '512')
              OR endid IN ('1', '360', '361', '363', '512');
            """
Queries.execute_query(conn, delete_divvy_non_trips_query)

In [None]:
Queries.add_bike_service_name(conn, 'divvy_trip','divvy', schema='trips')

## **Preparing the Neighborhood Titles Tables**

The neighborhood names and the titles

In [12]:
neighborhoods_schema_query = """CREATE SCHEMA IF NOT EXISTS neighborhoods;"""
Queries.execute_query(conn, neighborhoods_schema_query)

#### **NYC Neighborhoods I - The Neighborhood Information**

In [13]:
from bs4 import BeautifulSoup

In [14]:
# Connecting to Furman to get the code-names of neighborhoods
NYCHoodUrl = 'https://furmancenter.org/neighborhoods'

try:
    r2 = requests.get(NYCHoodUrl)
    r2.raise_for_status()
except request.exceptions.HTTPError as errh:
    print(errh)

In [15]:
soup = BeautifulSoup(r2.content, 'html.parser')

hood_code_names = [] # list of tuples: (code, neighborhood)

for code in soup.find_all('option')[1:]:
    hood_code_names.append((code.text[:4], code.text[6:].replace('/','-').replace(" ", "_")))

hood_df = pd.DataFrame(hood_code_names, columns=['code', 'hoodname'])

In [16]:
borough = {
    "BK": "Brooklyn",
    "BX": "Bronx",
    "MN": "Manhattan",
    "QN": "Queens",
    "SI": "Staten"
}

hood_df['borough'] = hood_df['code'].str[0:2].map(borough)

In [17]:
hood_df.head()

Unnamed: 0,code,hoodname,borough
0,BK01,Greenpoint-Williamsburg,Brooklyn
1,BK02,Fort_Greene-Brooklyn_Heights,Brooklyn
2,BK03,Bedford_Stuyvesant,Brooklyn
3,BK04,Bushwick,Brooklyn
4,BK05,East_New_York-Starrett_City,Brooklyn


#### **NYC Neighborhoods II - The GeoSpatial Data**

In [18]:
geofile = "s3://williams-citibike/GeoSpatial/NYC-Neighborhoods.geojson"

with fs.open(geofile, 'rb') as file:
    nyc_spatial = gpd.read_file(file)

In [19]:
nyc_spatial.head()

Unnamed: 0,boro_cd,shape_area,shape_leng,geometry
0,311,103177785.365,51549.5578567,"MULTIPOLYGON (((-73.97299 40.60881, -73.97259 ..."
1,313,88195686.2748,65821.875577,"MULTIPOLYGON (((-73.98372 40.59582, -73.98305 ..."
2,312,99525500.0655,52245.8304843,"MULTIPOLYGON (((-73.97140 40.64826, -73.97121 ..."
3,206,42664311.3238,35875.7111725,"MULTIPOLYGON (((-73.87185 40.84376, -73.87192 ..."
4,226,50566410.6415,32820.3983295,"MULTIPOLYGON (((-73.86790 40.90294, -73.86796 ..."


In [20]:
borough_num_to_abr = {
        "3": "BK", 
        "2": "BX",
        "1": "MN",
        "4": "QN",
        "5": "SI"
        }

nyc_spatial["boro_cd"] = nyc_spatial["boro_cd"].str[0].map(borough_num_to_abr) + nyc_spatial['boro_cd'].str[1:]

In [21]:
nyc_spatial = hood_df.merge(nyc_spatial, left_on='code', right_on='boro_cd', how='left').loc[:, ['code','hoodname','borough','geometry']]

In [22]:
nyc_spatial.head()

Unnamed: 0,code,hoodname,borough,geometry
0,BK01,Greenpoint-Williamsburg,Brooklyn,"MULTIPOLYGON (((-73.92406 40.71411, -73.92404 ..."
1,BK02,Fort_Greene-Brooklyn_Heights,Brooklyn,"MULTIPOLYGON (((-73.96929 40.70709, -73.96839 ..."
2,BK03,Bedford_Stuyvesant,Brooklyn,"MULTIPOLYGON (((-73.91805 40.68721, -73.91800 ..."
3,BK04,Bushwick,Brooklyn,"MULTIPOLYGON (((-73.89647 40.68234, -73.89653 ..."
4,BK05,East_New_York-Starrett_City,Brooklyn,"MULTIPOLYGON (((-73.86841 40.69473, -73.86868 ..."


#### **NYC Neighborhoods III - Upload to the Database**

In [25]:
nychood_table_query = """
        CREATE TABLE IF NOT EXISTS neighborhoods.nyc_hood (
            code CHAR(4) PRIMARY KEY,
            hoodname VARCHAR NOT NULL,
            borough VARCHAR(16) NOT NULL,
            geometry GEOGRAPHY(MULTIPOLYGON,4326) NOT NULL
        );
        """
Queries.execute_query(conn, nychood_table_query)
Queries.upload_data(conn, nyc_spatial, 'neighborhoods.nyc_hood', sep='\t')

#### **San Francisco Neighborhoods I - The GeoSpatial Data**

In [47]:
geofile = "s3://williams-citibike/GeoSpatial/San-Francisco-Neighborhoods.geojson"

with fs.open(geofile, 'rb') as file:
    sanfran_spatial = gpd.read_file(file)

In [48]:
sanfran_spatial.head()

Unnamed: 0,nhood,geometry
0,Bayview Hunters Point,"MULTIPOLYGON (((-122.38158 37.75307, -122.3815..."
1,Bernal Heights,"MULTIPOLYGON (((-122.40361 37.74934, -122.4037..."
2,Castro/Upper Market,"MULTIPOLYGON (((-122.42656 37.76948, -122.4269..."
3,Chinatown,"MULTIPOLYGON (((-122.40623 37.79756, -122.4055..."
4,Excelsior,"MULTIPOLYGON (((-122.42398 37.73155, -122.4239..."


#### **San Francisco Neighborhoods II - Upload to the Database**

In [49]:
sanfran_table_query = """
        CREATE TABLE IF NOT EXISTS neighborhoods.sanfran_hood (
            hoodname VARCHAR NOT NULL,
            geometry GEOGRAPHY(MULTIPOLYGON,4326) NOT NULL
        );
        """
Queries.execute_query(conn, sanfran_table_query)
upload_data(conn, sanfran_spatial, 'neighborhoods.sanfran_hood', sep='\t')

## **Database Construction V - The Neighborhood Statistics Tables**

#### **NYC Neighborhoods I - Lookup Table**

In [3]:
hood_filenames = fs.ls("s3://williams-citibike/HoodData/")[1:-4]

In [38]:
# Tables module
lookup_table_query = """
                CREATE TABLE IF NOT EXISTS neighborhoods.nyc_lookup (
                    alias VARCHAR(5) PRIMARY KEY,
                    indicator VARCHAR,
                    description VARCHAR
                );
                """

Queries.execute_query(conn, lookup_table_query)

In [31]:
cols_lst = [2,3,4]
names_lst = ["indicator_category", "indicator", "description"]
lookup = pd.read_excel("s3://" + hood_filenames[0], sheet_name=1, usecols = cols_lst, names = names_lst)

In [32]:
lookup = lookup.sort_values(by=["indicator_category",'indicator'])

In [33]:
alias = {
    'Demographics': 'DEM',
    'Housing Market and Conditions': 'HSC',
    'Land Use and Development': 'LUD',
    'Neighborhood Services and Conditions': 'NSC',
    'Renters': 'RNT'
}

In [34]:
lookup['indicator_category'] = lookup["indicator_category"].map(alias)

In [35]:
lookup = lookup.rename(columns={'indicator_category':'alias'})

In [36]:
indicator_group_order = lookup.groupby("alias").cumcount()+1

In [37]:
lookup['alias'] = lookup['alias'] + indicator_group_order.astype(str)

In [39]:
Queries.upload_data(conn, lookup, 'neighborhoods.nyc_lookup', sep='\t')

#### **NYC Neighborhoods II - The Profile Dataframe**

In [40]:
def flatten_hood_data(datafile: str) -> pd.DataFrame:
    """Grabs the data from the s3 bucket and flattens it to a single row consisting of the neighborhood attributes
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    pd.DataFrame:
        A single row DataFrame that contains the attributes of the neighborhood
    """
    cols_lst = [0,2,3,8]
    names_lst = ["code", "indicator category", "indicator", "2018"]

    # This function is a mess
    
    with fs.open("s3://"+datafile, 'rb') as file:
        data = pd.read_excel(file, sheet_name=1, usecols = cols_lst, names = names_lst)
       
        #In the previous section we did all the alias work, now we can simply input it into the df from lookup['alias']
        data = data.sort_values(by=['indicator category','indicator'])
        data.insert(1, 'alias', lookup['alias'])
        data = data.drop(columns = ['indicator category', 'indicator'])

        # Prep the '2018' column so that it can used as the value argument in the pivot_table 
        data['2018'] = data['2018'].str.replace('$',"")
        data['2018'] = data['2018'].str.replace(',',"")

        # Values that are percents get turned into decimals
        for index, value in data['2018'].items():
            if isinstance(value,str):
                if value[-1] == '%':
                    data['2018'][index] = float(value.strip('%')) / 100

        data['2018'] = pd.to_numeric(data['2018'])

        # The pivot_table alphabatizes the columns, but we want to maintain the original order
        column_order = ['code'] + list(data['alias'])

        data = data.pivot_table(index=['code'],values='2018', columns='alias', dropna=False)
        data = data.rename_axis(None, axis=1).reset_index()   # The pivot creates a unnecessary column axis
        data['code'] = data['code'][0].replace(" ","")
        data = data.reindex(column_order, axis=1)

    return data

In [41]:
hood_profile = pd.DataFrame()

# This loop only works successfully if there are those specific neighborhood excel files in the HoodData folder
for hood in hood_filenames:
    hood_profile = hood_profile.append(flatten_hood_data(hood))

In [42]:
hood_profile = hood_profile.dropna(axis=1, how='all')

In [43]:
hood_profile = hood_profile.fillna(-1)   # We need to fill NaN with -1 so they can be put into the database

#### **NYC Neighborhoods III - Uploading the Profile Table into the Database**

In [48]:
# Tables Module
profile_table_query = """
                CREATE TABLE IF NOT EXISTS neighborhoods.nyc_profile(
                );
                """
Queries.execute_query(conn, profile_table_query)

In [49]:
for name in hood_profile.columns:
    if name == 'code':
        import_column_query = f"""
                    ALTER TABLE neighborhoods.nyc_profile
                    ADD COLUMN {name} CHAR(4) PRIMARY KEY;
                    """
    else:
        import_column_query = f"""
                    ALTER TABLE neighborhoods.nyc_profile
                    ADD COLUMN {name} REAL;
                    """
    
    Queries.execute_query(conn, import_column_query)

In [50]:
Queries.upload_data(conn, hood_profile, 'neighborhoods.nyc_profile', sep='\t')

#### **San Francisco Neighborhoods I -  Population Dataframe**

In [19]:
pip install PyPDF2

Processing /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3/PyPDF2-1.26.0-py3-none-any.whl
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0
Note: you may need to restart the kernel to use updated packages.


In [23]:
import PyPDF2

In [51]:
pdfile = "s3://williams-citibike/HoodData/SX01_SanFran-Neighborhoods-Data.pdf"
population_list = []

In [52]:
with fs.open(pdfile, 'rb') as file:
    pdfread = PyPDF2.PdfFileReader(file)
    
    for page in range(13,94,2):
        data = pdfread.getPage(page)
        text = data.extractText()

        first_line = text.split('\n')[0]
        neighborhood = ''.join([i for i in first_line.split('Demographics')[0] if not i.isdigit()])
        population = ''.join([i for i in first_line.split()[-1] if i.isdigit()])

        if population == '':
            population = ''.join([i for i in first_line.split()[3] if i.isdigit()])

        population_list.append((neighborhood, population))

population_df = pd.DataFrame(population_list, columns= ['hoodname', 'population'])
population_df = population_df.astype({'population':'int32'})



In [54]:
population_df.head()

Unnamed: 0,hoodname,population
0,Bayview Hunters Point,37600
1,Bernal Heights,26140
2,Castro/Upper Market,21090
3,Chinatown,14820
4,Excelsior,39340


#### **San Francisco Neighborhoods II - Uploading the Populations into Database**

In [55]:
sanfranhood_table_query = """
        CREATE TABLE IF NOT EXISTS neighborhoods.sanfran_profile (
            hoodname VARCHAR NOT NULL,
            population INTEGER
        );
        """
Queries.execute_query(conn, sanfranhood_table_query)
Queries.upload_data(conn, population_df, 'neighborhoods.sanfran_profile', sep='\t')

#### **Zip Codes I - Creating the Zip Code Profile Table**

In [67]:
zipcode_file = "s3://williams-citibike/HoodData/ZX01_Zipcodes-USA.csv"

with fs.open(zipcode_file, 'r') as file:
    zipcodes = pd.read_csv(file, sep=',', low_memory = False)
    zipcodes = zipcodes[zipcodes.country_name == 'United States']
    zipcodes.drop(columns = ['cities_postalcode_id', 'country_name',
                         'area_land_sq_miles', 'area_water_sq_miles',
                         'units_in_structure_housing_units_total_housing_units'
                        ], inplace=True)

In [68]:
column_renames = ['zipcode', 'cbsa', 'state_name', 'state_code', 'ziptype', 'land_area_sqm',
                  'water_area_sqm', 'total_population', 'total_population_18_over',
                  'median_age', 'pct_labor_force_16_over', 
                  'pct_labor_force_unemployed_16_over', 'pct_armed_force_16_over', 
                  'pct_labor_force_employed_16_over', 'median_household_income',
                  'family_income_per_capita', 'total_housing_units',
                  'pct_vacant_housing', 'occupied_housing_units',
                  'pct_no_vehicle_of_occupied_housing', 'median_price_owner_occupied_units',
                  'median_rent_occupied_units_paying_rent', 'pct_1_unit_attached',
                  'pct_1_unit_detached', 'pct_2_units', 'pct_3_4_units', 'pct_5_9_units',
                  'pct_10_19_units', 'pct_20_over_units', 'avg_household_size', 'pct_bachelors_over_25_over',
                  'pct_diff_housing_from_last_year', 'pct_same_housing_from_last_year',
                  'resident_since_last_year', 'population_in_college_grad', 'population_density', 
                  'pct_population_in_college_grad', 'zipcode_segment'
                 ]

zipcodes.columns = column_renames
zipcodes.zipcode = zipcodes.zipcode.str.zfill(5)
zipcodes.loc[:,'land_area_sqm':'pct_population_in_college_grad'] = zipcodes.loc[:,'land_area_sqm':'pct_population_in_college_grad'].fillna(-1)

In [69]:
zipcodes.head()

Unnamed: 0,zipcode,cbsa,state_name,state_code,ziptype,land_area_sqm,water_area_sqm,total_population,total_population_18_over,median_age,...,pct_20_over_units,avg_household_size,pct_bachelors_over_25_over,pct_diff_housing_from_last_year,pct_same_housing_from_last_year,resident_since_last_year,population_in_college_grad,population_density,pct_population_in_college_grad,zipcode_segment
0,210,,New Hampshire,NH,,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
1,211,,New Hampshire,NH,,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
2,212,,New Hampshire,NH,,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
3,213,,New Hampshire,NH,,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
4,214,,New Hampshire,NH,,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,


In [70]:
# Tables Module
profile_table_query = """
                CREATE TABLE IF NOT EXISTS neighborhoods.zipcodes_profile(
                );
                """
Queries.execute_query(conn, profile_table_query)

In [71]:
varchar_list = ['zipcode', 'cbsa', 'state_name', 'state_code', 'ziptype', 'zipcode_segment']

for name in zipcodes.columns:
    if name in varchar_list:
        import_column_query = f"""
                    ALTER TABLE neighborhoods.zipcodes_profile
                    ADD COLUMN {name} VARCHAR;
                    """
    else:
        import_column_query = f"""
                    ALTER TABLE neighborhoods.zipcodes_profile
                    ADD COLUMN {name} REAL;
                    """
    
    Queries.execute_query(conn, import_column_query)

In [72]:
Queries.upload_data(conn, zipcodes, 'neighborhoods.zipcodes_profile', sep='\t')