## Designing the Database

Each citibike file records information about every single trip that was taken during a single month of the year. There are files for each month starting from June 2013. Each citibike file has the same format. The order and the description of the colomns are as follows:
- Trip Duration (seconds): The length of the trip in seconds
- Start Date & Time: The start time of the trip MM-DD-YYYY HH:MM:SS
- End Date & Time: The end time of the trip MM-DD-YYYY HH:MM:SS
- Start Station ID: The ID for the station where the trip started
- Start Station Name: The name of the station where the trip started
- Start Station Latitude: The latitude of the station where the trip started
- Start Station Longitude: The longitude of the station where the trip started
- End Station ID: The ID for the station where the trip ended
- End Station Name: The name of the station where the trip ended
- End Station Latitude: The latitude of the station where the trip ended
- End Station Longitude: The longitude of the station where the trip ended
- Bike ID: The ID for the bike that was used in the trip
- User Type: What type of user took the trip (Subscriber or Customer)
- Gender: The gender of the user (Male - 1, Female - 2, None - 0)
- Year of Birth: The year that the user was born

<img src="./Data/Images/DatabaseDiagramW.png" width="600" height="800" align="center"/>

*Note: If you cannot see the label names try editing the markdown code (double click diagram) and change the src from DatabaseDiagramW.png to DatabaseDiagramB.png

## Connecting to the Database

In [None]:
pip install psycopg2-binary;

In [None]:
import psycopg2

In [None]:
# Put the password in 
PGHOST = 'tripdatabase2.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [None]:
# Database Context Manager
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

## Database Construction I - Creating the BayWheels Staging Table

In [None]:
pip install s3fs;

In [None]:
import pandas as pd
import numpy as np
import s3fs
import os
from io import StringIO
import Queries

In [None]:
# The S3 Bucket that will be used to store the data should be created beforehand
ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = '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'

fs = s3fs.S3FileSystem(anon=False, key = ACCESS_KEY_ID, secret= ACCESS_SECRET_KEY)

In [None]:
def upload_data(conn, data: pd.DataFrame(), table: str):
    datastream = StringIO()
    cursor = conn.cursor()
    
    data.to_csv(datastream, index=False, header=False)
    datastream.seek(0)
    
    cursor.execute('rollback;')
    cursor.copy_from(datastream,table,sep=',')
    conn.commit()
    
    return None    

In [None]:
staging_schema_query = """CREATE SCHEMA staging;"""
cursor.execute("rollback;")
cursor.execute(staging_schema_query)

In [None]:
bay_filenames = fs.ls("s3://williams-citibike/TripData/BayWheels")

In [None]:
# TAbles module. One function for all the tables. 
bay_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.bay_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID VARCHAR,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID VARCHAR,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL             
              );
              """
cursor.execute("rollback;")
cursor.execute(bay_staging_query)
conn.commit()

In [None]:
def populate_bay_staging(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    columns = ['start_time','end_time',
               'start_station_id', 'start_station_name', 
               'start_station_latitude', 'start_station_longitude', 
               'end_station_id', 'end_station_name',
               'end_station_latitude', 'end_station_longitude']


    altcols = ['started_at','ended_at',
               'start_station_id', 'start_station_name',
               'start_lat', 'start_lng',
               'end_station_id', 'end_station_name',
               'end_lat', 'end_lng']
        
    na_fills = {'start_lat': -1,'start_lng': -1,
               'end_lat': -1, 'end_lng': -1}
    
    with fs.open("s3://"+datafile, 'r') as file:
        try:
            data = pd.read_csv(file, usecols = columns, na_values="")[columns]
        except:    
            file.seek(0)
            data = pd.read_csv(file, usecols = altcols, na_values="")[altcols]
            data.fillna(value=na_fills, inplace=True)
        
        #Some stations have commas in their name causing the copy_from to register extra data fields
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        upload_data(conn, data, 'staging.bay_trip')

    print(f"Finished Uploading to Bay Staging Table: {datafile}")
    return None

In [None]:
for file in bay_filenames:
    populate_bay_staging(file)

## Database Construction II - Creating the BlueBike Staging Table

In [None]:
blue_filenames = fs.ls("s3://williams-citibike/TripData/BlueBike")

In [None]:
# TAbles module. One function for all the tables. 
blue_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.blue_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
cursor.execute("rollback;")
cursor.execute(blue_staging_query)
conn.commit()

In [None]:
def populate_blue_staging(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
      
    columns = ['starttime','stoptime',
               'start station id', 'start station name',
               'start station latitude', 'start station longitude',
               'end station id', 'end station name',
               'end station latitude', 'end station longitude']
    
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, usecols=columns, na_values = "")[columns]
        
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        upload_data(conn,data,'staging.blue_trip')
    
    print(f"Finished Uploading to Blue Staging Table: {datafile}")
    return None

In [None]:
# Data starts from 2015, any data before data doesn't have location data
for file in blue_filenames[5:]:
    populate_blue_staging(file)

## Database Construction III - Creating the Capital Staging Table

In [None]:
capital_filenames = fs.ls("s3://williams-citibike/TripData/CapitalBike")
capital_filenames = fs.ls("s3://williams-citibike/TripData/CaptialBike")

In [None]:
# TAbles module. One function for all the tables. 
capital_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.capital_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL              
              );
              """
cursor.execute("rollback;")
cursor.execute(capital_staging_query)
conn.commit()

In [None]:
def populate_capital_staging(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['Start date', 'End date',
               'Start station number', 'Start station',
               'End station number', 'End station']
    
    altcolumns = ['started_at','ended_at',
                  'start_station_id', 'start_station_name',
                  'start_lat', 'start_lng',
                  'end_station_id', 'end_station_name',
                  'end_lat', 'end_lng']
    
    with fs.open("s3://"+datafile, 'r') as file:
        try:   
            data = pd.read_csv(file, usecols=columns, na_values = "")[columns]
            data.insert(4,'start_lat', -1)
            data.insert(5,'start_lng',-1)

            data.insert(8,'end_lat', -1)
            data.insert(9,'end_lng',-1)
        except:
            file.seek(0)
            data = pd.read_csv(file, usecols=altcolumns, na_values = "")[altcolumns]
            data.fillna({'start_station_id': -1, 'end_station_id':-1, 
                         'start_lat': -1, 'start_lng': -1,
                         'end_lat': -1, 'end_lng': -1}, inplace=True)
        
        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')

        upload_data(conn,data,'staging.capital_trip')
    
    print(f"Finished Uploading to Blue Staging Table: {datafile}")
    return None

In [None]:
for file in capital_filenames:
    populate_capital_staging(file)


## Database Construction IV - Creating the CitiBike Staging Table

In [None]:
citi_filenames = fs.ls("s3://williams-citibike/TripData/CitiBike")

Get rid of bikeID:gender

In [None]:
# TAbles module. One function for all the tables. 
citi_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.citi_trip (
                   tripduration NUMERIC, 
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL,
                   bikeID INTEGER,
                   usertype VARCHAR(16),
                   birthyear REAL,
                   gender SMALLINT                
              );
              """
cursor.execute("rollback;")
cursor.execute(citi_staging_query)
conn.commit()

In [None]:
def populate_citi_staging(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
       
    with fs.open("s3://"+datafile, 'r') as file:
        data = pd.read_csv(file, na_values ="")   # Can't use the C engine to speed this up
        data.fillna(-1, inplace=True)   # Empty spaces need to be integers for birthyear REAL type in database
        
        #Some stations have commas in their name causing the copy_from to register extra data fields
        data.iloc[:, 4] = data.iloc[:, 4].str.replace(',','_')
        data.iloc[:, 8] = data.iloc[:, 8].str.replace(',','_')
        
        data.iloc[:, 3] = data.iloc[:, 3].astype('int32')
        data.iloc[:, 7] = data.iloc[:, 7].astype('int32')
        
        upload_data(conn,data,'staging.citi_trip')
        
    datastream.close()
    print(f"Finished Uploading to Citi Staging Table: {datafile}")
    return None

In [None]:
"""
cursor.execute("rollback;")
for file in citi_filenames:
    populate_staging(file)
"""

## Database Construction V - Creating the Divvy Staging Table

In [479]:
divvy_filenames = fs.ls("s3://williams-citibike/TripData/DivvyBike")

In [463]:
# TAbles module. One function for all the tables. 
divvy_staging_query = """
               CREATE TABLE IF NOT EXISTS staging.divvy_trip (
                   starttime TIMESTAMP,
                   endtime TIMESTAMP,
                   startID NUMERIC,
                   startname VARCHAR(128),
                   start_lat REAL,
                   start_long REAL,
                   endID NUMERIC,
                   endname VARCHAR(128),
                   end_lat REAL,
                   end_long REAL             
              );
              """
cursor.execute("rollback;")
cursor.execute(divvy_staging_query)
conn.commit()

In [464]:
def populate_divvy_staging(datafile: str) -> None:
    """Grabs the data from the s3 bucket and edits it so that it can be uploaded to the staging table
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    None:
        If executed properly the database should now have rows corresponding to the rows in the data
    """
    
    columns = ['started_at', 'ended_at',
               'start_station_id', 'start_station_name',
               'start_lat', 'start_lng',
               'end_station_id', 'end_station_name',
               'end_lat', 'end_lng']
    
    altcolumns = ['starttime', 'stoptime',
                  'from_station_id', 'from_station_name',
                  'to_station_id','to_station_name']
    
    alt3 = ['start_time', 'end_time',
            'from_station_id', 'from_station_name',
            'to_station_id','to_station_name']
    
    names = ['starttime', 'endtime','startid','startname','endid','endname']
    
    with fs.open("s3://"+datafile, 'r') as file:
        try:
            data = pd.read_csv(file, usecols=columns, na_values="", parse_dates=[0,1])[columns]
            data.fillna({'start_station_id': -1, 'end_station_id':-1, 
                         'start_lat': -1, 'start_lng': -1,
                         'end_lat': -1, 'end_lng': -1}, inplace=True)            
        except ValueError:
            file.seek(0)
            try:
                data = pd.read_csv(file, usecols=altcolumns, na_values = "", parse_dates=[0,1])[altcolumns]
                data.columns = names
            except ValueError:
                file.seek(0)
                try:
                    data = pd.read_csv(file, usecols=alt3, na_values = "", parse_dates=[0,1])[alt3]
                    data.columns = names
                except:
                    file.seek(0)
                    data = pd.read_csv(file, usecols=[1,2,5,6,7,8], na_values="", parse_dates=[0,1])
                    data.columns = names
        
            data.insert(4,'start_lat', -1)
            data.insert(5,'start_lng',-1)

            data.insert(8,'end_lat', -1)
            data.insert(9,'end_lng',-1)
            
            data.fillna({'startid': -1, 'endidd':-1}, inplace=True)

        data.iloc[:, 3] = data.iloc[:, 3].str.replace(',','_')
        data.iloc[:, 7] = data.iloc[:, 7].str.replace(',','_')
        
        
        upload_data(conn,data,'staging.divvy_trip')
        
        
    print(f"Finished Uploading to Citi Staging Table: {datafile}")
    return None

In [480]:
for file in divvy_filenames:
    populate_divvy_staging(file)

williams-citibike/TripData/DivvyBike/Divvy_Trips_2013.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2014-Q3-07.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2014-Q3-0809.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2014-Q4.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2014_Q1Q2.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2015-Q1.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2015-Q2.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2015_07.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2015_08.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2015_09.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2015_Q4.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2016_04.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2016_05.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2016_06.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2016_Q1.csv
williams-citibike/TripData/DivvyBike/Divvy_Trips_2016_Q3.csv
williams-citibike

In [None]:
altcolumns = ['starttime', 'stoptime',
              'from_station_id', 'from_station_name',
              'to_station_id','to_station_name']

names = ['starttime', 'endtime','startid','startname','endid','endname']

data = pd.read_csv('/root/Citi-Bike-Expansion/DivvyData/Divvy_Trips_2013.csv', usecols=altcolumns, parse_dates=[0,1])[altcolumns]
data.columns = names

In [None]:
populate_divvy_staging('williams-citibike/TripData/DivvyBike/Divvy_Trips_2013.csv')

## Database Construction II - Creating the Citi Trip Table

In [None]:
# Tables module
citi_trip_table_query = """
            CREATE TABLE IF NOT EXISTS citi_trip (
                starttime TIMESTAMP,
                endtime TIMESTAMP,
                tripduration NUMERIC,
                startID NUMERIC,
                endID NUMERIC,
                usertype VARCHAR(16),
                age REAL,
                gender SMALLINT
            ) PARTITION BY RANGE (starttime);
            """
cursor.execute("rollback;")
cursor.execute(citi_trip_table_query)
conn.commit()

In [None]:
def create_partition(year: int, month: int) -> None: #Tables
    """Docstring 
    
    """
    nxt_month = month+1
    nxt_year = year   # Always the same as current year unless the month is December
    
    if month == 12:   # If Decemember sets the year-mon to January of the next year
        nxt_month = 1
        nxt_year = year+1
    
    month = str(month).zfill(2)
    nxt_month = str(nxt_month).zfill(2)
    
    # Move this to the Tables module
    # ----- This can use Queries.execute_query(conn, partition_query)
    partition_query = f"""
            CREATE TABLE cititrip_y{year}m{month} PARTITION OF citi_trip
            FOR VALUES FROM ('{year}-{month}-01') TO ('{nxt_year}-{nxt_month}-01');
            """
    
    cursor.execute("rollback;")
    cursor.execute(partition_query)
    conn.commit()
    # --------------------------
    return None
    

In [None]:
yearlist13 = [2013]
monthlist13 = [6, 7, 8, 9, 10, 11, 12]

for year in yearlist13:
    for month in monthlist13:
        create_partition(year, month)

In [None]:
yearlist14_20 = [2014, 2015, 2016, 2017, 2018, 2019,2020]
monthlist14_20 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

for year in yearlist14_20:
    for month in monthlist14_20:
        create_partition(year, month)

**In the following code we will be converting the tripduration from seconds to minutes and converting the birthyear to age. On a db.t3.micro rds instance it will take 3.3hrs to execute** 

*Style using CSS*

In [None]:
"""
# Tables module
insert_query2 = """
        INSERT INTO citi_trip
        SELECT DISTINCT starttime, endtime, ROUND(tripduration/60,2) as duration, startid, endid, usertype, 
               CASE WHEN birthyear > 0 THEN 2020 - birthyear
                    ELSE birthyear
                    END AS age,
               gender
          FROM staging.citi_trip
         ORDER BY starttime, endtime;
        """

cursor.execute("rollback;")
cursor.execute(insert_query2)
conn.commit()
"""

By using the DISTINCT clause we are filtering out trips that are exact duplicates. The reason is that in our data, only exact duplicates are trips that were accidentally duplicated. If all the values are the same except a single value then that represents a different trip. For example, two friends may take a ride from the same stations at the same exact time but one may be male and the other may be female. 

*Note: It is possible in reality that two separate trips have exactly the same data. However,that would require two people of the same age and gender, starting and stoping at the same stations at the exact same time (down to the second). Additionally, getting rid of duplicates removed only 0.004% of trips. Therefore on the off chance that all 4,797 counted duplicates weren't actually duplicates in real life we removed a miniscule amount of data from our dataset*

*Note 2: Our trip table doesn't include the bikeid, so there is a chance that those 4,797 duplicates aren't errors. Those people with the same age and gender, starting and stoping at the same stations at the exact same time (down to the second) might be on different bikes.*

## Preparing the Neighborhood Table I - Without the Spatial Data

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
# Attempt connection to the URL
HoodURL = "https://furmancenter.org/neighborhoods"
try:
    r2 = requests.get(HoodURL)
    r2.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(errh)

In [None]:
soup = BeautifulSoup(r2.content, "html.parser")

# The website has a dropdown with all the neighborhood codes and names
hood_code_names = []

#Instead of creating a dictionary like before, we create a list of tuples so that we can make a df
for code in soup.find_all('option')[1:]:
    hood_code_names.append((code.text[:4], code.text[6:].replace("/","-").replace(" ","_")))

In [None]:
hood_df = pd.DataFrame(hood_code_names, columns=["code", "hoodname"])

In [None]:
borough = {
        "BK": "Brooklyn", 
        "BX": "Bronx",
        "MN": "Manhattan",
        "QN": "Queens",
        "SI": "Staten"
        }

hood_df["borough"] = hood_df["code"].str[0:2].map(borough)

In [None]:
hood_df.head()

## Preparing the Neighborhood Table II - Adding the Spatial Data

In [None]:
pip install geopandas

In [None]:
import geopandas as gpd
import shapely

In [None]:
geofile = "s3://williams-citibike/Community_Districts.geojson"

with fs.open(geofile, 'rb') as file:
    districts = gpd.read_file(file)

In [None]:
districts.head()

The codes from the Furman Center are exactly the same as the codes seen in the boro_cd column. However, the first number in the boro_cd acts as a category that represents the borough. The original Furman codes, seen in the hood_df, have to be reversed engineered using a maping. Once the mapping is complete, the two dataframes can be merged together.

In [None]:
borough_num_to_abr = {
        "3": "BK", 
        "2": "BX",
        "1": "MN",
        "4": "QN",
        "5": "SI"
        }

districts["boro_cd"] = districts["boro_cd"].str[0].map(borough_num_to_abr) + districts['boro_cd'].str[1:]

In [None]:
districts = districts[['boro_cd','geometry']]

In [None]:
hood_spatial = hood_df.merge(districts, left_on='code', right_on='boro_cd', how='left').loc[:,['code', 'hoodname', 'borough', 'geometry']]

In [None]:
hood_spatial.sort_values(by='code', inplace=True)

In [None]:
hood_spatial = gpd.GeoDataFrame(hood_spatial)

In [None]:
hood_spatial.head()

## Database Construction III - Creating the Neighborhood Table

In [None]:
# Tables module
neighborhood_table_query = """
        CREATE TABLE IF NOT EXISTS neighborhood (
            code CHAR(4) PRIMARY KEY,
            hoodname VARCHAR NOT NULL,
            borough VARCHAR(16) NOT NULL,
            geometry GEOGRAPHY(MULTIPOLYGON,4326) NOT NULL
        );
        """
cursor.execute("rollback;")
cursor.execute(neighborhood_table_query)
conn.commit()

In [None]:
# Replace with the new function
hoodstream = StringIO()

hood_spatial.to_csv(hoodstream,sep='\t', index=False, header=False)
hoodstream.seek(0)

cursor.copy_from(hoodstream,'neighborhood',sep='\t')
conn.commit()

## Preparing the Station Table I - Querying from the Database

In [None]:
# Endid has more distinct values than startid
# Tables module
stations_query = """
        SELECT DISTINCT ON(endid) endid, endname, end_lat, end_long 
          FROM staging 
         ORDER BY endid;
        """

In [None]:
# stations = pd.read_sql(stations_query, conn) # Expect long execution times

In [None]:
stations_spatial = gpd.GeoDataFrame(stations, geometry=gpd.points_from_xy(stations.end_long, stations.end_lat), crs="EPSG:4326")

## Preparing the Station Table II - SJoining the Neighborhood Spatial Data

In [None]:
# The inner join will remove stations that aren't in NYC (some stations are in NJ).
# Additionally it will remove the handful of stations that didn't have information other than the ID

stations_spatial = gpd.sjoin(stations_spatial, hood_spatial, how='inner', op='within')

In [None]:
stations_spatial = stations_spatial[['endid','endname','code','geometry']].rename(columns={'endid':'stationID','endname':'name'})

In [None]:
stations_spatial.name = stations_spatial.name.str.replace("'","")

In [None]:
stations_spatial.head()

## Database Consruction IV - Creating the Station Table

In [None]:
# Tables module
station_table_query = """
               CREATE TABLE IF NOT EXISTS station (
                   stationID NUMERIC PRIMARY KEY,
                   name VARCHAR(64) NOT NULL,
                   code CHAR(4) NOT NULL,
                   geometry GEOGRAPHY(POINT,4326) NOT NULL
                );
                
                """
cursor.execute("rollback;")
cursor.execute(station_table_query)
conn.commit()

In [None]:
# Replace with function
stationstream = StringIO()
stations_spatial.to_csv(stationstream,sep='\t', index=False, header=False)
stationstream.seek(0)

cursor.copy_from(stationstream,'station',sep='\t')
conn.commit()

## Database Construction V - Creating the Lookup Table

In [None]:
hood_filenames = fs.ls("s3://williams-citibike/HoodData/")[1:]

In [None]:
# Tables module
lookup_table_query = """
                CREATE TABLE IF NOT EXISTS lookup(
                    alias VARCHAR(5) PRIMARY KEY,
                    indicator VARCHAR,
                    description VARCHAR
                );
                """

cursor.execute("rollback;")
cursor.execute(lookup_table_query)
conn.commit()

In [None]:
cols_lst = [2,3,4]
names_lst = ["indicator_category", "indicator", "description"]
lookup = pd.read_excel("s3://" + hood_filenames[0], sheet_name=1, usecols = cols_lst, names = names_lst)

In [None]:
lookup = lookup.sort_values(by=["indicator_category",'indicator'])

In [None]:
alias = {
    'Demographics': 'DEM',
    'Housing Market and Conditions': 'HSC',
    'Land Use and Development': 'LUD',
    'Neighborhood Services and Conditions': 'NSC',
    'Renters': 'RNT'
}

In [None]:
lookup['indicator_category'] = lookup["indicator_category"].map(alias)

In [None]:
lookup = lookup.rename(columns={'indicator_category':'alias'})

In [None]:
indicator_group_order = lookup.groupby("alias").cumcount()+1

In [None]:
lookup['alias'] = lookup['alias'] + indicator_group_order.astype(str)

In [None]:
# replace with function
lookupstream = StringIO()

lookup.to_csv(lookupstream,sep='\t', index=False, header=False)
lookupstream.seek(0)

cursor.copy_from(lookupstream,'lookup',sep='\t')
conn.commit()

## Creating the Neighborhood Profile Table

In [None]:
def flatten_hooddata(datafile: str) -> pd.DataFrame:
    """Grabs the data from the s3 bucket and flattens it to a single row consisting of the neighborhood attributes
    
    Parameters
    ----------
    datafile : str
        The name of a file in the s3 bucket without the s3:// prefix

    Returns
    -------
    pd.DataFrame:
        A single row DataFrame that contains the attributes of the neighborhood
    """
    cols_lst = [0,2,3,8]
    names_lst = ["code", "indicator category", "indicator", "2018"]

    # This function is a mess
    
    with fs.open("s3://"+datafile, 'rb') as file:
        data = pd.read_excel(file, sheet_name=1, usecols = cols_lst, names = names_lst)
       
        #In the previous section we did all the alias work, now we can simply input it into the df from lookup['alias']
        data = data.sort_values(by=['indicator category','indicator'])
        data.insert(1, 'alias', lookup['alias'])
        data = data.drop(columns = ['indicator category', 'indicator'])

        # Prep the '2018' column so that it can used as the value argument in the pivot_table 
        data['2018'] = data['2018'].str.replace('$',"")
        data['2018'] = data['2018'].str.replace(',',"")

        # Values that are percents get turned into decimals
        for index, value in data['2018'].items():
            if isinstance(value,str):
                if value[-1] == '%':
                    data['2018'][index] = float(value.strip('%')) / 100

        data['2018'] = pd.to_numeric(data['2018'])

        # The pivot_table alphabatizes the columns, but we want to maintain the original order
        column_order = ['code'] + list(data['alias'])

        data = data.pivot_table(index=['code'],values='2018', columns='alias', dropna=False)
        data = data.rename_axis(None, axis=1).reset_index()   # The pivot creates a unnecessary column axis
        data['code'] = data['code'][0].replace(" ","")
        data = data.reindex(column_order, axis=1)

    return data

In [None]:
hood_profile = pd.DataFrame()

# This loop only works successfully if there are those specific neighborhood excel files in the HoodData folder
for hood in hood_filenames:
    hood_profile = hood_profile.append(flatten_hooddata(hood))

In [None]:
hood_profile = hood_profile.dropna(axis=1, how='all')

In [None]:
hood_profile = hood_profile.fillna(-1)   # We need to fill NaN with -1 so they can be put into the database

## Database Construction VI - Importing the Neighborhood Profiles into Database

In [None]:
# Tables Module
profile_table_query = """
                CREATE TABLE IF NOT EXISTS profile(
                );
                """
cursor.execute("rollback;")
cursor.execute(profile_table_query)
conn.commit()

In [None]:
for name in hood_profile.columns:
    if name == 'code':
        import_column_query = f"""
                    ALTER TABLE profile
                    ADD COLUMN {name} CHAR(4) PRIMARY KEY;
                    """
    else:
        import_column_query = f"""
                    ALTER TABLE profile
                    ADD COLUMN {name} REAL;
                    """
        
    cursor.execute("rollback;")
    cursor.execute(import_column_query)
    conn.commit

In [None]:
# Can use the function
profilestream = StringIO()

hood_profile.to_csv(profilestream,sep='\t', index=False, header=False)
profilestream.seek(0)

cursor.copy_from(profilestream,'profile',sep='\t')
conn.commit()

## Database Construction VII - Purging the Database: Removing Trips that aren't Contained in NYC

When the neighbborhood data was inner joined to the station data, the stations that were not in NYC were dropped. Although removed from the stations table, there are still trips in the trip table that have the dropped stations. In this section the goal is to remove those trips that are not fully contained within NYC. 

*Note: Not in NYC is defined as trip either starting or ending at a station that is not in NYC.*

**Before we drop the trips that involve New Jersey (NJ), let's see how much of the market share NJ is gathering over time.**

*Note: There are other important questions that could be asked about the NJ data, however, this project is focused on NYC data. For now, more complex NJ based questions are out of scope.*

In [None]:
import Queries # This is actually going to be the Analyze module in the Queries package

In [None]:
# Counts the number of trips per year
all_trips_df = Queries.countYearlyTrips(conn)    # Query-0001 in file # How to use the context manager in the function

In [None]:
NJ_trips_df = Queries.countYearlyNJTrips(conn)   # Query-0002 in file

In [None]:
market_share = NJ_trips_df.merge(all_trips_df, on='year',suffixes=['_nj','_all'])

In [None]:
market_share['nj_percent'] = round(market_share['trips_nj'] / market_share['trips_all'], 4)* 100

In [None]:
market_share # Diagram

In [None]:
# Deleting the NJ data
Queries.deleteNJTrips(conn)