## Connecting to the Database

In [1]:
pip install psycopg2-binary;

Collecting psycopg2-binary
  Using cached psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.8.6
Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [4]:
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Creating the Indexes
In the previous notebook the data was partitioned by year-month to help speed up queries. To further help query execution time we will create indexes on both the startid and endid of the trip table

In [5]:
startid_idx_query = "CREATE INDEX idx_trip_startid ON trip(startid);"
cursor.execute("rollback;")
cursor.execute(startid_idx_query)
conn.commit()

DuplicateTable: relation "idx_trip_startid" already exists


In [None]:
endid_idx_query = "CREATE INDEX idx_trip_endid ON trip(endid)"
cursor.execute("rollback;")
cursor.execute(endid_idx_query)
conn.commit()

## Cleaning the Database I - Removing Trips that aren't Contained in NYC

In the previous notebook, when the neighbborhood data was inner joined to the station data, the stations that were not in NYC were dropped. Although removed from the stations table, there are still trips in the trip table that have the dropped stations. In this section the goal is to remove those trips that are not fully contained within NYC. 

*Note: Not in NYC is defined as trip either starting or ending at a station that is not in NYC.*

**Before we drop the trips that involve New Jersey (NJ), let's see how much of the market share NJ is gathering over time.**

*Note: There are other important questions that could be asked about the NJ data, however, this project is focused on NYC data. For now, more complex NJ based questions are out of scope.*

In [6]:
import pandas as pd
import Queries

In [7]:
# Counts the number of trips per year
colnames, data = Queries.countYearlyTrips(conn)    # Query-0001 in file
all_trips_df = pd.DataFrame(data, columns=colnames)

In [8]:
colnames, data = Queries.countYearlyNJTrips(conn)   # Query-0002 in file
NJ_trips_df = pd.DataFrame(data, columns=colnames)

In [12]:
market_share = NJ_trips_df.merge(all_trips_df, on='year',suffixes=['_nj','_all'])

In [13]:
market_share['nj_percent'] = round(market_share['trips_nj'] / market_share['trips_all'], 4)* 100

In [14]:
market_share

Unnamed: 0,year,trips_nj,trips_all,nj_percent
0,2013.0,67094,5614874,1.19
1,2014.0,55277,8081195,0.68
2,2015.0,183470,9936604,1.85
3,2016.0,618237,13843677,4.47
4,2017.0,948903,16363238,5.8
5,2018.0,1010194,17548339,5.76
6,2019.0,1128002,20551697,5.49
7,2020.0,1086898,16681224,6.52


In [None]:
# Deleting the NJ data
Queries.deleteNJTrips(conn)

## Creating the Foreign Keys

Up until this point, the tables in the database aren't connected to each other through relationships. With the NJ trips removed, we can now put Foreign Keys into place.

- The startID and endID column in the Trip table are related to the PK stationID column in the Station table
- The code column (neighborhood code) in the Station table are related to the PK code in the Neighborhood table
- The PK code in the Profile table is related to the PK code in the Neighborhood table

In [None]:
fk_startID_query = """
        ALTER TABLE trip 
        ADD CONSTRAINT fk_start_stationID 
        FOREIGN KEY (startid) 
        REFERENCES station(stationid);
        """

cursor.execute('rollback;')
cursor.execute(fk_startID_query)
conn.commit()

In [10]:
fk_endID_query = """
        ALTER TABLE trip 
        ADD CONSTRAINT fk_end_stationID 
        FOREIGN KEY (endid) 
        REFERENCES station(stationid);
        """

cursor.execute('rollback;')
cursor.execute(fk_endID_query)
conn.commit()

In [11]:
fk_code_station_hood_query = """
        ALTER TABLE station
        ADD CONSTRAINT fk_stationCode_hoodCode
        FOREIGN KEY (code)
        REFERENCES neighborhood(code);
        """

cursor.execute('rollback;')
cursor.execute(fk_code_station_hood_query)
conn.commit()

In [12]:
fk_code_profile_hood_query = """
        ALTER TABLE profile
        ADD CONSTRAINT fk_profileCode_hoodCode
        FOREIGN KEY (code)
        REFERENCES neighborhood(code);
        """
cursor.execute('rollback;')
cursor.execute(fk_code_profile_hood_query)
conn.commit()