## Connecting to the Database

In [1]:
pip install psycopg2-binary;

Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase2.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [4]:
# Database Context Manager
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Trip Table Cleaning I: Removing Outliers - Trip Duration
Before we remove outliers, we have to define what an outlier is. First let's get a feeling for how the trip duration values are distributed. The trip tables range from 7M to 111M trips, which means we can't query the entire table everytime we want to analyze something. For each table we will take a saple of 1M rows. Our sampling needs to fufill three criteria:
- The sampling needs to be random
- The sampling needs to be large enough
- The query needs to be decently fast 

### The Sampling Procedure

Unfortunately the ideal Bernoulli sampling in PostgresSQL meets the random condition, but is way too slow for our needs. And the System sampling, although fast enough, isn't truly random. To overcome these limitations, we will use the System sampling for speed and then make the selection of the rows random. The sampling process is as follows:
- System sample 1% of the data from a table
- Order the results of that sample randomly (ORDER BY RANDOM())
- Take the first 20,000 rows (100,000 for CitiBike)
- Repeat the previous steps 50 times (10 for CitiBike)

*Note: there is a chance that there are duplicate rows in the sampling. The more trips that a service has, the less likely repeats will appear.*

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import Queries
import importlib
import time

In [6]:
importlib.reload(Queries) # Delete this for the publish

<module 'Queries' from '/root/Citi-Bike-Expansion/Queries.py'>

### Samples

In [7]:
services = ['bay', 'blue', 'capital', 'citi', 'divvy']

In [9]:
bay_sample = Queries.get_random_rows(conn, 'bay', samples = 1000000) 

In [10]:
blue_sample = Queries.get_random_rows(conn, 'blue', samples = 1000000) 

In [11]:
capital_sample = Queries.get_random_rows(conn, 'capital', samples = 1000000) 

In [12]:
citi_sample = Queries.get_random_rows(conn, 'citi', samples = 1000000) 

In [13]:
divvy_sample = Queries.get_random_rows(conn, 'divvy', samples = 1000000) 

### Sampling Statistics

In [32]:
pd.DataFrame([bay_sample.duration.describe(), 
              blue_sample.duration.describe(),
              capital_sample.duration.describe(),
              citi_sample.duration.describe(),
              divvy_sample.duration.describe()
             ], index = ['bay','blue','capital','citi','divvy'])

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bay,1000000.0,15.894836,179.677429,-28611.720703,6.12,9.89,15.84,104612.75
blue,1000000.0,27.167665,694.015686,-49.380001,6.98,11.8,19.77,320467.15625
capital,1000000.0,19.211084,270.017792,-28950.429688,6.7,11.48,19.58,117703.226562
citi,1000000.0,16.648346,167.44841,-50.5,6.3,10.6,18.450001,73588.578125
divvy,1000000.0,19.962648,234.937729,-48.0,7.0,12.0,20.98,120301.382812


All of our trip duration distrubtions look the same: trip duration is drastically skewed to the right. Because of the skew our mean and standard deviation aren't reliable measures to use for outlier detection. Our standard deviations range from 2.7hrs to 11hrs, yet our median values are all under 12m and the 75th percentiles barely breach the 20m mark. 

It appears that quantiles are a good measure to use to determine outliers. Quantiles seem to be more representative of real life. However, each different service has their own distributions so there isn't a one-size-fits-all quantile measure to use. The 99th percentile may be 66m for one service, but 160m for another. So the question becomes: ***Is there a universal duration cutoff value that can be used for all bike share services?***

To find the answer to that question, we first need to asses whether 'Bike Share Behavior' is universal. ***Do people that use Bike Share services behave the same way, regardless of the service?***

### Universal Behavior I - Trip Duration Permuation Test
**Note: We are making a major assumption that the sample median is representative of the population median**

## Trip Table Cleaning II: Removing Time Errors - Start Time After End Time

Anytrip with a starttime that was greater than an endtime is an error and will get deleted. It is possible that the two values were just swapped, but the cost of swapping them is more expensive than the trips were worth. Additionally, it's just a small amount of data that it's better to just remove them. The operations of the swap are:
- Find the incorrect values
- Move them to a temporary table with the correct order
- Delete them from the original table
- Reinsert them 

In [None]:
for service in services:
    Queries.delete_time_swaps(conn, service)

## Trip Table Cleaning III: Handling Outliers - Speed
In this section we are going to modify trips whose speed (MPH) are physically unlikely. Luckily, we have a reference from the different services on what a speed outlier might look like. According to them their pedal assisted e-bikes can go up to 18 MPH. In the sport of cycling, although not on a pedal assisted bike, a reasonably experienced cyclist can reach over 19 MPH. Using a nicer whole number, we'll use 20 MPH as a conservative cutoff and anything over that will be considered an outlier. 

We've already took the "delete data" path multiple times in this project. We deleted all the NJ trips, the trip duration outliers, and the invalid swaps. Although after all those deletes we still have tons of data (111M to 105M), let's make an effort to conserve data. To handle speed outliers, we are going to cap the speed at 20 MPH. Any trip that has a speed over 20 MPH we are going to adjust the trip duration in a way that when the speed is calculated it will result in 20 MPH. $$\frac{Distance}{\frac{Duration}{60}} = 20$$

$$\frac{60 \times Distance}{20} = 3 \times Distance = Duration$$

Based off the formula, we will have to find the trips whose speed is above 20 MPH and set it to 20 MPH, then update tripduration by multiplying the distance by 3. **Before we can do any of this we need to actually create the distance and speed column**. The 'trip' table doesn't have distance and speed columns so they have to be manufactured. It's more efficient to recreate the 'trip' table from scratch and add in the columns then to try to update the prexisting table. The new table with the distance and speed columns will be called 'trip_ds'

In [None]:
Queries.recreate_trip()

In [None]:
speed_outliers_query = """
            UPDATE trip_ds
               SET speed = 20,
                   tripduration = distance * 3
             WHERE speed > 20;
             """
Queries.execute_query(conn,speed_outliers_query)

We dealt with outliers on the upper end of speed. Now let's deal with values on the lower end up speed. The lower end is tricky because of round trips. Round trips are when a rider starts and ends at the same station. Because of this, those trips' distances are calculated as zero miles resulting in the speed also being zero. To determine whether or not a round trip was valid or not we have to take into consideration the duration of the trip. To fix these outliers two assumptions will be made.
- Assumption 1: A valid round trip has a minimum distance of 0.5 miles (0.25 each way). 
- Assumption 2: The rider was traveling at the mean speed value of 6 MPH **(mean after filtering out the 0 MPH trips)**

Combining both assumptions, a valid round trip should have a minimum trip duration of 5 minutes. Any trip that has a distance of 0 and a tripduration less than 5 we will set the tripduration to 5, set the speed to 6 MPH, and set the distance to 0.5 miles

In [None]:
roundtrip_outliers_query = """
                UPDATE trip_ds
                   SET tripduration = 5,
                       distance = 0.5,
                       speed = 6
                 WHERE distance = 0 
                   AND tripduration < 5;
                 """
Queries.execute_query(conn, roundtrip_outliers_query)

## Station Table Update I: Station Status
The stations that exist in the system is always changing. CitiBike adds new stations, removes stations, and occasionally moves stations to nearby locations. In this section we are going to add two new columns to the 'station' table called birth and death. The birth column will represent the date of the first trip that was taken from the station. The death column represents the date of the last trip that was taken from the station. Stations are considered dead if there wasn't a trip within the last month. Any station that is still active will have a null value in the death column.

In [None]:
add_columns_query = """
            ALTER TABLE station
            ADD COLUMN birth TIMESTAMP,
            ADD COLUMN death TIMESTAMP;
            """
Queries.execute_query(conn, add_columns_query)

In [None]:
Queries.birth_certificate(conn)

In [None]:
set_death_null = """
        UPDATE station
           SET death = NULL
         WHERE death <= '2020-12-31'
           AND death > '2020-12-01';
        """
Queries.execute_query(conn, set_death_null)