## Connecting to the Database

In [1]:
pip install psycopg2-binary;

Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase2.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [4]:
# Database Context Manager
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Trip Table Cleaning I: Handling Outliers - Speed
In this section we are going to modify trips whose speed (MPH) are physically unlikely. Luckily, we have a reference from the different services on what a speed outlier might look like. According to them, their pedal assisted e-bikes can go up to 18 MPH. In the sport of cycling, although not on a pedal assisted bike, a reasonably experienced cyclist can reach over 19 MPH. Using a nicer whole number, we'll use 20 MPH as a conservative cutoff and anything over that will be considered an outlier. 

We've already took the "delete data" path multiple times in this project. Although there is still millions of rows of data after all those deletes, let's make an effort to conserve data. To handle speed outliers, we are going to cap the speed at 20 MPH. Any trip that has a speed over 20 MPH we are going to adjust the trip duration in a way that when the speed is calculated it will result in 20 MPH. $$\frac{Distance}{\frac{Duration}{60}} = 20$$

$$\frac{60 \times Distance}{20} = Duration$$

$$3 \times Distance = Duration$$

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import Queries
import importlib
import time

In [6]:
importlib.reload(Queries) # Delete this for the publish

<module 'Queries' from '/root/Citi-Bike-Expansion/Queries.py'>

In [7]:
services = ['bay', 'blue', 'capital', 'citi', 'divvy']

In [11]:
for service in services:
    speed_outliers_query = f"""
            UPDATE trips.{service}_trip
               SET speed = 20,
                   duration = distance * 3
             WHERE speed > 20;
             """
    
    Queries.execute_query(conn,speed_outliers_query)

We were able to update the speed column on the upper end because we are fairly certain that the bikes can't go over 20 MPH, but what about speed on the lower end. Specifically the trips that have a speed of 0 MPH which is a result of a round trip. Round trips occur when a rider starts and ends at the same station, resulting in a distance of zero miles and a speed also being zero. So how do we deal with this? Almost the same as with the upper speed value. We are going to set the speed to be the average 6 MPH. This time instead of finding the duration, we are going to find the distance. 
$$\frac{Distance}{\frac{Duration}{60}} = 6$$

$$Distance = \frac{6 \times Duration}{60}$$

$$Distance = \frac{Duration}{10}$$



***Note on Average Speed:*** *It was found by excluding the 0 MPH trips and after the 20 MPH adjustment and was 6 MPH for every bike service*

In [12]:
for service in services:
    roundtrip_outliers_query = f"""
                UPDATE trips.{service}_trip
                   SET speed = 6,
                       distance = duration / 10
                 WHERE distance = 0;
                 """
    Queries.execute_query(conn, roundtrip_outliers_query)

## Trip Table Cleaning II: Removing Time Errors - Start Time After End Time

Anytrip with a starttime that was greater than an endtime is an error and will get deleted. It is possible that the two values were just swapped, but the cost of swapping them is more expensive than the trips were worth. Additionally, it's just a small amount of data that it's better to just remove them. The operations of the swap are:
- Find the incorrect values
- Move them to a temporary table with the correct order
- Delete them from the original table
- Reinsert them 

In [None]:
for service in services:
    Queries.delete_time_swaps(conn, service)

## Trip Table Cleaning III: Removing Outliers - Trip Duration
Before we remove outliers, we have to define what an outlier is. First let's get a feeling for how the trip duration values are distributed. The trip tables range from 7M to 111M trips, which means we can't query the entire table everytime we want to analyze something. For each table we will take a saple of 1M rows. Our sampling needs to fufill three criteria:
- The sampling needs to be random
- The sampling needs to be large enough
- The query needs to be decently fast 

### The Sampling Procedure

Unfortunately the ideal Bernoulli sampling in PostgresSQL meets the random condition, but is way too slow for our needs. And the System sampling, although fast enough, isn't truly random. To overcome these limitations, we will use the System sampling for speed and then make the selection of the rows random. The sampling process is as follows:
- System sample 1% of the data from a table
- Order the results of that sample randomly (ORDER BY RANDOM())
- Take the first 20,000 rows (100,000 for CitiBike)
- Repeat the previous steps 50 times (10 for CitiBike)

*Note: there is a chance that there are duplicate rows in the sampling. The more trips that a service has, the less likely repeats will appear.*

### Samples

In [8]:
bay_sample = Queries.get_random_rows(conn, 'bay', samples = 1000000) 

In [9]:
blue_sample = Queries.get_random_rows(conn, 'blue', samples = 1000000) 

In [10]:
capital_sample = Queries.get_random_rows(conn, 'capital', samples = 1000000) 

In [11]:
citi_sample = Queries.get_random_rows(conn, 'citi', samples = 1000000) 

In [12]:
divvy_sample = Queries.get_random_rows(conn, 'divvy', samples = 1000000) 

### Sampling Statistics

In [13]:
pd.DataFrame([bay_sample.duration.describe(), 
              blue_sample.duration.describe(),
              capital_sample.duration.describe(),
              citi_sample.duration.describe(),
              divvy_sample.duration.describe()
             ], index = ['bay','blue','capital','citi','divvy'])

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bay,1000000.0,15.991732,186.108139,0.02,6.12,9.9,15.82,90228.796875
blue,1000000.0,26.624014,752.10498,1.02,6.98,11.8,19.82,392706.40625
capital,1000000.0,19.65163,338.997864,0.02,6.68,11.43,19.549999,214555.421875
citi,1000000.0,16.872683,179.344574,1.0,6.35,10.63,18.469999,46147.011719
divvy,1000000.0,20.709711,365.92218,0.02,7.0,12.0,21.0,192769.125


All of our duration distrubtions look the same: trip duration is drastically skewed to the right. Because of the skew our mean and standard deviation aren't reliable measures to use for outlier detection. Our standard deviations range from 2.7hrs to 11hrs, yet our median values are all under 12m and the 75th percentiles barely breach the 20m mark. 

It appears that quantiles are a good measure to use to determine outliers and seem to be more representative of real life. However, each different service has their own distributions so there isn't a one-size-fits-all quantile measure to use. The 99th percentile may be 66m for one service, but 160m for another. So the question becomes: ***Is there a universal duration cutoff value that can be used for all bike share services?***

To find the answer to that question, we first need to asses whether "Bike Share Behavior" is universal. ***Do people that use Bike Share services behave the same way, regardless of the service?***

### Bike Share Behavior I - Duration Statistical Tests

When we ask the question whether riders of bike share services behave the same way, what we are asking is if the underlying distributions are the same across all the services. We are looking at one dependent variable (duration), across five different services, on distributions that aren't normal. With those three characteristics we can use a Kruskal-Wallis Test to see if the underlying distributions are the same. 

- The Null Hypotheses: $H_0:$ The samples all originate from the same distribution and have the same median values
- The Alternative Hypotheses: $H_1:$ At least one sample originates from a different distribution and has a different median

### Kruskal-Wallis H Test

In [24]:
from scipy import stats

In [27]:
stats.kruskal(bay_sample.duration, blue_sample.duration, capital_sample.duration, citi_sample.duration, divvy_sample.duration)

KruskalResult(statistic=42476.35929180635, pvalue=0.0)

With a p-value of 0 there is enough evidence, at any signficant level, to reject the null hypothesis that the sample all originate form the same distribution and have the same median values.

### Pairwise Mann-Whitney U Test
From the Kruskal-Wallis test, we only know that at least one sample originates from a different distribution and we don't know how bike services compare pairwise. In this section we will compare each bike share service to another on a pairwise level. The null hypothesis is the same as the Kruskal-Wallis Test

BayWheels Comparisons

In [35]:
stats.mannwhitneyu(bay_sample.duration, blue_sample.duration), \
stats.mannwhitneyu(bay_sample.duration, capital_sample.duration), \
stats.mannwhitneyu(bay_sample.duration, citi_sample.duration), \
stats.mannwhitneyu(bay_sample.duration, divvy_sample.duration)

(MannwhitneyuResult(statistic=432718288960.5, pvalue=0.0),
 MannwhitneyuResult(statistic=444568916420.5, pvalue=0.0),
 MannwhitneyuResult(statistic=469393838548.0, pvalue=0.0),
 MannwhitneyuResult(statistic=427988173571.0, pvalue=0.0))

BlueBike Comparisons

In [36]:
stats.mannwhitneyu(blue_sample.duration, capital_sample.duration), \
stats.mannwhitneyu(blue_sample.duration, citi_sample.duration), \
stats.mannwhitneyu(blue_sample.duration, divvy_sample.duration)

(MannwhitneyuResult(statistic=489600684066.5, pvalue=1.962683214484485e-143),
 MannwhitneyuResult(statistic=464290382392.0, pvalue=0.0),
 MannwhitneyuResult(statistic=493846914443.0, pvalue=1.2376110963922516e-51))

CapitalBike Comparisons

In [37]:
stats.mannwhitneyu(capital_sample.duration, citi_sample.duration), \
stats.mannwhitneyu(capital_sample.duration, divvy_sample.duration)

(MannwhitneyuResult(statistic=475259769860.5, pvalue=0.0),
 MannwhitneyuResult(statistic=483904096488.5, pvalue=0.0))

CitiBike Comparisonsm

In [38]:
stats.mannwhitneyu(citi_sample.duration, divvy_sample.duration)

MannwhitneyuResult(statistic=458944167547.5, pvalue=0.0)

Every null hypothesis got rejected and no two bike share service samples come from the same underlying distribution. 

## Station Table Update I: Station Status
The ecosystem of stations in a bike share service is always changing. They add new stations, removes stations, and occasionally moves stations to nearby locations. In this section we are going to add two new columns to each station table in the stations schema called birth and death. 

The birth column will represent the date of the first trip that was taken from the station. The death column represents the date of the last trip that was taken from the station. Stations are considered dead if there wasn't a trip within the last month of 2020. **Any station that is still active will have a null value in the death column.**

In [13]:
for service in services:
    add_columns_query = f"""
            ALTER TABLE stations.{service}_station
            ADD COLUMN birth TIMESTAMP,
            ADD COLUMN death TIMESTAMP;
            """
    Queries.execute_query(conn, add_columns_query)

Once again the Bay Wheels data requires a format that is slightly different from the function

In [21]:
birth_certificate_query = """
                WITH timestamps AS (
                    SELECT DISTINCT startid, 
                                    MIN(DATE_TRUNC('day',starttime)::date) over w AS birth, 
                                    MAX(DATE_TRUNC('day',starttime)::date) over w AS death
                      FROM trips.bay_trip
                    WINDOW w as (PARTITION BY startid)
                )
            
                UPDATE stations.bay_station AS s
                   SET birth = ts.birth,
                       death = ts.death
                  FROM timestamps AS ts
                 WHERE s.stationid = ts.startid;
                """  
Queries.execute_query(conn, birth_certificate_query)

In [17]:
Queries.birth_certificate(conn, 'blue')

In [18]:
Queries.birth_certificate(conn, 'capital')

In [19]:
Queries.birth_certificate(conn, 'citi')

In [20]:
Queries.birth_certificate(conn, 'divvy')

In [22]:
for service in services:
    set_death_null = f"""
        UPDATE stations.{service}_station
           SET death = NULL
         WHERE death <= '2020-12-31'
           AND death > '2020-12-01';
        """
    Queries.execute_query(conn, set_death_null)