## Connecting to the Database

In [1]:
pip install psycopg2-binary;

Collecting psycopg2-binary
  Using cached psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.8.6
Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg2

In [3]:
# Put the password in 
PGHOST = 'tripdatabase.cmaaautpgbsf.us-east-2.rds.amazonaws.com'
PGDATABASE = ''
PGUSER = 'postgres'
PGPASSWORD = 'Josh1234'

In [4]:
# Database Context Manager
try:   
    # Set up a connection to the postgres server.    
    conn = psycopg2.connect(user = PGUSER,
                            port = "5432",
                            password = PGPASSWORD,
                            host = PGHOST,
                            database = PGDATABASE)
    # Create a cursor object
    cursor = conn.cursor()   
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("Connection Success:", record,"\n")

except (Exception, psycopg2.Error) as error:
    print("Error while connecting to PostgreSQL", error)

Connection Success: ('PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit',) 



## Trip Table Cleaning I: Removing Outliers - Trip Duration
We will be using a basic definition of an outlier in this project. An outlier will be defined as any value that is more than 3 times the standard deviation. The trip table has 111M rows in it so unfortunately we cannot directly take the mean and std of the tripduration column because it would take hours to execute. To overcome this we will use a sampling distribution to estimate the population mean and std.

In [5]:
# Possible Question: Evolution of ride length over time

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import Queries
import importlib
importlib.reload(Queries)

<module 'Queries' from '/root/Citi-Bike-Expansion/Queries.py'>

To get a feeling for how the trip duration values are distributed, let's sample 1M rows

In [50]:
sample_data = Queries.get_random_100k_rows(conn,shuffles=10)   # Get random samples 100K rows at a time

In [14]:
sample_data['tripduration'].describe()

count    1000000.000000
mean          16.173446
std          216.786921
min            1.000000
25%            6.000000
50%           10.000000
75%           18.000000
max       134941.000000
Name: tripduration, dtype: float64

The mean value of the trip duration column is 16m, but it has a standard deviation of 222m which is nearly 4 hours. Additionally, the 75 percentile of our data is 18m. It is clear that there are trips that are completely skewing our data. To put it in persective, if we use our original definition of an outlier then only trips over 12 hours would be considered outliers. Because the data is so heavily skewed the standard deviation is unreliable to help find outliers.

Our original outlier identification idea was based on the empirical rule that states 99.7% of data observed following a normal distribution lies within 3 standard deviations of the mean. Since our standard deviation is unreliable we will rip off the underlying idea behind the empirical rule, but with quantiles since they seem to be more representative of real life. Using quantiles, 99% of our data should be between the 0.5th and the 99.5th percentile. Anything above or below those percentile values will be considered outliers

In [15]:
sample_data.tripduration.quantile(0.005), sample_data.tripduration.quantile(0.995)

(1.0, 94.0)

Things to Mention
- All the trips that are slightly above 1.5 hrs aren't necessarily invalid trips. However, the trips that have a high likelihood of being invalid are above 1.5 hrs. Why? Citibike isn't a service targetted towards long distance rides, which can be seen in their pricing models. The annual membership allows unlimited rides that are capped at 45m. Their day pass allows unlimited rides that are capped at 30m. And a single ride is 3 for 30m. With these finacial thresholds in place, it makes sense why the majority of trips are beneath them. In fact, there is no incentive for a person to keep a bike out for significantly longer than the threshold. If you need the bike for 90m than an as annual rider you park the bike at 45m and then take another bike for the other 45m and not have to pay the extra 6.75 (unless you rode to a place where there aren't any stations to park or the stations have very limited bikes and you don't want to risk your bike being taken and being stranded).
- The data is 'sensitive' to quantile changes. Example the 99.5th quantile is 94m the 99th is 62m and the 98th is 46m.
- *Although there is only one sample of 1M rows taken, no matter how many times you sample from all the trips the lower quantile is 1 and the upper quantile is close to 96. The reason that I ommitted the many times I sampled was because sampling takes a long time to execute*

In [9]:
Queries.delete_duration_outliers(conn)

## Trip Table Cleaning II: Removing Time Errors - Start Time After End Time

Anytrip with a starttime that was greater than an endtime is an error and will get deleted. It is possible that the two values were just swapped. The cost of swapping them is more expensive than it is worth and it's better to just remove them. The operations of the swap are:
- Find the incorrect values
- Move them to a temporary table with the correct order
- Delete them from the original table
- Reinsert them 

In [74]:
data = Queries.find_time_swaps(conn)

In [75]:
data.shape

(201, 8)

From the first step there are barely over 200 incorrectly timed trips returned, continuing to the other steps isn't worth it for such little data. It's less expensive to delete them. 

In [12]:
Queries.delete_time_swaps(conn)

## Trip Table Cleaning III: Handling Outliers - Speed
In this section we are going to remove trips whose speeds (MPH) are physically unlikely. Luckily, we have a reference from CitiBike on what a speed outlier might look like. According to them their pedal assisted e-bikes can go up to 18 MPH. In the sport of cycling, although not on a pedal assisted bike, a reasonably experienced cyclist can reach over 19 MPH. Using a nicer whole number, we'll use 20 MPH as a conservative cutoff and anything over that will be considered an outlier. 

In [93]:
importlib.reload(Queries)
sample_data = Queries.get_random_100k_rows(conn, speed=True, distance=True, shuffles=10)

We've already took the "delete data" path multiple times in this project. We deleted all the NJ trips, the trip duration outliers, and the invalid swaps. Although after all those deletes we still have tons of data (111M to 105M), let's make an effort to conserve data. We are going to cap the speed at 20 MPH, any trip that has a speed over 20 MPH we are going to adjust the trip duration such that the speed is calculated to be 20 MPH. $$\frac{Distance}{\frac{Duration}{60}} = 20MPH$$

$$\frac{60 \times Distance}{20} = 3 \times Distance = Duration$$

In [97]:
sample_data[(sample_data.distance ==0) & (sample_data.tripduration < 5)] # any round trip under 5m is getting updated

Unnamed: 0,starttime,endtime,tripduration,startid,endid,usertype,age,gender,distance,MPH
31,2019-02-10 20:55:32.444,2019-02-10 20:59:46.870,4.0,150,150,Subscriber,29.0,1,0.0,0.0
252,2018-09-14 20:13:27.138,2018-09-14 20:14:40.253,1.0,531,531,Subscriber,50.0,1,0.0,0.0
479,2014-04-17 15:05:20.000,2014-04-17 15:07:29.000,2.0,341,341,Subscriber,36.0,2,0.0,0.0
491,2020-06-04 11:01:29.255,2020-06-04 11:03:02.550,1.0,284,284,Subscriber,54.0,1,0.0,0.0
512,2016-03-11 07:37:25.000,2016-03-11 07:39:06.000,1.0,3145,3145,Subscriber,32.0,1,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
999552,2019-06-15 12:07:14.364,2019-06-15 12:08:49.307,1.0,297,297,Subscriber,36.0,2,0.0,0.0
999589,2016-07-12 07:07:47.000,2016-07-12 07:09:40.000,1.0,3245,3245,Subscriber,28.0,1,0.0,0.0
999662,2019-09-05 16:40:04.670,2019-09-05 16:41:17.527,1.0,3664,3664,Subscriber,41.0,1,0.0,0.0
999667,2020-11-08 18:36:31.729,2020-11-08 18:38:36.529,2.0,3922,3922,Subscriber,27.0,1,0.0,0.0


In [101]:
sample_data[sample_data.MPH > 20] # cap speed and change duration

Unnamed: 0,starttime,endtime,tripduration,startid,endid,usertype,age,gender,distance,MPH
390,2013-06-15 22:36:19,2013-06-15 22:46:40,10.0,72,415,Customer,-1.0,0,4.39,26.340000
839,2013-07-23 16:19:23,2013-07-23 16:22:31,3.0,303,458,Subscriber,33.0,1,1.94,38.800000
992,2013-07-11 18:34:47,2013-07-11 18:43:09,8.0,285,323,Subscriber,52.0,1,2.92,21.900000
2336,2013-12-05 10:42:53,2013-12-05 10:48:27,5.0,319,505,Subscriber,55.0,1,2.69,32.280000
3202,2014-01-15 16:25:19,2014-01-15 16:34:16,8.0,268,450,Subscriber,36.0,2,3.04,22.800000
...,...,...,...,...,...,...,...,...,...,...
998037,2014-01-05 01:42:53,2014-01-05 01:45:40,2.0,469,379,Subscriber,35.0,1,1.09,32.700000
998555,2013-10-04 18:00:47,2013-10-04 18:03:59,3.0,458,345,Subscriber,43.0,1,1.11,22.200000
998958,2013-06-25 13:58:39,2013-06-25 14:05:42,7.0,483,419,Subscriber,42.0,1,2.64,22.628571
999090,2013-11-11 14:33:41,2013-11-11 14:41:44,8.0,243,510,Subscriber,55.0,1,4.99,37.425000


We dealt with outliers on the upper end of speed. Now let's deal with values on the lower end up speed. The lower end is tricky because of round trips. Round trips are when a rider starts and ends at the same station. Because of this those trips' distances are calculated as 0 resulting in the speed being zero. The determine whether or not a round trip was valid or not we have to look at the trips duration and make some assumptions. Assumption 1: A valid round trip has a minimum distance of 0.5 miles (0.25 each way). Assumption 2: The rider was traveling at the mean value of 6 MPH. 

With these assumptions then a valid round trip should have a minimum trip duration of 5 minutes. As before we are going to 