### Idea: Testing to see if there is a relationship between the length of a citi bike ride and the time of day (whether it is night or day)

### Null Hypothesis: Citi Bike rides at night time are longer duration than rides during the day time
### Alternative Hypothesis: Citi Bike rides during the night are shorter or the same as rides during the day

In [1]:
import pylab as pl
import pandas as pd
import numpy as np
import os
import json
import scipy.stats
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
os.environ["PUIDATA"] = "{}/PUIdata".format(os.getenv("HOME"))
puidata = os.environ["PUIDATA"]

In [3]:
!curl -O https://s3.amazonaws.com/tripdata/201710-citibike-tripdata.csv.zip

!unzip 201710-citibike-tripdata.csv.zip -d $PUIDATA

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64.4M  100 64.4M    0     0  54.0M      0  0:00:01  0:00:01 --:--:-- 54.0M
Archive:  201710-citibike-tripdata.csv.zip
replace /nfshome/cb4102/PUIdata/201710-citibike-tripdata.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [4]:
citi_1710 = pd.read_csv(os.getenv("PUIDATA") + "/" + "201710-citibike-tripdata.csv")

In [5]:
citi_1710.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,457,2017-10-01 00:00:00,2017-10-01 00:07:38,479,9 Ave & W 45 St,40.760193,-73.991255,478,11 Ave & W 41 St,40.760301,-73.998842,30951,Subscriber,1985.0,1
1,6462,2017-10-01 00:00:20,2017-10-01 01:48:03,279,Peck Slip & Front St,40.707873,-74.00167,307,Canal St & Rutgers St,40.714275,-73.9899,14809,Customer,,0
2,761,2017-10-01 00:00:27,2017-10-01 00:13:09,504,1 Ave & E 16 St,40.732219,-73.981656,350,Clinton St & Grand St,40.715595,-73.98703,28713,Subscriber,1992.0,1
3,1193,2017-10-01 00:00:29,2017-10-01 00:20:22,3236,W 42 St & Dyer Ave,40.758985,-73.9938,3233,E 48 St & 5 Ave,40.757246,-73.978059,16008,Customer,1992.0,2
4,2772,2017-10-01 00:00:32,2017-10-01 00:46:44,2006,Central Park S & 6 Ave,40.765909,-73.976342,469,Broadway & W 53 St,40.763441,-73.982681,14556,Customer,,0


In [6]:
citi_1710['date'] = pd.to_datetime(citi_1710["starttime"])

In [7]:
citi_1710['hour'] = pd.DatetimeIndex(citi_1710['starttime']).hour


In [12]:
citi_1710.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender', 'date', 'hour'],
      dtype='object')

In [8]:
# select only needed columns 


citi_1710_2 = citi_1710[['tripduration','starttime','start station longitude','start station longitude','usertype','birth year','gender','hour']]



In [9]:
def night_day_label(row):
    """Function to set up a column based on conditions"""
    if row['hour'] < 5:
        return 1
    if row['hour'] >= 20:
        return 1
    else:
        return 0

citi_1710_2['night_flag'] = citi_1710_2.apply(lambda row: night_day_label(row),axis=1) #apply night_day_label

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [10]:
citi_night = citi_1710_2[citi_1710_2['night_flag'] == 1] #create data frame for night time data 
citi_day = citi_1710_2[citi_1710_2['night_flag'] == 0] #create data frame for day time data

In [11]:
citi_night.head()

Unnamed: 0,tripduration,starttime,start station longitude,start station longitude.1,usertype,birth year,gender,hour,night_flag
0,457,2017-10-01 00:00:00,-73.991255,-73.991255,Subscriber,1985.0,1,0,1
1,6462,2017-10-01 00:00:20,-74.00167,-74.00167,Customer,,0,0,1
2,761,2017-10-01 00:00:27,-73.981656,-73.981656,Subscriber,1992.0,1,0,1
3,1193,2017-10-01 00:00:29,-73.9938,-73.9938,Customer,1992.0,2,0,1
4,2772,2017-10-01 00:00:32,-73.976342,-73.976342,Customer,,0,0,1


In [12]:
citi_day.head()

Unnamed: 0,tripduration,starttime,start station longitude,start station longitude.1,usertype,birth year,gender,hour,night_flag
1808,183,2017-10-01 05:00:33,-73.998842,-73.998842,Subscriber,1978.0,1,5,0
1809,220,2017-10-01 05:05:05,-73.999234,-73.999234,Subscriber,1993.0,1,5,0
1810,340,2017-10-01 05:06:13,-74.002116,-74.002116,Subscriber,1976.0,1,5,0
1811,1130,2017-10-01 05:07:16,-73.983035,-73.983035,Subscriber,1977.0,1,5,0
1812,258,2017-10-01 05:07:46,-73.947084,-73.947084,Subscriber,1987.0,1,5,0


In [13]:
ks = scipy.stats.ks_2samp(citi_day['tripduration'], citi_night['tripduration'])
print(ks)

Ks_2sampResult(statistic=0.04472519943096126, pvalue=0.0)


In [25]:
# The KS is a two sided test that tests whether 2 samples are drawn from the same distribution
# In this case, the null hypothesis is that the distribution of trip duration are the same for day and night time rides

#Because the p-value is approaching zero so can reject the null hypothesis that the distributions are the same, 
# i.e. would have a similar distribution 


In [15]:
# Now we test on a subset of the data frame.

citi_day_sample = citi_day.sample(n=40000, random_state=1)
citi_night_sample = citi_night.sample(n=40000, random_state=1)

In [17]:
#retest the KS statistic on a random sample of data

ks1 = scipy.stats.ks_2samp(citi_day_sample['tripduration'], citi_night_sample['tripduration'])
print(ks1)

#Given the p-value we would reject the null hypothesis on the same of 

Ks_2sampResult(statistic=0.044250000000000012, pvalue=1.6899585113796641e-34)


In [20]:
# test the pearson correlation on the sub-sample of data 

# The pearson correlation measures the linear relationship between x and y and assumes and underlying normal distribution. 
# When comparing trip duration between the day and night samples using the pearson correlation a strong correlation will imply
# the two samples share a similar underlying distribution.

#the null hypothesis: Trip durations during the day and night have a different distribution
#alternative hypothesis: Trip durations during the day and night have a similar distribution

#a high correlation and a a statistically significant p-value would allow us to reject the null hypothesis, as well as measure the strength of the linear relationship
#which would imply both underlying distributions are normal.

pearson2 = scipy.stats.pearsonr(citi_day_sample['tripduration'].sort_values(), citi_night_sample['tripduration'].sort_values())
print("the pearson correlation is:", pearson2)

#given the high correlation and significant p-value, we can reject the null hypothesis.

the pearson correlation is: (0.97974306964365243, 0.0)


In [22]:
# Test the Spearman's test for correlation

#spearman's correlation coefficient tests for monotonic relationships between x and y, i.e. it tests whether as x increases does y increase (or decrease).
#It does not test whether the relationship is linear or not and it does not test for the strength of the pearson coefficient.

#the null hypothesis: Trip durations during the day and night have a different distribution
#alternative hypothesis: Trip durations during the day and night have a similar distribution

#A strong spearmean correlation and significant p-value would tell us that our underlying samples have a similar distribution.

spearman1 = scipy.stats.spearmanr(citi_day_sample['tripduration'].sort_values(), citi_night_sample['tripduration'].sort_values())
print(spearman1)

#given the high correlation and significant p-value, we can reject the null hypothesis.

SpearmanrResult(correlation=0.99999930268280213, pvalue=0.0)
