### Author: Adriano Yoshino - amy324@nyu.edu

## Assignment 1: Compare Tests for Goodness of fit (on real data)

### Test whether a gaussian model N($\mu$, $\sigma$) for the age distribution of citibike drivers is a sensible model, or if you can find a better fit with another distribution.

- Use 2 tests: KS, AD, KL, chisq (even though we have not talked about it in detail yet) to do this.

- Test at the Normal and a least one other distributions (e.g. Poisson, or Binomial, or Chisq, Lognormal.......)

In [7]:
import pylab as pl
import pandas as pd
import numpy as np
import os
import scipy.stats

# Using Dr. Bianco function to download Citibike data
#imports downloader
from getCitiBikeCSV import getCitiBikeCSV

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
# As the data is already downloaded, I will not run this code this time
# datestring = '201503'
# getCitiBikeCSV(datestring)

('Downloading', '201507')
file in place, you can continue


In [4]:
datestring = '201503'

In [5]:
df = pd.read_csv(os.getenv("PUIDATA") + "/" + datestring + '-citibike-tripdata.csv')
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,669,3/1/2015 0:00,3/1/2015 0:11,164,E 47 St & 2 Ave,40.753231,-73.970325,477,W 41 St & 8 Ave,40.756405,-73.990026,21409,Subscriber,1987.0,1
1,750,3/1/2015 0:01,3/1/2015 0:14,258,DeKalb Ave & Vanderbilt Ave,40.689407,-73.968855,436,Hancock St & Bedford Ave,40.682166,-73.95399,19397,Subscriber,1968.0,1
2,663,3/1/2015 0:01,3/1/2015 0:12,497,E 17 St & Broadway,40.73705,-73.990093,477,W 41 St & 8 Ave,40.756405,-73.990026,20998,Customer,,0
3,480,3/1/2015 0:02,3/1/2015 0:10,470,W 20 St & 8 Ave,40.743453,-74.00004,491,E 24 St & Park Ave S,40.740964,-73.986022,21565,Subscriber,1983.0,1
4,1258,3/1/2015 0:02,3/1/2015 0:23,345,W 13 St & 6 Ave,40.736494,-73.997044,473,Rivington St & Chrystie St,40.721101,-73.991925,14693,Subscriber,1970.0,1


In [6]:
# df is the dataframe where the content of the csv file is stored
df['date'] = pd.to_datetime(df['starttime'])
# note that with dataframes I can refer to variables as dictionary keys, 
# i.e. df['starttime'] or as attributes: df.starttime. 
df.head(3)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,date
0,669,3/1/2015 0:00,3/1/2015 0:11,164,E 47 St & 2 Ave,40.753231,-73.970325,477,W 41 St & 8 Ave,40.756405,-73.990026,21409,Subscriber,1987.0,1,2015-03-01 00:00:00
1,750,3/1/2015 0:01,3/1/2015 0:14,258,DeKalb Ave & Vanderbilt Ave,40.689407,-73.968855,436,Hancock St & Bedford Ave,40.682166,-73.95399,19397,Subscriber,1968.0,1,2015-03-01 00:01:00
2,663,3/1/2015 0:01,3/1/2015 0:12,497,E 17 St & Broadway,40.73705,-73.990093,477,W 41 St & 8 Ave,40.756405,-73.990026,20998,Customer,,0,2015-03-01 00:01:00


In [8]:
df.columns

Index([u'tripduration', u'starttime', u'stoptime', u'start station id',
       u'start station name', u'start station latitude',
       u'start station longitude', u'end station id', u'end station name',
       u'end station latitude', u'end station longitude', u'bikeid',
       u'usertype', u'birth year', u'gender', u'date'],
      dtype='object')

In [10]:
#df is the dataframe where the content of the csv file is stored
df['age'] = 2015 - df['birth year'][(df['usertype'] == 'Subscriber')]
df['ageM'] = 2015 - df['birth year'][(df['usertype'] == 'Subscriber') & (df['gender'] == 1)]
df['ageF'] = 2015 - df['birth year'][(df['usertype'] == 'Subscriber') & (df['gender'] == 2)]

In [12]:
# dropping NaN values
df['age'].dropna(inplace= True)
df['ageM'].dropna(inplace= True)
df['ageF'].dropna(inplace= True)

In [13]:
df.head(3)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,date,age,ageM,ageF
0,669,3/1/2015 0:00,3/1/2015 0:11,164,E 47 St & 2 Ave,40.753231,-73.970325,477,W 41 St & 8 Ave,40.756405,-73.990026,21409,Subscriber,1987.0,1,2015-03-01 00:00:00,28.0,28.0,
1,750,3/1/2015 0:01,3/1/2015 0:14,258,DeKalb Ave & Vanderbilt Ave,40.689407,-73.968855,436,Hancock St & Bedford Ave,40.682166,-73.95399,19397,Subscriber,1968.0,1,2015-03-01 00:01:00,47.0,47.0,
2,663,3/1/2015 0:01,3/1/2015 0:12,497,E 17 St & Broadway,40.73705,-73.990093,477,W 41 St & 8 Ave,40.756405,-73.990026,20998,Customer,,0,2015-03-01 00:01:00,,,


### Applying KS tests

#### Normal distribution

In [28]:
scipy.stats.kstest(df['age'], 'norm')

KstestResult(statistic=1.0, pvalue=0.0)

In [27]:
# Just testing this result - comparing 2 samples (age and a normal distribution with same mean and same standart deviation)
scipy.stats.ks_2samp(df['ageM'], df.age.std() * np.random.randn(1000 + df.age.mean()))

  if __name__ == '__main__':


Ks_2sampResult(statistic=0.9595517561856773, pvalue=0.0)

- Null Hypothesis: the two distributions are identical (the distribution is normal)
- As the p-value (0.00) is < than the desired significance level (let's use 0.05), we can reject the Null Hypothesis. So the age of Citibike customers are not distributed as a normal distribution.

#### Poisson distribution

In [37]:
# KS Testing to age and a poisson distribution with same mean and same standart deviation
scipy.stats.kstest(df['age'], 'poisson', args=(df.age.mean(),df.age.std()))

KstestResult(statistic=0.54626444757662995, pvalue=0.0)

- Null Hypothesis: the two distributions are identical (the distribution is poisson)
-  As the p-value (0.00) is < than the desired significance level (let's use 0.05), we can reject the Null Hypothesis. So the age of Citibike customers are not distributed as a poisson distribution.

### Applying Anderson-Darling (AD) tests

#### Normal Distribution

In [38]:
scipy.stats.anderson(df['age'], 'norm')

AndersonResult(statistic=3256.2673628629418, critical_values=array([ 0.576,  0.656,  0.787,  0.918,  1.092]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))

- null hypothesis: a sample is drawn from a population that follows a particular distribution (in this case, normal)
- As we see, Statistic >> 0.788 (critical value for significance 0.05), so we reject the null hypothesis, i.e. sample is not drawn from a population that follows normal distribution.

#### Exponential Distribution

In [39]:
# As AD test dont test for poisson distribution, I choose a different one - exponential 
scipy.stats.anderson(df['age'], 'expon')

AndersonResult(statistic=76318.754837987304, critical_values=array([ 0.922,  1.078,  1.341,  1.606,  1.957]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))

- null hypothesis: a sample is drawn from a population that follows a particular distribution (in this case, exponential)
- As we see, Statistic >> 1.341 (critical value for significance 0.05), so we reject the null hypothesis, i.e. sample is not drawn from a population that follows exponential distribution.