# Investigation of citibike ridership with income

# Objective: relate the citibike ridership to income.
### 1. Download two months of citibike data 201501 and 201601
### 2. Download income information fron IRS for NYC
### 3. Find the zipcodes of citibike stations by reverse geocoding the coordinates
### 4. Find the number of rides per zipcodes in one of your 2015/01 of data 
### 5. Fit a line to ridership (number of rides over one of the 1 months of citibike data downloaded, start with 2015) vs income (total Adjusted gross income for the zip) and median income per person for that zip
### 6. Improve the fit by removing 2 suspected outliers and quantify the improvement
### 7. Fit a 2nd degree polynomial to the same data
### 8. Compare FORMALLY the line and the 2nd degree polynomial fit with LR test to assess which is a better fit to your chosen significance level.

## Extra Credit

### 1. Compare the income to the income per person as endogenous variable. I.e. redo your best fit with income per person instead of area income and compare the fits. How do the fit compare? If it is better what does this say? If it is worse, what does this say? Discuss why it may be and if you have an explaination describe how  you would test it. If you have time go ahead and test it too!

### 2. Repeat the analysis with another dataset. Are the results consistent?


## 1. Download two months of citibike data: 201501 and 201601. Begin working with 201501, and if you have time (for extra credit) you will repeat the analysis for the other month, to see if your conclusions are robust. 

In [None]:
cb2016.head()

## 2. Downloading income data from IRS 

### Find income data per zipcode in NYC: you can find it from IRS the file name is  14zp33ny 
### and the IRS site is https://www.irs.gov/pub/irs-soi/?C=N;O=D

Use the Adjusted gross income (AGI) for every zipcode. Additionally identify the columns indicating the number of returns, the number of dependents, the number of joint returns. Together they indicate the size of the family unit, allowing you to obtain the income per person in that zipcode, from the income of the whole zipcode. 

Convert the zip to numeric values (with pd.to_numeric)

For every zipcode the adjusted median income is the first valid row associated with that zipcode. 

If you need help look here [...]

Store the income data in a daraframe with (at least) the columns 

**zipcodes,	income, N,	incomePC**

where zipcodes are the zipcodes, income is the AGI, N the number of returns, incomePC the AGI for the zipcode divided by (N + N dependents + N joint returns)

In [None]:
incomeByZip.head


In [None]:
#extract the right entry with iloc[0]: e.g.
print ("Adjusted gross income (AGI)  for zipcode 10001:", 
       incomeByZip.loc[[10001]]["Adjusted gross income (AGI) [3]"].iloc[0])

Create a new dataframe with the value of income per zipcode, and income per person per zipcode 
(income per person = income / (Nreturns + Njoint returns + Ndependents)

In [8]:
zipincome = pd.DataFrame()
zipincome['zipcodes'] = ...

In [9]:
# compare your dataframe with mine to know you are on the right track

In [10]:
zipincome.head()

Unnamed: 0,zipcodes,income,N,Njoint,Ndeps,incomePC
0,0,766646080,9397410,2942890,5539120,42.878688
1,99999,14338084,88940,28130,43810,89.122849
2,10001,2363960,14080,2410,3250,119.754813
3,10002,2215542,43370,11040,19160,30.114748
4,10003,6910992,29810,5460,4790,172.516026


## 3. Find the zipcodes of citibike stations by reverse geocoding the coordinates
You can use the google API including the long and latitide of each station: CAREFUL!!: you do not need a separate API query per ride, just one per each citibike station! You have a limit of 2500 requests/day, so you cannot submit a request per ride. (You can use pd.DataFrame.drop_duplicates, for example, to identify identical coordinate pairs or identical station ids, so as to not to repeat queries for the same station)

https://developers.google.com/maps/documentation/geocoding/intro

If you do not have an API key for googlemaps you can get one instantly here
https://developers.google.com/maps/documentation/geocoding/get-api-key


Once you have the zip for a lat/lon pair (lat, lon) you can use a condition like 
```
(cb['start station latitude'] == lat) * (cb['start station longitude'] == lon)
```
as index to identify the rows of the citibike datframe that contain those coordinates and are associated to that zipcode
```
cb['zipcodes'](cb['start station latitude'] == lat) * (cb['start station longitude'] == lon)] = thatzipcode
```

If you need help go here [a link will say I created a function that given a URL returns a the zipcode. You can use it as 
revgeo = getJsonParsedData(url)["results"][0]['address_components'][-1]['long_name']]

If you are not up using the API for that you can download the zipcode of each citibike station here 
http://cosmo.nyu.edu/~fb55/UI_CUSP_2015/data/stationzips.json
However, this will cost you 0.5/10 points.


In [12]:
# you can compare your dataframe with mine to check that you are on track
cb2015.head()

  unsupported[op_str]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  comp = (nn == nn_at)


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,zipcodes
0,1338,6/1/2015 0:00,6/1/2015 0:22,128,MacDougal St & Prince St,40.727103,-74.002971,2021,W 45 St & 8 Ave,40.759291,-73.988597,20721,Subscriber,1984.0,1,2926
1,290,6/1/2015 0:00,6/1/2015 0:05,438,St Marks Pl & 1 Ave,40.727791,-73.985649,312,Allen St & E Houston St,40.722055,-73.989111,21606,Subscriber,1997.0,1,10003
2,634,6/1/2015 0:01,6/1/2015 0:11,383,Greenwich Ave & Charles St,40.735238,-74.000271,388,W 26 St & 10 Ave,40.749718,-74.00295,16595,Subscriber,1993.0,1,10011
3,159,6/1/2015 0:01,6/1/2015 0:04,361,Allen St & Hester St,40.716059,-73.991908,531,Forsyth St & Broome St,40.718939,-73.992663,16949,Subscriber,1981.0,1,5416
4,1233,6/1/2015 0:02,6/1/2015 0:22,382,University Pl & E 14 St,40.734927,-73.992005,532,S 5 Pl & S 4 St,40.710451,-73.960876,17028,Customer,,0,4510


In [13]:
# you can compare your dataframe with mine to check that you are on track

#grouping and counting
cbgroup = cb2015.groupby .... count()
cbgroup.head()

Unnamed: 0,zipcodes,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,914,3675,3675,3675,3675,3675,3675,3675,3675,3675,3675,3675,3675,3675,3369,3675
1,10001,51691,51691,51691,51691,51691,51691,51691,51691,51691,51691,51691,51691,51691,45571,51691
2,10002,45970,45970,45970,45970,45970,45970,45970,45970,45970,45970,45970,45970,45970,40795,45970
3,10003,74663,74663,74663,74663,74663,74663,74663,74663,74663,74663,74663,74663,74663,67536,74663
4,10004,13698,13698,13698,13698,13698,13698,13698,13698,13698,13698,13698,13698,13698,9850,13698


# MERGE
notice there may be lots of invalid zipcodes from bad reverse geocoding!! Drop all data w zipcodes > 1000

In [19]:
cbincome = pd.merge(...

In [20]:
cbincome.head()

Unnamed: 0,zipcodes,income,N,Njoint,Ndeps,incomePC,Nrides
0,10001,2363960,14080,2410,3250,119.754813,51691
1,10002,2215542,43370,11040,19160,30.114748,45970
2,10003,6910992,29810,5460,4790,172.516026,74663
3,10004,925417,2540,840,1130,205.192239,13698
4,10005,5545849,5890,1340,1340,647.123571,9631


# 5. Plot and fit the data

# 6. Choose two high leverage points that may be outliers, fit the data without them and compare the fit


# 7. Fit a 2nd degree polynomial and assess if the addition of the extra parameter is justified by the data

# EC:  Fit the rides to the income per person, discuss the result

# EC:  Test with 2016, discuss the results