# NYC Automated Bicycle Counts
June 29, 2020; revised July 22, 2020
Alice Friedman

This code will download, summarize, and clean data collected in NYC via automatated bike counteres and available to the public via NYC Open Data portal. The count data and location data are from two seperate tables, which are joined in this code.

Below is a demonstation that the same code pulling from the OpenData API and the Eco-Counter API result in *different results*. 


## Section 1: Setup

In [56]:
# make sure to install these packages before running:
import urllib.request, json, requests, certifi
import pandas as pd
from datetime import datetime
from sodapy import Socrata
import json

import matplotlib.pyplot as plt

## Section 2:  Download Location Data from Open Data

Automated counter location names, ids, and other data are stored in a table available here.
 
 * https://data.cityofnewyork.us/Transportation/Bicycle-Counters/smn3-rzf9

For the purposes of this analysis we will only use the table to match location names to ids, which is the key in the bike count table. Other data, such as lat/long, is also available.

For locations with multiple counters or where multiple counters have been used over a period of years (e.g. Manhattan Bridge), a summary count (e.g. counts in both directions and for all periods counted) is stored in an id with `sens==0`.  The list of locations with these complete counts is then used to call to the API to download counts, which are collected in 15-minute increments, here:

* https://data.cityofnewyork.us/Transportation/Bicycle-Counts/uczf-rk3c

Counts are then cleaned to assign relevant data types (e.g. dates are stored as timestamps rather than text) and then summed by month.

Finally, partial year data (the first year any counter is available as well as the current year) is removed from teh data set.

### 2.1 Download & Format Locations Table

In [57]:
#from open data
locations_url = 'https://data.cityofnewyork.us/resource/smn3-rzf9.csv'
locations_raw = pd.read_csv(locations_url)

In [58]:
#create & clean table of counter locations
locations = locations_raw[['name', 'id', 'sens', 'counter']]
locations = locations[locations['sens']==0] #includes just the sum of all counts at a location
locations = locations[~locations['name'].str.contains("Interference")] #selects out calibration counters
locations = locations[locations['counter'].notnull()] #selects only active counters
locations['site'] = locations['id'].astype(str)

#exclude 1st Ave (known to haev a lot of interference)
locations = locations[locations.name != '1st Avenue - 26th St N']

#set index as id
locations = locations.set_index('id')

print(len(locations))
print(locations.dtypes)
locations

13
name       object
sens        int64
counter    object
site       object
dtype: object


Unnamed: 0_level_0,name,sens,counter,site
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100057316,8th Ave at 50th St.,0,Y2H18055363,100057316
100010019,Kent Avenue Bike Path,0,Y2H13094302,100010019
100009425,Prospect Park West,0,Y2H13094304,100009425
100009428,Ed Koch Queensboro Bridge Shared Path,0,Y2H19111445,100009428
100057320,Columbus Ave at 86th St.,0,Y2H18055356,100057320
100047029,Manhattan Bridge Display Bike Counter,0,Y2H17062567,100047029
100010017,Staten Island Ferry,0,Y2H13094300,100010017
100009426,Manhattan Bridge Ped Path,0,Y2H13074107,100009426
100057318,Broadway at 50th St,0,Y2H18055362,100057318
100010022,Brooklyn Bridge Bike Path,0,Y2H13074106,100010022


## Section 3: Load Count Data from Open Data API

This section creates and runs a function to page through the OpenDate API to get all counts for the 
locations in the table above. It returns a list of dataframs so that each data frame can be filtered seperatel

In [59]:
client = Socrata("data.cityofnewyork.us", None) #none refers to token -- none required for public data
data_id = "uczf-rk3c" #url for BikeCounts data

#functon to page through data and load data based on id
def load_OD(loc_id):
    l = [] #empty list
    
    n=0 #set counter
    loc = 'id=' + str(loc_id)
    lim=500000 #limit on API

    while True:
    # First 500000 results (max), returned as JSON from API / converted to Python list of
    # dictionaries by sodapy.
        results = client.get(data_id, limit=lim, offset=lim*n, where=loc)
        frame = pd.DataFrame.from_records(results)
        #print(frame[0:1])
        l.append(frame)
        #print("n="+str(n))
        #print("length of l="+str(len(l)))
        n = n + 1
        if len(frame)<1:
            break
    df = pd.concat(l)
    
    return(df)

def loop(locations):
    dataList = [] #second empty list   
    
    for loc_id in locations.index:
        print("loading data for location " + str(loc_id))
        dataList.append(load_OD(loc_id))
    df = pd.concat(dataList)
    return (df)

counts_OD_raw = loop(locations[:2])



loading data for location 100057316
loading data for location 100010019


In [60]:
index=pd.to_datetime(counts_OD_raw['date'], infer_datetime_format=True)
counts_OD = counts_OD_raw.set_index(pd.DatetimeIndex(index))
counts_OD['counts'] = counts_OD['counts'].astype('int')

In [61]:
counts_OD_day = counts_OD.groupby('id')['counts'].resample('D').sum().reset_index().rename(columns={"id":"site"})
print(counts_OD_day.head())
len(counts_OD_day)

        site       date  counts
0  100010019 2016-11-21       0
1  100010019 2016-11-22     561
2  100010019 2016-11-23     624
3  100010019 2016-11-24     587
4  100010019 2016-11-25    1052


2035

## Section 4:  Download and Clean Data from EcoCounter API


### 4.1 Authorization

In [63]:
#username and pw are stored in a seperate file
with open('pw.json') as json_file:
    f = json.load(json_file)
user = f['user']
pw = f['pw']

In [64]:
token_headers = {
    'Authorization': 'Basic MWJRWWJPdUdOMXdsaktNMXNKNmZtOEdLczNvYTpINW9fNF8yQWtNOUc0SlRHa1JWakdDS0NKQTBh, \
     Content-Type: application/x-www-form-urlencoded',
}

login_data = {
  'grant_type': 'password',
  'username': user,
  'password': pw
}

response = requests.post('https://apieco.eco-counter-tools.com/token', headers=token_headers, data=login_data)
token_dict = json.loads(response.content.decode('utf-8'))
auth = 'Bearer '+ token_dict['access_token']

### 3.2 Download  Bicycle Counts from EcoCounter API

Setting step="15 min" will match the counts for Open Data

In [80]:
def load_data_EcoCounter_API(site, step):
    #authorization
    ### POST Request to acquire token
    t_headers = {'Authorization': 'Basic MWJRWWJPdUdOMXdsaktNMXNKNmZtOEdLczNvYTpINW9fNF8yQWtNOUc0SlRHa1JWakdDS0NKQTBh, Content-Type: application/x-www-form-urlencoded'}
    t_data = {
      'grant_type': 'password',
      'username': user,
      'password': pw
    }
    t_response = requests.post('https://apieco.eco-counter-tools.com/token', headers=t_headers, data=t_data)
    token_dict = json.loads(t_response.content.decode('utf-8'))
    auth = 'Bearer ' + token_dict['access_token']

    ###GET Request to use token to download data
    end = 'https://apieco.eco-counter-tools.com/api/1.0/data/site/'
    url = end + str(site) + '?step='+ step
    headers = {
        'Accept': 'application/json',
        'Authorization': auth,
    }
    response = requests.get(url, headers=headers)
    data_dict = json.loads(response.content.decode('utf-8'))
    
    df = pd.DataFrame(data_dict)
    df = df.assign(site=site)

    return (df)

dataList_EC = []

step="day"

for site in locations.index[:2]:
    print("loading data for location " + str(site))
    dataList_EC.append(load_data_EcoCounter_API(site, step))

loading data for location 100057316
loading data for location 100010019


In [81]:
counts_EC = pd.concat(dataList_EC).set_index(['site', 'date'])['counts'].reset_index()
counts_EC.head()

Unnamed: 0,site,date,counts
0,100057316,2018-06-14T00:00:00+0000,0
1,100057316,2018-06-15T00:00:00+0000,0
2,100057316,2018-06-16T00:00:00+0000,0
3,100057316,2018-06-17T00:00:00+0000,0
4,100057316,2018-06-18T00:00:00+0000,0


## Section 4: Summarize Data and Compare Results

In [70]:
print(counts_EC.dtypes)
print(counts_OD_day.dtypes)

site       int64
date      object
counts     int64
dtype: object
site              object
date      datetime64[ns]
counts             int64
dtype: object


In [71]:
counts_EC['date'] = pd.to_datetime(counts_EC['date'], infer_datetime_format=True)
counts_OD_day['site'] = counts_OD_day['site'].astype(int)

In [72]:
print(counts_EC.dtypes)
print(counts_OD_day.dtypes)

site               int64
date      datetime64[ns]
counts             int64
dtype: object
site               int64
date      datetime64[ns]
counts             int64
dtype: object


In [73]:
print(counts_EC.shape)
print(counts_OD_day.shape)

(2111, 3)
(2035, 3)


Why are there an extra 80 rows in the EcoCounter data?

In [74]:
print(counts_EC.date.max())
print(counts_OD_day.date.max())

2020-07-23 00:00:00
2020-06-14 00:00:00


In [75]:
print(counts_EC.date.min())
print(counts_OD_day.date.min())

2016-11-22 00:00:00
2016-11-21 00:00:00


In [76]:
print(counts_EC[counts_EC['date'].dt.year > 2018][:10])
print(counts_OD_day[counts_OD_day['date'].dt.year > 2018][:10])

          site       date  counts
201  100057316 2019-01-01       0
202  100057316 2019-01-02       0
203  100057316 2019-01-03       0
204  100057316 2019-01-04       0
205  100057316 2019-01-05       0
206  100057316 2019-01-06       0
207  100057316 2019-01-07       0
208  100057316 2019-01-08       0
209  100057316 2019-01-09       0
210  100057316 2019-01-10       0
          site       date  counts
771  100010019 2019-01-01     911
772  100010019 2019-01-02    1318
773  100010019 2019-01-03    1386
774  100010019 2019-01-04    1486
775  100010019 2019-01-05     746
776  100010019 2019-01-06    1387
777  100010019 2019-01-07    1218
778  100010019 2019-01-08    1297
779  100010019 2019-01-09    1415
780  100010019 2019-01-10    1108


In [84]:
EdKoch_OpenData_Jan2019 = counts_OD.loc['2019-01']
EdKoch_OpenData_Jan2019

Unnamed: 0_level_0,counts,date,id,status
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 00:00:00,0,2019-01-01T00:00:00.000,100057316,0
2019-01-01 00:15:00,0,2019-01-01T00:15:00.000,100057316,0
2019-01-01 00:30:00,0,2019-01-01T00:30:00.000,100057316,0
2019-01-01 00:45:00,0,2019-01-01T00:45:00.000,100057316,0
2019-01-01 01:00:00,0,2019-01-01T01:00:00.000,100057316,0
2019-01-01 01:15:00,0,2019-01-01T01:15:00.000,100057316,0
2019-01-01 01:30:00,0,2019-01-01T01:30:00.000,100057316,0
2019-01-01 01:45:00,0,2019-01-01T01:45:00.000,100057316,0
2019-01-01 02:00:00,0,2019-01-01T02:00:00.000,100057316,0
2019-01-01 02:15:00,0,2019-01-01T02:15:00.000,100057316,0


In [94]:
#Compare to manual download

EdKoch_EcoCounter_Jan2019 = pd.read_csv("EdKoch_2019-01_15min.csv")
date = pd.to_datetime(EdKoch_EcoCounter_Jan2019['Date'])
EdKoch_EcoCounter_Jan2019 = EdKoch_EcoCounter_Jan2019.set_index(date)
EdKoch_EcoCounter_Jan2019

Unnamed: 0_level_0,Date,Ed Koch Queensboro Bridge Shared Path
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01 00:00:00+00:00,2019-01-01-00-00-00,5.0
2019-01-01 00:00:00+00:00,2019-01-01-00-15-00,5.0
2019-01-01 00:00:00+00:00,2019-01-01-00-30-00,3.0
2019-01-01 00:00:00+00:00,2019-01-01-00-45-00,4.0
2019-01-01 01:00:00+00:00,2019-01-01-01-00-00,5.0
2019-01-01 01:00:00+00:00,2019-01-01-01-15-00,3.0
2019-01-01 01:00:00+00:00,2019-01-01-01-30-00,5.0
2019-01-01 01:00:00+00:00,2019-01-01-01-45-00,5.0
2019-01-01 02:00:00+00:00,2019-01-01-02-00-00,8.0
2019-01-01 02:00:00+00:00,2019-01-01-02-15-00,3.0
