# NYC Automated Bicycle Counts
June 29, 2020; revised July 22, 2020
Alice Friedman

This code will download, summarize, and clean data collected in NYC via automatated bike counteres and available to the public via NYC Open Data portal. The count data and location data are from two seperate tables, which are joined in this code.

Below is a demonstation that the same code pulling from the OpenData API and the Eco-Counter API result in *different results*. 


## Section 1: Setup

In [2]:
# make sure to install these packages before running:
import urllib.request, json, requests, certifi
import pandas as pd
from datetime import datetime
from sodapy import Socrata
import json

import matplotlib.pyplot as plt

## Section 2:  Download Location Data from Open Data

Automated counter location names, ids, and other data are stored in a table available here.
 
 * https://data.cityofnewyork.us/Transportation/Bicycle-Counters/smn3-rzf9

For the purposes of this analysis we will only use the table to match location names to ids, which is the key in the bike count table. Other data, such as lat/long, is also available.

For locations with multiple counters or where multiple counters have been used over a period of years (e.g. Manhattan Bridge), a summary count (e.g. counts in both directions and for all periods counted) is stored in an id with `sens==0`.  The list of locations with these complete counts is then used to call to the API to download counts, which are collected in 15-minute increments, here:

* https://data.cityofnewyork.us/Transportation/Bicycle-Counts/uczf-rk3c

Counts are then cleaned to assign relevant data types (e.g. dates are stored as timestamps rather than text) and then summed by month.

Finally, partial year data (the first year any counter is available as well as the current year) is removed from teh data set.

### 2.1 Download & Format Locations Table

In [3]:
#from open data
locations_url = 'https://data.cityofnewyork.us/resource/smn3-rzf9.csv'
locations_raw = pd.read_csv(locations_url)

In [4]:
#create & clean table of counter locations
locations = locations_raw[['name', 'id', 'sens', 'counter']]
locations = locations[locations['sens']==0] #includes just the sum of all counts at a location
locations = locations[~locations['name'].str.contains("Interference")] #selects out calibration counters
locations = locations[locations['counter'].notnull()] #selects only active counters
locations['site'] = locations['id'].astype(str)

#exclude 1st Ave (known to haev a lot of interference)
locations = locations[locations.name != '1st Avenue - 26th St N']

#set index as id
locations = locations.set_index('id')

print(len(locations))
print(locations.dtypes)
locations

13
name       object
sens        int64
counter    object
site       object
dtype: object


Unnamed: 0_level_0,name,sens,counter,site
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100009428,Ed Koch Queensboro Bridge Shared Path,0,Y2H19111445,100009428
100057320,Columbus Ave at 86th St.,0,Y2H18055356,100057320
100047029,Manhattan Bridge Display Bike Counter,0,Y2H17062567,100047029
100010017,Staten Island Ferry,0,Y2H13094300,100010017
100009426,Manhattan Bridge Ped Path,0,Y2H13074107,100009426
100057318,Broadway at 50th St,0,Y2H18055362,100057318
100010022,Brooklyn Bridge Bike Path,0,Y2H13074106,100010022
100010018,Pulaski Bridge,0,Y2H13094301,100010018
100057319,Amsterdam Ave at 86th St.,0,Y2H18055357,100057319
100009427,Williamsburg Bridge Bike Path,0,Y2H13074108,100009427


## Section 3: Load Count Data from Open Data API

This section creates and runs a function to page through the OpenDate API to get all counts for the 
locations in the table above. It returns a list of dataframs so that each data frame can be filtered seperatel

In [5]:
client = Socrata("data.cityofnewyork.us", None) #none refers to token -- none required for public data
data_id = "uczf-rk3c" #url for BikeCounts data

#functon to page through data and load data based on id
def load_OD(loc_id):
    l = [] #empty list
    
    n=0 #set counter
    loc = 'id=' + str(loc_id)
    lim=500000 #limit on API

    while True:
    # First 500000 results (max), returned as JSON from API / converted to Python list of
    # dictionaries by sodapy.
        results = client.get(data_id, limit=lim, offset=lim*n, where=loc)
        frame = pd.DataFrame.from_records(results)
        #print(frame[0:1])
        l.append(frame)
        #print("n="+str(n))
        #print("length of l="+str(len(l)))
        n = n + 1
        if len(frame)<1:
            break
    df = pd.concat(l)
    
    return(df)

def loop(locations):
    dataList = [] #second empty list   
    
    for loc_id in locations.index:
        print("loading data for location " + str(loc_id))
        dataList.append(load_OD(loc_id))
    df = pd.concat(dataList)
    return (df)

counts_OD_raw = loop(locations[:2])



loading data for location 100009428
loading data for location 100057320


In [None]:
index=pd.to_datetime(counts_OD_raw['date'], infer_datetime_format=True)
counts_OD = counts_OD_raw.set_index(pd.DatetimeIndex(index))
counts_OD['counts'] = counts_OD['counts'].astype('int')

In [None]:
counts_OD_day = counts_OD.groupby('id')['counts'].resample('D').sum().reset_index().rename(columns={"id":"site"})
print(counts_OD_day.head())
len(counts_OD_day)

## Section 4:  Download and Clean Data from EcoCounter API


### 4.1 Authorization

In [None]:
#username and pw are stored in a seperate file
with open('pw.json') as json_file:
    f = json.load(json_file)
user = f['user']
pw = f['pw']

In [None]:
token_headers = {
    'Authorization': 'Basic MWJRWWJPdUdOMXdsaktNMXNKNmZtOEdLczNvYTpINW9fNF8yQWtNOUc0SlRHa1JWakdDS0NKQTBh, \
     Content-Type: application/x-www-form-urlencoded',
}

login_data = {
  'grant_type': 'password',
  'username': user,
  'password': pw
}

response = requests.post('https://apieco.eco-counter-tools.com/token', headers=token_headers, data=login_data)
token_dict = json.loads(response.content.decode('utf-8'))
auth = 'Bearer '+ token_dict['access_token']

### 3.2 Download  Bicycle Counts from EcoCounter API

Setting step="15 min" will match the counts for Open Data

In [None]:
def load_data_EcoCounter_API(site, step):
    #authorization
    ### POST Request to acquire token
    t_headers = {'Authorization': 'Basic MWJRWWJPdUdOMXdsaktNMXNKNmZtOEdLczNvYTpINW9fNF8yQWtNOUc0SlRHa1JWakdDS0NKQTBh, Content-Type: application/x-www-form-urlencoded'}
    t_data = {
      'grant_type': 'password',
      'username': user,
      'password': pw
    }
    t_response = requests.post('https://apieco.eco-counter-tools.com/token', headers=t_headers, data=t_data)
    token_dict = json.loads(t_response.content.decode('utf-8'))
    auth = 'Bearer ' + token_dict['access_token']

    ###GET Request to use token to download data
    end = 'https://apieco.eco-counter-tools.com/api/1.0/data/site/'
    url = end + str(site) + '?step='+ step
    headers = {
        'Accept': 'application/json',
        'Authorization': auth,
    }
    response = requests.get(url, headers=headers)
    data_dict = json.loads(response.content.decode('utf-8'))
    
    df = pd.DataFrame(data_dict)
    df = df.assign(site=site)

    return (df)

dataList_EC = []

step="day"

for site in locations.index[:2]:
    print("loading data for location " + str(site))
    dataList_EC.append(load_data_EcoCounter_API(site, step))

In [None]:
counts_EC = pd.concat(dataList_EC).set_index(['site', 'date'])['counts'].reset_index()
counts_EC.head()

## Section 4: Summarize Data and Compare Results

In [None]:
print(counts_EC.dtypes)
print(counts_OD_day.dtypes)

In [None]:
counts_EC['date'] = pd.to_datetime(counts_EC['date'], infer_datetime_format=True)
counts_OD_day['site'] = counts_OD_day['site'].astype(int)

In [None]:
print(counts_EC.dtypes)
print(counts_OD_day.dtypes)

In [None]:
print(counts_EC.shape)
print(counts_OD_day.shape)

Why are there an extra 80 rows in the EcoCounter data?

In [None]:
print(counts_EC.date.max())
print(counts_OD_day.date.max())

In [None]:
print(counts_EC.date.min())
print(counts_OD_day.date.min())

In [None]:
print(counts_EC[counts_EC['date'].dt.year > 2018][:10])
print(counts_OD_day[counts_OD_day['date'].dt.year > 2018][:10])

Compare EdKoch QBB 2019 counts from manual pull EcoCounter as an example

In [57]:
#EdKoch Bridge ID is 100009428
#Reload data from Open Data API for QBB
EdKoch_OpenData = load_OD(100009428)
EdKoch_OpenData['counts'] = EdKoch_OpenData['counts'].astype(int)
EdKoch_OpenData['date'] = pd.to_datetime(EdKoch_OpenData_raw['date'], infer_datetime_format=True)
EdKoch_OpenData['year'] = EdKoch_OpenData['date'].dt.year
EdKoch_OpenData_2019 = EdKoch_OpenData[EdKoch_OpenData['year']==2019].sort_values('date').drop('status', axis=1)
EdKoch_OpenData_2019 = EdKoch_OpenData_2019.reset_index(drop=True) 

print(EdKoch_OpenData_2019.dtypes)
EdKoch_OpenData_2019.head()

counts             int64
date      datetime64[ns]
id                object
year               int64
dtype: object


Unnamed: 0,counts,date,id,year
0,1,2019-01-01 00:00:00,100009428,2019
1,2,2019-01-01 00:15:00,100009428,2019
2,7,2019-01-01 00:30:00,100009428,2019
3,6,2019-01-01 00:45:00,100009428,2019
4,10,2019-01-01 01:00:00,100009428,2019


In [59]:
#Import data from csv pulled from manual pull for EdKoch
EdKoch_EcoCounter_2019 = pd.read_csv("EdKoch_2019-01_15min.csv")
EdKoch_EcoCounter_2019['date'] = pd.to_datetime(EdKoch_EcoCounter_2019_raw['Date'], infer_datetime_format=True)
EdKoch_EcoCounter_2019['year'] = EdKoch_EcoCounter_2019['date'].dt.year
EdKoch_EcoCounter_2019 = EdKoch_EcoCounter_2019[EdKoch_EcoCounter_2019['year']==2019].sort_values('date').drop('Date', axis=1)
print(EdKoch_EcoCounter_2019.dtypes)
EdKoch_EcoCounter_2019.head()

Ed Koch Queensboro Bridge Shared Path           float64
date                                     datetime64[ns]
year                                              int64
dtype: object


Unnamed: 0,Ed Koch Queensboro Bridge Shared Path,date,year
0,5.0,2019-01-01 00:00:00,2019
1,5.0,2019-01-01 00:15:00,2019
2,3.0,2019-01-01 00:30:00,2019
3,4.0,2019-01-01 00:45:00,2019
4,5.0,2019-01-01 01:00:00,2019


In [60]:
#make dataframe comparison
QBB_2019_dict = {'date_OD':  EdKoch_OpenData_2019['date'], 
                 'date_EC': EdKoch_EcoCounter_2019['date'], 
                 'counts_OD': EdKoch_OpenData_2019['counts'], 
                 'counts_EC': EdKoch_EcoCounter_2019['Ed Koch Queensboro Bridge Shared Path']}

In [61]:
compare_QBB_2019 = pd.DataFrame(QBB_2019_dict)

print(compare_QBB_2019.shape)
print(compare_QBB_2019.info())

compare_QBB_2019.head()

(35040, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 4 columns):
date_OD      35040 non-null datetime64[ns]
date_EC      35040 non-null datetime64[ns]
counts_OD    35040 non-null int64
counts_EC    34924 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 1.1 MB
None


Unnamed: 0,date_OD,date_EC,counts_OD,counts_EC
0,2019-01-01 00:00:00,2019-01-01 00:00:00,1,5.0
1,2019-01-01 00:15:00,2019-01-01 00:15:00,2,5.0
2,2019-01-01 00:30:00,2019-01-01 00:30:00,7,3.0
3,2019-01-01 00:45:00,2019-01-01 00:45:00,6,4.0
4,2019-01-01 01:00:00,2019-01-01 01:00:00,10,5.0


In [70]:
compare_QBB_2019['date_OD'] != compare_QBB_2019['date_EC']
compare_QBB_2019[compare_QBB_2019['date_OD'] != compare_QBB_2019['date_EC']]

Unnamed: 0,date_OD,date_EC,counts_OD,counts_EC
6536,2019-03-10 03:00:00,2019-03-10 02:00:00,7,0.0
6537,2019-03-10 03:15:00,2019-03-10 02:15:00,3,0.0
6538,2019-03-10 03:30:00,2019-03-10 02:30:00,4,0.0
6539,2019-03-10 03:45:00,2019-03-10 02:45:00,5,0.0
6540,2019-03-10 04:00:00,2019-03-10 03:00:00,2,2.0
6541,2019-03-10 04:15:00,2019-03-10 03:15:00,4,5.0
6542,2019-03-10 04:30:00,2019-03-10 03:30:00,6,3.0
6543,2019-03-10 04:45:00,2019-03-10 03:45:00,8,4.0
6544,2019-03-10 05:00:00,2019-03-10 04:00:00,4,2.0
6545,2019-03-10 05:15:00,2019-03-10 04:15:00,0,2.0


In [86]:
#Are there duplicates?
compare_QBB_2019['date_OD'].duplicated().sum()

4

In [85]:
compare_QBB_2019['date_EC'].duplicated().sum()

0