# NYC Automated Bicycle Counts
June 29, 2020
Alice Friedman

This code will download, summarize, and clean data collected in NYC via automatated bike counteres and available to the public via NYC Open Data portal. The count data and location data are from two seperate tables, which are joined in this code.

In [1]:
import urllib.request, json, requests
import pandas as pd

## Method

Automated counter location names, ids, and other data are stored in a table available here.
 
 * https://data.cityofnewyork.us/Transportation/Bicycle-Counters/smn3-rzf9

For the purposes of this analysis we will only use the table to match location names to ids, which is the key in the bike count table.

For locations with multiple counters or where multiple counters have been used over a period of years (e.g. Manhattan Bridge), a summary count (e.g. counts in both directions and for all periods counted) is stored in an id with `sens==0`.  The list of locations with these complete counts is then used to call to the API to download counts, which are collected in 15-minute increments, here:

* https://data.cityofnewyork.us/Transportation/Bicycle-Counts/uczf-rk3c

Counts are then cleaned to assign relevant data types (e.g. dates are stored as timestamps rather than text) and then summed by month.

Finally, partial year data (the first year any counter is available as well as h 

In [2]:
#from open data
locations_url = 'https://data.cityofnewyork.us/resource/smn3-rzf9.csv'
locations_raw = pd.read_csv(locations_url)

In [3]:
#create & clean table of counter locations
locations = locations_raw[['name', 'id', 'sens', 'counter']]
locations = locations[locations['sens']==0] #includes just the sum of all counts at a location
locations = locations[~locations['name'].str.contains("Interference")] #selects out calibration counters
locations = locations[locations['counter'].notnull()] #selects only active counters
locations['id'] = locations['id'].astype(str)

print(locations.dtypes)
locations

name       object
id         object
sens        int64
counter    object
dtype: object


Unnamed: 0,name,id,sens,counter
5,8th Ave at 50th St.,100057316,0,Y2H18055363
7,Kent Avenue Bike Path,100010019,0,Y2H13094302
10,Prospect Park West,100009425,0,Y2H13094304
14,Ed Koch Queensboro Bridge Shared Path,100009428,0,Y2H19111445
15,1st Avenue - 26th St N,100010020,0,Y2H18044984
16,Columbus Ave at 86th St.,100057320,0,Y2H18055356
17,Manhattan Bridge Display Bike Counter,100047029,0,Y2H17062567
18,Staten Island Ferry,100010017,0,Y2H13094300
19,Manhattan Bridge Ped Path,100009426,0,Y2H13074107
20,Broadway at 50th St,100057318,0,Y2H18055362


In [4]:
#this section creates and runs a function to page through the OpenDate API to get all counts for the 
#locations in the table above

# make sure to install these packages before running:
# pip install sodapy

from sodapy import Socrata

client = Socrata("data.cityofnewyork.us", None) #none refers to token -- none required for public data


#functon to page through data and load data based on id
#returns a data frame  containing all available counts for given id

def load_data(loc_id):
    l = [] #empty dataframe

    n=0 #set counter
    loc = 'id=' + str(loc_id)
    lim=500000 #limit on API

    while True:
    # First 500000 results (max), returned as JSON from API / converted to Python list of
    # dictionaries by sodapy.
        results = client.get("uczf-rk3c", limit=lim, offset=lim*n, where=loc)
        frame = pd.DataFrame.from_records(results)
        #print(frame[0:1])
        l.append(frame)
        #print("n="+str(n))
        #print("length of l="+str(len(l)))
        n = n + 1
        if len(frame)<1:
            break
    
    return (pd.concat(l))

# use function to create list of dataframes for each id

dataList = []
for loc_id in locations['id']:
    print("loading data for location " + str(loc_id))
    dataList.append(load_data(loc_id))  



loading data for location 100057316
loading data for location 100010019
loading data for location 100009425
loading data for location 100009428
loading data for location 100010020
loading data for location 100057320
loading data for location 100047029
loading data for location 100010017
loading data for location 100009426
loading data for location 100057318
loading data for location 100010022
loading data for location 100010018
loading data for location 100057319
loading data for location 100009427


In [5]:
#load interference dates (mannually entered as CSV from metadata in Open Data)
#pull from GitHub
#filter out data from date 
calibration_date_raw = pd.read_csv('https://raw.githubusercontent.com/aliceafriedman/BikeCounters/master/FilteredLoc.csv')

#table of dates for locations with known calibration starts
calibration_date = pd.DataFrame(calibration_date_raw.dropna())
#calibration_date['id'] = calibration_date['id'].astype(str)

c_date = pd.to_datetime(calibration_date['filterBefore'], infer_datetime_format=True)

c_dict = dict(zip(calibration_date['id'], c_date))

print(c_dict)

{100010020: Timestamp('2016-11-01 00:00:00'), 100057320: Timestamp('2019-12-05 00:00:00'), 100047029: Timestamp('2018-08-23 00:00:00'), 100057318: Timestamp('2019-12-05 00:00:00'), 100057319: Timestamp('2019-12-05 00:00:00'), 100057316: Timestamp('2019-12-05 00:00:00'), 100010019: Timestamp('2016-12-13 00:00:00')}


In [6]:
# filters out data before calibration date, if applicable, before concatenating data from each location
#doing this with a list because different locations have different filterBefore dates
filtered_counts = []
for i in range(len(dataList)):
    k = dataList[i]['id'][0]
    if k in c_dict:
        f_date = c_dict[k]
        dataList[i]['date'] = pd.to_datetime(dataList[i]['date'], infer_datetime_format=True)
        cond = dataList[i]['date'] > f_date
        filtered_counts.append(dataList[i][cond])
        #dataList[i] = dataList[i][]
    else:
        filtered_counts.append(dataList[i])
        
calibrated_counts_raw = pd.concat(filtered_counts)

In [7]:
#more data cleaning

#set index to pandas DateTime format
date = pd.to_datetime(calibrated_counts_raw['date'], infer_datetime_format=True)
counts = calibrated_counts_raw.set_index(date)

#select counts, id
counts = counts[['id', 'counts']]
#counts['date'] = date

#correct data type to int
counts['counts'] = counts['counts'].astype(int)
#counts['id'] = counts['id'].astype(int) #to match locations dtype

In [8]:
counts.head()

Unnamed: 0_level_0,id,counts
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-06-14 19:45:00,100057316,12
2020-06-14 19:30:00,100057316,8
2020-06-14 19:15:00,100057316,13
2020-06-14 19:00:00,100057316,20
2020-06-14 18:45:00,100057316,22


In [9]:
#resample and sum by month
m_counts = counts.groupby('id').resample('M').sum().reset_index()
m_counts= m_counts.set_index(pd.to_datetime(m_counts['date'], infer_datetime_format=True))
print(m_counts.dtypes)
m_counts.head()

id                object
date      datetime64[ns]
counts             int64
dtype: object


Unnamed: 0_level_0,id,date,counts
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-11-30,100009425,2016-11-30,38272
2016-12-31,100009425,2016-12-31,35955
2017-01-31,100009425,2017-01-31,32039
2017-02-28,100009425,2017-02-28,36430
2017-03-31,100009425,2017-03-31,35263


In [10]:
#reset indices to id
monthly_counts = m_counts.set_index('id')
locs = locations.set_index(locations['id'])
locs = locs[['name']]

#join to locations table get location name
monthly_counts = pd.concat([monthly_counts, locs], axis=1, join='inner')
monthly_counts['date'] = monthly_counts['date'].dt.to_period('M')

monthly_counts.head()

Unnamed: 0_level_0,date,counts,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100009425,2016-11,38272,Prospect Park West
100009425,2016-12,35955,Prospect Park West
100009425,2017-01,32039,Prospect Park West
100009425,2017-02,36430,Prospect Park West
100009425,2017-03,35263,Prospect Park West


In [11]:
#write table
monthly_counts.to_csv("monthly_counts_clean.csv")

# Remove partial years of data
This section removes partial years of data by removing the first (always partial) year of data for each location as well as the current year.

There are counts missing for part of 2014 at Brooklyn Bridge... not sure what the story is there.

In [18]:
from datetime import datetime

def remove_first_yr_and_current_yr(df):
    print (str(len(df)) + " rows in initial data")
    allNames = df['name'].unique() #list of unique names in DF
    l = [] #empty list
    i = 0 #set counter
    for name in allNames:
        l.append(df[df['name']==name]) #seperate dataframe into list by name
        data = l[i]
        first_year = data['date'].min().year #stores first year of data for each location
        
        #condition
        remove_first_yr = data['date'].dt.year > first_year
        remove_current_year = data['date'].dt.year < datetime.today().year
        
        #filter each dataframe for conditions
        l[i] = data[remove_first_yr & remove_current_year]        
        #print("removing partial data from " + name + " for year " + str(first_year))
    
        i += 1 #increment counter
    
    result = pd.concat(l) # recombine filtered lists into df
    print(str(len(result)) + " rows returned")
    
    result = result.sort_values(by=['date']) #sort ascending
    
    return result # returns dataframe


full_yr_monthly_counts = remove_first_yr_and_current_yr(monthly_counts) 

print(full_yr_monthly_counts.head())
print(full_yr_monthly_counts.tail())

651 rows in initial data
516 rows returned
              date  counts                                   name
id                                                               
100009428  2014-01   27048  Ed Koch Queensboro Bridge Shared Path
100010022  2014-01   14579              Brooklyn Bridge Bike Path
100010020  2014-01       0                 1st Avenue - 26th St N
100009427  2014-01   41692          Williamsburg Bridge Bike Path
100009426  2014-01    1215              Manhattan Bridge Ped Path
              date  counts                                   name
id                                                               
100009428  2019-12   57951  Ed Koch Queensboro Bridge Shared Path
100010020  2019-12       0                 1st Avenue - 26th St N
100009425  2019-12   41080                     Prospect Park West
100010018  2019-12   22734                         Pulaski Bridge
100057316  2019-12   75841                    8th Ave at 50th St.


Code by Alice Friedman 06-26-2020