# NYC Automated Bicycle Counts
June 29, 2020
Alice Friedman

This code will download, summarize, and clean data collected in NYC via automatated bike counteres and available to the public via NYC Open Data portal. The count data and location data are from two seperate tables, which are joined in this code.

In [11]:
# make sure to install these packages before running:

import urllib.request, json, requests, certifi
import pandas as pd
from datetime import datetime
from sodapy import Socrata

import certifi


## Method

Automated counter location names, ids, and other data are stored in a table available here.
 
 * https://data.cityofnewyork.us/Transportation/Bicycle-Counters/smn3-rzf9

For the purposes of this analysis we will only use the table to match location names to ids, which is the key in the bike count table. Other data, such as lat/long, is also available.

For locations with multiple counters or where multiple counters have been used over a period of years (e.g. Manhattan Bridge), a summary count (e.g. counts in both directions and for all periods counted) is stored in an id with `sens==0`.  The list of locations with these complete counts is then used to call to the API to download counts, which are collected in 15-minute increments, here:

* https://data.cityofnewyork.us/Transportation/Bicycle-Counts/uczf-rk3c

Counts are then cleaned to assign relevant data types (e.g. dates are stored as timestamps rather than text) and then summed by month.

Finally, partial year data (the first year any counter is available as well as the current year) is removed from teh data set.

In [55]:
user = 'nyc_dot_api'
pw ='MLkF$bhP0s%z'

token_headers = {
    'Authorization': 'Basic MWJRWWJPdUdOMXdsaktNMXNKNmZtOEdLczNvYTpINW9fNF8yQWtNOUc0SlRHa1JWakdDS0NKQTBh, \
     Content-Type: application/x-www-form-urlencoded',
}

login_data = {
  'grant_type': 'password',
  'username': user,
  'password': pw
}

response = requests.post('https://apieco.eco-counter-tools.com/token', headers=token_headers, data=login_data)
token_dict = json.loads(response.content.decode('utf-8'))
auth = 'Bearer '+ token_dict['access_token']

'ed4ed16ecb1289abe9aac589ddf8ed7'

### Locations table

In [57]:
headers = {
    'Accept': 'application/json',
    'Authorization': auth,
}

response = requests.get('https://apieco.eco-counter-tools.com/api/1.0/counter', headers=headers)


print(response.content)

b'[{"serial":"COM19050190","gsm":"+337000020256620","iccid":"89332401000011626484","articleCode":"1600","softVersion":"02.05","hardVersion":"Cc","expiditionDate":"2019-05-27T00:00:00-0400"},{"serial":"REC19050200","gsm":null,"iccid":null,"articleCode":"1599","softVersion":"01.00","hardVersion":"Bb","expiditionDate":"2019-05-27T00:00:00-0400"},{"serial":"Y2G12072545","gsm":"+33612303497","iccid":"89331042110036755502","articleCode":null,"softVersion":"2.2xd","hardVersion":"G_x2_2d","expiditionDate":"2012-07-30T00:00:00-0400"},{"serial":"Y2H13074105","gsm":"+33671472685","iccid":"2040300145004","articleCode":null,"softVersion":"2.9xf","hardVersion":"i","expiditionDate":"2013-07-29T00:00:00-0400"},{"serial":"Y2H13074106","gsm":"+33671472378","iccid":"2040300145012","articleCode":null,"softVersion":"2.9xf","hardVersion":"i","expiditionDate":"2013-07-29T00:00:00-0400"},{"serial":"Y2H13074107","gsm":"+33671472303","iccid":"2040300145020","articleCode":null,"softVersion":"2.9xf","hardVersion"

In [73]:
#from open data
locations_url = 'https://data.cityofnewyork.us/resource/smn3-rzf9.csv'
locations_raw = pd.read_csv(locations_url)

In [74]:
#create & clean table of counter locations
locations = locations_raw[['name', 'id', 'sens', 'counter']]
locations = locations[locations['sens']==0] #includes just the sum of all counts at a location
locations = locations[~locations['name'].str.contains("Interference")] #selects out calibration counters
locations = locations[locations['counter'].notnull()] #selects only active counters
locations['id'] = locations['id'].astype(str)

#exclude 1st Ave (known to haev a lot of interference)
locations = locations[locations.name != '1st Avenue - 26th St N']

#set index as id
locations = locations.set_index('id')

print(len(locations))
print(locations.dtypes)
locations

13
name       object
sens        int64
counter    object
dtype: object


Unnamed: 0_level_0,name,sens,counter
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100057316,8th Ave at 50th St.,0,Y2H18055363
100010019,Kent Avenue Bike Path,0,Y2H13094302
100009425,Prospect Park West,0,Y2H13094304
100009428,Ed Koch Queensboro Bridge Shared Path,0,Y2H19111445
100057320,Columbus Ave at 86th St.,0,Y2H18055356
100047029,Manhattan Bridge Display Bike Counter,0,Y2H17062567
100010017,Staten Island Ferry,0,Y2H13094300
100009426,Manhattan Bridge Ped Path,0,Y2H13074107
100057318,Broadway at 50th St,0,Y2H18055362
100010022,Brooklyn Bridge Bike Path,0,Y2H13074106


### Bicycle Counts from API

In [104]:
user = 'nyc_dot_api'
pw = 'MLkF$bhP0s%z'

### POST Request to acquire token
t_headers = {
    'Authorization': 'Basic MWJRWWJPdUdOMXdsaktNMXNKNmZtOEdLczNvYTpINW9fNF8yQWtNOUc0SlRHa1JWakdDS0NKQTBh, Content-Type: application/x-www-form-urlencoded',
}

t_data = {
  'grant_type': 'password',
  'username': user,
  'password': pw
}

t_response = requests.post('https://apieco.eco-counter-tools.com/token', headers=t_headers, data=t_data)

token_dict = json.loads(t_response.content.decode('utf-8'))

auth = 'Bearer ' + token_dict['access_token']

###GET Request to use token to download data


                      
def load_data(site, step):
    l = [] #empty dataframe

    n=0 #set counter
    loc = 'id=' + str(loc_id)
    lim=500000 #limit on API

    end = 'https://apieco.eco-counter-tools.com/api/1.0/data/site/'
    step = step #sum by (choices include day, month, year)
    site = site

    url = end+site+'?step='+step

    headers = {
        'Accept': 'application/json',
        'Authorization': auth,
    }


    response = requests.get(url, headers=headers)

    data_dict = json.loads(response.content.decode('utf-8'))
    
    df = pd.DataFrame(data_dict)
    df = df.assign(site=site)

    return (df)

# use function to create list of dataframes for each id


#df= load_data(site='100010017', step='month')
#print(df)

dataList = []
for site in locations.index:
    print("loading data for location " + str(site))
    dataList.append(load_data(site, 'day'))  

loading data for location 100057316
loading data for location 100010019
loading data for location 100009425
loading data for location 100009428
loading data for location 100057320
loading data for location 100047029
loading data for location 100010017
loading data for location 100009426
loading data for location 100057318
loading data for location 100010022
loading data for location 100010018
loading data for location 100057319
loading data for location 100009427


### Filter data prior to calibration

Certain locations have experienced known electircal intereference and were manually calibrated on a certain date. This data is located in the data dictionary for [Bicycle Counts on Open Data](https://data.cityofnewyork.us/Transportation/Bicycle-Counts/uczf-rk3c). I have manually created a table of this data which is linked on my [Bicycle Counters Repository on GitHub](https://raw.githubusercontent.com/aliceafriedman/BikeCounters).

In [105]:
#load interference dates (mannually entered as CSV from metadata in Open Data)
#pull from GitHub
#store as dict
calibration_date_raw = pd.read_csv('https://raw.githubusercontent.com/aliceafriedman/BikeCounters/master/FilteredLoc.csv')

#table of dates for locations with known calibration starts
calibration_date = pd.DataFrame(calibration_date_raw.dropna())
#calibration_date['id'] = calibration_date['id'].astype(str)

c_date = pd.to_datetime(calibration_date['filterBefore'], infer_datetime_format=True)

c_dict = dict(zip(calibration_date['id'], c_date))

print(c_dict)

{100010020: Timestamp('2016-11-01 00:00:00'), 100057320: Timestamp('2019-12-05 00:00:00'), 100047029: Timestamp('2018-08-23 00:00:00'), 100057318: Timestamp('2019-12-05 00:00:00'), 100057319: Timestamp('2019-12-05 00:00:00'), 100057316: Timestamp('2019-12-05 00:00:00'), 100010019: Timestamp('2016-12-13 00:00:00')}


### Additional data cleaning

Data is further cleaned to correct data types and select relevant fields. The table `counts` is a cleaned version of the 14 locations in 15-minute increments to which a 'day_of_week' column has been added.

In [106]:
# filters out data before calibration date, if applicable, before concatenating data from each location
#doing this with a list because different locations have different filterBefore dates
filtered_counts = []
for i in range(len(dataList)):
    k = dataList[i]['site'][0]
    if k in c_dict:
        f_date = c_dict[k]
        dataList[i]['date'] = pd.to_datetime(dataList[i]['date'], infer_datetime_format=True)
        cond = dataList[i]['date'] > f_date
        filtered_counts.append(dataList[i][cond])
        #dataList[i] = dataList[i][]
    else:
        filtered_counts.append(dataList[i])

In [107]:
counts = pd.concat(filtered_counts)

#correct data types
#counts['counts'] = counts['counts'].astype(int)
counts['datetime'] = pd.to_datetime(counts['date'], infer_datetime_format=True)
counts['date'] = counts['datetime']
counts = counts.set_index(pd.DatetimeIndex(counts['datetime']))

#drop unwanted columns
counts = counts.drop(['datetime', 'status'], axis=1)

#add day_of_week 0 = Monday 6 = Sunday
counts['day_of_week'] = counts['date'].dt.weekday

isweekday = []
for day in counts['day_of_week']:
    if day < 5:
        isweekday.append(True)
    else: 
        isweekday.append(False)
counts['isweekday'] = isweekday 


print(counts.index)
print(counts.dtypes)
counts.tail()

DatetimeIndex(['2018-06-14', '2018-06-15', '2018-06-16', '2018-06-17',
               '2018-06-18', '2018-06-19', '2018-06-20', '2018-06-21',
               '2018-06-22', '2018-06-23',
               ...
               '2020-06-26', '2020-06-27', '2020-06-28', '2020-06-29',
               '2020-06-30', '2020-07-01', '2020-07-02', '2020-07-03',
               '2020-07-04', '2020-07-05'],
              dtype='datetime64[ns]', name='datetime', length=17338, freq=None)
counts                float64
date           datetime64[ns]
site                   object
day_of_week             int64
isweekday                bool
dtype: object


Unnamed: 0_level_0,counts,date,site,day_of_week,isweekday
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-07-01,7510.0,2020-07-01,100009427,2,True
2020-07-02,9486.0,2020-07-02,100009427,3,True
2020-07-03,6694.0,2020-07-03,100009427,4,True
2020-07-04,8524.0,2020-07-04,100009427,5,False
2020-07-05,6813.0,2020-07-05,100009427,6,False


In [108]:
counts.head()

Unnamed: 0_level_0,counts,date,site,day_of_week,isweekday
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-06-14,0.0,2018-06-14,100057316,3,True
2018-06-15,0.0,2018-06-15,100057316,4,True
2018-06-16,0.0,2018-06-16,100057316,5,False
2018-06-17,0.0,2018-06-17,100057316,6,False
2018-06-18,0.0,2018-06-18,100057316,0,True


### Sum by month

In [129]:
#create function to sum by any period

def sum_by_period(counts, period):
    p_counts_total = counts[['site', 'counts']].groupby('site').resample(period).sum().reset_index()
    index = pd.MultiIndex.from_tuples(zip(p_counts_total['site'], p_counts_total['datetime']))
    p_counts_total = p_counts_total.set_index(index)
    print(p_counts_total.dtypes)
    print(p_counts_total.head(10))
    
    
    return p_counts_total
 
m_counts_total = sum_by_period(counts, "M")

site                object
datetime    datetime64[ns]
counts             float64
dtype: object
                           site   datetime   counts
100009425 2016-11-30  100009425 2016-11-30  38213.0
          2016-12-31  100009425 2016-12-31  35927.0
          2017-01-31  100009425 2017-01-31  32093.0
          2017-02-28  100009425 2017-02-28  36401.0
          2017-03-31  100009425 2017-03-31  35274.0
          2017-04-30  100009425 2017-04-30  63333.0
          2017-05-31  100009425 2017-05-31  68600.0
          2017-06-30  100009425 2017-06-30  83469.0
          2017-07-31  100009425 2017-07-31  85577.0
          2017-08-31  100009425 2017-08-31  81196.0


### Sum weekend & weekday by month

In [111]:
wkend_counts = counts[counts['isweekday'] == False]
wday_counts = counts[counts['isweekday'] == True]

m_counts_wkend = wkend_counts[['site', 'counts']].rename(columns={'counts':'weekend_counts'}).groupby('site').resample('M').sum()
m_counts_wday = wday_counts[['site', 'counts']].rename(columns={'counts':'weekday_counts'}).groupby('site').resample('M').sum()

monthly_counts = pd.concat([m_counts_total, m_counts_wkend, m_counts_wday], axis=1, join='inner')
monthly_counts.head()

Unnamed: 0,Unnamed: 1,site,datetime,counts,weekend_counts,weekday_counts
100010017,2020-06-30,100010017,2020-06-30,13006.0,3297.0,9709.0
100009428,2020-01-31,100009428,2020-01-31,74596.0,14512.0,60084.0
100009428,2016-10-31,100009428,2016-10-31,114531.0,25618.0,88913.0
100009428,2015-04-30,100009428,2015-04-30,104776.0,26821.0,77955.0
100010018,2018-02-28,100010018,2018-02-28,26789.0,4447.0,22342.0


### Join to `locations` to add location name

In [112]:
m_counts = monthly_counts.set_index('site')
monthly_counts_named = pd.concat([m_counts, locations], axis=1, join='inner')

print(monthly_counts_named.dtypes)
monthly_counts_named.head()

datetime          datetime64[ns]
counts                   float64
weekend_counts           float64
weekday_counts           float64
name                      object
sens                       int64
counter                   object
dtype: object


Unnamed: 0,datetime,counts,weekend_counts,weekday_counts,name,sens,counter
100010017,2020-06-30,13006.0,3297.0,9709.0,Staten Island Ferry,0,Y2H13094300
100009428,2020-01-31,74596.0,14512.0,60084.0,Ed Koch Queensboro Bridge Shared Path,0,Y2H19111445
100009428,2016-10-31,114531.0,25618.0,88913.0,Ed Koch Queensboro Bridge Shared Path,0,Y2H19111445
100009428,2015-04-30,104776.0,26821.0,77955.0,Ed Koch Queensboro Bridge Shared Path,0,Y2H19111445
100010018,2018-02-28,26789.0,4447.0,22342.0,Pulaski Bridge,0,Y2H13094301


# Remove partial years of data
This section removes partial years of data by removing the first (always partial) year of data for each location as well as the current year.

There are counts missing for part of 2014 at Brooklyn Bridge and 0s for some of 8th Ave... not sure what the story is there, but removing 8th Ave 0's results in no full years of data for that location.


Note: This code would remove a row in error if any location started counting in January, but that turns out not to be the case.

In [114]:
#remove 0 counts
has_counts = monthly_counts_named[monthly_counts_named['counts'] != 0]

def remove_first_yr_and_current_yr(df):
    print (str(len(df)) + " rows in initial data")
    allNames = df['name'].unique() #list of unique names in DF
    l = [] #empty list
    i = 0 #set counter
    for name in allNames:
        l.append(df[df['name']==name]) #seperate dataframe into list by name
        data = l[i]
        first_year = data['datetime'].min().year #stores first year of data for each location
        
        #condition
        remove_first_yr = data['datetime'].dt.year > first_year
        remove_current_year = data['datetime'].dt.year < datetime.today().year
        
        #filter each dataframe for conditions
        l[i] = data[remove_first_yr & remove_current_year]        
        #print("removing partial data from " + name + " for year " + str(first_year))
    
        i += 1 #increment counter
    
    result = pd.concat(l) # recombine filtered lists into df
    print(str(len(result)) + " rows returned")
    
    result = result.sort_values(by=['datetime']) #sort ascending
    
    return result # returns dataframe


full_yr_monthly_counts = remove_first_yr_and_current_yr(has_counts) 

print(full_yr_monthly_counts.head())
print(full_yr_monthly_counts.tail())

566 rows in initial data
430 rows returned
            datetime   counts  weekend_counts  weekday_counts  \
100009427 2014-01-31  41642.0          7386.0         34256.0   
100009428 2014-01-31  27011.0          4710.0         22301.0   
100009426 2014-01-31   1219.0           277.0           942.0   
100010022 2014-01-31  14584.0          2032.0         12552.0   
100009427 2014-02-28  34232.0         11615.0         22617.0   

                                            name  sens      counter  
100009427          Williamsburg Bridge Bike Path     0  Y2H13074108  
100009428  Ed Koch Queensboro Bridge Shared Path     0  Y2H19111445  
100009426              Manhattan Bridge Ped Path     0  Y2H13074107  
100010022              Brooklyn Bridge Bike Path     0  Y2H13074106  
100009427          Williamsburg Bridge Bike Path     0  Y2H13074108  
            datetime   counts  weekend_counts  weekday_counts  \
100010017 2019-12-31   7101.0          1834.0          5267.0   
100010019 2019-1

## Conclusions
The table below includes bicycle counts for all locations for which there is an active bike counter in NYC, including a monthly total, monthly total of weekdays, and monthly total of weekend days.

Known issues:

* Missing data for November and December of 2014 at Brooklyn Bridge (counters broken at that time)
* Something seems off about the Open Data numbers--they do not match what pulls straight from EcoCounter
* This just uses a single Manhattan counter -- MN dispaly

In [115]:
#write table
full_yr_monthly_counts.to_csv("full_yr_monthly_counts_clean.csv")