## Data Merging

This notebook covers how we gathered & cleaned our data from the various sources (New York MTA, Google Maps data). The code here is included in **clean2.py** in order to easily ready the data for analysis in other Jupyter Notebooks.

The intent here is to prepare this dataset for visualization, particularly to create our map figures.

This file:
- adds the ``lat`` and ``lon`` columns to our dataFrames(from ``df_turnstiles``).
- adds ``zipcode`` to our dataFrames
- generates adjusted gross income (agi) by NYC zipcode using google APIs, and insert ``adj_gross_inc`` into ``df_turnstiles``
- adds in necesarry geopandas shape columns to ``df_turnstiles`` that will allow the geodataframe to be visualized as a map of new york city

It also pulls geolocation data from Google (using its ``geocode api``). This allows us to add ``zipcode`` to our data in order to filter by **AGI (Adjusted Gross Income)**, which is pulled from US Census today.

In [144]:
import numpy as np
import pandas as pd
import datetime as dt
import googlemaps
import requests

In [145]:
# this cell not working - can't import the file we're trying to generate

# from clean2 import *
# df_turnstiles, df_ampm, df_dailytraffic = data_wrangling()

Download MTA Data

In [146]:
url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"

# gather the week numbers of the data we want to pull from above urls
week_nums=[191228, 191221, 191214, 191207, 191130, 191123, 191116, 191109]
dfs = []

for week_num in week_nums:
    file_url = url.format(week_num)
    dfs.append(pd.read_csv(file_url, parse_dates=[["DATE", "TIME"]], keep_date_col=True))
    
df_turnstiles = pd.concat(dfs)
df_turnstiles.head(5)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,2019-12-21 03:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,12/21/2019,03:00:00,REGULAR,7318040,2480587
1,2019-12-21 07:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,12/21/2019,07:00:00,REGULAR,7318049,2480598
2,2019-12-21 11:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,12/21/2019,11:00:00,RECOVR AUD,7318101,2480680
3,2019-12-21 15:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,12/21/2019,15:00:00,REGULAR,7318263,2480763
4,2019-12-21 19:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,12/21/2019,19:00:00,REGULAR,7318559,2480823


An important note about this dataFrame:
- The ``ENTRIES`` and ``EXITS`` columns are **cumulative**. In order for us to analyze AMPM traffic, we need to convert these columns from cumulative to hourly (in 4 hour increments, since that's how the data is formatted).
    - before we can do that, we need to remove some rows that were added as a result of an audit (can think of these as corrections). There is a relatively small amount of them (about 7k of 1.63M rows using the above weeks.)


In [147]:
mask = df_turnstiles.DESC == 'RECOVR AUD'
df_turnstiles[mask].shape, df_turnstiles.shape

((7102, 12), (1649098, 12))

Now let's get rid of the rows where ``turnstiles_df.DESC == 'RECOVR AUD'``. To do this, we will sort the rows so that the correct entry will come before the 'RECOVR AUD' entry. Then, we can apply the ``drop_duplicates`` dataFrame method to get rid of those rows.

Then, since we've removed all rows where ``DESC`` is not ``'REGULAR'``, we can drop the column entirely.

In [148]:
df_turnstiles.sort_values(
            ["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"],
            inplace=True,
            ascending=False)

# keeps top row, deletes others
df_turnstiles.drop_duplicates(
    subset=["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True
)

# remove DESC column
df_turnstiles = df_turnstiles.drop(["DESC"], axis=1, errors="ignore")

df_turnstiles.head(5)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS
206706,2019-12-27 20:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,20:00:00,5554,420
206705,2019-12-27 16:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,16:00:00,5554,420
206704,2019-12-27 12:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,12:00:00,5554,420
206703,2019-12-27 08:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,08:00:00,5554,420
206702,2019-12-27 04:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,04:00:00,5554,420


Great, now we can see that there is no more DESC column. 

As you can see in the output of the below cell, the ``EXITS`` column has a lot of spaces after it. Let's fix that so we can easily select this column in the future.

In [149]:
df_turnstiles.columns

Index(['DATE_TIME', 'C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION',
       'DATE', 'TIME', 'ENTRIES',
       'EXITS                                                               '],
      dtype='object')

In [150]:
df_turnstiles.rename(columns={"EXITS                                                               ": "EXITS"},
            inplace=True)

Now, let's add the AMPM column.

In [151]:
df_turnstiles["AMPM"] = (pd.DatetimeIndex(df_turnstiles["TIME"]).strftime("%r").str[-2:])
df_turnstiles.head(3)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM
206706,2019-12-27 20:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,20:00:00,5554,420,PM
206705,2019-12-27 16:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,16:00:00,5554,420,PM
206704,2019-12-27 12:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,12:00:00,5554,420,PM


And day name.

In [152]:
df_turnstiles["DAY_NAME"] = pd.to_datetime(df_turnstiles["DATE"]).dt.day_name()
df_turnstiles.head(3)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME
206706,2019-12-27 20:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,20:00:00,5554,420,PM,Friday
206705,2019-12-27 16:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,16:00:00,5554,420,PM,Friday
206704,2019-12-27 12:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,12/27/2019,12:00:00,5554,420,PM,Friday


We eventually want to add the adjusted gross income of each station. To get that, we need zipcode. And to get zipcode, we will to take the lat/lon data provided in MTA's station data and pass it to Google's **geocode** API.

In [153]:
# Read in mta station's zipcode and income data into ``df_turnstiles``
mta_station_info = pd.read_csv("http://web.mta.info/developers/data/nyct/subway/Stations.csv")
mta_station_info.rename(columns={'Stop Name': 'STATION', 'GTFS Latitude': 'Lat', 'GTFS Longitude': 'Lon'}, inplace=True)

mta_station_info.head()

Unnamed: 0,Station ID,Complex ID,GTFS Stop ID,Division,Line,STATION,Borough,Daytime Routes,Structure,Lat,Lon,North Direction Label,South Direction Label,ADA,ADA Notes
0,1,1,R01,BMT,Astoria,Astoria-Ditmars Blvd,Q,N W,Elevated,40.775036,-73.912034,,Manhattan,0,
1,2,2,R03,BMT,Astoria,Astoria Blvd,Q,N W,Elevated,40.770258,-73.917843,Ditmars Blvd,Manhattan,1,
2,3,3,R04,BMT,Astoria,30 Av,Q,N W,Elevated,40.766779,-73.921479,Astoria - Ditmars Blvd,Manhattan,0,
3,4,4,R05,BMT,Astoria,Broadway,Q,N W,Elevated,40.76182,-73.925508,Astoria - Ditmars Blvd,Manhattan,0,
4,5,5,R06,BMT,Astoria,36 Av,Q,N W,Elevated,40.756804,-73.929575,Astoria - Ditmars Blvd,Manhattan,0,


Nice! Now we have the station data...but still no zipcode. Let's get the station names and let Google do the rest.

**NOTE**: This will take some time. Go get yourself a drink!

In [121]:
import googlemaps

# you will need to get your own API Key, this API key will not work for you.
# get your own at: https://developers.google.com/maps/documentation/geocoding/start
gmaps = googlemaps.Client(key='AIzaSyAn-enZAKGfjRe3WguahiEy1K4QwB9xO2s') 

#initialize dictionary to store zipcodes in
station_zips = {}
mta_station_names = list(mta_station_info.STATION.unique())

for station in mta_station_names:
    address = station + ' Station New York City, NY'
    geocode_result = gmaps.geocode(address)
    try:
        zipcode = geocode_result[0]['address_components'][6]['long_name']
        if len(zipcode) == 5:
            station_zips[station.upper()] = str(zipcode) 
    except:
        continue

In [186]:
station_zips

{'ASTORIA-DITMARS BLVD': '11105',
 '30 AV': '11102',
 '36 AV': '11106',
 '39 AV-DUTCH KILLS': '11101',
 'LEXINGTON AV/59 ST': '10065',
 '5 AV/59 ST': '10019',
 '57 ST-7 AV': '10106',
 '49 ST': '10019',
 'TIMES SQ-42 ST': '10018',
 '34 ST-HERALD SQ': '10001',
 '28 ST': '10001',
 '23 ST': '10011',
 '14 ST-UNION SQ': '10003',
 '8 ST-NYU': '10003',
 'PRINCE ST': '10012',
 'CANAL ST': '10013',
 'CITY HALL': '10013',
 'CORTLANDT ST': '10007',
 'RECTOR ST': '10006',
 'COURT ST': '11201',
 'JAY ST-METROTECH': '11201',
 'DEKALB AV': '11217',
 'UNION ST': '11215',
 '4 AV-9 ST': '11215',
 'PROSPECT AV': '11215',
 '36 ST': '10012',
 '53 ST': '10022',
 'BAY RIDGE AV': '11220',
 'BAY RIDGE-95 ST': '11209',
 '7 AV': '10019',
 'PARKSIDE AV': '11226',
 'CHURCH AV': '11226',
 'BEVERLEY RD': '11226',
 'CORTELYOU RD': '11226',
 'NEWKIRK PLAZA': '11226',
 'AVENUE H': '11230',
 'AVENUE J': '11230',
 'AVENUE M': '11230',
 'KINGS HWY': '11229',
 'AVENUE U': '11223',
 'NECK RD': '11229',
 'SHEEPSHEAD BAY': '11

``station_zips`` now contains the zip codes for each station. Some are missing - that's because some zip codes in New York are so small that Google doesn't have the data for them.

Now let's add that to our dataFrames!

In [204]:
mta_station_info['ZIPCODE'] = mta_station_info['STATION'].str.upper().map(station_zips)
df_turnstiles['ZIPCODE'] = df_turnstiles['STATION'].map(station_zips)
df_turnstiles.sample(10)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
168643,2019-12-02 23:22:00,R262A,R195,04-00-03,161/YANKEE STAD,4BD,IRT,12/02/2019,23:22:00,107441,94120,PM,Monday,,
204007,2019-11-19 23:00:00,S101,R070,00-00-06,ST. GEORGE,1,SRT,11/19/2019,23:00:00,2096947,8380,PM,Tuesday,,
85939,2019-11-15 16:00:00,N333,R141,01-00-01,FOREST HILLS 71,EFMR,IND,11/15/2019,16:00:00,612633,195642,PM,Friday,,
140227,2019-12-19 07:00:00,R145,R032,00-06-02,TIMES SQ-42 ST,1237ACENQRSW,IRT,12/19/2019,07:00:00,707418,508512,AM,Thursday,10018.0,3964454.0
146921,2019-12-23 00:00:00,R176,R169,00-00-00,137 ST CITY COL,1,IRT,12/23/2019,00:00:00,11030589,6867518,AM,Monday,,
203246,2019-11-07 20:00:00,R637,R451,00-00-02,WINTHROP ST,25,IRT,11/07/2019,20:00:00,18325387,755473,PM,Thursday,11225.0,1755176.0
95457,2019-11-05 12:00:00,N414A,R316,01-06-00,FLUSHING AV,G,IND,11/05/2019,12:00:00,1752338,1637807,PM,Tuesday,,
157596,2019-12-06 08:00:00,R231,R176,00-00-01,33 ST,6,IRT,12/06/2019,08:00:00,4482580,2414849,AM,Friday,10001.0,2906435.0
36657,2019-11-30 23:00:00,H035,R348,00-00-02,ATLANTIC AV,L,BMT,11/30/2019,23:00:00,1764032,961138,PM,Saturday,11207.0,1650855.0
42125,2019-12-22 03:00:00,J035,R008,00-00-03,111 ST,J,BMT,12/22/2019,03:00:00,6954058,5062010,AM,Sunday,11418.0,840143.0


In [227]:
df_turnstiles['ZIPCODE'].iloc[213431]

'10472'

#### Adjusted gross income, AGI - first retrieve for the entire United States by zipcode

In [247]:
us_zips_agi = pd.read_csv("https://www.irs.gov/pub/irs-soi/18zpallagi.csv")
us_zips_agi.rename(columns={'A00100':'adj_gross_inc'}, inplace=True) # in 18zpallagi.csv, A00100 stands for AGI
us_zips_agi = us_zips_agi[['zipcode','adj_gross_inc']].groupby('zipcode').agg(sum) # group by zipcode and sum AGI

We can sort this zipcode/agi data into NYC zipcodes by joining the data with a list of nyc_zipcodes (ny_zips.csv)

In [258]:
nyc_zips = pd.read_csv("data/ny_zips.csv")
nyc_zips.dropna(axis=1, how='all', inplace=True)
nyc_agi_by_zip = nyc_zips.join(us_zips_agi, how='inner', on='zipcode')

# must capitalize col name & change from dtype 'object' to 'str' in order to merge into df_turnstiles
nyc_agi_by_zip.columns = nyc_agi_by_zip.columns.str.upper()
nyc_agi_by_zip['ZIPCODE'] = nyc_agi_by_zip.ZIPCODE.astype(str)

nyc_agi_by_zip.head(10)


Unnamed: 0,ZIPCODE,AREA,ADJ_GROSS_INC
0,10001,Manhattan,2906435.0
1,10002,Manhattan,2718913.0
2,10003,Manhattan,8191737.0
3,10004,Manhattan,944925.0
4,10005,Manhattan,2603668.0
5,10006,Manhattan,577145.0
6,10007,Manhattan,2910802.0
7,10009,Manhattan,2948597.0
8,10010,Manhattan,4542337.0
9,10011,Manhattan,9331779.0


In [252]:
nyc_agi_by_zip[['ZIPCODE', 'ADJ_GROSS_INC']].set_index('ZIPCODE').to_dict()['ADJ_GROSS_INC']

{'10001': 2906435.0,
 '10002': 2718913.0,
 '10003': 8191737.0,
 '10004': 944925.0,
 '10005': 2603668.0,
 '10006': 577145.0,
 '10007': 2910802.0,
 '10009': 2948597.0,
 '10010': 4542337.0,
 '10011': 9331779.0,
 '10012': 3646355.0,
 '10013': 7947938.0,
 '10014': 6795181.0,
 '10016': 8000771.0,
 '10017': 4417331.0,
 '10018': 3964454.0,
 '10019': 8005583.0,
 '10021': 12798847.0,
 '10022': 14226340.0,
 '10023': 12391069.0,
 '10024': 13806195.0,
 '10025': 8028810.0,
 '10026': 1440633.0,
 '10027': 1972417.0,
 '10028': 10128983.0,
 '10029': 2022052.0,
 '10030': 645026.0,
 '10031': 1401061.0,
 '10032': 1270384.0,
 '10033': 1360083.0,
 '10034': 1019289.0,
 '10035': 833400.0,
 '10036': 4568142.0,
 '10037': 570270.0,
 '10038': 1953888.0,
 '10039': 586050.0,
 '10040': 1037660.0,
 '10044': 472585.0,
 '10069': 1777529.0,
 '10128': 12570244.0,
 '10162': 203692.0,
 '10280': 1047685.0,
 '10282': 1442366.0,
 '10301': 1369609.0,
 '10302': 458236.0,
 '10303': 594095.0,
 '10304': 1515252.0,
 '10305': 1278252

Great! Now we can add AGI to our turnstiles DataFrames. Let's do that now.

In [257]:
# THIS CELL WORKS IN JNB- do not change
zipcode_agis = nyc_agi_by_zip[['ZIPCODE', 'ADJ_GROSS_INC']].set_index('ZIPCODE').to_dict()['ADJ_GROSS_INC']
df_turnstiles['ZIPCODE_AGI'] = df_turnstiles['ZIPCODE'].map(zipcode_agis)
df_turnstiles.sample(10)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
8868,2019-12-27 04:00:00,A049,R088,02-03-00,CORTLANDT ST,RNW,BMT,12/27/2019,04:00:00,1691836,271118,AM,Friday,10007,2910802.0
70310,2019-11-06 07:00:00,N137,R354,00-06-01,104 ST,A,IND,11/06/2019,07:00:00,1681109400,978345590,AM,Wednesday,11418,840143.0
69208,2019-12-02 04:00:00,N131,R383,00-00-02,80 ST,A,IND,12/02/2019,04:00:00,3773395,1066443,AM,Monday,11417,724600.0
47112,2019-12-16 23:00:00,N025,R102,01-00-00,125 ST,ACBD,IND,12/16/2019,23:00:00,61329,55450,PM,Monday,,
48931,2019-11-18 15:00:00,N043,R186,00-06-00,86 ST,BC,IND,11/18/2019,15:00:00,373509,99580,PM,Monday,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164087,2019-11-10 07:00:00,R249,R179,01-00-06,86 ST,456,IRT,11/10/2019,07:00:00,1277087,2985517,AM,Sunday,,
20094,2019-11-05 03:00:00,C003,R089,00-00-02,JAY ST-METROTEC,R,BMT,11/05/2019,03:00:00,3390143,1327564,AM,Tuesday,,
25326,2019-12-08 08:00:00,D002,R390,00-06-01,8 AV,N,BMT,12/08/2019,08:00:00,1,50,AM,Sunday,,
95870,2019-11-22 15:00:00,N500,R020,00-06-00,47-50 STS ROCK,BDFM,IND,11/22/2019,15:00:00,31139,34868,PM,Friday,,


Cool, now it's time to convert ENTRIES and EXITS columns from cumulative to it's change from previous value.

We do this by shifting the previous values forward, then subtracting the previous values from the current ones.

In [219]:
df_turnstiles['ZIPCODE_AGI'] = df_turnstiles['STATION'].map(station_zips)
df_turnstiles.sample(40)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
117278,2019-11-29 18:06:46,PTH04,R551,00-00-01,GROVE STREET,1,PTH,11/29/2019,18:06:46,72548,83487,PM,Friday,,
107697,2019-11-11 04:00:00,N555,R423,00-00-02,AVENUE N,F,IND,11/11/2019,04:00:00,1179021932,1551581,AM,Monday,11230.0,11230.0
120430,2019-11-15 04:51:43,PTH11,R545,00-00-00,14TH STREET,1,PTH,11/15/2019,04:51:43,113674,4936,AM,Friday,,
169987,2019-12-26 19:00:00,R287,R244,00-00-00,BURNSIDE AV,4,IRT,12/26/2019,19:00:00,355731,270019,PM,Thursday,10453.0,10453.0
2835,2019-11-18 19:00:00,A021,R032,01-00-01,TIMES SQ-42 ST,ACENQRS1237W,BMT,11/18/2019,19:00:00,3804229,4416391,PM,Monday,10018.0,10018.0
30861,2019-11-16 19:00:00,G011,R312,00-00-01,W 8 ST-AQUARIUM,FQ,BMT,11/16/2019,19:00:00,3168601,2593014,PM,Saturday,,
66657,2019-12-19 23:00:00,N112A,R284,01-00-01,CLINTON-WASH AV,C,IND,12/19/2019,23:00:00,678433,2532480,PM,Thursday,,
112599,2019-11-10 04:00:00,N700,R570,00-03-01,72 ST-2 AVE,Q,IND,11/10/2019,04:00:00,1625349,1870989,AM,Sunday,,
144726,2019-12-26 20:00:00,R166,R167,02-00-02,86 ST,1,IRT,12/26/2019,20:00:00,2857308,3348640,PM,Thursday,,
154883,2019-12-13 16:00:00,R221,R170,01-06-03,14 ST-UNION SQ,456LNQRW,IRT,12/13/2019,16:00:00,387579,172098,PM,Friday,10003.0,10003.0


In [203]:
# group data by AMPM, taking the maximum entries/exits for each date 
ampm_station_group = df_turnstiles.groupby(["C/A", "UNIT", "SCP", "STATION", "ZIPCODE", "ZIPCODE_AGI", "DATE", "AMPM", "DAY_NAME"],
    as_index=False)

df_ampm = ampm_station_group.ENTRIES.max()
ampm_station_exits = ampm_station_group.EXITS.max()
df_ampm["EXITS"] = ampm_station_exits["EXITS"]

# create prev_date and prev_entries cols by shifting these columns forward one day
# if shifting date and entries, don't group by date
df_ampm[["PREV_DATE", "PREV_ENTRIES", "PREV_EXITS"]] = df_ampm.groupby(
    ["C/A", "UNIT", "SCP", "STATION", "ZIPCODE", "ZIPCODE_AGI", "AMPM"]
)[["DATE", "ENTRIES", "EXITS"]].apply(lambda grp: grp.shift(1))

# Drop the rows for the earliest date in the df, which are now NaNs for prev_date and prev_entries cols
df_ampm.dropna(subset=["PREV_DATE"], axis=0, inplace=True)

df_ampm.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,ZIPCODE,ZIPCODE_AGI,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS
2,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/03/2019,AM,Sunday,4099092,7058848,11/02/2019,4097957.0,7057072.0
3,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/03/2019,PM,Sunday,4099979,7060287,11/02/2019,4098923.0,7058586.0
4,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/04/2019,AM,Monday,4100128,7061222,11/03/2019,4099092.0,7058848.0
5,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/04/2019,PM,Monday,4101641,7063433,11/03/2019,4099979.0,7060287.0
6,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/05/2019,AM,Tuesday,4101898,7064433,11/04/2019,4100128.0,7061222.0


Alright, as we can see above, we now have ``PREV_ENTRIES`` and ``PREV_EXITS`` columns, and they seem mostly correct. However, there is a lot of data, and we noticed that sometimes ``PREV_ENTRIES`` > ``ENTRIES``. This shouldn't be...in fact, we determined that some stations were counting in reverse! In other cases, the variance to the previous count was hundreds of thousands, sometimes millions. We decided to cap these at 200,000 to maintain our data's integrity as best as possible. 

The below functions takes care of this for us for both ENTRIES and EXITS.

In [199]:
def add_counts(row, max_counter, column_name):
    """
    Takes:
        - max_counter is the maximum difference between entries/exits & their prev. row values that
    we will allow.
    column_name (string): which column to count
    """

    counter = row[column_name] - row[f"PREV_{column_name}"]
    if counter < 0:
        # Maybe counter is reversed?
        counter = -counter

    if counter > max_counter:
        # Maybe counter was reset to 0?
        # take the lower value as the counter for this row
        counter = min(row[column_name], row[f"PREV_{column_name}"])

    if counter > max_counter:
        # Check it again to make sure we're not still giving a counter that's too big
        return 0

    return counter
    

Alright, now let's apply this function to our dataFrames.

In [200]:
# we will use a 200k counter - anything more seems incorrect.
df_ampm["TMP_ENTRIES"] = df_ampm.apply(
    add_counts, axis=1, max_counter=200000, column_name='ENTRIES')

df_ampm["TMP_EXITS"] = df_ampm.apply(
    add_counts, axis=1, max_counter=200000, column_name='EXITS')

In [201]:
df_ampm.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS
2,A002,R051,02-00-00,59 ST,11/03/2019,AM,Sunday,7257271,2459060,11/02/2019,7256298.0,2458759.0,973.0,301.0
3,A002,R051,02-00-00,59 ST,11/03/2019,PM,Sunday,7258068,2459211,11/02/2019,7256982.0,2458965.0,1086.0,246.0
4,A002,R051,02-00-00,59 ST,11/04/2019,AM,Monday,7258268,2459570,11/03/2019,7257271.0,2459060.0,997.0,510.0
5,A002,R051,02-00-00,59 ST,11/04/2019,PM,Monday,7259609,2459759,11/03/2019,7258068.0,2459211.0,1541.0,548.0
6,A002,R051,02-00-00,59 ST,11/05/2019,AM,Tuesday,7259800,2460102,11/04/2019,7258268.0,2459570.0,1532.0,532.0
