# 1.3 | Data Acquisition: DAILY data
* [01 API Data Requests](01_API_pulls.ipynb)
* [01.1 Additional BART Data](01_v2_bart.ipynb.ipynb)
* [01.3 Daily BART Data](01_v3_bart.ipynb.ipynb)
* [02 Initial EDA](02_EDA.ipynb)
* [03 First Model: PROPHET](03_prophet.ipynb)
---

### <b>Daily</b> BART ridership

Pre-Processing a massive CSV from `bart.gov`
* collapse from HOURLY to DAILY counts (sum)

No header/header file, in format: 
date | hour (of day, 24hr) | origin station | destination station | riders
---  |---                  | ---            | ---                   | ---

<br>

> for `datetime`, `pandas.DatetimeIndex.dayofweek` returns day of week, with `0 = Monday` and `6 = Sunday`. 

* Initial modeling will look at _daily_, _system-wide_ ridership. 
* Subsequent analysis will consider _hourly_
* More granualar analysis consider fuel prices will consider trips `>10mi` to assess long-distance _commuter_ sensitivity to fuel prices without the intra-city _urban_ rides.

In [1]:
##### BASIC IMPORTS
import glob
import pandas as pd

import gcutsoms as gf

In [2]:
path = '../data/raw/bart/hourly/'
files = os.listdir(path)

# Print file list to verify file types, count 
files

['.DS_Store',
 'date-hour-soo-dest-2019.csv',
 'date-hour-soo-dest-2018.csv',
 'date-hour-soo-dest-2022.csv',
 'date-hour-soo-dest-2020.csv',
 'date-hour-soo-dest-2021.csv',
 'date-hour-soo-dest-2011.csv',
 'date-hour-soo-dest-2013.csv',
 'date-hour-soo-dest-2012.csv',
 'date-hour-soo-dest-2016.csv',
 'date-hour-soo-dest-2017.csv',
 'date-hour-soo-dest-2015.csv',
 'date-hour-soo-dest-2014.csv']

---
This function iterates through directory holding yearly files: 
* eliminates same-station exits `origin = destination'
* output is single `dataframe` with date as index, `ds` = date column, and ridership column
* rider count is `aggregate` by: 
  * date & station
  * _by data & by exit station_ * add this to analysis 
  * _by weekly sum of per weekday_  * add this to analysis 

In [3]:
def agg_station_day(path_name):

    df1 = pd.DataFrame()
    n = 0

    # for file in os.listdir(path_name):
    for file in glob.glob(path_name + '*.csv'):
        n += 1
        print(file)
        
        df = pd.read_csv(file) 

        # add columns header for ease of manipulation 
        df.columns = ['dt', 'hour', 'origin', 'exit', 'riders']

        # ensure date is datetime format, set as index
        df.set_index('dt')        
        # df['ds'] = df_y.index
        
        # filter out origin = destination rides
        df = df[ df['origin'] != df['exit'] ] 
        
        #group / sum / aggregate data for each day 
        # df_1 = df_y_out.groupby(['date']).agg({'riders': ['sum']})
        # df_1 = df_y['riders'].groupby(['date']).sum()

        # Group / sum / aggregate data for each day BY STATION
        # df = df.groupby(['dt', 'exit']).agg({'riders': ['sum']}).reset_index()
        # df.groupby("dummy")['returns'].agg(['mean', 'sum'])
        df = df.groupby(['dt', 'exit'])['riders'].agg(['sum']).reset_index()
        # df = df.groupby(['dt', 'exit']).agg({'riders': ['sum']}) ####
        # df.groupby("dummy")['returns'].agg(['mean', 'sum'])
        # df_2 = df_y['riders'].groupby(['exit']).sum()
        
        # add each year to running list 
        df1 = pd.concat([df1, df])

    df1.sort_index(inplace=True)
    # df1.columns = ['ridership']

    return(pd.DataFrame(df1))

In [4]:
df_daily= agg_station_day(path)

../data/raw/bart/hourly/date-hour-soo-dest-2019.csv
../data/raw/bart/hourly/date-hour-soo-dest-2018.csv
../data/raw/bart/hourly/date-hour-soo-dest-2022.csv
../data/raw/bart/hourly/date-hour-soo-dest-2020.csv
../data/raw/bart/hourly/date-hour-soo-dest-2021.csv
../data/raw/bart/hourly/date-hour-soo-dest-2011.csv
../data/raw/bart/hourly/date-hour-soo-dest-2013.csv
../data/raw/bart/hourly/date-hour-soo-dest-2012.csv
../data/raw/bart/hourly/date-hour-soo-dest-2016.csv
../data/raw/bart/hourly/date-hour-soo-dest-2017.csv
../data/raw/bart/hourly/date-hour-soo-dest-2015.csv
../data/raw/bart/hourly/date-hour-soo-dest-2014.csv


In [5]:
df_daily.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193689 entries, 0 to 18235
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   dt      193689 non-null  object
 1   exit    193689 non-null  object
 2   sum     193689 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 5.9+ MB


In [6]:
df_daily.head(12)

Unnamed: 0,dt,exit,sum
0,2019-01-01,12TH,2098
0,2022-01-01,12TH,798
0,2020-01-01,12TH,2345
0,2021-01-01,12TH,382
0,2011-01-01,12TH,2582
0,2013-01-01,12TH,3179
0,2016-01-01,12TH,3138
0,2017-01-01,12TH,2641
0,2015-01-01,12TH,3147
0,2014-01-01,12TH,3129


In [16]:
# current name of column holding date 
col_title = 'dt'
# sets date as time index
df_daily2 = gf.dt_index(df_daily, col_title)
# rename date column to either fb prophet or linkedin greykite format 
# df_daily.dt 
df_daily2.head()

Unnamed: 0_level_0,dt,exit,sum
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-01-01,2011-01-01,POWL,13640
2011-01-01,2011-01-01,24TH,5179
2011-01-01,2011-01-01,NBRK,2120
2011-01-01,2011-01-01,MONT,4155
2011-01-01,2011-01-01,MLBR,3739


### Print out merged, clean csv.

In [11]:
df_out = df_daily[['dt', 'exit', 'sum']]

df_out.head()

Unnamed: 0_level_0,dt,exit,sum
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-01-01,2019-01-01,12TH,2098
2022-01-01,2022-01-01,12TH,798
2020-01-01,2020-01-01,12TH,2345
2021-01-01,2021-01-01,12TH,382
2011-01-01,2011-01-01,12TH,2582


In [12]:
df_daily.to_csv('../data/processed/bart_daily_station.csv', index = False)