# 2 | Importing Data for Initial EDA, Visualizations
---
* [01 API Data Requests](01_API_pulls.ipynb)
* [01.1 Additional BART Data](01_v2_bart.ipynb.ipynb)
* _[02 Initial EDA](02_EDA.ipynb)_
* [03 First Model: Prophet](03_prophet.ipynb)
---

### Data Discussion

* [BART](bart.gov) Publishes monthly rerpots, with daily ridership that month, using faregate counts for on and off boarding.
* [EIA](https://www.eia.gov/opendata/qb.php?category=240839&sdid=PET.EMM_EPM0_PTE_SCA_DPG.M) publishes monthly and weekly fuel rates 
* [CA Energy](https://www.energy.ca.gov/data-reports/energy-almanac/zero-emission-vehicle-and-infrastructure-statistics/vehicle-population) publishes vehicle counts annualy. DMV and CA Data only provide annual counts. 
* [Fed Reserve](federalreserve.gov) publishes yearly consumer debt 

In [None]:
# pip install ipywidgets

In [1]:
##### BASIC IMPORTS 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly

import gcustoms.py

ModuleNotFoundError: No module named 'gcustoms'

In [1]:
# CUSTOM IMPORTS AND SETTINGS 

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)  # for plots to render in jupyter notebook
s
pd.options.display.max_columns = 90                     # view settings
pd.options.display.max_rows = 100

path = '../data/processed/'

NameError: name 'pd' is not defined

In [10]:
def date_index(df): 
    df['date'] = pd.to_datetime(df['ds'])
    df = df.set_index('date')
    df.rename(columns = {'ridership' : 'y'}, inplace = True)

    return(df)

In [11]:
# FUNCTION RETURNS PLOTLY TRACES
# TAKES 3 ARGUMENTS: (dataframe, y, and title for plot)
def plot_traces(df, y, title):
    y_trace = go.Scatter(
                    x = df.index,
                    y = df[y], 
                    name = y + 'trace',
                    line = dict(color = 'blue'),
                    opacity = 0.4)

    layout = dict(title = title)

    fig = dict(data=[y_trace], layout=layout)
    iplot(fig)
    return (print ('done') )

> <br>
>
> 1. BART Ridership
> 
> <br>

In [26]:
filenameA = path + 'bart_daily_station.csv'
bartA = pd.read_csv(filenameA)
bartA.columns = ['dt', 'exit', 'riders']

bartA.head()



Unnamed: 0,dt,exit,riders
0,2019-01-01,12TH,2098
1,2022-01-01,12TH,798
2,2020-01-01,12TH,2345
3,2021-01-01,12TH,382
4,2011-01-01,12TH,2582


In [38]:
bartA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193689 entries, 0 to 193688
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   dt      193689 non-null  object
 1   exit    193689 non-null  object
 2   riders  193689 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.4+ MB


In [37]:

# df_all = bartA.groupby(['dt']).agg({'riders': ['sum']})
# df_all2 = bartA[('riders')].groupby(['dt']).sum()
df_all2 = bartA[('riders')].groupby(['exit']).sum()
        # df.groupby("dummy")['returns'].agg(['mean', 'sum'])
        # df = df.groupby(['dt', 'exit'])['riders'].agg(['sum']).reset_index()

# df_all.head(), df_all2.head()
df_all2.head()

KeyError: 'exit'

In [12]:
filename = path + 'bart.csv'
bart = pd.read_csv(filename)
# bart = date_index(bart)

In [13]:
# bart2 = bart.loc['2011-01-01':]
bart_plot = plot_traces(bart, 'ridership', 'BART Monthly Ridership, 2011 - 2022')

done


> <br>
>
> 2. Fuel Prices
> 
> <br>

In [14]:
filename = path + 'fuel_w.csv'
fuel_w = pd.read_csv(filename)

fuel_w = date_index(fuel_w)
fuel_w.tail()

Unnamed: 0_level_0,ds,fuel_w
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-04-11,2022-04-11,5.715
2022-04-18,2022-04-18,5.641
2022-04-25,2022-04-25,5.609
2022-05-02,2022-05-02,5.629
2022-05-09,2022-05-09,5.748


In [15]:
fuel2 = fuel_w.loc['2010-01-01':]
fuel_plot2 = plot_traces(fuel2, 'fuel_w', 'Weekly Average Gas Price ($), California: 2010 - 2022')

done


In [16]:
filename = path + 'fuel_m.csv'
fuel_m = pd.read_csv(filename)

fuel_m = date_index(fuel_m)
fuel_m.tail()

Unnamed: 0_level_0,ds,fuel_m
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-12-01,2021-12-01,4.597
2022-01-01,2022-01-01,4.584
2022-02-01,2022-02-01,4.66
2022-03-01,2022-03-01,5.655
2022-04-01,2022-04-01,5.692


In [17]:
fuel3 = fuel_m.loc['2010-01-01':]
fuel_plot3 = plot_traces(fuel3, 'fuel_m', 'Monthly Average Gas Price ($), California: 2010 - 2022')

done


> <br>
>
> 3. Manipulating 'REGISTERED VEHICLES' file: 
> 
> <br>

In [18]:
filename = path + 'vehs.csv'
vehs = pd.read_csv(filename)

vehs = date_index(vehs)
vehs.tail()

Unnamed: 0_level_0,ds,cars
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-01,2017-01-01,28418039
2018-01-01,2018-01-01,28681493
2019-01-01,2019-01-01,29029787
2020-01-01,2020-01-01,28665934
2021-01-01,2021-01-01,29942517


In [19]:
vehs2 = vehs.loc['2010-01-01':]
cars_plot = plot_traces(vehs, 'cars', 'Estimated Count of Registered Cars CA: 2010 - 2021')

done


> <br>
>
> 4. Manipulating 'CONSUMER DEBT' file: 
> 
> <br>

In [20]:
filename = path + 'debt.csv'
debt = pd.read_csv(filename)
debt['ds'] = debt['date']

debt = date_index(debt)
debt.tail()

Unnamed: 0_level_0,debt,ds
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-11-01,4408.96983,2021-11-01
2021-12-01,4431.91715,2021-12-01
2022-01-01,4448.88285,2022-01-01
2022-02-01,4486.57969,2022-02-01
2022-03-01,4539.01445,2022-03-01


In [21]:
debt2 = debt.loc['2010-01-01':]
debt_plot = plot_traces(debt2, 'debt', 'Consumer Debt ($) 2010 - 2022 (not adjusted, Federal Reserve)')

done
