# EDA Demo July 26

## Intro: Notebooks

* What is a notebook
* Different notebooks / how to install 
   * [Google Colab](https://colab.research.google.com)
        * [How to use Google Colab with GitHub via Google Drive](https://medium.com/analytics-vidhya/how-to-use-google-colab-with-github-via-google-drive-68efb23a42d)
   * Jupyter Notebook / Jupyter Lab
        * [Project Jupyter | Installing Jupyter](https://jupyter.org/install)
        * [How to set up Anaconda and Jupyter Notebook the right way](https://towardsdatascience.com/how-to-set-up-anaconda-and-jupyter-notebook-the-right-way-de3b7623ea4a)
   * [R Markdown](https://rmarkdown.rstudio.com/)


## Loading data

In [None]:
# !pip install markupsafe==2.0.1
# !pip install pandas-profiling

In [None]:
# import packages
# required for pandas to read csv from aws
import s3fs

import pandas as pd
import pendulum
# note that I had to do this: https://stackoverflow.com/a/70689489
from pandas_profiling import ProfileReport

In [None]:
# construct date range that we can use to load in each daily file
start_date = '2022-05-20'
end_date = '2022-07-16'

date_range = [d for d in pendulum.period(
    pendulum.from_format(start_date, 'YYYY-MM-DD'), 
    pendulum.from_format(end_date, 'YYYY-MM-DD')).range('days')]

In [None]:
date_range

In [None]:
# copied from: https://github.com/chihacknight/chn-ghost-buses/blob/main/data_analysis/compare_scheduled_and_rt.ipynb
# initialize an empty dataframe
data_raw = pd.DataFrame()

# loop through each date with data
for day in date_range:
    # format the date as a string like YYYY-MM-DD -- this is the format of the filenames
    date_str = day.to_date_string()
    # print message to monitor progress
    print(f"Processing {date_str} at {pendulum.now().to_datetime_string()}")    
    # use pandas read_csv method to load each file
    daily_data = pd.read_csv(f's3://chn-ghost-buses-public/bus_hourly_summary_v2/{date_str}.csv')
    # append the new data to the dataframe we initialized at the beginning 
    # use concat: https://stackoverflow.com/a/15822811 -- got deprecation warning using append
    data_raw = pd.concat([data_raw, daily_data])

## Explore!

### Profile

In [None]:
# make a copy just to avoid having to reload...
data = data_raw.copy()

In [None]:
# pandas-profiling report
profile = ProfileReport(data, title="Profile")

In [None]:
profile.to_notebook_iframe()

### Relationships between variables 

In [None]:
# examine discrepancies between vehicles, blocks, trips
# sort by temporary column: https://stackoverflow.com/a/38663354
data['vh_block_diff'] = data.vh_count - data.block_count
data['vh_trip_diff'] = data.vh_count - data.trip_count
data['block_trip_diff'] = data.block_count - data.trip_count

In [None]:
data.vh_block_diff.describe()

In [None]:
data[data.vh_count != data.block_count].sort_values(by = 'vh_block_diff', ascending = False)

### Summarize by routes

In [None]:
avg_hrly_trips_by_route = data.groupby(by = ['rt', 'data_hour'])['trip_count'].mean().reset_index()

In [None]:
avg_hrly_trips_by_route[avg_hrly_trips_by_route.rt == '22'].plot(x='data_hour', y='trip_count', kind = 'line')