# Get Data

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
import os
from glob import glob

import boto3
import geopandas as gpd
import pandas as pd
import pandera as pa
from dotenv import find_dotenv, load_dotenv

In [None]:
%aimport src.aggregate_data
import src.aggregate_data as ad

%aimport src.city_neighbourhoods
import src.city_neighbourhoods as cn

%aimport src.city_pub_data
import src.city_pub_data as cpd

%aimport src.process_trips
from src.process_trips import process_trips_data

%aimport src.stations_metadata
from src.stations_metadata import get_stations_metadata, transform_metadata

%aimport src.trips
import src.trips as bt

%aimport src.utils
from src.utils import (
    log_prefect,
    summarize_df,
    save_data_to_parquet_file,
)

%aimport src.validators
from src.validators import validate_data_download_status, validate_merged_data

<a id='table-of-contents'></a>

## [Table of Contents](#table-of-contents)

1. [About](#about)
   - 1.1. [Overview](#overview)
   - 1.2. [Data Retrieval Workflow](#data-retrieval-workflow)
   - 1.3. [Constraints](#constraints)
2. [User Inputs](#user-inputs)
3. [Get Supplementary Datasets](#get-supplementary-datasets)
   - 3.1. [Stations Metadata](#stations-metadata)
   - 3.2. [Neighbourhood Boundary and Land Area Data](#neighbourhood-boundary-and-land-area-data)
   - 3.3. [Colleges and Universities](#colleges-and-universities)
   - 3.4. [Aggregations by Neighbourhood - Get Neighbourhood Data for Supplementary Datasets](#aggregations-by-neighbourhood---get-neighbourhood-data-for-supplementary-datasets)
     - 3.4.1. [Colleges and Universities](#colleges-and-universities)
   - 3.5. [Aggregations by Neighbourhood - Merge Neighbourhood Aggregations with GeoData](#aggregations-by-neighbourhood---merge-neighbourhood-aggregations-for-supplememtary-datasets)
   - 3.6. [Aggregations by Neighbourhood - Merge Stations Metadata with Aggregated Neighbourhood Stats](#aggregations-by-neighbourhood---merge-stations-metadata-with-aggregated-neighbourhood-stats)
4. [Get Bikeshare Trips Data](#get-bikeshare-trips-data)
   - 4.1. [Get URLs for Raw Trips Data Files](#get-urls-for-raw-trips-data-files)
   - 4.2. [Download Raw Trips Data Files](#download-raw-trips-data-files)
   - 4.3. [Load Single-Month of Trips Data](#load-single-month-of-trips-data)
   - 4.4. [Process Single-Month of Trips Data](#process-single-month-of-trips-data)
5. [Data Download End-to-End Workflow](#data-download-end-to-end-workflow)
   - 5.1. [Download Supplementary Datasets](#download-supplementary-datasets)
   - 5.2. [Download Bikeshare Trips Data](#download-bikeshare-trips-data)
6. [To Be Done](#to-be-done)

<a id='about'></a>

## 1. [About](#about)

Retrieve data needed for building the dashboard. Aggregate the trips data and save to a `.parquet` file which can be loaded and used by the dashboard application.

<a id='overview'></a>

### 1.1. [Overview](#overview)

Retrieve the following datasets
- Toronto Bikeshare [trips (ridership) data](https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/)
- three supplementary datasets
  - [city neighbourhood boundaries](https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/about-toronto-neighbourhoods/)
  - number of colleges/universities per city neighbourhood
  - [bikeshare stations metadata](https://open.toronto.ca/dataset/bike-share-toronto/)

  and perform the following
  - combine all datasets
  - aggregate bikeshare trips in the combined data by
    - city neighbourhood
    - type of bikeshare user
    - `datetime` attributes (year, month, weekday, hour)

<a id='data-retrieval-workflow'></a>

### 1.2. [Data Retrieval Workflow](#data-retrieval-workflow)

This notebook covers two workflows to retrieve the trips data
- interactive workflow
  - only retrieves data for a single month
- end-to-end workflow
  - retrieves data for multiple months

If trips (ridership) data for a single year has been previously retrieved, then the workflow must check if that data is outdated before re-downloading trips data for the same year. This functionality is implemented in the end-to-end workflow in this notebook.

<a id='constraints'></a>

### 1.3. [Constraints](#constraints)

1. Currently, **only one** of the following workflows shown in this notebook
   - interactive workflow
     - Get Supplementary Datasets
     - Get Bikeshare Trips Data
   - end-to-end workflow
     - Data Download End-to-End Workflow

   can be run at a time.
2. The notebook has only been verified to run using a single year of trips data at a time (single year in `years_wanted = [...]`). The functionality has not been verified for simultaneously retrieving multiple years of data (eg. `years_wanted = [2020, 2021]`).

<a id='user-inputs'></a>

## 2. [User Inputs](#user-inputs)

In [None]:
# Datasets
# # City's Open Data Portal
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
# # Bikeshare Ridership
trips_params = {"id": "7e876c24-177c-4605-9cef-e50dd74c617f"}
years_wanted = [2021]
# # Neighbourhood Boundaries
neigh_boundary_params = {"id": "4def3f65-2a65-4a4f-83c4-b2a4aed72d46"}
# # Stations Metadata
about_params = {"id": "2b44db0d-eea9-442d-b038-79335368ad5a"}
stations_cols_wanted = [
    "station_id",
    "name",
    "physical_configuration",
    "lat",
    "lon",
    "altitude",
    "address",
    "capacity",
    "physicalkey",
    "transitcard",
    "creditcard",
    "phone",
]

# Neighbourhood boundary columns to keep
neigh_cols_to_show = [
    "AREA_ID",
    "AREA_SHORT_CODE",
    "AREA_LONG_CODE",
    "AREA_NAME",
    "Shape__Area",
    "AREA_LATITUDE",
    "AREA_LONGITUDE",
    "geometry",
]

# Ridership datetime columns
date_cols = ["Start Time", "End Time"]

# Ridership Columns in which to drop missing values
nan_cols = [
    "START_STATION_ID",
    "START_STATION_NAME",
]

# Ridership Columns with duplicates, in which to drop rows
duplicated_cols = ["TRIP_ID", "START_TIME"]

# Geodata columns to use when checking if a point lies
# within a neighbourhood
geo_cols = ["AREA_NAME", "geometry", "Shape__Area"]

# Directories in which to store data
raw_data_dir = "data/raw"
processed_data_dir = "data/processed"

# Name of .parquet file that will be created with the aggregated data
parquet_filename = "agg_data.parquet.gzip"

ci_run = "no"

In [None]:
raw_data_filepath = os.path.join(raw_data_dir, parquet_filename)
updated_data_filepath = os.path.join(processed_data_dir, parquet_filename)

# Ridership dtypes dict
dtypes_dict_trips = {
    "Trip Id": pd.Int64Dtype(),
    "Trip Duration": pd.Int64Dtype(),
    "Start Station Id": pd.Int64Dtype(),
    "Start Station Name": pd.StringDtype(),
    "User Type": pd.StringDtype(),
}

<a id='get-supplementary-datasets'></a>

## 3. [Get Supplementary Datasets](#get-supplementary-datasets)

<a id='stations-metadata'></a>

### 3.1. [Stations Metadata](#stations-metadata)

In [None]:
%%time
df_stations = get_stations_metadata(url, about_params)
df_stations = transform_metadata(df_stations, stations_cols_wanted)
display(df_stations.head(2))
display(df_stations.dtypes.rename("dtype").to_frame())

<a id='neighbourhood-boundary-and-land-area-data'></a>

### 3.2. [Neighbourhood Boundary and Land Area Data](#neighbourhood-boundary-and-land-area-data)

In [None]:
%%time
neigh_boundary_params = {"id": "4def3f65-2a65-4a4f-83c4-b2a4aed72d46"}
gdf = cpd.get_neighbourhood_boundary_land_area_data(url, neigh_boundary_params, neigh_cols_to_show)
gdf[
    gdf["AREA_NAME"].str.contains(
        "Wychwood|Yonge-Eglinton|Yonge-St.|York Univ|Yorkdale-Glen"
    )
].sort_values(by=["AREA_NAME"])

In order to use the correct CRS for allowing an area calculation in square km, we'll get the current EPSG ([link](https://epsg.io/4326)) from the geodata

In [None]:
print(gdf.crs)

<a id='neighbourhood-boundary-and-land-area-data'></a>

### 3.3. [Colleges and Universities](#colleges-and-universities)

In [None]:
df_coll_univ = cpd.get_coll_univ_locations(False)

<a id='aggregations-by-neighbourhood---get-neighbourhood-data-for-supplementary-datasets'></a>

### 3.4. [Aggregations by Neighbourhood - Get Neighbourhood Data for Supplementary Datasets](#aggregations-by-neighbourhood---get-neighbourhood-data-for-supplementary-datasets)

<a id='colleges-and-universities'></a>

#### 3.4.1. [Colleges and Universities](#colleges-and-universities)

In [None]:
%%time
df_coll_univ_new = pa.check_output(ad.coll_univ_schema_new)(cn.get_data_with_neighbourhood)(
    gdf[geo_cols],
    df_coll_univ,
    "lat",
    "lon",
    "institution_id",
)
display(df_coll_univ_new.head(2))
display(df_coll_univ_new.dtypes.to_frame())

<a id='aggregations-by-neighbourhood---merge-neighbourhood-aggregations-for-supplememtary-datasets'></a>

### 3.5. [Aggregations by Neighbourhood - Merge Neighbourhood Aggregations with GeoData](#aggregations-by-neighbourhood---merge-neighbourhood-aggregations-for-supplememtary-datasets)

In [None]:
%%time
df_neigh_stats = ad.combine_neigh_stats_v2(
    gdf,
    df_coll_univ_new,
)
display(df_neigh_stats.head())
display(df_neigh_stats.dtypes.to_frame())

<a id='aggregations-by-neighbourhood---merge-stations-metadata-with-aggregated-neighbourhood-stats'></a>

### 3.6. [Aggregations by Neighbourhood - Merge Stations Metadata with Aggregated Neighbourhood Stats](#aggregations-by-neighbourhood---merge-stations-metadata-with-aggregated-neighbourhood-stats)

In [None]:
display(df_stations.head(2))
display(gdf[geo_cols].head(2))

Append the neighbourhood containing each bikeshare station to the station metadata

In [None]:
%%time
df_stations_new = pa.check_output(ad.stations_schema_merged)(cn.get_data_with_neighbourhood)(
    gdf[geo_cols],
    df_stations,
    "lat",
    "lon",
    "station_id",
)
display(df_stations_new.head(2))
display(df_stations_new.dtypes.rename("dtype").to_frame())

Merge the modified stations metadata with the neighbourhood stats

In [None]:
display(df_stations_new.head(2))
display(df_neigh_stats.head(2))

In [None]:
%%time
df_stations_new = ad.combine_stations_metadata_neighbourhood_v2(df_stations_new, df_neigh_stats)
print(df_stations_new.shape)
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(df_stations_new.head(4))
display(df_stations_new.dtypes.rename("dtype").to_frame())

Show the frequency of stations based on a subset of station characteristics

In [None]:
for col in [
    "CAPACITY",
    "PHYSICAL_CONFIGURATION",
    "PHYSICALKEY",
    "TRANSITCARD",
    "CREDITCARD",
    "PHONE",
]:
    display(df_stations_new[col].value_counts().to_frame())

**Observations**
1. We will keep all stations, without filtering out trips departing from stations based on the sub-categories of these characteristics of the stations.

<a id='get-bikeshare-trips-data'></a>

## 4. [Get Bikeshare Trips Data](#get-bikeshare-trips-data)

<a id='get-urls-for-raw-trips-data-files'></a>

### 4.1. [Get URLs for Raw Trips Data Files](#get-urls-for-raw-trips-data-files)

In [None]:
%%time
df_all_urls = bt.get_file_urls(url, trips_params, years_wanted)
df_all_urls

<a id='download-raw-trips-data-files'></a>

### 4.2. [Download Raw Trips Data Files](#download-raw-trips-data-files)

Check if a local `.parquet` file exists with trips data and perform two follow-up actions based on this check
- if the `.parquet` file does exist, check if the trips data in that file is outdated compared to the data in the zip file at the above url
- for the first run of this notebook, the local `.parquet` file will not exist so the contents of the zip file will be extracted

In [None]:
%%time
df_status = bt.get_data_zip_file_download_status(df_all_urls, raw_data_dir, False)
display(df_status)
display(df_status.dtypes.rename("dtype").to_frame())

Get a list of all CSV files that were extracted

In [None]:
csvs = bt.get_local_csv_list(raw_data_dir, years_wanted)
csvs

<a id='load-single-month-of-trips-data'></a>

### 4.3. [Load Single-Month of Trips Data](#load-single-month-of-trips-data)

Select a single month of trips data (a single CSV file) for processing and get the filepath for that CSV file

In [None]:
csv_filepath = csvs[11]

Get the name of the zip file from the full filepath

In [None]:
csv_file = os.path.basename(csv_filepath)

Extract attributes from the check (for outdated trips data) that was performed above

In [None]:
cols = [
    "trips_file_name",
    "downloaded_file",
    "parquet_file_exists",
    "parquet_file_outdated",
    "last_modified_opendata",
]
zip_file, downloaded_file, has_parquet, parquet_file_outdated_check, last_mod = (
    df_status[cols].squeeze().tolist()
)

Read in the trips data from the CSV for the selected month

In [None]:
df = bt.get_single_ridership_data_file(
    csv_filepath,
    dtypes_dict_trips,
    date_cols,
    nan_cols,
    duplicated_cols,
)

<a id='process-single-month-of-trips-data'></a>

### 4.4. [Process Single-Month of Trips Data](#process-single-month-of-trips-data)

Perform data processing, including extracting `datetime` attributes from the trips data

In [None]:
%%time
df = process_trips_data(df)
summarize_df(df)

In [None]:
display(df.head(2))
display(df_stations_new.head(2))
display(df.dtypes.rename("dtype").to_frame())
display(df_stations_new.dtypes.rename("dtype").to_frame())

Combine metadata and trips data

In [None]:
%%time
df_merged = ad.merge_trips_neighbourhood_stats(df, df_stations_new)
validate_merged_data(df_merged, False)
display(df_merged.head(2))
display(df_merged.dtypes.rename("dtype").to_frame())

Count bikeshare trips by station, usertype and `datetime` attributes

In [None]:
%%time
df_agg = ad.aggregate_merged_data(df_merged, zip_file, downloaded_file, csv_file, last_mod)
display(df_agg.head(3))
display(df_agg.dtypes.rename("dtype").to_frame())

Delete all local data files

In [None]:
for f in csvs + glob(os.path.join(raw_data_dir, "*.zip")):
    os.remove(f)

<a id='data-download-end-to-end-workflow'></a>

## 5. [Data Download End-to-End Workflow](#data-download-end-to-end-workflow)

<a id='download-supplementary-datasets'></a>

### 5.1. [Download Supplementary Datasets](#download-supplementary-datasets)

In [None]:
%%time
# Get metadata about bikeshare station locations
df_stations = get_stations_metadata(url, about_params)
df_stations = transform_metadata(df_stations, stations_cols_wanted)

# Get neighbourhood boundary metadata
gdf = cpd.get_neighbourhood_boundary_land_area_data(url, neigh_boundary_params, neigh_cols_to_show)

# Get colleges and universities within the city
df_coll_univ = cpd.get_coll_univ_locations()

# Get neighbourhood containing college and university locations
df_coll_univ_new = pa.check_output(ad.coll_univ_schema_new)(cn.get_data_with_neighbourhood)(
    gdf[geo_cols],
    df_coll_univ,
    "lat",
    "lon",
    "institution_id",
)

# Combine aggregated statistics about colleges and universities per neighbourhood with other
# neighbourhood attributes
df_neigh_stats = ad.combine_neigh_stats_v2(
    gdf,
    df_coll_univ_new,
)

# Get neighbourhood containing bikeshare station locations
df_stations_new = pa.check_output(ad.stations_schema_merged)(cn.get_data_with_neighbourhood)(
    gdf[geo_cols],
    df_stations,
    "lat",
    "lon",
    "station_id",
)

# Merge bikeshare station locations with combined neighbourhood statistics
df_stations_new = ad.combine_stations_metadata_neighbourhood_v2(df_stations_new, df_neigh_stats)

**Notes**
1. This is a small dataset (less than 1,000 rows) so loading the data is not time consuming and the data does not fill up local memory. The bikeshare stations may be changed between updates (as the network expands) and so, to get the latest set of stations, the station metadata supplementary dataset should be retrieved during every update. For these reasons, the supplementary datasets will always be retrieved when updating the dataset for building the dashboard.

<a id='download-bikeshare-trips-data'></a>

### 5.2. [Download Bikeshare Trips Data](#download-bikeshare-trips-data)

In [None]:
%%time
df_all_urls = bt.get_file_urls(url, trips_params, years_wanted, False)
df_status = bt.get_data_zip_file_download_status(df_all_urls, raw_data_dir, False)
display(df_status)

cols = ['trips_file_name', 'downloaded_file', "parquet_file_exists", "parquet_file_outdated", "last_modified_opendata"]
zip_file, downloaded_file, has_parquet, parquet_file_outdated_check, last_mod = df_status[cols].squeeze().tolist()

if not has_parquet or parquet_file_outdated_check:
    log_prefect("Loading updated trips data...", True, False)
    csvs = bt.get_local_csv_list(raw_data_dir, years_wanted, False)
    dfs_agg = []
    for k, csv_filepath in enumerate(csvs):
        if k > 1:
            break

        csv_file = os.path.basename(csv_filepath)
        # year = csv_filepath.split("-")[0][-4:]
        # print(year, csv_filepath, zip_file, downloaded_file, last_mod)

        df = bt.get_single_ridership_data_file(
            csv_filepath,
            dtypes_dict_trips,
            date_cols,
            nan_cols,
            duplicated_cols,
            False
        )

        df = process_trips_data(df, False)

        df_merged = ad.merge_trips_neighbourhood_stats(df, df_stations_new, False)
        validate_merged_data(df_merged, False)

        df_agg = ad.aggregate_merged_data(
            df_merged, zip_file, downloaded_file, csv_file, last_mod, False
        )
        dfs_agg.append(df_agg)
    log_prefect("Loaded updated trips data.", False, False)
else:
    dfs_agg = []
    log_prefect("Trips data is up-to-date. Did not load file.", True, False)
df_agg_downloaded = pd.concat(dfs_agg, ignore_index=True) if dfs_agg else pd.DataFrame(columns=list(ad.agg_schema.columns))
validate_data_download_status(df_agg_downloaded, df_status)
display(df_agg_downloaded.head(3))

# If updated trips data was downloaded and aggregated, then
# update existing parquet file or create new file
if not df_agg_downloaded.empty:
    # Update contents of current parquet file
    if has_parquet:
        ad.update_parquet_file_data(df_agg_downloaded, raw_data_filepath, updated_data_filepath)
    # Export to parquet file
    else:
        pa.check_io(df=ad.agg_schema)(save_data_to_parquet_file)(df_agg_downloaded, raw_data_filepath)

Delete all local data files

In [None]:
files_by_dir = [
    glob(os.path.join(fdir, f"*{ft}"))
    for ft in [".zip", ".gzip", ".csv"]
    for fdir in ["data/raw", "data/processed"]
]
for f in [f for fdir in files_by_dir for f in fdir]:
    os.remove(f)

**Notes**
1. `df_status` will be used to determine if trips data is outdated and it contains the following columns
   - `trips_file_name`
     - name of the trips data zip file available on the city's Open Data portal
   - `last_modified_opendata`
     - timestamp when trips data zip file available on the city's Open Data portal was last updated (on the data portal)
   - `parquet_file_exists`
     - boolean column
     - indicates whether `parquet` file exists with trips data
   - `parquet_file_data_last_modified`
     - timestamp indicating when trips data zip file available in current `parquet` file was last updated
   - `parquet_file_outdated`
     - boolean column
     - indicating whether `.parquet` file's contents are outdated
       - contents are oudated if
         - `parquet` file exists
         - `parquet_file_data_last_modified` is earlier than `last_modified_opendata`
   - `downloaded_file`
     - boolean column
     - whether the trips data zip file was downloaded (only `True` if current `parquet` file's contents are outdated)
   
   As mentioned in the [About](#about) section, `df_status` **has only been verified for a single year's trips data** (in the `.zip` file `trips_file_name`).
2. If no new trips data is available than the current contents of the `parquet` file, then `df_agg_downloaded` will be empty. If new trips data is available then `df_agg_downloaded` will contain the following columns
   - `AREA_NAME`
     - name of city neighbourhood ([1](https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/about-toronto-neighbourhoods/), [2](https://open.toronto.ca/dataset/neighbourhoods/))
   - `USER_TYPE`
     - type of bikeshare user
   - `START_year`
     - trip departure year
   - `START_month`
     - trip departure month
   - `START_weekday`
     - trip departure day of week (starts at Monday and ends at Sunday)
   - `START_hour`
     - trip departure hour of day (range from 0 to 23)
   - `TRIP_DURATION`
     - total trip duration for departures (trips)
   - `NUM_STATIONS`
     - number of stations from which departures occurred
   - `NUM_DOCKS`
     - total number of available bike docks
     - note that this is not the number of docks from which departures occurred
   - `NUM_TRIPS`
     - number of departures (trips)
   - `NUM_TRIPS_PHYSICALKEY`
     - number of trips from stations that accept payment using a physical key ([link](https://bikesharetoronto.com/faq/))
   - `NUM_TRIPS_TRANSITCARD`
     - number of trips from stations that accept payment using a [transit card](https://dailyhive.com/toronto/presto-discount-bike-share-toronto-february-2019)
   - `NUM_TRIPS_CREDITCARD`
     - number of trips from stations that accept payment using a credit card
   - `NUM_TRIPS_PHONE`
     - number of trips from stations that accept payment using a [mobile phone app](https://bikesharetoronto.com/app/)
   - `NEIGH_COLLEGES_UNIVS`
     - number of major colleges and universities
   - `zip_file`
     - name of the trips data zip file available on the city's Open Data portal
   - `csv_file`
     - name of monthly CSV file containing trips data whose contents are populated in current `parquet` file
   - `downloaded_file`
     - boolean column
     - whether the trips data zip file was downloaded (only `True` if current `parquet` file's contents are outdated)
   - `last_modified_timestamp`
     - timestamp when trips data zip file available on the city's Open Data portal was last updated (on the data portal)

   where all columns with the `NUM_` prefix are per neighbourhood per user type per year per month per weekday per hour.
3. In `df_agg_downloaded`
   - `NUM_STATIONS` is the unique number of stations within each grouping of neighbourhood, user type, year, month, weekday and hour, so it cannot be aggregated further - it can only be shown as is
   - `NUM_DOCKS` should be reported as is as it is not intuitive to interpret further aggregations of this column
   - Since all reported institutions were in existence before the year of bikeshare (trips) occurred, `NEIGH_COLLEGES_UNIVS` number will not change per neighbourhood user type per year per month per weekday per hour

<a id='to-be-done'></a>

## 6. [To Be Done](#to-be-done)

1. Export `parquet` file to cloud storage, for use in dashboard
   - if contents of `parquet` file are outdated, then
     - run `updated_data_filepath` to export updated contents locally
     - delete file from cloud storage
     - upload local file (with updated contents) to cloud storage
2. In `src/trips/get_ridership_data.py`, read from cloud filepath by replacing local filepath with cloud storage filepath.

---

<span style="float:left;">
    2022 | <a href="https://github.com/elsdes3/bikeshare-dash">@elsdes3</a> (MIT)
</span>

<span style="float:right;">
    <a href="./02_transform_bikeshare_data.ipynb">02 - Transform Bikeshare Data >></a>
</span>