# Notebook Preamble

## IPython Magic

In [1]:
%load_ext autoreload
%autoreload 3

## Notebook Imports

In [2]:
# Standard Library Imports
import logging
import os
import pathlib
import sys
from typing import List
from pathlib import Path

# 3rd Party Imports:
import numpy as np
import pandas as pd
import sqlalchemy as sa
import pyarrow as pa
import pyarrow.parquet as pq
from intake import open_catalog

# Local Imports
import pudl
from pudl.output.pudltabl import PudlTabl
from pudl.metadata.classes import Resource
from pudl.output.epacems import year_state_filter

## Set up a logger

In [3]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter("%(message)s")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Set up standard PUDL DB connections

In [4]:
pudl_settings = pudl.workspace.setup.get_defaults()
ferc1_engine = sa.create_engine(pudl_settings["ferc1_db"])
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

pudl_out_raw = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine)
pudl_out = pudl_out_raw

pudl_settings

{'pudl_in': '/home/zane/code/catalyst/pudl-work',
 'data_dir': '/home/zane/code/catalyst/pudl-work/data',
 'settings_dir': '/home/zane/code/catalyst/pudl-work/settings',
 'pudl_out': '/home/zane/code/catalyst/pudl-work',
 'sqlite_dir': '/home/zane/code/catalyst/pudl-work/sqlite',
 'parquet_dir': '/home/zane/code/catalyst/pudl-work/parquet',
 'ferc1_db': 'sqlite:////home/zane/code/catalyst/pudl-work/sqlite/ferc1.sqlite',
 'pudl_db': 'sqlite:////home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite',
 'censusdp1tract_db': 'sqlite:////home/zane/code/catalyst/pudl-work/sqlite/censusdp1tract.sqlite'}

# Potential benefits of Intake catalogs:
**Expose metadata:** The Intake catalog doesn't contain any column-level metadata, but I think it could. This would allow a user to see what columns were available, what their types were, and read their descriptions before querying the large dataset.

**Local data caching:** Local file caching is not available. We would expect this to make using many small files more efficient for repeated access, since they would each only need to be transmitted over the network once. However, `fsspec` based file caching hasn't yet been implemented in the `intake-parquet` library.

**Data packaging:** The catalog can be packaged and versioned using conda to manage its dependencies on other software packages and ensure compatibility. With remote access or automatic local file caching, the user also doesn't need to think about where the data is being stored, or putting it in the "right place" -- Intake would manage that.

**Uniform API:** All the data sources of a given type (parquet, SQL) would have the same interface, reducing the number of things a user needs to remember to access the data.

**Decoupling data storage location:** As with DNS, we can change / update the location where the data is being stored without impacting the user directly, since the catalog acts as a decoupling reference.


# Test Intake & Parquet Functionality & Performance

This notebook demonstrates several different ways of organizing and accessing the same EPA CEMS data:
* Local storage on disk vs. remote storage in Google Cloud Storage buckets
* Directly accessing the data via `pandas.read_parquet()` vs. an Intake catalog.
* Using one big Parquet file for all data vs. separate small files for each combination of state & year.

In [42]:
EPACEMS_DIR = pudl_settings["parquet_dir"] + "/epacems"

TEST_FILTERS = year_state_filter(years=[2019, 2020], states=["CO", "TX", "ID"])
INTAKE_PATH_LOCAL = Path(os.getcwd())

# Authenticated URL:
INTAKE_PATH_REMOTE = "gs://catalyst.coop/intake/test"
# Publicly visible URL:
#INTAKE_PATH_REMOTE = "https://storage.googleapis.com/catalyst.coop/intake/test"

local_single_file = str(INTAKE_PATH_LOCAL / "hourly_emissions_epacems.parquet")
local_multi_file = str(INTAKE_PATH_LOCAL / "hourly_emissions_epacems")
remote_single_file = INTAKE_PATH_REMOTE + "/hourly_emissions_epacems.parquet"
remote_multi_file = INTAKE_PATH_REMOTE + "/hourly_emissions_epacems"

pudl_catalog_path = str(INTAKE_PATH_LOCAL / "pudl-catalog.yml")

## PUDL Baseline
Read the test data from your local EPA CEMS outputs directly.
* On an SSD this should take less than 10 seconds.
* The only `string` type columns should be `unitid` and `unit_id_epa`
* The dataframe should take about 1.4 GB of memory and have ~8M rows.

In [6]:
%%time
pudl_epacems = pd.read_parquet(
    EPACEMS_DIR,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

CPU times: user 4.34 s, sys: 1.26 s, total: 5.61 s
Wall time: 3.6 s


In [7]:
display(pudl_epacems.info(show_counts=True, memory_usage="deep"))
del pudl_epacems

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   state                      8006424 non-null  category           
 1   plant_id_eia               8006424 non-null  Int32              
 2   unitid                     8006424 non-null  string             
 3   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 4   operating_time_hours       8003928 non-null  float32            
 5   gross_load_mw              8006424 non-null  float32            
 6   steam_load_1000_lbs        33252 non-null    float32            
 7   so2_mass_lbs               3586052 non-null  float32            
 8   so2_mass_measurement_code  3586052 non-null  category           
 9   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 10  nox_rate_measurement_code  3716001 non-nul

None

## Single File Local
For the single file local tests, download [this file](https://storage.googleapis.com/catalyst.coop/intake/test/hourly_emissions_epacems.parquet) into the same directory as this notebook.

### Direct access with `read_parquet()`
* This takes 3-4 seconds

In [8]:
%%time
df = pd.read_parquet(
    local_single_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

CPU times: user 4.04 s, sys: 1.19 s, total: 5.23 s
Wall time: 4.12 s


In [9]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   state                      8006424 non-null  category           
 1   plant_id_eia               8006424 non-null  Int32              
 2   unitid                     8006424 non-null  string             
 3   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 4   operating_time_hours       8003928 non-null  float32            
 5   gross_load_mw              8006424 non-null  float32            
 6   steam_load_1000_lbs        33252 non-null    float32            
 7   so2_mass_lbs               3586052 non-null  float32            
 8   so2_mass_measurement_code  3586052 non-null  category           
 9   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 10  nox_rate_measurement_code  3716001 non-nul

None

### Via Intake Catalog
* This takes 10-15 seconds

In [22]:
%%time
os.environ["INTAKE_PATH"] = str(INTAKE_PATH_LOCAL)
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_one_file(
    filters=TEST_FILTERS,
    engine="pyarrow",
    use_nullable_dtypes=True,
)
dd = source.to_dask()
df = dd.compute()

CPU times: user 10.9 s, sys: 1.95 s, total: 12.9 s
Wall time: 11.9 s


In [23]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   state                      8006424 non-null  category           
 1   plant_id_eia               8006424 non-null  Int32              
 2   unitid                     8006424 non-null  string             
 3   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 4   operating_time_hours       8003928 non-null  Float32            
 5   gross_load_mw              8006424 non-null  Int64              
 6   steam_load_1000_lbs        33252 non-null    Int64              
 7   so2_mass_lbs               3586052 non-null  Float32            
 8   so2_mass_measurement_code  3586052 non-null  category           
 9   nox_rate_lbs_mmbtu         3716001 non-null  Float32            
 10  nox_rate_measurement_code  3716001 non-nul

None

## Single File Remote

### Direct access with `read_parquet()`
* Using the authenticated `gs://` URL it takes **20 seconds**
* Using the public `https://` URL this takes **10+ minutes**

In [29]:
%%time
df = pd.read_parquet(
    remote_single_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

CPU times: user 5.71 s, sys: 1.52 s, total: 7.23 s
Wall time: 19.4 s


In [27]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   state                      8006424 non-null  category           
 1   plant_id_eia               8006424 non-null  Int32              
 2   unitid                     8006424 non-null  string             
 3   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 4   operating_time_hours       8003928 non-null  float32            
 5   gross_load_mw              8006424 non-null  float32            
 6   steam_load_1000_lbs        33252 non-null    float32            
 7   so2_mass_lbs               3586052 non-null  float32            
 8   so2_mass_measurement_code  3586052 non-null  category           
 9   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 10  nox_rate_measurement_code  3716001 non-nul

None

### Via Intake Catalog
* With `gs://` URL this takes **1 minute**
* With `https://` URL it downloads a huge amount of data and then times out.

In [33]:
%%time
os.environ["INTAKE_PATH"] = INTAKE_PATH_REMOTE
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_one_file(
    filters=TEST_FILTERS,
    engine="pyarrow",
)
dd = source.to_dask()
df = dd.compute()

FSTimeoutError: 

In [32]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   state                      8006424 non-null  category           
 1   plant_id_eia               8006424 non-null  int32              
 2   unitid                     8006424 non-null  object             
 3   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 4   operating_time_hours       8003928 non-null  float32            
 5   gross_load_mw              8006424 non-null  float32            
 6   steam_load_1000_lbs        33252 non-null    float32            
 7   so2_mass_lbs               3586052 non-null  float32            
 8   so2_mass_measurement_code  3586052 non-null  category           
 9   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 10  nox_rate_measurement_code  3716001 non-nul

None

## Multi File Local

For the multi-file local tests download [this tarball](https://storage.googleapis.com/catalyst.coop/intake/test/hourly_emissions_epacems.tar) and extract it in the same directory as this notebook.

### Direct access with `read_parquet()`
* This takes 5 seconds, and results in an excessively large 3GB dataframe because I generated these parquet files before fixing the string-to-categorical type issue.

In [34]:
%%time
df = pd.read_parquet(
    local_multi_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

CPU times: user 7.27 s, sys: 1.43 s, total: 8.7 s
Wall time: 5.58 s


In [35]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   plant_id_eia               8006424 non-null  Int32              
 1   unitid                     8006424 non-null  string             
 2   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 3   operating_time_hours       8003928 non-null  float32            
 4   gross_load_mw              8006424 non-null  float32            
 5   steam_load_1000_lbs        33252 non-null    float32            
 6   so2_mass_lbs               3586052 non-null  float32            
 7   so2_mass_measurement_code  3586052 non-null  string             
 8   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 9   nox_rate_measurement_code  3716001 non-null  string             
 10  nox_mass_lbs               3716549 non-nul

None

### Via Intake Catalog
* This takes about 15 seconds, and results in the 3GB dataframe as above.

In [36]:
%%time
os.environ["INTAKE_PATH"] = str(INTAKE_PATH_LOCAL)
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_multi_file(
    filters=TEST_FILTERS,
    engine="pyarrow",
)
dd = source.to_dask()
df = dd.compute()

CPU times: user 16.2 s, sys: 2.86 s, total: 19.1 s
Wall time: 13.1 s


In [37]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8006424 entries, 0 to 3320279
Data columns (total 17 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   plant_id_eia               8006424 non-null  Int64              
 1   unitid                     8006424 non-null  string             
 2   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 3   operating_time_hours       8003928 non-null  float32            
 4   gross_load_mw              8006424 non-null  float32            
 5   steam_load_1000_lbs        33252 non-null    float32            
 6   so2_mass_lbs               3586052 non-null  float32            
 7   so2_mass_measurement_code  3586052 non-null  string             
 8   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 9   nox_rate_measurement_code  3716001 non-null  string             
 10  nox_mass_lbs               3716549 non-nul

None

## Multi File Remote

### Direct access with `read_parquet()`
* With the `gs://` URL this takes **1 minute** and downloads minimal data.
* With the `https://` URL this results in a 403 Forbidden error.

In [43]:
%%time
df = pd.read_parquet(
    remote_multi_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

CPU times: user 25.7 s, sys: 3.92 s, total: 29.7 s
Wall time: 59.5 s


In [45]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8006424 entries, 0 to 3320279
Data columns (total 17 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   plant_id_eia               8006424 non-null  Int64              
 1   unitid                     8006424 non-null  string             
 2   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 3   operating_time_hours       8003928 non-null  float32            
 4   gross_load_mw              8006424 non-null  float32            
 5   steam_load_1000_lbs        33252 non-null    float32            
 6   so2_mass_lbs               3586052 non-null  float32            
 7   so2_mass_measurement_code  3586052 non-null  string             
 8   nox_rate_lbs_mmbtu         3716001 non-null  float32            
 9   nox_rate_measurement_code  3716001 non-null  string             
 10  nox_mass_lbs               3716549 non-nul

None

### Via Intake Catalog
* With the `gs://` URL this takes **3 minutes** and downloads a little bit of data across the whole time.
* With the `https://` URL this results in a 403 Forbidden error.

In [44]:
%%time
os.environ["INTAKE_PATH"] = INTAKE_PATH_REMOTE
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_multi_file(
    filters=TEST_FILTERS,
    engine="pyarrow",
)
dd = source.to_dask()
df = dd.compute()

CPU times: user 1min 12s, sys: 8.69 s, total: 1min 20s
Wall time: 2min 47s


In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

# Notes
* Unsurprisingly, local access is blazing fast regardless of whether it's a single file or many, and while the Intake catalog access takes around 3x as long, it seems fast enough to be plenty usable.
* Remote performance using a single file, the `gs://` protocol, and `read_parquet()` was shockingly fast. It took less than 10x as long as direct local access.
* Remote performance over `https://` was painfully slow, to the point of being unusable in all uses of Intake. It also seemed to be transmitting far, far more data than in the `gs://` case.
* Basically none of the `https://` cases were usable. The only one that completed took 10 minutes.
* The only remote Intake catalog case that worked was the single-file `gs://`, which (as with the local catalogs) took about 3x as long as the `read_parquet()` case.
* Over `https://` it seems like we can't use directories or wildcards -- we have to enumerate each filename specifically.
* Some of the issues here have to be network speed, but I have 50-100Mbit download speeds, and the amount of data being transmitted varied widely between the different cases.
* Still some data type issues happening in all of the Intake cases. Strings get turned into objects.