# Notebook Preamble

## IPython Magic

In [None]:
%load_ext autoreload
%autoreload 3

## Notebook Imports

In [None]:
# Standard Library Imports
import logging
import os
import pathlib
import sys
from typing import List
from pathlib import Path

# 3rd Party Imports:
import numpy as np
import pandas as pd
import sqlalchemy as sa
import pyarrow as pa
import pyarrow.parquet as pq
from intake import open_catalog

# Local Imports
import pudl
from pudl.output.pudltabl import PudlTabl
from pudl.metadata.classes import Resource
from pudl.output.epacems import year_state_filter

## Set up a logger

In [None]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter("%(message)s")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Set up standard PUDL DB connections

In [None]:
pudl_settings = pudl.workspace.setup.get_defaults()
ferc1_engine = sa.create_engine(pudl_settings["ferc1_db"])
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

pudl_out_raw = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine)
pudl_out = pudl_out_raw

pudl_settings

# Potential benefits of Intake catalogs:
**Expose metadata:** The Intake catalog doesn't contain any column-level metadata, but I think it could. This would allow a user to see what columns were available, what their types were, and read their descriptions before querying the large dataset.

**Local data caching:** Local file caching is not available. We would expect this to make using many small files more efficient for repeated access, since they would each only need to be transmitted over the network once. However, `fsspec` based file caching hasn't yet been implemented in the `intake-parquet` library.

**Data packaging:** The catalog can be packaged and versioned using conda to manage its dependencies on other software packages and ensure compatibility. With remote access or automatic local file caching, the user also doesn't need to think about where the data is being stored, or putting it in the "right place" -- Intake would manage that.

**Uniform API:** All the data sources of a given type (parquet, SQL) would have the same interface, reducing the number of things a user needs to remember to access the data.

**Decoupling data storage location:** As with DNS, we can change / update the location where the data is being stored without impacting the user directly, since the catalog acts as a decoupling reference.

## Intake References
* [Intake Documentations](https://intake.readthedocs.io/en/latest/start.html)
* [Intake Examples](https://github.com/intake/intake-examples)
* [CarbonPlan Data Catalogs](https://github.com/carbonplan/data)
* [AnacondaCon Presentation Video](https://www.youtube.com/watch?v=oyZJrROQzUs)

# Test Intake & Parquet Functionality & Performance

This notebook demonstrates several different ways of organizing and accessing the same EPA CEMS data:
* Local storage on disk vs. remote storage in Google Cloud Storage buckets
* Directly accessing the data via `pandas.read_parquet()` vs. an Intake catalog.
* Using one big Parquet file for all data vs. separate small files for each combination of state & year.

In [None]:
EPACEMS_DIR = pudl_settings["parquet_dir"] + "/epacems"

TEST_FILTERS = year_state_filter(years=[2019, 2020], states=["CO", "TX", "ID"])
INTAKE_PATH_LOCAL = Path(os.getcwd())
os.environ["INTAKE_PATH"] = str(INTAKE_PATH_LOCAL)

# Authenticated URL:
INTAKE_PATH_REMOTE = "gs://catalyst.coop/intake/test"
# Publicly visible URL:
#INTAKE_PATH_REMOTE = "https://storage.googleapis.com/catalyst.coop/intake/test"

local_single_file = str(INTAKE_PATH_LOCAL / "hourly_emissions_epacems.parquet")
local_multi_file = str(INTAKE_PATH_LOCAL / "hourly_emissions_epacems")
remote_single_file = INTAKE_PATH_REMOTE + "/hourly_emissions_epacems.parquet"
remote_multi_file = INTAKE_PATH_REMOTE + "/hourly_emissions_epacems"

pudl_catalog_path = str(INTAKE_PATH_LOCAL / "pudl-catalog.yml")

## Examine the Catalog

In [None]:
pudl_cat = open_catalog(pudl_catalog_path)
list(pudl_cat)

In [None]:
pudl_cat.metadata

### Source level description
* Basically from the YAML file, but with some extra things, `container`, `direct_access`, `user_parameters`

In [None]:
pudl_cat.epacems_one_file.describe()

### Source internals
* Categorical values showing up as integers.
* String values showing up as objects.
* No length in the shape, but 19 columns.
* `npartitions` is apparently referring to file, not row-group based partitions.

In [None]:
pudl_cat.epacems_one_file.discover()

### The other source internals
* Here we have nullable ints, but they're 64-bit?
* Categories show up as `category` not integers.
* Strings show up as `string` not `object`
* `npartitions` is referring to the separate files. How do we get information about how the partitions are structured in here?
* Somehow in `shape` we've lost a couple of columns! There's no state or year. Probably this is because of how I converted from the old hive partitioned version of epacems.

In [None]:
pudl_cat.epacems_multi_file.discover()

## PUDL Baseline
Read the test data from your local EPA CEMS outputs directly.
* On an SSD this should take less than 10 seconds.
* The only `string` type columns should be `unitid` and `unit_id_epa`
* The dataframe should take about 1.4 GB of memory and have ~8M rows.

In [None]:
%%time
pudl_epacems = pd.read_parquet(
    EPACEMS_DIR,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

In [None]:
display(pudl_epacems.info(show_counts=True, memory_usage="deep"))
del pudl_epacems

## Single File Local
For the single file local tests, download [this file](https://storage.googleapis.com/catalyst.coop/intake/test/hourly_emissions_epacems.parquet) into the same directory as this notebook.

### Direct access with `read_parquet()`
* This takes 3-4 seconds

In [None]:
%%time
df = pd.read_parquet(
    local_single_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

### Via Intake Catalog
* This takes 3-4 seconds

In [None]:
%%time
os.environ["INTAKE_PATH"] = str(INTAKE_PATH_LOCAL)
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_one_file(filters=TEST_FILTERS)
dd = source.to_dask()
df = dd.compute()

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

## Single File Remote

### Direct access with `read_parquet()`
* Using the authenticated `gs://` URL it takes **20 seconds**
* Using the public `https://` URL this takes **10+ minutes**

In [None]:
%%time
df = pd.read_parquet(
    remote_single_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
)

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

### Via Intake Catalog
* With `gs://` URL this takes **1 minute**
* With `https://` URL it downloads a huge amount of data and then times out.

In [None]:
%%time
os.environ["INTAKE_PATH"] = INTAKE_PATH_REMOTE
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_one_file(filters=TEST_FILTERS)
dd = source.to_dask()
df = dd.compute()

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

## Multi File Local

For the multi-file local tests download [this tarball](https://storage.googleapis.com/catalyst.coop/intake/test/hourly_emissions_epacems.tar) and extract it in the same directory as this notebook.

### Direct access with `read_parquet()`
* This takes 5 seconds, and results in an excessively large 3GB dataframe because I generated these parquet files before fixing the string-to-categorical type issue.

In [None]:
%%time
df = pd.read_parquet(
    local_multi_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

### Via Intake Catalog
* This takes about 15 seconds, and results in the 3GB dataframe as above.

In [None]:
%%time
os.environ["INTAKE_PATH"] = str(INTAKE_PATH_LOCAL)
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_multi_file(
    filters=TEST_FILTERS,
    engine="pyarrow",
)
dd = source.to_dask()
df = dd.compute()

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

## Multi File Remote

### Direct access with `read_parquet()`
* With the `gs://` URL this takes **1 minute** and downloads minimal data.
* With the `https://` URL this results in a 403 Forbidden error.

In [None]:
%%time
df = pd.read_parquet(
    remote_multi_file,
    engine="pyarrow",
    filters=TEST_FILTERS,
    use_nullable_dtypes=True,
)

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

### Via Intake Catalog
* With the `gs://` URL this takes **3 minutes** and downloads a little bit of data across the whole time.
* With the `https://` URL this results in a 403 Forbidden error.

In [None]:
%%time
os.environ["INTAKE_PATH"] = INTAKE_PATH_REMOTE
pudl_cat = open_catalog(pudl_catalog_path)
source = pudl_cat.epacems_multi_file(
    filters=TEST_FILTERS,
    engine="pyarrow",
)
dd = source.to_dask()
df = dd.compute()

In [None]:
display(df.info(show_counts=True, memory_usage="deep"))
del df

# Notes
* Unsurprisingly, local access is blazing fast regardless of whether it's a single file or many, and while the Intake catalog access takes around 3x as long, it seems fast enough to be plenty usable.
* Remote performance using a single file, the `gs://` protocol, and `read_parquet()` was shockingly fast. It took less than 10x as long as direct local access.
* Remote performance over `https://` was painfully slow, to the point of being unusable in all uses of Intake. It also seemed to be transmitting far, far more data than in the `gs://` case.
* Basically none of the `https://` cases were usable. The only one that completed took 10 minutes.
* The only remote Intake catalog case that worked was the single-file `gs://`, which (as with the local catalogs) took about 3x as long as the `read_parquet()` case.
* Over `https://` it seems like we can't use directories or wildcards -- we have to enumerate each filename specifically.
* Some of the issues here have to be network speed, but I have 50-100Mbit download speeds, and the amount of data being transmitted varied widely between the different cases.
* Still some data type issues happening in all of the Intake cases. Strings get turned into objects.

# Questions
* Can non-authenticated users access publicly readable data using `gs://` URLs?
* How do we add column-level metadata to the catalog appropriately? Can we get the embedded descriptions to show up?
* How do we add information about what's in the different partitions (i.e. split by year and state, allowable values)
* Why are we getting jumbled nullable/non-nullable, ints/categories, strings/objects in the types?