# Notebook Preamble

## IPython Magic

In [1]:
%load_ext autoreload
%autoreload 3

## Notebook Imports

In [2]:
# Standard Library Imports
import logging
import os
import sys
from pathlib import Path

# We need to set these environment variables prior to importing our intake catalog.
# You can also set them in your own shell environment instead.
os.environ["PUDL_INTAKE_CACHE"] = str(Path.home() / ".cache/intake-pudl")

# The fastest remote data, requires authentication for now.
os.environ["PUDL_INTAKE_PATH"] = "gcs://catalyst.coop/intake/test"

# Available to the anonymous public, but not yet working
#os.environ["PUDL_INTAKE_PATH"] = "https://storage.googleapis.com/catalyst.coop/intake/test"

# Local data if you've got it!
# os.environ["PUDL_INTAKE_PATH"] = str(Path.cwd().parent() / "data")

# 3rd Party Imports:
import intake
import pandas as pd
from pudl_catalog.helpers import year_state_filter

TEST_YEARS = [2019, 2020]
TEST_STATES = ["ID", "CO", "TX"]

## Set up a logger

In [3]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter("%(message)s")
handler.setFormatter(formatter)
logger.handlers = [handler]

# Potential benefits of Intake catalogs:
**Expose metadata:** The Intake catalog doesn't contain any column-level metadata, but I think it could. This would allow a user to see what columns were available, what their types were, and read their descriptions before querying the large dataset.

**Local data caching:** Local file caching is not available. We would expect this to make using many small files more efficient for repeated access, since they would each only need to be transmitted over the network once. However, `fsspec` based file caching hasn't yet been implemented in the `intake-parquet` library.

**Data packaging:** The catalog can be packaged and versioned using conda to manage its dependencies on other software packages and ensure compatibility. With remote access or automatic local file caching, the user also doesn't need to think about where the data is being stored, or putting it in the "right place" -- Intake would manage that.

**Uniform API:** All the data sources of a given type (parquet, SQL) would have the same interface, reducing the number of things a user needs to remember to access the data.

**Decoupling data storage location:** As with DNS, we can change / update the location where the data is being stored without impacting the user directly, since the catalog acts as a decoupling reference.

## Intake References
* [Intake Documentations](https://intake.readthedocs.io/en/latest/start.html)
* [Intake Examples](https://github.com/intake/intake-examples)
* [CarbonPlan Data Catalogs](https://github.com/carbonplan/data)
* [AnacondaCon Presentation Video](https://www.youtube.com/watch?v=oyZJrROQzUs)

# Test Intake & Parquet Functionality & Performance

This notebook demonstrates several different ways of organizing and accessing the same EPA CEMS data:
* Local storage on disk vs. remote storage in Google Cloud Storage buckets
* Directly accessing the data via `pandas.read_parquet()` vs. an Intake catalog.
* Using one big Parquet file for all data vs. separate small files for each combination of state & year.

## Data for local catalog testing
Download these files and place it in the `data` directory at the top level of the repo. Make sure you extract the tarball.
* Single Parquet file: https://storage.googleapis.com/catalyst.coop/intake/test/hourly_emissions_epacems.parquet
* Year-state partitioned data: https://storage.googleapis.com/catalyst.coop/intake/test/hourly_emissions_epacems.tar

## What Intake data sources are installed?

In [4]:
list(intake.cat)

['hourly_emissions_epacems',
 'hourly_emissions_epacems_partitioned',
 'pudl_cat']

In [5]:
pudl_cat = intake.cat.pudl_cat
list(pudl_cat)

['hourly_emissions_epacems', 'hourly_emissions_epacems_partitioned']

In [6]:
pudl_cat

pudl_cat:
  args:
    path: /home/zane/code/catalyst/pudl-data-catalog/src/pudl_catalog/pudl_catalog.yaml
  description: A catalog of open energy system data for use by climate advocates,
    policymakers, journalists, researchers, and other members of civil society.
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    creator:
      email: pudl@catalyst.coop
      path: https://catalyst.coop
      title: Catalyst Cooperative


In [7]:
pudl_cat.hourly_emissions_epacems

hourly_emissions_epacems:
  args:
    engine: pyarrow
    storage_options:
      simplecache:
        cache_storage: /home/zane/.cache/intake-pudl
    urlpath: simplecache::gcs://catalyst.coop/intake/test/hourly_emissions_epacems.parquet
  description: Hourly pollution emissions and plant operational data reported via
    Continuous Emissions Monitoring Systems (CEMS) as required by 40 CFR Part 75.
    Includes CO2, NOx, and SO2, as well as the heat content of fuel consumed and gross
    power output. Hourly values reported by US EIA ORISPL code and emissions unit
    (smokestack) ID.
  driver: intake_parquet.source.ParquetSource
  metadata:
    catalog_dir: /home/zane/code/catalyst/pudl-data-catalog/src/pudl_catalog/
    license:
      name: CC-BY-4.0
      path: https://creativecommons.org/licenses/by/4.0
      title: Creative Commons Attribution 4.0
    path: https://ampd.epa.gov/ampd
    provider: US Environmental Protection Agency Air Markets Program
    title: Continuous Emission

In [8]:
pudl_cat.hourly_emissions_epacems.describe()

{'name': 'hourly_emissions_epacems',
 'container': 'dataframe',
 'plugin': ['parquet'],
 'driver': ['parquet'],
 'description': 'Hourly pollution emissions and plant operational data reported via Continuous Emissions Monitoring Systems (CEMS) as required by 40 CFR Part 75. Includes CO2, NOx, and SO2, as well as the heat content of fuel consumed and gross power output. Hourly values reported by US EIA ORISPL code and emissions unit (smokestack) ID.',
 'direct_access': 'forbid',
 'user_parameters': [],
 'metadata': {'title': 'Continuous Emissions Monitoring System (CEMS) Hourly Data',
  'type': 'application/parquet',
  'provider': 'US Environmental Protection Agency Air Markets Program',
  'path': 'https://ampd.epa.gov/ampd',
  'license': {'name': 'CC-BY-4.0',
   'title': 'Creative Commons Attribution 4.0',
   'path': 'https://creativecommons.org/licenses/by/4.0'}},
 'args': {'engine': 'pyarrow',
  'urlpath': 'simplecache::{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems.parquet',
  

### Parquet metdata with `discover()`
* Categorical values showing up as integers.
* String values showing up as objects.
* No length in the shape, but 19 columns.
* `npartitions` is apparently referring to file, not row-group based partitions.

In [9]:
pudl_cat.hourly_emissions_epacems.discover()

{'dtype': {'plant_id_eia': 'int32',
  'unitid': 'object',
  'operating_datetime_utc': 'datetime64[ns, UTC]',
  'year': 'int32',
  'state': 'int64',
  'facility_id': 'int32',
  'unit_id_epa': 'object',
  'operating_time_hours': 'float32',
  'gross_load_mw': 'float32',
  'heat_content_mmbtu': 'float32',
  'steam_load_1000_lbs': 'float32',
  'so2_mass_lbs': 'float32',
  'so2_mass_measurement_code': 'int64',
  'nox_rate_lbs_mmbtu': 'float32',
  'nox_rate_measurement_code': 'int64',
  'nox_mass_lbs': 'float32',
  'nox_mass_measurement_code': 'int64',
  'co2_mass_tons': 'float32',
  'co2_mass_measurement_code': 'int64'},
 'shape': (None, 19),
 'npartitions': 1,
 'metadata': {'title': 'Continuous Emissions Monitoring System (CEMS) Hourly Data',
  'type': 'application/parquet',
  'provider': 'US Environmental Protection Agency Air Markets Program',
  'path': 'https://ampd.epa.gov/ampd',
  'license': {'name': 'CC-BY-4.0',
   'title': 'Creative Commons Attribution 4.0',
   'path': 'https://creat

### Partitioned metadata with `discover()`
* Same issues as above.
* How do we get information about how the partitions are structured in here? How long they each are?
* For some reason this cell takes forever to run, and results in a bunch of network traffic

In [10]:
pudl_cat.hourly_emissions_epacems_partitioned.discover()

{'dtype': {'plant_id_eia': 'int32',
  'unitid': 'object',
  'operating_datetime_utc': 'datetime64[ns, UTC]',
  'year': 'int32',
  'state': 'int64',
  'facility_id': 'int32',
  'unit_id_epa': 'object',
  'operating_time_hours': 'float32',
  'gross_load_mw': 'float32',
  'heat_content_mmbtu': 'float32',
  'steam_load_1000_lbs': 'float32',
  'so2_mass_lbs': 'float32',
  'so2_mass_measurement_code': 'int64',
  'nox_rate_lbs_mmbtu': 'float32',
  'nox_rate_measurement_code': 'int64',
  'nox_mass_lbs': 'float32',
  'nox_mass_measurement_code': 'int64',
  'co2_mass_tons': 'float32',
  'co2_mass_measurement_code': 'int64'},
 'shape': (None, 19),
 'npartitions': 1274,
 'metadata': {'title': 'Continuous Emissions Monitoring System (CEMS) Hourly Data',
  'type': 'application/parquet',
  'provider': 'US Environmental Protection Agency Air Markets Program',
  'path': 'https://ampd.epa.gov/ampd',
  'license': {'name': 'CC-BY-4.0',
   'title': 'Creative Commons Attribution 4.0',
   'path': 'https://cr

## Normal usage

In [11]:
%%time
print(f"Reading data from {os.getenv('PUDL_INTAKE_PATH')}")
filters = year_state_filter(
    years=TEST_YEARS,
    states=TEST_STATES,
)
display(filters)
epacems_df = (
    pudl_cat.hourly_emissions_epacems(filters=filters)
    .to_dask().compute()
)

Reading data from gcs://catalyst.coop/intake/test


[[('year', '=', 2019), ('state', '=', 'ID')],
 [('year', '=', 2019), ('state', '=', 'CO')],
 [('year', '=', 2019), ('state', '=', 'TX')],
 [('year', '=', 2020), ('state', '=', 'ID')],
 [('year', '=', 2020), ('state', '=', 'CO')],
 [('year', '=', 2020), ('state', '=', 'TX')]]

CPU times: user 4.05 s, sys: 3.11 s, total: 7.15 s
Wall time: 7 s


In [12]:
epacems_df

Unnamed: 0,plant_id_eia,unitid,operating_datetime_utc,year,state,facility_id,unit_id_epa,operating_time_hours,gross_load_mw,heat_content_mmbtu,steam_load_1000_lbs,so2_mass_lbs,so2_mass_measurement_code,nox_rate_lbs_mmbtu,nox_rate_measurement_code,nox_mass_lbs,nox_mass_measurement_code,co2_mass_tons,co2_mass_measurement_code
0,469,4,2019-01-01 07:00:00+00:00,2019,CO,79,298,1.0,203.0,2146.199951,,1.3,Measured,0.061,Measured,130.917999,Calculated,127.199997,Measured
1,469,4,2019-01-01 08:00:00+00:00,2019,CO,79,298,1.0,203.0,2152.699951,,1.3,Measured,0.061,Measured,131.315002,Calculated,127.599998,Measured
2,469,4,2019-01-01 09:00:00+00:00,2019,CO,79,298,1.0,204.0,2142.199951,,1.3,Measured,0.061,Measured,130.673996,Calculated,127.000000,Measured
3,469,4,2019-01-01 10:00:00+00:00,2019,CO,79,298,1.0,204.0,2129.199951,,1.3,Measured,0.061,Measured,129.880997,Calculated,126.199997,Measured
4,469,4,2019-01-01 11:00:00+00:00,2019,CO,79,298,1.0,204.0,2160.600098,,1.3,Measured,0.061,Measured,131.796997,Calculated,128.100006,Measured
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8006419,61242,CT2,2021-01-01 01:00:00+00:00,2020,TX,8435,91229,0.0,0.0,0.000000,,,,,,,,,
8006420,61242,CT2,2021-01-01 02:00:00+00:00,2020,TX,8435,91229,0.0,0.0,0.000000,,,,,,,,,
8006421,61242,CT2,2021-01-01 03:00:00+00:00,2020,TX,8435,91229,0.0,0.0,0.000000,,,,,,,,,
8006422,61242,CT2,2021-01-01 04:00:00+00:00,2020,TX,8435,91229,0.0,0.0,0.000000,,,,,,,,,


In [13]:
epacems_df.info(show_counts=True, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   plant_id_eia               8006424 non-null  int32              
 1   unitid                     8006424 non-null  object             
 2   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 3   year                       8006424 non-null  int32              
 4   state                      8006424 non-null  category           
 5   facility_id                8006424 non-null  int32              
 6   unit_id_epa                8006424 non-null  object             
 7   operating_time_hours       8003928 non-null  float32            
 8   gross_load_mw              8006424 non-null  float32            
 9   heat_content_mmbtu         8006424 non-null  float32            
 10  steam_load_1000_lbs        33252 non-null 

## Test Performance of different sources

In [14]:
from pudl_catalog.hourly_emissions_epacems import TestEpaCemsParquet
epacems_tester = TestEpaCemsParquet()

In [15]:
epacems_tester.test_direct(years=TEST_YEARS, states=TEST_STATES, verify_df=False)

read_parquet, protocol='local', partition=False, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 3.27s
read_parquet, protocol='local', partition=True, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 3.57s
read_parquet, protocol='gcs', partition=False, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 18.27s
read_parquet, protocol='gcs', partition=True, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 42.11s


In [16]:
epacems_tester.test_intake(years=TEST_YEARS, states=TEST_STATES, verify_df=False)

read_parquet, protocol='local', partition=False, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 4.66s
intake, protocol='local', partition=False, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 8.17s
intake, protocol='local', partition=True, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 29.94s
intake, protocol='gcs', partition=False, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 10.72s
intake, protocol='gcs', partition=True, years=[2019, 2020], states=['ID', 'CO', 'TX']:
    elapsed time: 35.96s


# Notes
* Unsurprisingly, local access is blazing fast regardless of whether it's a single file or many, and while the Intake catalog access takes around 3x as long, it seems fast enough to be plenty usable.
* Remote performance using a single file, the `gs://` protocol, and `read_parquet()` was shockingly fast. It took less than 10x as long as direct local access.
* Remote performance over `https://` was painfully slow, to the point of being unusable in all uses of Intake. It also seemed to be transmitting far, far more data than in the `gs://` case.
* It seems like directories and wildcards can't be used over `https://`? Do we have to enumerate each filename specifically?
* My first thought is that some of the issues are related to network speed, but I have 50-100Mbit download speeds, and the amount of data being transmitted varied widely between the different cases.
* There are still some data type issues with Intake. Strings get turned into objects, categoricals are integers, nullability isn't preserved. What we see in `source.discover()` doesn't reflect what we get in the output dataframe eventually.
* `simplecache::` is working, but it caches file even when we have them locally. Is there some way to avoid that?
* `simplecache::` also ends up caching the **entire** dataset on the first access when trying to filter, because every file has to be examined to find out which ones have the relevant row groups. This kind of defeats the purpose of doing the file partitioning and per-file caching. Is this how it's supposed to work?
* Working with the partitioned data seems to result in a pretty big performance hit, especially remotely, but not locally what's up with that? Can it not scan only the relevant row groups when working remotely?

# Questions
* Can non-authenticated users access publicly readable data using `gs://` URLs?
* How do we add column-level metadata to the catalog appropriately? Can we get the embedded descriptions to show up?
* How do we add information about what's in the different partitions (i.e. split by year and state, allowable values)
* Why are we getting jumbled nullable/non-nullable, ints/categories, strings/objects in the types?
* Can we explicitly add Pandas metadata to the parquet files when we write them? Why isn't that happening already, they're being written from pandas dataframes.