EPA CEMS Intake Catalog #1564

zaneselvans · 2022-03-30T21:15:10Z

Description

Create a full featured Intake Catalog for distributing the EPA CEMS hourly emissions data stored as Parquet files. This follows some exploration in #1155. See also notes in #1495 and PR #1563

Billing

This work should be billed under our Sloan Foundation "Data Distribution" sub-project.

Goals

Allow anonymous / public users to easily work with this relatively large data set (~1 billion rows, 50GB uncompressed).
Make it fast and easy to query subsets of the dataset remotely, without having to download the whole thing.
Provide human and machine readable metadata in association with the data itself, so that users can understand what they're working with.
Ensure that parquet file names are human & machine readable, and compatible with many different filesystems and parquet access methods.
Ensure that our EPA CEMS ETL outputs data that's appropriate for distribution as a data catalog.
Avoid unnecessary data storage and egress fees.

Tasks / Issues tracked by this Epic

Phase 1:

Get a functional intake catalog deployed for demonstration & feedback.

Phase 2:

Flesh out metadata and improve performance.

Out of Scope

Distributed processing of the EPA CEMS data (that'll happen in the Dagster refactor).
Additional EPA CEMS data wrangling, e.g. the things discussed in Change EPA CEMS data types to reduce memory usage #1049
Automatically disable caching of local data catalog sources pudl-catalog#3

The text was updated successfully, but these errors were encountered:

katie-lamb · 2022-03-31T20:02:29Z

I'll do a review on the Intake catalog PR but this is just a few comments from a first pass at the notebook:

Questions

Can non-authenticated users access publicly readable data using gs:// URLs?
- Unless I'm doing something wrong, then no. I set the GOOGLE_CLOUD_PROJECT project_id to be catalyst-cooperative-pudl as well, but seems like you need authorization. The https:// links seem unworkably slow as well.
How do we add column-level metadata to the catalog appropriately? Can we get the embedded descriptions to show up?
- Hopefully this column-level metadata can be formatted in pudl-catalog.yml like it is for metadata.yml for Datasette.
How do we add information about what's in the different partitions (i.e. split by year and state, allowable values)?
- Can this be put in the Intake metadata?
Why are we getting jumbled nullable/non-nullable, ints/categories, strings/objects in the types?
- The dtypes from the catalog (pudl_cat.epacems_one_file.discover()) don't seem to be the same for me as they are for you (unless I'm misunderstanding your comment).
- Specifically, "Categorical values showing up as integers. String values showing up as objects.". I'm getting both categorical values as well as strings. This is what the dtypes are for me for both epacems_one_file and epacems_multi_file.
- Additionally, when I use pd.read_parquet() on the whole EPA CEMS directory, then unit_id_epa is an Int32 instead of a string, on a single file it is a string, and on a remote file it's an object. When read in with the intake catalog it is back to an Int32
- So it seems like types are potentially not consistent over Pandas vs Intake, local vs. remote, and also between operating systems.

dtype': {'plant_id_eia': 'int32',
  'unitid': 'string',
  'operating_datetime_utc': 'datetime64[ns, UTC]',
  'operating_time_hours': 'float32',
  'gross_load_mw': 'float32',
  'steam_load_1000_lbs': 'float32',
  'so2_mass_lbs': 'float32',
  'so2_mass_measurement_code': 'category',
  'nox_rate_lbs_mmbtu': 'float32',
  'nox_rate_measurement_code': 'category',
  'nox_mass_lbs': 'float32',
  'nox_mass_measurement_code': 'category',
  'co2_mass_tons': 'float32',
  'co2_mass_measurement_code': 'category',
  'heat_content_mmbtu': 'float32',
  'facility_id': 'Int32',
  'unit_id_epa': 'Int32'},

Other nits

At one point I had to install intake_parquet after getting a somewhat cryptic error. intake and intake_parquet should be added to pudl-dev?

zaneselvans · 2022-04-01T01:00:38Z

Whoops yes I forgot to add the intake requirements. I had them installed in my local environment.

The dtypes you've got listed there seem to be the correct ones. But you have no year or state columns. I'm still confused as to why those aren't showing up, given that the data is definitely stored in the files.

I really don't understand how the source specific metadata works. My suspicion is that the allowable year/state values can be put in there, and the column/table descriptions, but I don't see any documentation on how to do it appropriately.

zaneselvans · 2022-04-07T21:12:29Z

Hey @martindurant thanks so much for your comment on #1496! I got simplecache working and created a basic installable catalog, and have been experimenting with different setups for our open US energy data catalog over in the pudl-catalog repo. I've collected a bunch of outstanding issues in the issue above, which point at the individual pudl-catalog issues and was wondering if we might be able to get some advice from you on how best to set things up. I'm not sure which of these things is just me not understanding how to configure the catalog correctly and which are deeper constraints.

Do you happen to have a list of publicly visible intake catalogs that use Parquet data sources? I've tried searching GitHub but haven't been very successful. The CarbonPlan Data repo is the best I've seen, but they have a very simple configuration.

Once the EPA CEMS Hourly Emissions data source is finished, we also want to look at writing an intake-sqlite driver (#1156) to manage the distribution of versioned SQLite databases, which will download and cache the database file locally, and then use the intake-sql driver to access it. Does that seem like a reasonable approach? Thanks for any pointers you can offer!

zaneselvans added the epic Any issue whose primary purpose is to organize other issues into a group. label Mar 30, 2022

This was referenced Mar 30, 2022

Adapt Parquet partitions for use with Intake #1495

Closed

Scope improvements to intake-parquet #1496

Closed

Add metadata to PyArrow schema and Parquet outputs #1378

Closed

Modify EPA CEMS ETL to facilitate Intake Catalog #1563

Merged

zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. intake Issues related to intake data catalogs. parquet Issues related to the Apache Parquet file format which we use for long tables. labels Mar 31, 2022

MichaelTiemannOSC mentioned this issue Apr 1, 2022

Guidance needed for using AddFiles with Iceberg os-climate/os_c_data_commons#153

Open

zaneselvans mentioned this issue May 13, 2022

Publish PUDL Intake Catalog #1179

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPA CEMS Intake Catalog #1564

EPA CEMS Intake Catalog #1564

zaneselvans commented Mar 30, 2022 •

edited

Loading

katie-lamb commented Mar 31, 2022 •

edited

Loading

zaneselvans commented Apr 1, 2022

zaneselvans commented Apr 7, 2022

EPA CEMS Intake Catalog #1564

EPA CEMS Intake Catalog #1564

Comments

zaneselvans commented Mar 30, 2022 • edited Loading

Description

Billing

Goals

Tasks / Issues tracked by this Epic

Phase 1:

Phase 2:

Out of Scope

katie-lamb commented Mar 31, 2022 • edited Loading

Questions

Other nits

zaneselvans commented Apr 1, 2022

zaneselvans commented Apr 7, 2022

zaneselvans commented Mar 30, 2022 •

edited

Loading

katie-lamb commented Mar 31, 2022 •

edited

Loading