Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPA CEMS Intake Catalog #1564

Open
12 of 15 tasks
zaneselvans opened this issue Mar 30, 2022 · 3 comments
Open
12 of 15 tasks

EPA CEMS Intake Catalog #1564

zaneselvans opened this issue Mar 30, 2022 · 3 comments
Labels
epacems Integration and analysis of the EPA CEMS dataset. epic Any issue whose primary purpose is to organize other issues into a group. intake Issues related to intake data catalogs. parquet Issues related to the Apache Parquet file format which we use for long tables.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 30, 2022

Description

Create a full featured Intake Catalog for distributing the EPA CEMS hourly emissions data stored as Parquet files. This follows some exploration in #1155. See also notes in #1495 and PR #1563

Billing

This work should be billed under our Sloan Foundation "Data Distribution" sub-project.

Goals

  • Allow anonymous / public users to easily work with this relatively large data set (~1 billion rows, 50GB uncompressed).
  • Make it fast and easy to query subsets of the dataset remotely, without having to download the whole thing.
  • Provide human and machine readable metadata in association with the data itself, so that users can understand what they're working with.
  • Ensure that parquet file names are human & machine readable, and compatible with many different filesystems and parquet access methods.
  • Ensure that our EPA CEMS ETL outputs data that's appropriate for distribution as a data catalog.
  • Avoid unnecessary data storage and egress fees.

Tasks / Issues tracked by this Epic

Phase 1:

Get a functional intake catalog deployed for demonstration & feedback.

Phase 2:

Flesh out metadata and improve performance.

Out of Scope

@zaneselvans zaneselvans added the epic Any issue whose primary purpose is to organize other issues into a group. label Mar 30, 2022
@zaneselvans zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. intake Issues related to intake data catalogs. parquet Issues related to the Apache Parquet file format which we use for long tables. labels Mar 31, 2022
@katie-lamb
Copy link
Member

katie-lamb commented Mar 31, 2022

I'll do a review on the Intake catalog PR but this is just a few comments from a first pass at the notebook:

Questions

  • Can non-authenticated users access publicly readable data using gs:// URLs?
    • Unless I'm doing something wrong, then no. I set the GOOGLE_CLOUD_PROJECT project_id to be catalyst-cooperative-pudl as well, but seems like you need authorization. The https:// links seem unworkably slow as well.
  • How do we add column-level metadata to the catalog appropriately? Can we get the embedded descriptions to show up?
    • Hopefully this column-level metadata can be formatted in pudl-catalog.yml like it is for metadata.yml for Datasette.
  • How do we add information about what's in the different partitions (i.e. split by year and state, allowable values)?
    • Can this be put in the Intake metadata?
  • Why are we getting jumbled nullable/non-nullable, ints/categories, strings/objects in the types?
    • The dtypes from the catalog (pudl_cat.epacems_one_file.discover()) don't seem to be the same for me as they are for you (unless I'm misunderstanding your comment).
    • Specifically, "Categorical values showing up as integers. String values showing up as objects.". I'm getting both categorical values as well as strings. This is what the dtypes are for me for both epacems_one_file and epacems_multi_file.
    • Additionally, when I use pd.read_parquet() on the whole EPA CEMS directory, then unit_id_epa is an Int32 instead of a string, on a single file it is a string, and on a remote file it's an object. When read in with the intake catalog it is back to an Int32
    • So it seems like types are potentially not consistent over Pandas vs Intake, local vs. remote, and also between operating systems.
dtype': {'plant_id_eia': 'int32',
  'unitid': 'string',
  'operating_datetime_utc': 'datetime64[ns, UTC]',
  'operating_time_hours': 'float32',
  'gross_load_mw': 'float32',
  'steam_load_1000_lbs': 'float32',
  'so2_mass_lbs': 'float32',
  'so2_mass_measurement_code': 'category',
  'nox_rate_lbs_mmbtu': 'float32',
  'nox_rate_measurement_code': 'category',
  'nox_mass_lbs': 'float32',
  'nox_mass_measurement_code': 'category',
  'co2_mass_tons': 'float32',
  'co2_mass_measurement_code': 'category',
  'heat_content_mmbtu': 'float32',
  'facility_id': 'Int32',
  'unit_id_epa': 'Int32'},

Other nits

  • At one point I had to install intake_parquet after getting a somewhat cryptic error. intake and intake_parquet should be added to pudl-dev?

@zaneselvans
Copy link
Member Author

Whoops yes I forgot to add the intake requirements. I had them installed in my local environment.

The dtypes you've got listed there seem to be the correct ones. But you have no year or state columns. I'm still confused as to why those aren't showing up, given that the data is definitely stored in the files.

I really don't understand how the source specific metadata works. My suspicion is that the allowable year/state values can be put in there, and the column/table descriptions, but I don't see any documentation on how to do it appropriately.

@zaneselvans
Copy link
Member Author

Hey @martindurant thanks so much for your comment on #1496! I got simplecache working and created a basic installable catalog, and have been experimenting with different setups for our open US energy data catalog over in the pudl-catalog repo. I've collected a bunch of outstanding issues in the issue above, which point at the individual pudl-catalog issues and was wondering if we might be able to get some advice from you on how best to set things up. I'm not sure which of these things is just me not understanding how to configure the catalog correctly and which are deeper constraints.

Do you happen to have a list of publicly visible intake catalogs that use Parquet data sources? I've tried searching GitHub but haven't been very successful. The CarbonPlan Data repo is the best I've seen, but they have a very simple configuration.

Once the EPA CEMS Hourly Emissions data source is finished, we also want to look at writing an intake-sqlite driver (#1156) to manage the distribution of versioned SQLite databases, which will download and cache the database file locally, and then use the intake-sql driver to access it. Does that seem like a reasonable approach? Thanks for any pointers you can offer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epacems Integration and analysis of the EPA CEMS dataset. epic Any issue whose primary purpose is to organize other issues into a group. intake Issues related to intake data catalogs. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
None yet
Development

No branches or pull requests

2 participants