Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up CEMS handling of datatypes #3221

Open
e-belfer opened this issue Jan 8, 2024 · 1 comment
Open

Clean up CEMS handling of datatypes #3221

e-belfer opened this issue Jan 8, 2024 · 1 comment
Labels
data-types Dtype conversions, standardization and implications of data types epacems Integration and analysis of the EPA CEMS dataset. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. parquet Issues related to the Apache Parquet file format which we use for long tables.

Comments

@e-belfer
Copy link
Member

e-belfer commented Jan 8, 2024

Is your feature request related to a problem? Please describe.
Right now, CEMS datatypes are not handled like FERC or EIA data. Datatypes are not defined using codes.py or fields.py, but rather there are a few constraints imposed using apply_pudl_dtypes. As of #3187, we now apply some dtypes when we read in raw CEMS data to reduce memory usage, but we don't enforce a set of categories for categorical columns, or define metadata for these fields.

Describe the solution you'd like
Reconfigure CEMS column dtype handling to be more similar to other datasets. CEMS data is stored as a parquet only, so column-level descriptions may not be required. In particular, categoricals should be defined in codes.py.

Describe alternatives you've considered
Currently columns are assigned dtypes in pudl.extract.epacems on read-in, using a dictionary. See #3187.

@e-belfer e-belfer added epacems Integration and analysis of the EPA CEMS dataset. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. data-types Dtype conversions, standardization and implications of data types labels Jan 8, 2024
@zaneselvans
Copy link
Member

There's a complication here in that we have a standardized way of applying PUDL dtypes to tables when we write to or read from SQLite, but IIRC we aren't currently using an IO Manager for the EPA CEMS Parquet outputs, and that's where we would need to apply these dtypes.

Also, because we're writing to Parquet, we probably want to implement these dtypes through the Resource.to_pyarrow() method, which would read in whatever metadata (e.g. ENUM constraints, nullability) has been associated with the fields and table, and translate them into a valid PyArrow schema. This is also something that we need to do for #3102

@zaneselvans zaneselvans added the parquet Issues related to the Apache Parquet file format which we use for long tables. label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-types Dtype conversions, standardization and implications of data types epacems Integration and analysis of the EPA CEMS dataset. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Status: New
Development

No branches or pull requests

2 participants