Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output Parquet files as well as SQLite in PUDL ETL #3296

Merged
merged 68 commits into from Feb 6, 2024
Merged

Conversation

zschira
Copy link
Member

@zschira zschira commented Jan 25, 2024

Background

This PR is a branch off of #3232 . Most of the changes come from that branch/pr with a few additions which attempt to answer unresolved questions from that ongoing PR. Specific changes here include: refactoring where the check_foreign_keys logic lives, reverting the use of unittest for consistency (I don't feel too strongly either way, but consistency is nice), and removing the check for extra columns from enforce_schema which was causing CI to fail.

rousik and others added 25 commits January 6, 2024 17:12
Use pyarrow.parquet for writing/reading, add BaseSettings to
configure input/output behaviors. By default, let both modes
be disabled but allow overrides via PUDL_WRITE_TO_PARQUET and
PUDL_READ_FROM_PARQUET env variables.
Conversion to pyarrow table was necessary before writing to parquet.
This makes more sense as the io manager should be format independent.
This encapsulation makes it much nicer than passing around fixtures.
@zschira zschira self-assigned this Jan 25, 2024
@zschira zschira linked an issue Jan 25, 2024 that may be closed by this pull request
@zaneselvans
Copy link
Member

@bendnorman I think we're just waiting to see if your comments were all addressed.

@zaneselvans zaneselvans added this pull request to the merge queue Feb 6, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 6, 2024
@zaneselvans zaneselvans added this pull request to the merge queue Feb 6, 2024
@zaneselvans zaneselvans changed the title Parquet outputs Output Parquet files as well as SQLite in PUDL ETL Feb 6, 2024
@zaneselvans zaneselvans removed this pull request from the merge queue due to a manual request Feb 6, 2024
@zaneselvans zaneselvans added this pull request to the merge queue Feb 6, 2024
@zaneselvans zaneselvans added this to the v2024.02 milestone Feb 6, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 6, 2024
@zaneselvans zaneselvans added this pull request to the merge queue Feb 6, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 6, 2024
@zaneselvans zaneselvans added this pull request to the merge queue Feb 6, 2024
Merged via the queue into main with commit 776b3e7 Feb 6, 2024
13 checks passed
@zaneselvans zaneselvans deleted the parquet_outputs branch February 6, 2024 16:08
@MichaelTiemannOSC
Copy link

Awesome work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Output PUDL as Parquet as well as SQLite
5 participants