Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE Deploy #1627

Merged
merged 147 commits into from
Jun 21, 2022
Merged

GCE Deploy #1627

merged 147 commits into from
Jun 21, 2022

Conversation

zaneselvans
Copy link
Member

Set up automated nightly builds on GCP.

@bendnorman
Copy link
Member

Great idea to make this a squash merge. A couple of thoughts and questions before I wrap this up:

  • @zaneselvans Should the container create the CEMS partitioned directory of parquet files and the single parquet file for the intake catalogs? If it needs to create both, do I need to run the CEMS pipeline twice?
  • The github workflow is now a single job because github does not allow you to pass env vars between jobs. The two previous jobs need the same values for CHECKOUT_BRANCH, GCE_INSTANCE, and ACTION_SHA.

@zaneselvans
Copy link
Member Author

On the partitioned vs. monolithic CEMS issue, the pudl-catalog currently expects both to exist, but I don't think that's the long-term plan. We want a single version which is fast and space efficient, which I think will be the partitioned version, but we need to play around with how the metadata is stored to make that work. Right now I'm using the monolithic version in the examples because querying it is faster, and the caching mechanism downloads every file from the partitioned version whenever anything is queried because it has to look at the metadata inside each file to know where the right data is.

So either I should disable the partitioned version of the data in the catalog for the moment, or we should generate both outputs.

The epacems_to_parquet script should be able to generate the partitioned output with the --partition output after the normal ETL has run.

@bendnorman
Copy link
Member

Ok, I'll just run epacems_to_parquet --partition after the full ETL is run we have the partitioned and monolithic versions.

@bendnorman
Copy link
Member

Changes:

  • Create a Docker image that installs PUDL and runs the ETL.
  • Create a Github Action that builds the PUDL image, pushes it to Docker Hub, runs the ETL on a GCP VM and copies the outputs to intake GCS buckets on a schedule and tags.
  • Add a --loglevel arg to the package entrypoint commands.
  • Add GoogleCloudStorageCache support to ferc1_to_sqlite and censusdp1tract_to_sqlite commands and pytest.
  • Allow users to create monolithic and partitioned EPA CEMS outputs without having to clobber or move any existing CEMS outputs.
  • Add requester pays support to GoogleCloudStorageCache.

@bendnorman bendnorman marked this pull request as ready for review June 14, 2022 22:51
@bendnorman
Copy link
Member

Ok! The YAML formatting has been sorted out and the unit and CI tests have been re-enabled. I think this is good to merge in.

Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you add a page to the documentation in the Development section that explains how the nightly builds and data deployment works in general and what the moving parts are so we can all have a shared understanding of it?
  • It would also be good to get your bulletized summary of the code changes into the release notes.
  • How are local_pudl_etl.sh and gcp_pudl_etl.sh different? When do we use the local one?
  • In some places we're using the abbreviation GCE (GCE_INSTANCE) elsewhere it's GCP (gcp_pudl_etl.sh) and else-elsewhere it's GCLOUD (GCLOUD_BILLING_PROJECT) and it's not always clear to me why it's one and not the other. I think it'll be easier to remember these names without looking them up if we make them as consistent as possible. Or maybe I just don't understand what's differentiating them?
  • github.ref_name / $GITHUB_REF will be whatever the branch or tag are right? But the only case in which it runs on a branch is for dev because of the on: push trigger.
  • We wanted to condense all these commits into a single squash-merge right? That still seems like a good idea to me.

.github/workflows/build-deploy-pudl.yml Outdated Show resolved Hide resolved
.github/workflows/build-deploy-pudl.yml Outdated Show resolved Hide resolved
docker/local_pudl_etl.sh Show resolved Hide resolved
@zaneselvans
Copy link
Member Author

Looks good to me. Weird that you apparently can't transfer ownership of a PR to someone else. Since I created the initial PR I can't "approve".

@bendnorman
Copy link
Member

Hmm weird. I'll squash and merge it in.

@bendnorman bendnorman merged commit b8fb80e into dev Jun 21, 2022
@zaneselvans zaneselvans deleted the gce-deploy branch October 26, 2022 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants