Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename s3 bucket #2793

Merged
merged 12 commits into from
Sep 13, 2023
Merged

Rename s3 bucket #2793

merged 12 commits into from
Sep 13, 2023

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Aug 16, 2023

PR Overview

This PR:

  • Updates nightly build script to write outputs to s3://pudl.catalyst.coop and gs://pudl.catalyst.coop. The build script still writes outputs to s3://intake.catalyst.coop for backward compatibility for folks still using intake.catalyst.coop. We'll deprecate it at a later date.
  • Update references intake.catalyst.coop to pud.catalyst.coop in docs.
  • Only copy outputs to distribution buckets if the build passed when kicked off by a tag or a scheduled build on dev. Previously, new unwanted directories for outputs were created in the distribution buckets when manually triggered builds succeeded.

Things that changed in the cloud:

  • Created new buckets called s3://pudl.catalyst.coop and gs://pudl.catalyst.coop.
  • Create a new log bucket for s3://pudl.catalyst.coop called pudl.catalyst.coop-logs. The AWS Open Data Program requires we have a 1 month retention policy on the log bucket. I created a GCP Storage Transfer job that copies the s3 logs daily to a GCS bucket called pudl-s3-logs.catalyst.coop so we can retain logs for longer than a month.
  • We now have 1 TB of storage in s3!

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@bendnorman bendnorman marked this pull request as draft August 16, 2023 23:04
@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (b12967a) 88.5% compared to head (f86c4f0) 88.5%.
Report is 1 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2793   +/-   ##
=====================================
  Coverage   88.5%   88.5%           
=====================================
  Files         90      90           
  Lines      10139   10139           
=====================================
  Hits        8982    8982           
  Misses      1157    1157           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bendnorman
Copy link
Member Author

I keep getting this certificate error when I run the gsutil rsync command. I'm able to sync the two buckets on my local machine so I think it might be some Github Action authenticating weirdness:

Caught non-retryable exception while listing s3://pudl.catalyst.coop-logs/: Host pudl.catalyst.coop-logs.s3.amazonaws.com returned an invalid certificate (remote hostname "pudl.catalyst.coop-logs.s3.amazonaws.com" does not match certificate): ***'subject': ((('commonName', '*.s3.amazonaws.com'),),), 'issuer': ((('countryName', 'US'),), (('organizationName', 'Amazon'),), (('commonName', 'Amazon RSA 2048 M01'),)), 'version': 3, 'serialNumber': '013E1B83BC89D354ED4263B9FD2161BA', 'notBefore': 'Mar 21 00:00:00 2023 GMT', 'notAfter': 'Dec 19 23:59:59 2023 GMT', 'subjectAltName': (('DNS', '*.s3.amazonaws.com'), ('DNS', 's3.amazonaws.com')), 'OCSP': ('http://ocsp.r2m01.amazontrust.com/',), 'caIssuers': ('http://crt.r2m01.amazontrust.com/r2m01.cer',), 'crlDistributionPoints': ('http://crl.r2m01.amazontrust.com/r2m01.crl',)***
CommandException: Caught non-retryable exception - aborting rsync
Error: Process completed with exit code 1.

I decided to setup GCP Storage Transfer Service to copy the data from s3 to the GCS bucket.

@bendnorman bendnorman linked an issue Aug 17, 2023 that may be closed by this pull request
@bendnorman bendnorman self-assigned this Aug 28, 2023
@bendnorman bendnorman marked this pull request as ready for review September 11, 2023 15:01
@bendnorman
Copy link
Member Author

Ah, running into the ogr failure getting debugged in #2849

@zaneselvans
Copy link
Member

Is there a reason not to use a valid domain name in the pudl.catalyst.coop-logs bucket name? Do S3 and GCS treat bucket names that are domain names differently?

@bendnorman
Copy link
Member Author

The AWS open data startup script creates log buckets using the {bucket-name}-logs format so I just stuck with it. I could update it to be pudl-logs.catalyst.coop if you'd prefer. I'm not sure if AWS and GCS handle domain named bucket differently. What are the benefits of domain-named buckets other than ensuring uniqueness?

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, though I can't see the Storage Transfer job in the Google Cloud console due to permissions stuff. It would be nice to set that up through Terraform, too, instead of manually, but that shouldn't block this already long-suffering PR.

Looks like the storage transfer job is exposed through the Terraform provider: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_transfer_job

echo "Copying outputs to GCP intake bucket"
gsutil -m -u $GCP_BILLING_PROJECT cp -r "$PUDL_OUTPUT/*" "gs://intake.catalyst.coop/$GITHUB_REF"
function copy_outputs_to_distribution_bucket() {
echo "Copying outputs to GCP distribution bucket"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"distribution" means "user-facing", right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There's still a few references to "PUDL Intake catalogs" in the non-code sections of this doc, should we rename them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. I'll remove the references since we aren't supporting the intake catalogs.

@bendnorman
Copy link
Member Author

Sounds good! Yeah, I wanted to use Terraform but thought it would drag the PR on for longer :/

@bendnorman bendnorman merged commit 38d0b5e into dev Sep 13, 2023
11 checks passed
@bendnorman bendnorman deleted the rename-s3-bucket branch September 13, 2023 21:12
@jdangerx jdangerx mentioned this pull request Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Rename AWS bucket
3 participants