Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive TSVs saved on S3 at lower (cheaper) access level (original #376) #1787

Closed
obulat opened this issue Apr 21, 2021 · 10 comments
Closed
Assignees
Labels
πŸ’» aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs πŸ”§ tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects

Comments

@obulat
Copy link
Contributor

obulat commented Apr 21, 2021

This issue has been migrated from the CC Search Catalog repository

Author: mathemancer
Date: Tue Apr 28 2020
Labels: ✨ goal: improvement,πŸ™… status: discontinued

Current Situation

We now save the TSV resulting from each Provider API Script run in S3, in a Standard Access Object Storage Class bucket.

Suggested Improvement

After loading the data into the DB, we should archive in a Standard-IA (Infrequent Access) class bucket.

This should be done via a node in the loader_workflow Apache Airflow DAG (source file at src/cc_catalog_airflow/loader_workflow.py)

Benefit

We don't currently need to re-access these files very much (or at all under normal circumstances). Thus, we could save money by putting them in a lower-access tier storage bucket.

@AetherUnbound AetherUnbound added 🐍 tech: python Involves Python πŸ”§ tech: airflow Involves Apache Airflow labels Jan 24, 2022
@krysal krysal added 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon labels Nov 18, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@sarayourfriend sarayourfriend self-assigned this Feb 7, 2024
@sarayourfriend sarayourfriend changed the title [Infrastructure] Archive TSVs saved on S3 at lower (cheaper) access level (original #376) Archive TSVs saved on S3 at lower (cheaper) access level (original #376) Feb 7, 2024
@sarayourfriend sarayourfriend added the πŸ’» aspect: code Concerns the software code in the repository label Feb 7, 2024
@sarayourfriend
Copy link
Contributor

Implementing this would cut our catalog data storage costs by at least half, if not more, depending on whether we consider the data "recreatable". There are two infrequent access classes, a multi-zone option and a single-zone option. The multi-zone option is only 0.00625 USD more expensive per GB than the single-zone option. Single-zone is only useful if we consider the data recreatable, because there's no disaster redundancy if the AZ goes out. I think that would be a safe option, but cannot make that call myself, so asking @WordPress/openverse-catalog to give input there.

There are two stages to implement this:

  1. The easiest part, updating the storage class of the object when we are finished with it
  2. The harder part, backfilling the storage class of previous objects

The harder part might actually not be harder, depending on how the objects are used. For example, if it's safe to do a one-time conversion of every existing object in the bucket to the new storage class then it's very easy, and can be done through the AWS CLI:

aws s3 cp s3://<bucket-name>/ s3://<bucket-name>/ --recursive --storage-class <storage_class>

The relevant bucket does not have versioning enabled, so this is a safe way to change the storage class. If versioning was enabled, this would just create a new version with the new storage class, but the previous version would still exist with the standard storage class and we'd be paying double.

Whether we can do that operation depends entirely on when those tsvs get put to the s3 bucket. Do the objects get read from the bucket during a provider ingestion, like, do we put the object there and then immediately retrieve it for usage later in the ingestion flow? If not, then we can just update the code to use the new storage class when it uploads the object and then run the s3 cli command above. It must be in that order to prevent objects getting added after we do the conversion of existing objects but before we have a chance to update the code.

@sarayourfriend
Copy link
Contributor

For context on cost: there's currently 553.2 GB in the catalog bucket. At the standard rate, we are paying 12.7236 USD a month or just over 152 USD a year. Switching to the multi-AZ infrequent access storage class would essentially cut that in half to 6.915 USD a month or 82.98 USD a year.

Switching to the single-AZ class would cut it to 5.532 USD a month or 66.384 USD a year. The near-term cost savings of the single-AZ class are trivial compared to the switch to multi-AZ from standard, so if we are unsure of whether single-AZ is right for us, we can do the multi-AZ option and consider this again later if we want to try to squeeze those additional savings out.

While looking at this though, I noticed we have some much bigger buckets that would save us far more money to change storage class on, specifically the image analysis bucket. That bucket is 2.3 TB, making up over 2/3 of our total s3 usage. We pay 52.9 USD a month for that bucket, or 634.8 USD a year. Cutting it to the multi-zone IA (which I think is the right option for that bucket to be sure) would almost halve that amount at 28.75 USD a month or 345 USD a year. But for that bucket, I wonder if the glacier level classes are actually more appropriate. We literally haven't touched that data in any meaningful way in years. We plan to with #1126 for example, but until we get around to implementing whatever plan we come up with, we could save a substantial sum on our AWS bill by switching to S3 Glacier Flexible Retrieval for example, which would have us paying 8.28 USD a month or 99.36 USD a year, with the only caveat being that it would take between minutes to a few hours to access the data once we need it again (entirely tolerable, from my perspective).

@AetherUnbound can you give a πŸ‘ here if you think it's alright for us to switch the image analysis bucket to glacier flexible retrieval or at least the multi-AZ infrequent access?

@AetherUnbound
Copy link
Contributor

Thanks for looking into this Sara!

do we put the object there and then immediately retrieve it for usage later in the ingestion flow?

Yes, the object is read in the next step in the ingestion workflow and is immediately loaded from S3 into a temp table, then upserted into the primary table in the catalog. So it would be straightforward to:

  1. Update the DAG to specify the storage class, then
  2. Recursively apply that storage class to all items in that bucket.

The multi-AZ IA tier does seem most appropriate for that, and I'm fine with moving forward on that! I'll go ahead and add this to our TODOs, since the sooner we can start getting that savings the better.

As for the image analysis bucket, we should definitely move it to multi-AZ IA as well at the very least. My only concern for the glacier flexible retrieval level is if there would be any difficult to converting that data back to the standard storage level once we started on the Rekognition data incorporation. If that shouldn't be an issue, then we should move that bucket over too!

@sarayourfriend
Copy link
Contributor

My only concern for the glacier flexible retrieval level is if there would be any difficult to converting that data back to the standard storage level once we started on the Rekognition data incorporation. If that shouldn't be an issue, then we should move that bucket over too!

From what I understand (not much!) the caveat is in access times. We can run the same command above to convert back from the glacier level storage to IA or standard: there'd just be a delay between doing that and when the objects would actually be accessible. Best to look deeper into it though.

Sounds good for multi-AZ on the others!

@AetherUnbound
Copy link
Contributor

@sarayourfriend do you feel that merging #3810 resolves this issue?

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Feb 26, 2024

No. As I said in the PR, we still need to run the s3 command to backfill the IA storage class to existing objects. Only then will this issue be resolved (will we actually see a cost savings). Otherwise only new objects would be in IA, which would net us almost no benefit in the near term.

From the PR:

This does not close the issue, as we still need to run the s3 command noted in the issue to backfill the storage class onto all existing objects.

I'll move this into the "Fixes" section to make it clearer.

@sarayourfriend
Copy link
Contributor

I was working on getting the existing objects updated to the new storage class, but when I was looking at the costs before, I missed that transitioning objects between storage classes costs money as well. Every 1000 requests to transition an object to Standard IA costs 0.01 USD.

There are 13199 objects in the bucket right now, so transitioning all objects in the bucket would cost us a one-time expense of 131.99 USD. Copying the objects (rather than using a lifecycle rule) would incur the cost of LIST in the Standard class and COPY/PUT in the Standard IA class, which cost 0.005 USD / 1k requests and 0.01 USD 1/k requests respectively. In other words, using the aws s3 cp approach would incur an additional cost associated with listing the objects, increasing the total one-time cost of updating the storage class on existing objects to 131.99 + 66 USD (rounding up from 65.995) rather than just 131.99.

Assuming we use the lifecycle transition to cut the cost of the LIST operation, that means that the first year after this change, we would see an increase in costs for the bucket, due to the cost of the lifecycle transitions outweighing the storage savings for the first year. In storage costs, we would save 69.02 in the first year, but paying the transition cost this year (only a one time cost to us) will make this first year cost 62.97 USD more (152 - 82.98 - 131.99 = -62.97). Next year, we will not pay that transition cost, and will net a savings of just over 6 USD (-62.97 + 69.02). Every year after the second we are netting the full 69.02 in savings for storage. We would incur the full 131.99 within 30 days, but that would spread across more than one billing cycle, depending on when we activated the rule, and how fast the existing objects transition. For the purposes of approval, assume we would incurr the 131.99 within a two month period, with heavy weighting to the first month (say, 128 USD the first month and the rest the second month).

@AetherUnbound and @zackkrida (I'm not sure who to ask, as the team leadership transitions) can you:

  1. Confirm your understanding of the cost of this transition is the same as mine, after reviewing S3's picing structure (https://aws.amazon.com/s3/pricing/). Especially confirm that using a lifecycle rule is indeed the least expensive method of transitioning the objects to the new storage class, in addition to my cost calculations.
  2. Assuming 1 is "yes", confirm that the one time cost of 131.99 USD and near term loss of 62.97 is worth the long term (2+ years) savings.

I've drafted a PR to implement the lifecycle rule in Terraform: https://github.com/WordPress/openverse-infrastructure/pull/808

If the answer to 2 is "yes", undraft that PR and request reviews on it to kick off getting it merged and applied. Otherwise, close the PR.

@zackkrida
Copy link
Member

@sarayourfriend my understanding of the pricing matches yours and I think the long term savings are worth it. I will defer to @AetherUnbound for now though :)

@AetherUnbound
Copy link
Contributor

Agreed, let's work towards the long term improvements 😊 Thanks for running the calculations for it!

@sarayourfriend
Copy link
Contributor

Thanks. I've marked the PR ready for review and updated the description. @dhruvkb and @stacimc are assigned reviewers but really anyone could review it rather quickly, it's a very small PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
πŸ’» aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs πŸ”§ tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Openverse
  
Backlog
Development

No branches or pull requests

5 participants