New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive TSVs saved on S3 at lower (cheaper) access level (original #376) #1787
Comments
Implementing this would cut our catalog data storage costs by at least half, if not more, depending on whether we consider the data "recreatable". There are two infrequent access classes, a multi-zone option and a single-zone option. The multi-zone option is only 0.00625 USD more expensive per GB than the single-zone option. Single-zone is only useful if we consider the data recreatable, because there's no disaster redundancy if the AZ goes out. I think that would be a safe option, but cannot make that call myself, so asking @WordPress/openverse-catalog to give input there. There are two stages to implement this:
The harder part might actually not be harder, depending on how the objects are used. For example, if it's safe to do a one-time conversion of every existing object in the bucket to the new storage class then it's very easy, and can be done through the AWS CLI:
The relevant bucket does not have versioning enabled, so this is a safe way to change the storage class. If versioning was enabled, this would just create a new version with the new storage class, but the previous version would still exist with the standard storage class and we'd be paying double. Whether we can do that operation depends entirely on when those tsvs get put to the s3 bucket. Do the objects get read from the bucket during a provider ingestion, like, do we put the object there and then immediately retrieve it for usage later in the ingestion flow? If not, then we can just update the code to use the new storage class when it uploads the object and then run the s3 cli command above. It must be in that order to prevent objects getting added after we do the conversion of existing objects but before we have a chance to update the code. |
For context on cost: there's currently 553.2 GB in the catalog bucket. At the standard rate, we are paying 12.7236 USD a month or just over 152 USD a year. Switching to the multi-AZ infrequent access storage class would essentially cut that in half to 6.915 USD a month or 82.98 USD a year. Switching to the single-AZ class would cut it to 5.532 USD a month or 66.384 USD a year. The near-term cost savings of the single-AZ class are trivial compared to the switch to multi-AZ from standard, so if we are unsure of whether single-AZ is right for us, we can do the multi-AZ option and consider this again later if we want to try to squeeze those additional savings out. While looking at this though, I noticed we have some much bigger buckets that would save us far more money to change storage class on, specifically the image analysis bucket. That bucket is 2.3 TB, making up over 2/3 of our total s3 usage. We pay 52.9 USD a month for that bucket, or 634.8 USD a year. Cutting it to the multi-zone IA (which I think is the right option for that bucket to be sure) would almost halve that amount at 28.75 USD a month or 345 USD a year. But for that bucket, I wonder if the glacier level classes are actually more appropriate. We literally haven't touched that data in any meaningful way in years. We plan to with #1126 for example, but until we get around to implementing whatever plan we come up with, we could save a substantial sum on our AWS bill by switching to S3 Glacier Flexible Retrieval for example, which would have us paying 8.28 USD a month or 99.36 USD a year, with the only caveat being that it would take between minutes to a few hours to access the data once we need it again (entirely tolerable, from my perspective). @AetherUnbound can you give a π here if you think it's alright for us to switch the image analysis bucket to glacier flexible retrieval or at least the multi-AZ infrequent access? |
Thanks for looking into this Sara!
Yes, the object is read in the next step in the ingestion workflow and is immediately loaded from S3 into a temp table, then upserted into the primary table in the catalog. So it would be straightforward to:
The multi-AZ IA tier does seem most appropriate for that, and I'm fine with moving forward on that! I'll go ahead and add this to our TODOs, since the sooner we can start getting that savings the better. As for the image analysis bucket, we should definitely move it to multi-AZ IA as well at the very least. My only concern for the glacier flexible retrieval level is if there would be any difficult to converting that data back to the standard storage level once we started on the Rekognition data incorporation. If that shouldn't be an issue, then we should move that bucket over too! |
From what I understand (not much!) the caveat is in access times. We can run the same command above to convert back from the glacier level storage to IA or standard: there'd just be a delay between doing that and when the objects would actually be accessible. Best to look deeper into it though. Sounds good for multi-AZ on the others! |
@sarayourfriend do you feel that merging #3810 resolves this issue? |
No. As I said in the PR, we still need to run the s3 command to backfill the IA storage class to existing objects. Only then will this issue be resolved (will we actually see a cost savings). Otherwise only new objects would be in IA, which would net us almost no benefit in the near term. From the PR:
I'll move this into the "Fixes" section to make it clearer. |
I was working on getting the existing objects updated to the new storage class, but when I was looking at the costs before, I missed that transitioning objects between storage classes costs money as well. Every 1000 requests to transition an object to Standard IA costs 0.01 USD. There are 13199 objects in the bucket right now, so transitioning all objects in the bucket would cost us a one-time expense of 131.99 USD. Copying the objects (rather than using a lifecycle rule) would incur the cost of LIST in the Standard class and COPY/PUT in the Standard IA class, which cost 0.005 USD / 1k requests and 0.01 USD 1/k requests respectively. In other words, using the Assuming we use the lifecycle transition to cut the cost of the LIST operation, that means that the first year after this change, we would see an increase in costs for the bucket, due to the cost of the lifecycle transitions outweighing the storage savings for the first year. In storage costs, we would save 69.02 in the first year, but paying the transition cost this year (only a one time cost to us) will make this first year cost 62.97 USD more (152 - 82.98 - 131.99 = -62.97). Next year, we will not pay that transition cost, and will net a savings of just over 6 USD (-62.97 + 69.02). Every year after the second we are netting the full 69.02 in savings for storage. We would incur the full 131.99 within 30 days, but that would spread across more than one billing cycle, depending on when we activated the rule, and how fast the existing objects transition. For the purposes of approval, assume we would incurr the 131.99 within a two month period, with heavy weighting to the first month (say, 128 USD the first month and the rest the second month). @AetherUnbound and @zackkrida (I'm not sure who to ask, as the team leadership transitions) can you:
I've drafted a PR to implement the lifecycle rule in Terraform: https://github.com/WordPress/openverse-infrastructure/pull/808 If the answer to 2 is "yes", undraft that PR and request reviews on it to kick off getting it merged and applied. Otherwise, close the PR. |
@sarayourfriend my understanding of the pricing matches yours and I think the long term savings are worth it. I will defer to @AetherUnbound for now though :) |
Agreed, let's work towards the long term improvements π Thanks for running the calculations for it! |
This issue has been migrated from the CC Search Catalog repository
Current Situation
We now save the TSV resulting from each Provider API Script run in S3, in a Standard Access Object Storage Class bucket.
Suggested Improvement
After loading the data into the DB, we should archive in a Standard-IA (Infrequent Access) class bucket.
This should be done via a node in the
loader_workflow
Apache Airflow DAG (source file atsrc/cc_catalog_airflow/loader_workflow.py
)Benefit
We don't currently need to re-access these files very much (or at all under normal circumstances). Thus, we could save money by putting them in a lower-access tier storage bucket.
The text was updated successfully, but these errors were encountered: