New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save cleaned data of Ingestion Server to AWS S3 #4163
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the data refresh locally, and saw the saved data in the minio bucket. I'll look through the code again later, but wanted to add a comment here now.
When running locally, I got the bucket does not exist error. Do you think it's better to reuse an existing bucket, or to add the new one to the minio docker template:
BUCKETS_TO_CREATE=openverse-storage,openverse-airflow-logs
@@ -2,7 +2,7 @@ PYTHONUNBUFFERED="0" | |||
|
|||
#AWS_REGION="us-east-1" | |||
|
|||
#ENVIRONMENT="local" | |||
ENVIRONMENT="local" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add a default value instead of uncommenting this var here.
Now, when the ENVIRONMENT is not defined, you get an error when running the data refresh [2024-04-19 12:49:29,717 - root - 186][ERROR] ENVIRONMENT not found. Declare it as envvar or define a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I ran into this while trying to test this locally myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it, and the thing is, often, I have a conflict with default env vars for the catalog. I'm used to using production values as default, but we prefer dev/testing values there. I guess using local
is not so bad here, and we want to get rid of the ingestion server so that they won't last long anyway.
I checked in S3 and we've used
Having checked S3, it looks like the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm excited about this! There are a few things that will need to be adjusted, in addition to my comments above about file size before upload.
Additionally, the testing instructions appear to no longer be accurate, since that initialization based on the environment appears to be a part of the code now.
Lastly, I adjusted the default bucket name and removed some HTTP schemes from the sample data as I noted below. I was able to get the upload to work, but the CSV which was uploaded had multiple copies of each identifier (the screenshot below has the rows sorted just to show the duplications, they were not in that order in the originally produced file). It appears that some of the logic is causing duplicate file writing.
@@ -364,7 +413,7 @@ def clean_image_data(table): | |||
log.info(f"Starting {len(jobs)} cleaning jobs") | |||
|
|||
for result in pool.starmap(_clean_data_worker, jobs): | |||
batch_cleaned_counts = save_cleaned_data(result) | |||
batch_cleaned_counts = data_uploader.save(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting this because I had to talk myself through it - I was worried that the data uploader step was happening in each process as part of the multiprocessing, but it looks like each result that comes out of pool.starmap
is processed serially, so we don't need to worry about multiple processes stepping on each other.
@@ -2,7 +2,7 @@ PYTHONUNBUFFERED="0" | |||
|
|||
#AWS_REGION="us-east-1" | |||
|
|||
#ENVIRONMENT="local" | |||
ENVIRONMENT="local" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I ran into this while trying to test this locally myself.
} | ||
|
||
def __init__(self): | ||
bucket_name = config("AWS_S3_BUCKET", default="openverse-catalog") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @obulat noted, while we use openverse-catalog
in production, the default bucket locally is openverse-storage
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I added the existing bucket in production to BUCKETS_TO_CREATE
. This way we have fewer differences between environments. I don't know why we would want to conserve openverse-storage
. Let me know if there is a reason for this default bucket to be different between environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to replace the local openverse-storage
with openverse-catalog
here:
openverse/catalog/env.template
Line 106 in 0ed4071
OPENVERSE_BUCKET=openverse-storage |
What do you think, @AetherUnbound ?
Otherwise, now locally we have 2 different buckets: one keeps the cleanup data and another - the data from provider scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do decide to use openverse-catalog
everywhere locally, then it's best to fallback to the OPENVERSE_BUCKET
env variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to use OPENVERSE_BUCKET
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like openverse-catalog
is still being added to the BUCKETS_TO_CREATE
list, shouldn't that be removed if we're deferring to OPENVERSE_BUCKET
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AetherUnbound OPENVERSE_BUCKET
defaults to "openverse-catalog" in the Ingestion Server.
d24d5c4
to
314c96f
Compare
7e6c8e1
to
a9d9894
Compare
@obulat @AetherUnbound Thanks for your valuable suggestions. I was too quick to mark this as ready before but I'm still glad I did! @AetherUnbound Regarding the file size, I set it to 850 MB, so only one file is created for each field but tags. 1 GB is on the big side for downloading without a high-speed connection. About the duplicated rows in the file, I'm not sure what could have happened there; I don't see those results in my tests 🤔 Could you try again? I applied the rest of changes suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great now, @krysal !
I left several comments for improvements (non-blocking).
} | ||
|
||
def __init__(self): | ||
bucket_name = config("AWS_S3_BUCKET", default="openverse-catalog") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to replace the local openverse-storage
with openverse-catalog
here:
openverse/catalog/env.template
Line 106 in 0ed4071
OPENVERSE_BUCKET=openverse-storage |
What do you think, @AetherUnbound ?
Otherwise, now locally we have 2 different buckets: one keeps the cleanup data and another - the data from provider scripts.
} | ||
|
||
def __init__(self): | ||
bucket_name = config("AWS_S3_BUCKET", default="openverse-catalog") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do decide to use openverse-catalog
everywhere locally, then it's best to fallback to the OPENVERSE_BUCKET
env variable
92b4eec
to
98d3334
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunately still producing duplicate rows for me locally 😕 here's everything I did to get this working:
- Add
OPENVERSE_BUCKET=openverse-storage
because my existing minio environment file didn't have the new bucket, and the new bucket defaults toopenverse-catalog
- Remove
https://
from theforeign_landing_url
for the first 10 records ofsample_image.csv
- Run
just down -v
- Run
just c
to start the catalog (and get S3 running) - Run
just api/init
(this performs the image data refresh for us) - Download the produced TSV from localhost:5011
Although the logs say 10 records were written for foreign_landing_url
, the CSV I download has 40 records (4 full copies of each 10 rows).
98d3334
to
dc3a9f3
Compare
@AetherUnbound After following those instructions, I also get some duplicates. It's weird, but I think it might be related to the retries for the image data refresh process locally. Lines 130 to 138 in feb8e00
I added a specific folder for the output files, which gets recreated every time the process begins. I don't get more duplicates with these changes; I hope it works for you too! Let me know. |
Looks like tests for the API & ingestion server are failing on this |
6e63a5c
to
89b3063
Compare
c65e66f
to
81a5995
Compare
3d955b1
to
0bf4f81
Compare
0bf4f81
to
c2550ae
Compare
The client authenticates using the instance profile. The instance profile is not configured with permissions for s3, so this change will fail when deployed unless those permissions are added to the ingestion server IAM policy. You'll need to add those new policies to this Terraform resource block: https://github.com/WordPress/openverse-infrastructure/blob/610a8207a50c08e38955bdf069ee0e721ab630f1/modules/services/ingestion-server/iam.tf#L26 If there is a new bucket, please also add the new bucket in Terraform and ensure we control and own the bucket. There's a trend of scanning GitHub for bucket name references and hijacking ones that aren't reserved. |
Fixes
Fixes #3912 by @krysal
Description
This PR changes the Ingestion Server's cleanup step to temporarily save the cleaned data to disk and upload the files to an S3 bucket once they reach a certain size (chose 10 MB kind of arbitrarily, to make them manageable). S3 does not support appending to files; you can only replace them, so the file for each column to modify is split into chunks and uploaded as they are generated.
Looking at this code line in the same Ingestion Server makes me guess the credentials are taken from the environment variables and nothing else is needed in production to connect to S3, but I can't guarantee it. Let me know your thoughts on this approach.
openverse/ingestion_server/ingestion_server/distributed_reindex_scheduler.py
Line 26 in 9ac2586
Testing Instructions
To test this locally with MinIO, create the
openverse-catalog
bucket if it's not automatically created at start. Go to http://localhost:5011/ (username:test_user
& password:test_secret
).Then make some rows of
image
table in the catalog dirty by removing the protocol in one or several of the URLs (url
,creator_url
, orforeign_landing_url
) or modify the tags for low accuracy. Run the data refresh from the Airflow UI, wait for it to finish and check the bucket in MinIO.http://localhost:5011/browser/openverse-catalog/
The data refresh should continue even if the upload to S3 fails for whatever reason. Try shooting down the S3 container and clearing the
ingest_upstream
step in the DAG to confirms it continues despite the failure in the upload.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin