Save cleaned data of Ingestion Server to AWS S3 #4163

krysal · 2024-04-19T01:46:28Z

Fixes

Description

This PR changes the Ingestion Server's cleanup step to temporarily save the cleaned data to disk and upload the files to an S3 bucket once they reach a certain size (chose 10 MB kind of arbitrarily, to make them manageable). S3 does not support appending to files; you can only replace them, so the file for each column to modify is split into chunks and uploaded as they are generated.

Looking at this code line in the same Ingestion Server makes me guess the credentials are taken from the environment variables and nothing else is needed in production to connect to S3, but I can't guarantee it. Let me know your thoughts on this approach.

openverse/ingestion_server/ingestion_server/distributed_reindex_scheduler.py

Line 26 in 9ac2586

    
           client = boto3.client("ec2", region_name=config("AWS_REGION", default="us-east-1"))

Testing Instructions

To test this locally with MinIO, create the openverse-catalog bucket if it's not automatically created at start. Go to http://localhost:5011/ (username: test_user & password: test_secret).

just api/up && just catalog/up

Then make some rows of image table in the catalog dirty by removing the protocol in one or several of the URLs (url, creator_url, or foreign_landing_url) or modify the tags for low accuracy. Run the data refresh from the Airflow UI, wait for it to finish and check the bucket in MinIO.

http://localhost:5011/browser/openverse-catalog/

The data refresh should continue even if the upload to S3 fails for whatever reason. Try shooting down the S3 container and clearing the ingest_upstream step in the DAG to confirms it continues despite the failure in the upload.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

obulat

I ran the data refresh locally, and saw the saved data in the minio bucket. I'll look through the code again later, but wanted to add a comment here now.

When running locally, I got the bucket does not exist error. Do you think it's better to reuse an existing bucket, or to add the new one to the minio docker template:
BUCKETS_TO_CREATE=openverse-storage,openverse-airflow-logs

obulat · 2024-04-19T12:52:11Z

ingestion_server/env.template

@@ -2,7 +2,7 @@ PYTHONUNBUFFERED="0"

 #AWS_REGION="us-east-1"

-#ENVIRONMENT="local"
+ENVIRONMENT="local"


I think it would be better to add a default value instead of uncommenting this var here.
Now, when the ENVIRONMENT is not defined, you get an error when running the data refresh [2024-04-19 12:49:29,717 - root - 186][ERROR] ENVIRONMENT not found. Declare it as envvar or define a default value.

Agreed, I ran into this while trying to test this locally myself.

I was thinking about it, and the thing is, often, I have a conflict with default env vars for the catalog. I'm used to using production values as default, but we prefer dev/testing values there. I guess using local is not so bad here, and we want to get rid of the ingestion server so that they won't last long anyway.

AetherUnbound · 2024-04-19T22:43:57Z

When running locally, I got the bucket does not exist error. Do you think it's better to reuse an existing bucket, or to add the new one to the minio docker template:
BUCKETS_TO_CREATE=openverse-storage,openverse-airflow-logs

I checked in S3 and we've used s3://${OPENVERSE_BUCKET}/shared/data-refresh-cleaned-data/ for previous versions of these files, so I think that would be okay to use!

upload the files to an S3 bucket once they reach a certain size (chose 10 MB kind of arbitrarily, to make them manageable)

Having checked S3, it looks like the creator_url.tsv and foreign_landing_url.tsv (the largest ones) are 700-800MB (IIRC tags.tsv was too large to even save and upload). We have 16GB of storage available on the primary ingestion server and 8GB available on each indexer worker. Given that a 10MB limit would produce 80 files for creator_url.tsv for example, I think setting the limit to something like 1GB might make more sense!

AetherUnbound

I'm excited about this! There are a few things that will need to be adjusted, in addition to my comments above about file size before upload.

Additionally, the testing instructions appear to no longer be accurate, since that initialization based on the environment appears to be a part of the code now.

Lastly, I adjusted the default bucket name and removed some HTTP schemes from the sample data as I noted below. I was able to get the upload to work, but the CSV which was uploaded had multiple copies of each identifier (the screenshot below has the rows sorted just to show the duplications, they were not in that order in the originally produced file). It appears that some of the logic is causing duplicate file writing.

ingestion_server/ingestion_server/cleanup.py

AetherUnbound · 2024-04-19T22:56:35Z

ingestion_server/ingestion_server/cleanup.py

@@ -364,7 +413,7 @@ def clean_image_data(table):
                log.info(f"Starting {len(jobs)} cleaning jobs")

                for result in pool.starmap(_clean_data_worker, jobs):
-                    batch_cleaned_counts = save_cleaned_data(result)
+                    batch_cleaned_counts = data_uploader.save(result)


Just noting this because I had to talk myself through it - I was worried that the data uploader step was happening in each process as part of the multiprocessing, but it looks like each result that comes out of pool.starmap is processed serially, so we don't need to worry about multiple processes stepping on each other.

AetherUnbound · 2024-04-19T23:00:37Z

ingestion_server/env.template

@@ -2,7 +2,7 @@ PYTHONUNBUFFERED="0"

 #AWS_REGION="us-east-1"

-#ENVIRONMENT="local"
+ENVIRONMENT="local"


Agreed, I ran into this while trying to test this locally myself.

AetherUnbound · 2024-04-19T23:10:25Z

ingestion_server/ingestion_server/cleanup.py

+    }
+
+    def __init__(self):
+        bucket_name = config("AWS_S3_BUCKET", default="openverse-catalog")


As @obulat noted, while we use openverse-catalog in production, the default bucket locally is openverse-storage:

https://github.com/WordPress/openverse/blob/51fd23537b695114b203ace195041eaec0d7f8b4/catalog/env.template#L106-L105

Gotcha, I added the existing bucket in production to BUCKETS_TO_CREATE. This way we have fewer differences between environments. I don't know why we would want to conserve openverse-storage. Let me know if there is a reason for this default bucket to be different between environments.

I would prefer to replace the local openverse-storage with openverse-catalog here:

openverse/catalog/env.template

Line 106 in 0ed4071

OPENVERSE_BUCKET=openverse-storage

What do you think, @AetherUnbound ?
Otherwise, now locally we have 2 different buckets: one keeps the cleanup data and another - the data from provider scripts.

If we do decide to use openverse-catalog everywhere locally, then it's best to fallback to the OPENVERSE_BUCKET env variable

I changed it to use OPENVERSE_BUCKET!

It looks like openverse-catalog is still being added to the BUCKETS_TO_CREATE list, shouldn't that be removed if we're deferring to OPENVERSE_BUCKET?

@AetherUnbound OPENVERSE_BUCKET defaults to "openverse-catalog" in the Ingestion Server.

krysal · 2024-04-21T03:57:46Z

@obulat @AetherUnbound Thanks for your valuable suggestions. I was too quick to mark this as ready before but I'm still glad I did!

@AetherUnbound Regarding the file size, I set it to 850 MB, so only one file is created for each field but tags. 1 GB is on the big side for downloading without a high-speed connection. About the duplicated rows in the file, I'm not sure what could have happened there; I don't see those results in my tests 🤔 Could you try again? I applied the rest of changes suggested.

obulat

This works great now, @krysal !

I left several comments for improvements (non-blocking).

obulat · 2024-04-22T07:18:43Z

ingestion_server/ingestion_server/cleanup.py

+    }
+
+    def __init__(self):
+        bucket_name = config("AWS_S3_BUCKET", default="openverse-catalog")


I would prefer to replace the local openverse-storage with openverse-catalog here:

openverse/catalog/env.template

Line 106 in 0ed4071

OPENVERSE_BUCKET=openverse-storage

What do you think, @AetherUnbound ?
Otherwise, now locally we have 2 different buckets: one keeps the cleanup data and another - the data from provider scripts.

ingestion_server/ingestion_server/cleanup.py

obulat · 2024-04-22T07:23:39Z

ingestion_server/ingestion_server/cleanup.py

+    }
+
+    def __init__(self):
+        bucket_name = config("AWS_S3_BUCKET", default="openverse-catalog")


If we do decide to use openverse-catalog everywhere locally, then it's best to fallback to the OPENVERSE_BUCKET env variable

AetherUnbound

This is unfortunately still producing duplicate rows for me locally 😕 here's everything I did to get this working:

Add OPENVERSE_BUCKET=openverse-storage because my existing minio environment file didn't have the new bucket, and the new bucket defaults to openverse-catalog
Remove https:// from the foreign_landing_url for the first 10 records of sample_image.csv
Run just down -v
Run just c to start the catalog (and get S3 running)
Run just api/init (this performs the image data refresh for us)
Download the produced TSV from localhost:5011

Although the logs say 10 records were written for foreign_landing_url, the CSV I download has 40 records (4 full copies of each 10 rows).

krysal · 2024-04-25T03:41:55Z

@AetherUnbound After following those instructions, I also get some duplicates. It's weird, but I think it might be related to the retries for the image data refresh process locally.

openverse/load_sample_data.sh

Lines 130 to 138 in feb8e00

    
           # Image ingestion is flaky; but usually works on the next attempt 
        
           set +e 
        
           while true; do 
        
             just ingestion_server/ingest-upstream "image" "init" 
        
             if just docker/es/wait-for-index "image-init"; then 
        
               break 
        
             fi 
        
             ((c++)) && ((c == 3)) && break 
        
           done

I added a specific folder for the output files, which gets recreated every time the process begins. I don't get more duplicates with these changes; I hope it works for you too! Let me know.

AetherUnbound · 2024-04-25T20:52:41Z

Looks like tests for the API & ingestion server are failing on this

ingestion_server/ingestion_server/cleanup.py

sarayourfriend · 2024-05-01T22:53:03Z

Looking at this code line in the same Ingestion Server makes me guess the credentials are taken from the environment variables and nothing else is needed in production to connect to S3, but I can't guarantee it. Let me know your thoughts on this approach.

The client authenticates using the instance profile. The instance profile is not configured with permissions for s3, so this change will fail when deployed unless those permissions are added to the ingestion server IAM policy. You'll need to add those new policies to this Terraform resource block: https://github.com/WordPress/openverse-infrastructure/blob/610a8207a50c08e38955bdf069ee0e721ab630f1/modules/services/ingestion-server/iam.tf#L26

If there is a new bucket, please also add the new bucket in Terraform and ensure we control and own the bucket. There's a trend of scanning GitHub for bucket name references and hijacking ones that aren't reserved.

krysal requested review from a team as code owners April 19, 2024 01:46

krysal requested review from fcoveram, AetherUnbound and obulat and removed request for fcoveram April 19, 2024 01:46

github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Apr 19, 2024

openverse-bot added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Apr 19, 2024

obulat reviewed Apr 19, 2024

View reviewed changes

AetherUnbound requested changes Apr 19, 2024

View reviewed changes

krysal force-pushed the ing_server_save_to_s3 branch from d24d5c4 to 314c96f Compare April 20, 2024 16:16

krysal marked this pull request as draft April 20, 2024 16:23

krysal force-pushed the ing_server_save_to_s3 branch 3 times, most recently from 7e6c8e1 to a9d9894 Compare April 21, 2024 03:28

krysal marked this pull request as ready for review April 21, 2024 03:57

krysal requested review from obulat and AetherUnbound April 21, 2024 03:57

obulat approved these changes Apr 22, 2024

View reviewed changes

krysal force-pushed the ing_server_save_to_s3 branch 2 times, most recently from 92b4eec to 98d3334 Compare April 22, 2024 21:04

AetherUnbound requested changes Apr 23, 2024

View reviewed changes

AetherUnbound mentioned this pull request Apr 24, 2024

Check crawled images have the correct URI protocol #1411

Open

krysal mentioned this pull request Apr 24, 2024

Data normalization #430

Open

krysal force-pushed the ing_server_save_to_s3 branch from 98d3334 to dc3a9f3 Compare April 25, 2024 03:32

krysal requested a review from AetherUnbound April 25, 2024 03:42

AetherUnbound reviewed Apr 25, 2024

View reviewed changes

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved

krysal force-pushed the ing_server_save_to_s3 branch from 6e63a5c to 89b3063 Compare April 26, 2024 14:45

krysal marked this pull request as draft April 26, 2024 16:15

krysal force-pushed the ing_server_save_to_s3 branch from c65e66f to 81a5995 Compare April 26, 2024 20:50

krysal mentioned this pull request Apr 26, 2024

[TEST - Omit] Integration tests in CI #4215

Closed

krysal force-pushed the ing_server_save_to_s3 branch 4 times, most recently from 3d955b1 to 0bf4f81 Compare April 26, 2024 22:28

krysal added 14 commits May 1, 2024 13:43

Save cleaned data to an AWS S3 bucket

9ec6f69

Remove exception for tags and change S3 connection depending on env

410d3a8

Fix environment variables for ingestion server

8cc68e4

Use proper perf_counter for timers

3766b4a

Raise disk_buffer_size to 850 MB

1725f1e

Include date in uploaded file names

da044e6

Use a dataclass for buffered fields

4af1c0d

Handle exception in case S3 upload fails

3bdf1eb

Use existing variable for S3 bucket

3b89c0c

Add formating to cleanup_time logging

46c161c

Create (and destory) local folder for output files

3a24f20

Fix pytest warning about environment variable

db71d47

Replace os file calls with pathlib.Path

4860602

Move S3 MinIO service up in compose file

c2550ae

krysal force-pushed the ing_server_save_to_s3 branch from 0bf4f81 to c2550ae Compare May 1, 2024 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save cleaned data of Ingestion Server to AWS S3 #4163

Save cleaned data of Ingestion Server to AWS S3 #4163

krysal commented Apr 19, 2024 •

edited

obulat left a comment

obulat Apr 19, 2024

AetherUnbound Apr 19, 2024

krysal Apr 20, 2024

AetherUnbound commented Apr 19, 2024

AetherUnbound left a comment

AetherUnbound Apr 19, 2024

AetherUnbound Apr 19, 2024

AetherUnbound Apr 19, 2024

krysal Apr 20, 2024

obulat Apr 22, 2024

obulat Apr 22, 2024

krysal Apr 22, 2024

AetherUnbound Apr 23, 2024

krysal Apr 26, 2024

krysal commented Apr 21, 2024

obulat left a comment

obulat Apr 22, 2024

obulat Apr 22, 2024

AetherUnbound left a comment

krysal commented Apr 25, 2024

AetherUnbound commented Apr 25, 2024

sarayourfriend commented May 1, 2024 •

edited

Save cleaned data of Ingestion Server to AWS S3 #4163

Are you sure you want to change the base?

Save cleaned data of Ingestion Server to AWS S3 #4163

Conversation

krysal commented Apr 19, 2024 • edited

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

obulat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Apr 19, 2024

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krysal commented Apr 21, 2024

obulat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

krysal commented Apr 25, 2024

AetherUnbound commented Apr 25, 2024

sarayourfriend commented May 1, 2024 • edited

krysal commented Apr 19, 2024 •

edited

sarayourfriend commented May 1, 2024 •

edited