Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

iNaturalist in-SQL loading #745

Merged
merged 51 commits into from Jan 13, 2023
Merged

iNaturalist in-SQL loading #745

merged 51 commits into from Jan 13, 2023

Conversation

rwidom
Copy link
Collaborator

@rwidom rwidom commented Sep 28, 2022

Related to

This relates to WordPress/openverse#1456 in that my understanding is that we have not yet completed a full load of iNaturalist data, and if we were going to start loading the data incrementally, we would probably want a full load to start from. In practice, it also runs the cleaning and upsert steps on smaller batches, rather than all at once across 120+ million records.

It also integrates Catalog of Life data to address WordPress/openverse#1452.

Description

The existing iNaturalist dag essentially follows these steps: load normalized data from s3 to PostgreSQL, pull integrated JSON out in batches of 10,000 joined records, and then follow the rest of the steps in any provider dag, to reload the JSON into PostgreSQL.

The initial load of normalized data takes on the order of 10-15 minutes for approximately 120,000,000 records in `inaturalist.photos', and somewhat fewer in the other inaturalist tables. However, the dag has timed out after several days on the step where joined the data is being pulled out of PostgreSQL.

This PR moves all of the processing to SQL. The latest local run timed out after 24 hours and 87,943,312 records loaded to public.image. I increased the timeout to be 48 hours, but/and we could consider longer. I estimate that it should complete within 36 hours. More on that on the chart 12-14-2022 tab in this google sheet. The workbook also contains visualizations for other test runs during this development process.

Issues I will make and add to comments for future work

  • Improve Airflow reporting, right now it only reports the number of records loaded
  • Consider adding indices to the temp loading table after loading data into it.
  • Figure out a better way to store batch limit, for DRYer code
  • Keep copies of the most recent source files on s3.

Testing Instructions

I increased the disk image size in my local docker to 192 GB. You should be able to run the whole thing with a very small sample of test data in the repo, and I made some changes to tests/dags/providers/provider_api_scripts/resources/inaturalist/pull_sample_records.py to make it easier/faster to pull a test sample of your choice. And then poke around for data quality/formatting stuff. Instructions to download the inaturalist data.

I could really use a second set of eyes to double check that I have implemented all of the python data cleaning steps (initial attempt on the "image table ddl" tab here, in the SQL. After running the large majority of the dataset, the log for the python cleaning step looks promising to me, but without more on WordPress/openverse#1331, a lot of this has to happen manually.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added this to In progress in Openverse PRs Sep 28, 2022
@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Sep 29, 2022
@AetherUnbound
Copy link
Contributor

This is some really exciting work! I took a look at your first commit, I think that would probably be the most ideal route. Ultimately that was what we had discussed early on I believe - circumventing the python-based provider data ingestion steps entirely and only doing the operations in SQL. The license table you've set up is probably something we'd want to hard code for all SQL-only loads, and I wonder what other cleaning operations/checks we could standardize in SQL-land which would match what we currently have in python land too. Maintaining that parity will probably be the most difficult piece going forward, but it's worth it if that is the best way of doing bulk imports IMO 🙂

@AetherUnbound AetherUnbound changed the title Feature/inaturalist performance iNaturalist in-SQL loading Oct 6, 2022
@rwidom
Copy link
Collaborator Author

rwidom commented Oct 11, 2022

Thanks @AetherUnbound ! To help me make sure that all of the data quality / processing steps in python get moved to SQL here, I made this sheet to try to list out where cleaning is happening in the catalog and API repos. Maybe it would help with WordPress/openverse#244 as well. I wonder if you and/or @krysal would have a second to take a look at it and give me feedback? I haven't finished the API part, but it's a start there. I'm also thinking about the performance issues that @obulat raised, and the impact of having so many "TOASTed" fields in such a large table. I'm wondering how to navigate the trade-offs between complete data (e.g. repeating the provider name for each and every tag in the intermediate table), and performance of the load. More to learn there!

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, so much of this SQL magic is way over my head 🤯 I'm excited to see you're making awesome progress on this though!! Please feel free to ping us if there's anything you'd like us to look at or test specifically as you chug along 😄 🚋 🔬

@@ -72,7 +72,7 @@ def create_loading_table(
columns_definition = f"{create_column_definitions(loading_table_columns)}"
table_creation_query = dedent(
f"""
CREATE TABLE public.{load_table}(
CREATE UNLOGGED TABLE public.{load_table}(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verrrrrry cool 😮

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!!! Right???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! Would this ever be an issue in the future if we use log based replication? Is that something the catalogue would ever need? Maybe not something we need to worry about if we think we might move towards parquet or some other data storage than a relational DB?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation, the biggest downsides to an unlogged table are that they 1) are not crash resistant and will be truncated on an unclean shutdown and 2) are not replicated. Since this is a transient table (and we don't do replication anyway), we should be able to recover it if postgres shuts down by re-running this task. I don't think there's any additional concern about having this one be unlogged, even if we don't move to some other data storage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great question, @sarayourfriend , I was kind of concerned at first when I saw what a big difference it made, but then... well, what @AetherUnbound said. :)

Comment on lines 148 to 150
# Tried importing common.loader.paths.STAGING_DIRECTORY but it didn't work in
# local environment, TO DO: find a better place for this process.
DATA_DIR = Path(__file__).parents[4]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we no longer have STAGING_DIRECTORY there! Best place for it might be OUTPUT_DIR? Although would we want to make sure this gets cleaned up, or have it stick around for the lifetime of the container (e.g. however long it is until the next deployment)? I see you have a note about getting this into S3, would that be more resilient?

Copy link
Collaborator Author

@rwidom rwidom Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure. I did a project once that would download tables via an API (kind of like the Catalog of Life data), and store them on S3 under a dated prefix, just because I was a hard core data hoarder, and then Redshift would pretty much only have the current day's data except for very special cases where if we wanted to know when something changed it wasn't impossible to go back through the archive on S3. But I think that might be overkill for here. (?) Maybe it would make sense to have something like a last_inaturalist_data_load/ prefix in the bucket where the tsv files normally go, where the raw source files could go? And yeah, changing this to OUTPUT_DIR sounds great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that working! Especially with the cadence this DAG will run at, putting this where we'd store the other data (either TSV or otherwise) sounds like a good call 🙂

f"No download from Catalog of Life. {DATA_DIR}/{local_zip_file} exists."
)
else:
with requests.get(COL_URL, stream=True) as r:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, this is a new request pattern! Unfortunately we don't have a way to handle streaming yet in the DelayedRequester machinery. Something to think about down the line maybe!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I thought about trying to use DelayedRequester, and I think it does take kwargs, so technically it could just pass along the stream=True, but then it wouldn't have the rest of the pattern. It seems like it's more designed for a bunch of smaller requests than the one big one, but maybe I should at least be more explicit about retries? Or maybe it would be better to break this into smaller tasks and let airflow handle it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is solid as is, because even with DelayedRequester's retries we'd still need to change a LOT in order to get the context manager + stream to work. Something to look at again if it's a pattern we notice though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this would be possible outside the context manager, but we would need to ensure the connection is closed once the operation is complete:

r = None
try:
    r = self.delayed_requester.get(COL_URL, stream=True)
    with open(DATA_DIR / local_zip_file, "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)
finally:
    if r is not None:
        r.close()

Comment on lines 140 to 149
(1, 'Animals'),
(47118, 'Spider'),
(47124, 'Flowers'),
(47126, 'Plants'),
(47157, 'Butterflies and Moths'),
(47158, 'Insect'),
(47167, 'Mushroom'),
(47434, 'Grass'),
(47604, 'Daisy'),
(48662, 'Monarch Butterfly');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any concern these taxon_ids could change? Would it be better to map the Latin to the English here and then link that to IDs via a join?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is so much a better idea!!!

FROM inaturalist.col_name_usage n
INNER JOIN inaturalist.col_vernacular v on v.taxonid = n.id
where length(n.id) <= 10
group by cast(md5(n.scientificname) as uuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple SQL literacy question, could this just be group by 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a great question! Yeah, I feel like there are some dialects or configurations where that doesn't work and others where it's more performant, but I think our postgres is one of the latter, so yeah, I'll totally change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I was mostly thinking about not having to declare the same functions on the column, but it'd be even better if that is more performant 🏎️

,'|')) as TAGS,
taxa_enriched.title,
LICENSE_CODES.license_url_metadata as META_DATA,
taxa_enriched.tags,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag enrichment is exactly the kind of behavior I was thinking we'd need to have SQL equivalents for 😅 This is great for now, but definitely as we start to have more SQL-only ingestions we'll want some mechanism for abstracting this! Maybe a set of defined functions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I worry about navigating costs of context switching with user defined functions in SQL, particularly for functions that will be run many times over a table, but that might just be something I need to learn to do better in postgres. dbt lets you write jinja macros (e.g. pivot) for this kind of thing so that you're always running something in SQL, but your code can still be DRY.

And I guess for this one in particular, I have a bit of a fantasy that we will make a separate tags table (e.g. with media_type, provider, identifier, tag_name and maybe authority if that makes sense) and stop trying to handle them in JSON at all.

But yeah, at a high level, I totally agree that this is not ideal for the longer term. I'm adding it to the list at the top of the PR for now.

@rwidom
Copy link
Collaborator Author

rwidom commented Jan 2, 2023

I'm tracking my test runs here. I haven't done a full load to image locally yet, and I'm not 100% sure what's going on with docker that made airflow come down on this most recent run. But, this last commit has a couple of improvements that might help:

  • I realized that the catalog of life data does have an indicator for which country/ies use a particular vernacular name. It's only populated about 2/3 of the time, but I'm guessing / hoping that the number of countries is a reasonable proxy for the popularity/importance of the vernacular name.
  • Given that, I felt more comfortable limiting the titles to 255 characters rather than 5,000, which depending on how many photos the longer-titled taxa (and their children taxa) have, could save a lot of space.

But, on this test run, airflow completely crashed before the DB had a chance to run out of space. :/ Maybe you can test it locally again and/or offer advice @AetherUnbound ?

@AetherUnbound
Copy link
Contributor

Ah okay, that's all helpful! I had assumed that it was the Airflow disk space being filled up because of the TSVs, but on second glance that seems incorrect since we're loading directly from PG->PG now! The production database has plenty of space so that shouldn't be a problem 😄 It's odd that Airflow went down on that last run. Would you be able to try another and see what happens? If the scheduler crashes again, I'd love to see the last few lines of output of just logs scheduler to see what might have caused the scheduler to crash.

@rwidom
Copy link
Collaborator Author

rwidom commented Jan 3, 2023

Would you be able to try another and see what happens? If the scheduler crashes again, I'd love to see the last few lines of output of just logs scheduler to see what might have caused the scheduler to crash.

Yeah, I can give it a try this weekend, just it would have to wait until then, because I need to kind of leave my computer alone for a couple of days while it's running locally. I did try just logs but I didn't add scheduler, so it might have been confused. I think I used to have all of airflow in one container, and I wouldn't think that would matter here, but maybe it does?

@krysal
Copy link
Member

krysal commented Jan 3, 2023

I just started reviewing this, and I have to say, this is such an incredible work of investigation and improvements! I haven't read the whole thread or code yet but observing the following make it very promising:

  • switching the chunks to fixed photo ID ranges (with some variation in number of records), rather than asking postgres to figure out how to pull an exact number of records for each chunk;
  • moving the dataset between postgres and s3 just once in compressed form, rather than once compressed, and then twice more in a massively expanded json format; and
  • switching the temp table to nologging;

I ran the DAG as is and I got 14 rows in the image table, which I assume is the current test data. Could you add more instructions on how to test it with a bigger/more updated iNaturalist dataset? I downloaded inaturalist-open-data-latest.tar.gz. Should I copy the files into the /tests/s3-data/inaturalist-open-data as specified in #549? Truly apologize for the wait; lot of urgent things appeared near the end of last year's holidays, but thank you a lot for the hard work put in here, I'm eager to deep through it!

@AetherUnbound
Copy link
Contributor

I did try just logs but I didn't add scheduler, so it might have been confused. I think I used to have all of airflow in one container, and I wouldn't think that would matter here, but maybe it does?

It used to be all in one container! But now it's spread across webserver and scheduler as part of the changes from #874. When docker spits out its logs, I've found that it tends to split them out sometimes by service rather than chronologically (which can be very frustrating for debugging!). just logs scheduler makes sure that only the one services logs are shown which helps a bit in piecing things apart (if the problem is actually the scheduler that is 😅).

@rwidom
Copy link
Collaborator Author

rwidom commented Jan 3, 2023

I ran the DAG as is and I got 14 rows in the image table, which I assume is the current test data. Could you add more instructions on how to test it with a bigger/more updated iNaturalist dataset? I downloaded inaturalist-open-data-latest.tar.gz. Should I copy the files into the /tests/s3-data/inaturalist-open-data as specified in #549?

Yeah, this has been one of the key challenges of this work. I've been downloading four separate zipped files (observations.csv.gz, observers.csv.gz, photos.csv.gz, and taxa.csv.gz) from s3://inaturalist-open-data and putting them in that test folder. Just be aware that it will take a good long while to run locally if you do the whole dataset. I've also used tests/dags/providers/provider_api_scripts/resources/inaturalist/pull_sample_records.py to pull a more mid-sized sample dataset from the raw data, and you can play around with that if you'd like before committing to the full load.

Truly apologize for the wait; lot of urgent things appeared near the end of last year's holidays, but thank you a lot for the hard work put in here, I'm eager to deep through it!

No apologies necessary at all, and I'm psyched to hear/read your thoughts @krysal !

@rwidom
Copy link
Collaborator Author

rwidom commented Jan 4, 2023

I added comments with more details on testing. And then, it occurred to me that I sometimes comment out the post-ingestion parts of the dag for testing, so that I can compare the raw inaturalist and catalog of life tables against the target image table. Maybe I'll take a look at how to set up a run time variable in airflow to skip those steps for testing. Hmmmm... Overkill? Helpful? What do you think @krysal and @AetherUnbound ?

@AetherUnbound
Copy link
Contributor

Ah, good question! While that would be possible certainly, I think it's okay to just add that as an additional comment to the most recent testing documentation you added 🙂

@rwidom
Copy link
Collaborator Author

rwidom commented Jan 5, 2023

I started looking at how to do use a runtime parameter today, just because I was curious, and in the process realized that I had commented out the code to delete the catalog of life downloads. :/ So this uses the parameter for both the catalog of life files, and the inaturalist db schema. But it also raises the question -- like catalog of life could be it's own dag in some ways. Anyway, I don't know, I can roll back this commit. I have some concerns about side effects given that I had to set it at the dag level rather than the task level. Or maybe instead I should move the COL file removal into its own separate post-ingestion task, so that the parameter can just be for the ShortCircuit step rather than the whole dag... ???

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Jan 5, 2023

Ooo, very cool! I think your current implementation looks great - defaulting to prod behavior (removing the files) was all I wanted to assure and it seems that's the case 🙂

@rwidom
Copy link
Collaborator Author

rwidom commented Jan 9, 2023

I finished a full local run with data downloaded on 1/1/2023. Some details and visualizations are here.

[2023-01-09, 11:30:46 UTC] {python.py:177} INFO - Done. Returned value was: 
*DAG*: `inaturalist_workflow`
*Date range*: _all_
*Duration of data pull tasks*: 1 day, 21 hours, 57 mins, 46 secs
*Number of records upserted per media type*:
  - `image`: 152,187,801

Reflections:

  • The run paused twice because the database ran out of space. Each time, I did just down (no -v!!!), went into docker resources to increase space, then just up, and worked great (64GB --> 192GB --> 296GB). Pretty awesome.
  • I ran this super helpful sql to get a sense of how much space the inaturalist data would take up in the prod database (top row of output below). Total space is ~200GB, or on the order of 20 times the size of the raw (normalized, compressed) files.
+--------------------+--------------------------+--------------+------------+------------+------------+------------+------------------------+
| table_schema       | table_name               | row_estimate | total      | index      | toast      | table      | total_size_share       |
|--------------------+--------------------------+--------------+------------+------------+------------+------------+------------------------|
| public             | image                    | 143581550.0  | 198 GB     | 50 GB      | 8192 bytes | 148 GB     | 0.9999561837655186     |
  • I'm not sure that I really understand how postgres is counting toast data here or in production.
  • This was still starting from an empty image table. But, I waited a bit, and then ran just the small test files to see how an update would perform, and it seems really good:
_Duration is the sum of the duration for each data pull task. It does not include loading time and does not account for data pulls that may happen concurrently._
[2023-01-09, 12:58:02 UTC] {python.py:177} INFO - Done. Returned value was: 
*DAG*: `inaturalist_workflow`
*Date range*: _all_
*Duration of data pull tasks*: 8 secs
*Number of records upserted per media type*:
  - `image`: 14

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That info is so helpful, thank you @rwidom!! I'm feeling good about this based on your data, and your changes to the tests look great! Note that there are likely to be some conflicts with #939 because of the executor changes. I think I'm going to merge that PR shortly so if it's OK with you I'll also rebase this PR and address any merge conflicts that might come up with that one.

Thank you for all your continued efforts on this PR! It's exciting to think how much we'll be increasing our catalog's data size 💥

@AetherUnbound
Copy link
Contributor

Okay, I think I've done the merge correctly and gotten this in line with main! I was able to run the DAG with the test data successfully 🙂 Please give it a double-check @rwidom in case I missed anything!

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking the time to review this in detail, I have to say this is impressive! Watching all the parts involved, like importing from the S3 CSV files, the communication between different tasks in the DAG, the use of SQL templates, and even more! (I won't be able to name them all). Seriously huge kudos for transforming iNaturalist to almost a purely SQL load DAG provider 🎉

I left a few comments on the fields used that are quick to fix, so I don't want to hold more onto this PR! Just being able to finish a DAG run is amazing 😄 and as you already noted, we can work on improvements in separate issues/PRs. For Example, one of the things we introduced a while was standardized filetypes, would be nice to port that to SQL, and avoid having mixed jpg jpeg and JPG but it's no super necessary right now.

Comment on lines 42 to 43
COALESCE(INATURALIST.OBSERVERS.LOGIN, INATURALIST.PHOTOS.OBSERVER_ID::text)
as CREATOR,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the observer will be preferred over their username.

Suggested change
COALESCE(INATURALIST.OBSERVERS.LOGIN, INATURALIST.PHOTOS.OBSERVER_ID::text)
as CREATOR,
COALESCE(INATURALIST.OBSERVERS.NAME, INATURALIST.OBSERVERS.LOGIN)
as CREATOR,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh! Great catch! I think I must have had a left join to observers in a prior version and maybe I was worried about some names being truncated or something? I don't know. Hopefully my fix doesn't over-complicate things.

Openverse PRs automation moved this from Needs review to Reviewer approved Jan 11, 2023
rwidom and others added 5 commits January 13, 2023 07:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
No open projects
Openverse PRs
  
Merged!
Development

Successfully merging this pull request may close these issues.

Add translations for top level taxonomies in inaturalist
5 participants