This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
iNaturalist in-SQL loading #745
Merged
Merged
Changes from 46 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
5aeac20
cleaning and temp table in pg
rwidom 39d1a36
sketch of full dag NOT TESTED
rwidom 9f195e9
inaturalist dag without tests or reporting (yet)
rwidom 952f2cf
complete dag, 25 mill recs in 5.5 hours local test
rwidom 5570f09
Merge branch 'main' into feature/inaturalist-performance
rwidom 3162ef5
Add passwords for s3 testing with new docker
rwidom e297cbf
make temp loading table UNLOGGED to load it faster
rwidom 8650940
inat with translation 75 million recs in 8 hrs
rwidom e588b00
using OUTPUT_DIR for API files
rwidom 0ad51ba
clarify delayed requester vs requester
rwidom ee65714
DRYer approach to tags TO DO
rwidom 4f343ad
comments on taxa transformation
rwidom 19ae673
scientific names not ids for manual translation
rwidom bf8846f
TO DO comment clean-up
rwidom 1f7e98f
fix name insert syntax
rwidom c1ebae4
Merge 'main' into feature/inaturalist-performance
rwidom ca2b444
add clarity on batch limit override
rwidom 6daa0a5
missing piece of merge from main
rwidom 4397e9c
limit to 20 tags per photo
rwidom 249770a
add option to use alternate dag creation for sql
rwidom c786d19
adjust tests see issue #898
rwidom a8c76af
slightly faster way to pull medium test sample
rwidom b10c7ee
Note another data source for vernacular names
rwidom 65eec9b
remove unnecessary test code
rwidom 2b1f56b
clean and upsert one batch at a time
rwidom 7764ca1
log parsing resource doc
rwidom 5437d7b
Merge branch 'main' into feature/inaturalist-performance
rwidom 0812fb4
use common.constants.IMAGE instead of MEDIA_TYPE
rwidom d6818d9
Merge branch 'main' into feature/inaturalist-performance
rwidom 1af20c5
add explanation of ancestry joins and taxa tags
rwidom 7e25262
use existing clean_intermediate_table_data
rwidom d3ae9f6
remove unnecessary env vars from load_to_s3
rwidom 4f01d3b
declarative doc string for file update check
rwidom cb01384
update iNaturalist description
rwidom 8f057c8
remove message to Staci :)
rwidom f562556
use dynamically generated load subtasks
rwidom 2ac1562
clarify taxa comments and include languages
rwidom c7af0f9
consolidate consolidation code
rwidom 2f03c71
add testing for consolidated metrics
rwidom 40becc6
separate ti_mock instances per test
rwidom 4c0b036
test get batches
rwidom a1fac6f
shorter titles to save space
rwidom 6ff4a14
add better testing instructions
rwidom f975c81
dag parameter to manage post-ingestion deletions
rwidom cc6ba28
Merge branch 'main' into feature/inaturalist-performance
AetherUnbound 893c48b
Add kwargs to get_response_json call
AetherUnbound d354bc9
get_media_type can be static method
rwidom fc60f53
link to original inaturalist photo, rather than medium
rwidom 9f71a6c
prefer creator name over login
rwidom 8c6b56f
remove unused constants
rwidom 2b52c06
add to do for extension cleanup
rwidom File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
418 changes: 347 additions & 71 deletions
418
openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
87 changes: 0 additions & 87 deletions
87
..._catalog/dags/providers/provider_csv_load_scripts/inaturalist/export_to_json.template.sql
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verrrrrry cool 😮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!!! Right???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL! Would this ever be an issue in the future if we use log based replication? Is that something the catalogue would ever need? Maybe not something we need to worry about if we think we might move towards parquet or some other data storage than a relational DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation, the biggest downsides to an unlogged table are that they 1) are not crash resistant and will be truncated on an unclean shutdown and 2) are not replicated. Since this is a transient table (and we don't do replication anyway), we should be able to recover it if postgres shuts down by re-running this task. I don't think there's any additional concern about having this one be unlogged, even if we don't move to some other data storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, great question, @sarayourfriend , I was kind of concerned at first when I saw what a big difference it made, but then... well, what @AetherUnbound said. :)