Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

iNaturalist in-SQL loading #745

Merged
merged 51 commits into from Jan 13, 2023
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
5aeac20
cleaning and temp table in pg
rwidom Sep 26, 2022
39d1a36
sketch of full dag NOT TESTED
rwidom Sep 28, 2022
9f195e9
inaturalist dag without tests or reporting (yet)
rwidom Oct 1, 2022
952f2cf
complete dag, 25 mill recs in 5.5 hours local test
rwidom Oct 4, 2022
5570f09
Merge branch 'main' into feature/inaturalist-performance
rwidom Oct 15, 2022
3162ef5
Add passwords for s3 testing with new docker
rwidom Oct 17, 2022
e297cbf
make temp loading table UNLOGGED to load it faster
rwidom Oct 24, 2022
8650940
inat with translation 75 million recs in 8 hrs
rwidom Oct 24, 2022
e588b00
using OUTPUT_DIR for API files
rwidom Oct 29, 2022
0ad51ba
clarify delayed requester vs requester
rwidom Dec 1, 2022
ee65714
DRYer approach to tags TO DO
rwidom Dec 1, 2022
4f343ad
comments on taxa transformation
rwidom Dec 2, 2022
19ae673
scientific names not ids for manual translation
rwidom Dec 2, 2022
bf8846f
TO DO comment clean-up
rwidom Dec 2, 2022
1f7e98f
fix name insert syntax
rwidom Dec 6, 2022
c1ebae4
Merge 'main' into feature/inaturalist-performance
rwidom Dec 10, 2022
ca2b444
add clarity on batch limit override
rwidom Dec 10, 2022
6daa0a5
missing piece of merge from main
rwidom Dec 10, 2022
4397e9c
limit to 20 tags per photo
rwidom Dec 11, 2022
249770a
add option to use alternate dag creation for sql
rwidom Dec 11, 2022
c786d19
adjust tests see issue #898
rwidom Dec 12, 2022
a8c76af
slightly faster way to pull medium test sample
rwidom Dec 14, 2022
b10c7ee
Note another data source for vernacular names
rwidom Dec 14, 2022
65eec9b
remove unnecessary test code
rwidom Dec 14, 2022
2b1f56b
clean and upsert one batch at a time
rwidom Dec 16, 2022
7764ca1
log parsing resource doc
rwidom Dec 16, 2022
5437d7b
Merge branch 'main' into feature/inaturalist-performance
rwidom Dec 16, 2022
0812fb4
use common.constants.IMAGE instead of MEDIA_TYPE
rwidom Dec 20, 2022
d6818d9
Merge branch 'main' into feature/inaturalist-performance
rwidom Dec 20, 2022
1af20c5
add explanation of ancestry joins and taxa tags
rwidom Dec 21, 2022
7e25262
use existing clean_intermediate_table_data
rwidom Dec 21, 2022
d3ae9f6
remove unnecessary env vars from load_to_s3
rwidom Dec 21, 2022
4f01d3b
declarative doc string for file update check
rwidom Dec 21, 2022
cb01384
update iNaturalist description
rwidom Dec 24, 2022
8f057c8
remove message to Staci :)
rwidom Dec 24, 2022
f562556
use dynamically generated load subtasks
rwidom Dec 25, 2022
2ac1562
clarify taxa comments and include languages
rwidom Dec 25, 2022
c7af0f9
consolidate consolidation code
rwidom Dec 26, 2022
2f03c71
add testing for consolidated metrics
rwidom Dec 26, 2022
40becc6
separate ti_mock instances per test
rwidom Dec 30, 2022
4c0b036
test get batches
rwidom Dec 30, 2022
a1fac6f
shorter titles to save space
rwidom Jan 2, 2023
6ff4a14
add better testing instructions
rwidom Jan 4, 2023
f975c81
dag parameter to manage post-ingestion deletions
rwidom Jan 5, 2023
cc6ba28
Merge branch 'main' into feature/inaturalist-performance
AetherUnbound Jan 11, 2023
893c48b
Add kwargs to get_response_json call
AetherUnbound Jan 11, 2023
d354bc9
get_media_type can be static method
rwidom Jan 13, 2023
fc60f53
link to original inaturalist photo, rather than medium
rwidom Jan 13, 2023
9f71a6c
prefer creator name over login
rwidom Jan 13, 2023
8c6b56f
remove unused constants
rwidom Jan 13, 2023
2b52c06
add to do for extension cleanup
rwidom Jan 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions docker-compose.override.yml
Expand Up @@ -50,6 +50,9 @@ services:
image: minio/mc:latest
env_file:
- .env
environment:
MINIO_ROOT_USER: ${AWS_ACCESS_KEY}
MINIO_ROOT_PASSWORD: ${AWS_SECRET_KEY}
krysal marked this conversation as resolved.
Show resolved Hide resolved
depends_on:
- s3
volumes:
Expand Down
19 changes: 18 additions & 1 deletion openverse_catalog/dags/common/loader/sql.py
Expand Up @@ -71,7 +71,7 @@ def create_loading_table(
columns_definition = f"{create_column_definitions(loading_table_columns)}"
table_creation_query = dedent(
f"""
CREATE TABLE public.{load_table}(
CREATE UNLOGGED TABLE public.{load_table}(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verrrrrry cool 😮

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!!! Right???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! Would this ever be an issue in the future if we use log based replication? Is that something the catalogue would ever need? Maybe not something we need to worry about if we think we might move towards parquet or some other data storage than a relational DB?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation, the biggest downsides to an unlogged table are that they 1) are not crash resistant and will be truncated on an unclean shutdown and 2) are not replicated. Since this is a transient table (and we don't do replication anyway), we should be able to recover it if postgres shuts down by re-running this task. I don't think there's any additional concern about having this one be unlogged, even if we don't move to some other data storage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great question, @sarayourfriend , I was kind of concerned at first when I saw what a big difference it made, but then... well, what @AetherUnbound said. :)

{columns_definition});
"""
)
Expand Down Expand Up @@ -128,6 +128,23 @@ def load_local_data_to_intermediate_table(
_clean_intermediate_table_data(postgres, load_table)


def clean_transformed_provider_s3_data(
postgres_conn_id, identifier, media_type="image"
):
"""
Apply standard cleaning where data has been loaded from provider files on S3, and
transformed in SQL.
TO DO: if this process is ever used for a provider other than inaturalist, with
types other than image, consider adapting `_extract_media_type`
"""
postgres = PostgresHook(postgres_conn_id)
load_table = _get_load_table_name(identifier, media_type=media_type)
missing_columns, foreign_id_dup = _clean_intermediate_table_data(
krysal marked this conversation as resolved.
Show resolved Hide resolved
postgres, load_table
)
# TO DO: return missing_columns and foreign_id_dup to XCOMs and reporting


def _handle_s3_load_result(cursor) -> int:
"""
Handle the results of the aws_s3.table_import_from_s3 function. Locally this will
Expand Down
300 changes: 251 additions & 49 deletions openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

Large diffs are not rendered by default.

Expand Up @@ -2,3 +2,167 @@ CREATE SCHEMA IF NOT EXISTS inaturalist;
COMMIT;
SELECT schema_name
FROM information_schema.schemata WHERE schema_name = 'inaturalist';

/*
LICENSE LOOKUP
Everything on iNaturalist is holding at version 4, except CC0 which is version 1.0.
License versions below are hard-coded from inaturalist
https://github.com/inaturalist/inaturalist/blob/d338ba76d82af83d8ad0107563015364a101568c/app/models/shared/license_module.rb#L5
*/

DROP TABLE IF EXISTS inaturalist.license_codes;
COMMIT;

/*
_enrich_metadata calls for both license_url and raw_license_url, but there is no
raw license_url here, it's all calculated
https://github.com/WordPress/openverse-catalog/blob/337ea7aede228609cbd5031e3a501f22b6ccc482/openverse_catalog/dags/common/storage/media.py#L247
*/
CREATE TABLE inaturalist.license_codes (
inaturalist_code varchar(50),
license_name varchar(255),
license_url_metadata jsonb,
openverse_code varchar(50),
license_version varchar(25)
);
COMMIT;

INSERT INTO inaturalist.license_codes
(inaturalist_code, license_name, license_url_metadata, openverse_code, license_version)
VALUES
('CC-BY-NC-SA', 'Creative Commons Attribution-NonCommercial-ShareAlike License', jsonb_build_object('license_url', 'http://creativecommons.org/licenses/by-nc-sa/4.0/'), 'by-nc-sa', '4.0'),
('CC-BY-NC', 'Creative Commons Attribution-NonCommercial License', jsonb_build_object('license_url', 'http://creativecommons.org/licenses/by-nc/4.0/'), 'by-nc', '4.0'),
('CC-BY-NC-ND', 'Creative Commons Attribution-NonCommercial-NoDerivs License', jsonb_build_object('license_url', 'http://creativecommons.org/licenses/by-nc-nd/4.0/'), 'by-nc-nd', '4.0'),
('CC-BY', 'Creative Commons Attribution License', jsonb_build_object('license_url', 'http://creativecommons.org/licenses/by/4.0/'), 'by', '4.0'),
('CC-BY-SA', 'Creative Commons Attribution-ShareAlike License', jsonb_build_object('license_url', 'http://creativecommons.org/licenses/by-sa/4.0/'), 'by-sa', '4.0'),
('CC-BY-ND', 'Creative Commons Attribution-NoDerivs License', jsonb_build_object('license_url', 'http://creativecommons.org/licenses/by-nd/4.0/'), 'by-nd', '4.0'),
('PD', 'Public domain', jsonb_build_object('license_url', 'http://en.wikipedia.org/wiki/Public_domain'), 'pdm', ''),
('GFDL', 'GNU Free Documentation License', jsonb_build_object('license_url', 'http://www.gnu.org/copyleft/fdl.html'), 'gfdl', ''),
('CC0', 'Creative Commons CC0 Universal Public Domain Dedication', jsonb_build_object('license_url', 'http://creativecommons.org/publicdomain/zero/1.0/'), 'cc0', '1.0');
COMMIT;

/*
SPECIES NAMES
The Catalog of Life (COL) has data on vernacular names which we use to optimize titles
and tags based on iNaturalist taxon information. But there a few very common taxon_ids
that do not have matches in the COL so I am adding them hard coded here.

Another option would be the Integrated Taxonomic Information System
https://www.itis.gov/dwca_format.html which also has vernacular names / synonyms.
*/

DROP TABLE IF EXISTS inaturalist.col_vernacular;
COMMIT;

CREATE TABLE inaturalist.col_vernacular (
taxonID varchar(5),
sourceID decimal,
taxon_name varchar(2000),
transliteration text,
name_language varchar(3),
country varchar(3),
area varchar(2000),
sex decimal,
referenceID decimal
);
COMMIT;

DROP TABLE IF EXISTS inaturalist.col_name_usage;
COMMIT;

CREATE TABLE inaturalist.col_name_usage (
ID varchar(50),
alternativeID decimal,
nameAlternativeID decimal,
sourceID decimal,
parentID varchar(5),
basionymID varchar(5),
status varchar(22),
scientificName varchar(76),
authorship varchar(255),
rank varchar(21),
notho varchar(13),
uninomial varchar(50),
genericName varchar(50),
infragenericEpithet varchar(25),
specificEpithet varchar(50),
infraspecificEpithet varchar(50),
cultivarEpithet varchar(50),
namePhrase varchar(80),
nameReferenceID varchar(36),
publishedInYear decimal,
publishedInPage varchar(255),
publishedInPageLink varchar(255),
code varchar(10),
nameStatus varchar(15),
accordingToID varchar(36),
accordingToPage decimal,
accordingToPageLink decimal,
referenceID text,
scrutinizer varchar(149),
scrutinizerID decimal,
scrutinizerDate varchar(10),
extinct boolean,
temporalRangeStart varchar(15),
temporalRangeEnd varchar(15),
environment varchar(38),
species decimal,
section decimal,
subgenus decimal,
genus decimal,
subtribe decimal,
tribe decimal,
subfamily decimal,
taxon_family decimal,
superfamily decimal,
suborder decimal,
taxon_order decimal,
subclass decimal,
taxon_class decimal,
subphylum decimal,
phylum decimal,
kingdom decimal,
sequenceIndex decimal,
branchLength decimal,
link varchar(240),
nameRemarks decimal,
remarks text
);
COMMIT;

DROP TABLE IF EXISTS inaturalist.manual_name_additions;
COMMIT;

CREATE TABLE inaturalist.manual_name_additions (
md5_scientificname uuid,
vernacular_name varchar(100)
);
with records as
(
select cast(md5('Animalia') as uuid) as md5_scientificname, 'Animals' as vernacular_name
union all
select cast(md5('Araneae') as uuid) as md5_scientificname, 'Spider' as vernacular_name
union all
select cast(md5('Magnoliopsida') as uuid) as md5_scientificname, 'Flowers' as vernacular_name
union all
select cast(md5('Plantae') as uuid) as md5_scientificname, 'Plants' as vernacular_name
union all
select cast(md5('Lepidoptera') as uuid) as md5_scientificname, 'Butterflies and Moths' as vernacular_name
union all
select cast(md5('Insecta') as uuid) as md5_scientificname, 'Insect' as vernacular_name
union all
select cast(md5('Agaricales') as uuid) as md5_scientificname, 'Mushroom' as vernacular_name
union all
select cast(md5('Poaceae') as uuid) as md5_scientificname, 'Grass' as vernacular_name
union all
select cast(md5('Asteraceae') as uuid) as md5_scientificname, 'Daisy' as vernacular_name
union all
select cast(md5('Danaus plexippus') as uuid) as md5_scientificname, 'Monarch Butterfly' as vernacular_name
)
INSERT INTO inaturalist.manual_name_additions
(select * from records);
COMMIT;

select distinct table_schema
from information_schema.tables
where table_schema='inaturalist';

This file was deleted.

Expand Up @@ -6,8 +6,6 @@ PHOTOS
constraint on observer_id in order to save load time
-- photo_id is not unique. There are ~130,000 duplicate photo_ids (~0.1% of photos).
Both records are saved to the TSV and only one is loaded back into to postgres.
-- TO DO: See https://github.com/WordPress/openverse-catalog/issues/685 for more on
handling duplicate photo ids.

Taking DDL from
https://github.com/inaturalist/inaturalist-open-data/blob/main/Metadata/structure.sql
Expand Down Expand Up @@ -41,4 +39,14 @@ SELECT aws_s3.table_import_from_s3('inaturalist.photos',
-- more here: https://www.postgresql.org/docs/current/indexes-ordering.html
CREATE INDEX ON INATURALIST.PHOTOS USING btree (PHOTO_ID);

DROP TABLE IF EXISTS inaturalist.photo_dupes;
CREATE TABLE inaturalist.photo_dupes as (
SELECT PHOTO_ID, count(*) PHOTO_RECORDS
FROM INATURALIST.PHOTOS
GROUP BY PHOTO_ID
HAVING COUNT(*)>1
);
ALTER TABLE inaturalist.photo_dupes ADD PRIMARY KEY (PHOTO_ID);
COMMIT;

SELECT count(*) FROM inaturalist.photos;