Filter out duplicates from `raw_tags` in the catalog #3927

krysal · 2024-03-14T23:29:33Z

Fixes

Description

This PR mainly aims to convert the raw_tags field to a set to ensure we're letting any duplicates pass. The number of files affected might be daunting but the commit history can help track changes.

This PR also:

renames variables for denied tags
adds terms to denied tags lists
homogenizes the signatures of the add_item method of the ImageStore AudioStore classes

Something to consider while validating tags is sorting them right before enhancing them. This could allegedly be beneficial for "consistency between runs, saving on inserts into the DB later." The flicker DAG was doing it in addition to filtering out the duplicates.

I also identified the following provider DAG aren't collecting any tags for media:

Auckland Museum
Brooklyn Museum
Cleveland Museum
Europeana
iNaturalist
Museum Victoria
Phylopic
Science Museum
SMK
Wikimedia

Testing Instructions

Run any number of DAGs you desire to test and confirm the tags are saved without duplicates. Run at least an audio and an image provider. Using the AIRFLOW_VAR_INGESTION_LIMIT variable to a low number, e.g. 100, is suggested for quick checks.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

catalog/dags/common/storage/media.py

obulat

When the code to clean up was originally written, it was used to clean up the old TSVs as well as the code from provider scripts. This is why there are checks to see if the tags are a dict or a list of strings. Now, we only use this with the MediaStore, so I think we assume that the tags come as a list of strings.

With this assumption, we can simplify the tag enrichment and formatting by removing the checks for the shape of the tags (I added code suggestions inline where appropriate).

catalog/dags/common/storage/media.py

obulat · 2024-03-17T10:46:32Z

catalog/dags/common/storage/media.py

+            self._format_raw_tag(tag)
+            for tag in raw_tags
+            if not self._tag_denylisted(tag)
+        ]

    def _format_raw_tag(self, tag):


Suggested change

def _format_raw_tag(self, tag):

def _format_raw_tag(self, tag):

return {"name": tag, "provider": self.provider}

We can safely remove the isinstance checks since the provider scripts always provide a list of strings.

obulat · 2024-03-20T04:46:51Z

@krysal, I converted this PR to draft to prevent more pings.

AetherUnbound

I think that changing the type of raw_tags to a set and all the downstream changes necessary in the provider scripts to support that is not the appropriate approach here. It brings on a greater maintenance burden for each of the individual scripts when it's something we can centralize the logic for. Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally:

[2024-03-20, 22:33:28 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: Expected set, got <class 'list'>: ['Branch libraries', 'Libraries'].
query_params: {"q": "CC_0", "field": "use_rtxt_s", "page": 1, "per_page": 250}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
    data = ingester.ingest_records()
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
    raise error from ingestion_error
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 241, in ingest_records
    self.record_count += self.process_batch(batch)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 473, in process_batch
    store.add_item(**record)
  File "/opt/airflow/catalog/dags/common/storage/image.py", line 146, in add_item
    image = self._get_image(**image_data)
  File "/opt/airflow/catalog/dags/common/storage/image.py", line 153, in _get_image
    image_metadata = self.clean_media_metadata(**kwargs)
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 157, in clean_media_metadata
    media_data["tags"] = self._enrich_tags(media_data.pop("raw_tags", None))
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 299, in _enrich_tags
    raise TypeError(f"Expected set, got {type(raw_tags)}: {raw_tags}.")
TypeError: Expected set, got <class 'list'>: ['Branch libraries', 'Libraries'].

Given that we can just perform the de-duplication on the MediaStore class and prevent this from occurring entirely, I'd rather go with that approach.

krysal · 2024-03-28T15:18:49Z

I refute that "it brings on a greater maintenance burden for each of the individual scripts," instead, it is changed to use the correct data type. Keeping two possible different types in fact creates a greater maintenance burden down the line.

Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally: ...

This is precisely what I'm proposing we should avoid, not using a list for raw_tags. Why would you want to continue to do that? @AetherUnbound you're one of the people usually proposing the use of set over list whenever it's possible also for performance reasons. On the other hand, that error is pretty clear to me. How can it be more explicit? It mentions the problematic field and what is is expected to be. It seems to me that we are underestimating the contributors in this regard. Sets are not a complex data structure or something that comes from some obscure and esoteric third-party library.

If you see something else that currently can be problematic with the use of set here I'd like to know and discuss it, for now I'd rather apply the changes already made here.

zackkrida · 2024-03-28T15:35:44Z

@krysal and @AetherUnbound, I think this might be a good point in the conversation to solicit advice from @obulat and @stacimc on this particular issue concerning lists and sets. Since there is a clear difference of opinion here it might be useful to see where the majority opinion is held and go from there.

obulat · 2024-03-29T16:29:14Z

If I were writing this PR, I would not change the provider scripts and would convert the lists in raw_tag to sets inside the MediaStore instead. This would mean fewer lines of code changes in this PR. However, it's just a preference, and using sets in provider scripts is okay with me.

Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally

I think this error would be caught during the first run of the new (or updated) DAG, and is pretty easy to fix. We can catch it during the code review. A better solution would be to use a dataclass for the records here:

openverse/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

Line 455 in 5bfab8b

if not (record_data := self.get_record_data(data)):

stacimc · 2024-03-29T16:48:05Z

it's the constraint that we want no duplicates in tags.

We could require the providers to do all the cleanup that we do in the MediaStore by the same logic, and then just raise errors in the MediaStore if it was done incorrectly; which isn't a bad thing, I just don't see a particular benefit when it's easy and convenient to apply the constraints in a centralized place. I would feel differently if there was any ambiguity in how the constraint should be applied.

I don't think it's a huge hurdle to developing provider scripts to enforce this there, but it is a hurdle without significant benefit IMO and also a departure from how we use the MediaStore. If we were going to do more of this down the road Olga's suggestion for using a dataclass sounds promising to me, but I think that would require a separate issue and discussion. This discussion has raised some interesting points!

krysal · 2024-03-29T18:02:25Z

We could require the providers to do all the cleanup that we do in the MediaStore by the same logic...

I'm not advocating for that change so I see no point in discussing it. Agree that it can be interesting as a separate issue if you all wish.

Understandably, we have different ways of approaching the problem. It is part of what it means to be a diverse team with different perspectives. It's unfortunate, though, that we disagree on the relationship between benefits and inconveniences, but we can still move forward.

Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.

obulat

The code changes look good.

My main request for change is the update to _format_raw_tag function

Since we are updating the shape of tags in the provider scripts, we should also update the _format_raw_tag to stop handling tags that are not a set of strings (I added a comment on isinstance checks inline.

catalog/dags/providers/provider_api_scripts/nypl.py

catalog/dags/providers/provider_api_scripts/rawpixel.py

catalog/dags/providers/provider_api_scripts/stocksnap.py

obulat · 2024-03-30T10:08:10Z

catalog/tests/dags/common/storage/test_media.py

-            license_version="4.0",
-            license_url="https://license/url",
-        ),
+        license_info=BY_LICENSE_INFO,


AetherUnbound

Understandably, we have different ways of approaching the problem. It is part of what it means to be a diverse team with different perspectives. It's unfortunate, though, that we disagree on the relationship between benefits and inconveniences, but we can still move forward.

Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.

To be clear, my rationale for disagreeing with this approach is not one of personal preferences. When we initially inherited the project, each provider script implemented nearly every step that was necessary for normalizing the data and preparing it for insertion into the database. Since then, we've taken several steps to centralize those efforts rather than have all that logic exist on the providers themselves, namely with the MediaStore and ProviderDataIngester classes. These efforts sought to move more of the shared normalization and functionality from the individual provider scripts into a common location. Doing this was a sacrifice to the simplicity of each script at the cost of more abstraction, but it allowed us an easier way to modify the behavior of all provider scripts without having to affect each individual one.

The approach set out in this PR would be a step backwards from the ways we've worked to centralize the data normalization in the past. In the same way that we're requiring types for the tags be enforced on each provider script, we could require that each provider script provide valid and fully qualified URLs as well. Instead, we've aggregated that logic in a singe place within the MediaStore class so that it does not need to be present in each provider. Regardless of the accuracy of the type at the provider level, deduplicating tags seems like it's exactly the sort of normalization logic that should also exist at the MediaStore level along with all the other normalization steps we're taking there. I do not think that enforcing types at the provider level is a strong enough reason to motivate enforcing an individual validation everywhere that could be done in one place for all providers.

Perhaps I have the wrong impression or there are other elements of this I'm not considering. Do you mind sharing why you feel this validation differs significantly enough from the other centralized validation we're doing that it warrants being on every provider script?

krysal · 2024-04-01T22:08:31Z

@AetherUnbound I understand better where your reasoning comes from, thank you for explaining it. It seems that we differ on that this change is making the provider scripts more complex. I find it simpler to make sure we use the correct data type that does the work we want (deduplicate) in its instantiation and even has performance benefits than checking for lists and doing type conversions later in the media store. URL validations are, of course, more complex, which we cannot easily solve with a change of type, and it's why there is a dedicated module for it. But I truly don't see the harm in changing to set when we could also save the CPU of conversions, maybe make people see this data type in action. The tags, a list of non-duplicate strings (which is also unlikely to change), are the perfect use of it.

I'm laser-focused on the case we have here, but I get the general preference to concentrate all the validation in media stores. Before implementing something else in future opportunities, I will consult with the team about possible options.

…IST`

openverse-bot · 2024-04-02T00:00:09Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@stacimc
@obulat
@AetherUnbound
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

stacimc

We could require the providers to do all the cleanup that we do in the MediaStore by the same logic...

I'm not advocating for that change so I see no point in discussing it. Agree that it can be interesting as a separate issue if you all wish.

I wasn't suggesting you were advocating for this, @krysal, and for what it's worth neither am I (I would even be opposed to that idea). I was trying to illustrate that what we're doing in this PR goes against the existing convention that we have for all of the other cleanup in the MediaStore (as @AetherUnbound expanded upon in her later comment).

Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.

This was delayed primarily because there was an explicit ask earlier in the thread to get feedback for a majority opinion. I don't think the views expressed were simple style preferences or anything of that nature.

Reiterating my take for example: I do not think we should add (even minor) complexity to the provider scripts and break with our established conventions of normalizing in the MediaStore without a clear and significant benefit. I do not think deduplicating tags in the provider scripts has any meaningful performance improvement over doing it in the MediaStore, and I don't see particular benefit to provider script authors being made to think about tag duplication specifically, when that is so trivial to normalize at a higher level. So while I do not think the approach in this PR breaks anything, I disagree with the approach.

Given that the group consensus that was asked for is in favor of deduplicating in the MediaStore, can you please change the implementation to match the established patterns?

krysal · 2024-04-03T00:05:52Z

I'm sorry we don't agree on how we view this, but commit to the group consensus. It will be easier to make the changes in a new PR, so I'm closing this one.

krysal requested a review from a team as a code owner March 14, 2024 23:29

krysal requested review from AetherUnbound and stacimc March 14, 2024 23:29

github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Mar 14, 2024

krysal changed the title ~~Rename variables for denied tags~~ Filter out duplicates from raw_tagsin the catalog Mar 14, 2024

krysal removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 14, 2024

krysal changed the title ~~Filter out duplicates from raw_tagsin the catalog~~ Filter out duplicates from raw_tags in the catalog Mar 14, 2024

krysal marked this pull request as draft March 15, 2024 00:41

krysal force-pushed the fix/duplicated_tags branch from b382a56 to 9b9cdf4 Compare March 15, 2024 01:28

krysal marked this pull request as ready for review March 15, 2024 01:41

AetherUnbound reviewed Mar 15, 2024

View reviewed changes

catalog/dags/common/storage/media.py Show resolved Hide resolved

krysal requested a review from AetherUnbound March 17, 2024 03:41

obulat reviewed Mar 17, 2024

View reviewed changes

catalog/dags/common/storage/media.py Show resolved Hide resolved

obulat reviewed Mar 17, 2024

View reviewed changes

AetherUnbound linked an issue Mar 18, 2024 that may be closed by this pull request

Rename TAG_BLACKLIST to TAG_EXCLUDELIST and add more tags #1412

Open

1 task

krysal mentioned this pull request Mar 19, 2024

Add tags to Wikimedia DAG #3937

Open

obulat marked this pull request as draft March 20, 2024 04:46

AetherUnbound requested changes Mar 20, 2024

View reviewed changes

krysal force-pushed the fix/duplicated_tags branch from 9b9cdf4 to 54ea78e Compare March 28, 2024 15:19

krysal marked this pull request as ready for review March 28, 2024 15:29

krysal requested review from obulat and AetherUnbound March 28, 2024 15:29

krysal requested a review from stacimc March 28, 2024 17:47

krysal force-pushed the fix/duplicated_tags branch from c84d288 to ff53645 Compare March 28, 2024 17:48

obulat requested changes Mar 30, 2024

View reviewed changes

krysal force-pushed the fix/duplicated_tags branch from ff53645 to f1a8aab Compare April 1, 2024 17:49

krysal requested a review from obulat April 1, 2024 17:59

AetherUnbound requested changes Apr 1, 2024

View reviewed changes

krysal requested a review from AetherUnbound April 1, 2024 22:08

krysal added 12 commits April 1, 2024 18:09

Rename variables for denied tags

af37475

Sort TAG_DENYLIST and add term

2d9599d

Change raw_tags type to set in storage classes

daa71e0

Update raw_tags from audio providers

03d73f0

Update raw_tags from cc_mixter and finnish_museums

e1a228f

Update raw_tags from flickr

fddcd22

Update raw_tags from justtakeitfree and metropolitan_museum

594ec87

Update raw_tags from nappy, nypl and rawpixel

b0984f1

Update raw_tags from smithsonian, stocksnap and wordpress

10d1dcd

Update docstring for watermarked and remove extra str in `TAG_DENYL…

684ba53

…IST`

Update _format_raw_tag fn and remove redundant excluded tags

e0163cd

Apply review suggestions

9d16254

krysal force-pushed the fix/duplicated_tags branch from f1a8aab to 9d16254 Compare April 1, 2024 22:09

stacimc requested changes Apr 2, 2024

View reviewed changes

krysal closed this Apr 3, 2024

krysal mentioned this pull request Apr 3, 2024

Filter out duplicates from raw_tags in the catalog v2 #4014

Merged

8 tasks

AetherUnbound deleted the fix/duplicated_tags branch April 16, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out duplicates from `raw_tags` in the catalog #3927

Filter out duplicates from `raw_tags` in the catalog #3927

krysal commented Mar 14, 2024

obulat left a comment •

edited

obulat Mar 17, 2024

obulat Mar 17, 2024

obulat commented Mar 20, 2024

AetherUnbound left a comment

krysal commented Mar 28, 2024 •

edited

zackkrida commented Mar 28, 2024

obulat commented Mar 29, 2024

stacimc commented Mar 29, 2024

krysal commented Mar 29, 2024

obulat left a comment

obulat Mar 30, 2024

AetherUnbound left a comment

krysal commented Apr 1, 2024

openverse-bot commented Apr 2, 2024

stacimc left a comment •

edited

krysal commented Apr 3, 2024

	def _format_raw_tag(self, tag):
	def _format_raw_tag(self, tag):
	return {"name": tag, "provider": self.provider}

Filter out duplicates from raw_tags in the catalog #3927

Filter out duplicates from raw_tags in the catalog #3927

Conversation

krysal commented Mar 14, 2024

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

obulat left a comment • edited

Choose a reason for hiding this comment

obulat Mar 17, 2024

Choose a reason for hiding this comment

obulat Mar 17, 2024

Choose a reason for hiding this comment

obulat commented Mar 20, 2024

AetherUnbound left a comment

Choose a reason for hiding this comment

krysal commented Mar 28, 2024 • edited

zackkrida commented Mar 28, 2024

obulat commented Mar 29, 2024

stacimc commented Mar 29, 2024

krysal commented Mar 29, 2024

obulat left a comment

Choose a reason for hiding this comment

obulat Mar 30, 2024

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

krysal commented Apr 1, 2024

openverse-bot commented Apr 2, 2024

Footnotes

stacimc left a comment • edited

Choose a reason for hiding this comment

krysal commented Apr 3, 2024

Filter out duplicates from `raw_tags` in the catalog #3927

Filter out duplicates from `raw_tags` in the catalog #3927

obulat left a comment •

edited

krysal commented Mar 28, 2024 •

edited

stacimc left a comment •

edited