Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out duplicates from raw_tags in the catalog #3927

Closed
wants to merge 12 commits into from

Conversation

krysal
Copy link
Member

@krysal krysal commented Mar 14, 2024

Fixes

Fixes #3926 by @krysal

Description

This PR mainly aims to convert the raw_tags field to a set to ensure we're letting any duplicates pass. The number of files affected might be daunting but the commit history can help track changes.

This PR also:

  • renames variables for denied tags
  • adds terms to denied tags lists
  • homogenizes the signatures of the add_item method of the ImageStore AudioStore classes

Something to consider while validating tags is sorting them right before enhancing them. This could allegedly be beneficial for "consistency between runs, saving on inserts into the DB later." The flicker DAG was doing it in addition to filtering out the duplicates.


I also identified the following provider DAG aren't collecting any tags for media:

  • Auckland Museum
  • Brooklyn Museum
  • Cleveland Museum
  • Europeana
  • iNaturalist
  • Museum Victoria
  • Phylopic
  • Science Museum
  • SMK
  • Wikimedia

Testing Instructions

Run any number of DAGs you desire to test and confirm the tags are saved without duplicates. Run at least an audio and an image provider. Using the AIRFLOW_VAR_INGESTION_LIMIT variable to a low number, e.g. 100, is suggested for quick checks.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal requested a review from a team as a code owner March 14, 2024 23:29
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🧰 goal: internal improvement Improvement that benefits maintainers, not users 💻 aspect: code Concerns the software code in the repository 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Mar 14, 2024
@github-actions github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Mar 14, 2024
@krysal krysal changed the title Rename variables for denied tags Filter out duplicates from raw_tagsin the catalog Mar 14, 2024
@krysal krysal removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 14, 2024
@krysal krysal changed the title Filter out duplicates from raw_tagsin the catalog Filter out duplicates from raw_tags in the catalog Mar 14, 2024
@krysal krysal marked this pull request as draft March 15, 2024 00:41
@krysal krysal marked this pull request as ready for review March 15, 2024 01:41
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the code to clean up was originally written, it was used to clean up the old TSVs as well as the code from provider scripts. This is why there are checks to see if the tags are a dict or a list of strings. Now, we only use this with the MediaStore, so I think we assume that the tags come as a list of strings.

With this assumption, we can simplify the tag enrichment and formatting by removing the checks for the shape of the tags (I added code suggestions inline where appropriate).

self._format_raw_tag(tag)
for tag in raw_tags
if not self._tag_denylisted(tag)
]

def _format_raw_tag(self, tag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _format_raw_tag(self, tag):
def _format_raw_tag(self, tag):
return {"name": tag, "provider": self.provider}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can safely remove the isinstance checks since the provider scripts always provide a list of strings.

@AetherUnbound AetherUnbound linked an issue Mar 18, 2024 that may be closed by this pull request
1 task
@obulat obulat marked this pull request as draft March 20, 2024 04:46
@obulat
Copy link
Contributor

obulat commented Mar 20, 2024

@krysal, I converted this PR to draft to prevent more pings.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that changing the type of raw_tags to a set and all the downstream changes necessary in the provider scripts to support that is not the appropriate approach here. It brings on a greater maintenance burden for each of the individual scripts when it's something we can centralize the logic for. Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally:

[2024-03-20, 22:33:28 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: Expected set, got <class 'list'>: ['Branch libraries', 'Libraries'].
query_params: {"q": "CC_0", "field": "use_rtxt_s", "page": 1, "per_page": 250}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
    data = ingester.ingest_records()
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
    raise error from ingestion_error
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 241, in ingest_records
    self.record_count += self.process_batch(batch)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 473, in process_batch
    store.add_item(**record)
  File "/opt/airflow/catalog/dags/common/storage/image.py", line 146, in add_item
    image = self._get_image(**image_data)
  File "/opt/airflow/catalog/dags/common/storage/image.py", line 153, in _get_image
    image_metadata = self.clean_media_metadata(**kwargs)
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 157, in clean_media_metadata
    media_data["tags"] = self._enrich_tags(media_data.pop("raw_tags", None))
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 299, in _enrich_tags
    raise TypeError(f"Expected set, got {type(raw_tags)}: {raw_tags}.")
TypeError: Expected set, got <class 'list'>: ['Branch libraries', 'Libraries'].

Given that we can just perform the de-duplication on the MediaStore class and prevent this from occurring entirely, I'd rather go with that approach.

@krysal
Copy link
Member Author

krysal commented Mar 28, 2024

I refute that "it brings on a greater maintenance burden for each of the individual scripts," instead, it is changed to use the correct data type. Keeping two possible different types in fact creates a greater maintenance burden down the line.

Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally: ...

This is precisely what I'm proposing we should avoid, not using a list for raw_tags. Why would you want to continue to do that? @AetherUnbound you're one of the people usually proposing the use of set over list whenever it's possible also for performance reasons. On the other hand, that error is pretty clear to me. How can it be more explicit? It mentions the problematic field and what is is expected to be. It seems to me that we are underestimating the contributors in this regard. Sets are not a complex data structure or something that comes from some obscure and esoteric third-party library.

If you see something else that currently can be problematic with the use of set here I'd like to know and discuss it, for now I'd rather apply the changes already made here.

@krysal krysal marked this pull request as ready for review March 28, 2024 15:29
@zackkrida
Copy link
Member

@krysal and @AetherUnbound, I think this might be a good point in the conversation to solicit advice from @obulat and @stacimc on this particular issue concerning lists and sets. Since there is a clear difference of opinion here it might be useful to see where the majority opinion is held and go from there.

@obulat
Copy link
Contributor

obulat commented Mar 29, 2024

If I were writing this PR, I would not change the provider scripts and would convert the lists in raw_tag to sets inside the MediaStore instead. This would mean fewer lines of code changes in this PR. However, it's just a preference, and using sets in provider scripts is okay with me.

Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally

I think this error would be caught during the first run of the new (or updated) DAG, and is pretty easy to fix. We can catch it during the code review. A better solution would be to use a dataclass for the records here:

if not (record_data := self.get_record_data(data)):

@stacimc
Copy link
Contributor

stacimc commented Mar 29, 2024

it's the constraint that we want no duplicates in tags.

We could require the providers to do all the cleanup that we do in the MediaStore by the same logic, and then just raise errors in the MediaStore if it was done incorrectly; which isn't a bad thing, I just don't see a particular benefit when it's easy and convenient to apply the constraints in a centralized place. I would feel differently if there was any ambiguity in how the constraint should be applied.

I don't think it's a huge hurdle to developing provider scripts to enforce this there, but it is a hurdle without significant benefit IMO and also a departure from how we use the MediaStore. If we were going to do more of this down the road Olga's suggestion for using a dataclass sounds promising to me, but I think that would require a separate issue and discussion. This discussion has raised some interesting points!

@krysal
Copy link
Member Author

krysal commented Mar 29, 2024

We could require the providers to do all the cleanup that we do in the MediaStore by the same logic...

I'm not advocating for that change so I see no point in discussing it. Agree that it can be interesting as a separate issue if you all wish.

Understandably, we have different ways of approaching the problem. It is part of what it means to be a diverse team with different perspectives. It's unfortunate, though, that we disagree on the relationship between benefits and inconveniences, but we can still move forward.

Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good.

My main request for change is the update to _format_raw_tag function

Since we are updating the shape of tags in the provider scripts, we should also update the _format_raw_tag to stop handling tags that are not a set of strings (I added a comment on isinstance checks inline.

catalog/dags/providers/provider_api_scripts/nypl.py Outdated Show resolved Hide resolved
catalog/dags/providers/provider_api_scripts/nypl.py Outdated Show resolved Hide resolved
catalog/dags/providers/provider_api_scripts/rawpixel.py Outdated Show resolved Hide resolved
catalog/dags/providers/provider_api_scripts/stocksnap.py Outdated Show resolved Hide resolved
license_version="4.0",
license_url="https://license/url",
),
license_info=BY_LICENSE_INFO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandably, we have different ways of approaching the problem. It is part of what it means to be a diverse team with different perspectives. It's unfortunate, though, that we disagree on the relationship between benefits and inconveniences, but we can still move forward.

Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.

To be clear, my rationale for disagreeing with this approach is not one of personal preferences. When we initially inherited the project, each provider script implemented nearly every step that was necessary for normalizing the data and preparing it for insertion into the database. Since then, we've taken several steps to centralize those efforts rather than have all that logic exist on the providers themselves, namely with the MediaStore and ProviderDataIngester classes. These efforts sought to move more of the shared normalization and functionality from the individual provider scripts into a common location. Doing this was a sacrifice to the simplicity of each script at the cost of more abstraction, but it allowed us an easier way to modify the behavior of all provider scripts without having to affect each individual one.

The approach set out in this PR would be a step backwards from the ways we've worked to centralize the data normalization in the past. In the same way that we're requiring types for the tags be enforced on each provider script, we could require that each provider script provide valid and fully qualified URLs as well. Instead, we've aggregated that logic in a singe place within the MediaStore class so that it does not need to be present in each provider. Regardless of the accuracy of the type at the provider level, deduplicating tags seems like it's exactly the sort of normalization logic that should also exist at the MediaStore level along with all the other normalization steps we're taking there. I do not think that enforcing types at the provider level is a strong enough reason to motivate enforcing an individual validation everywhere that could be done in one place for all providers.

Perhaps I have the wrong impression or there are other elements of this I'm not considering. Do you mind sharing why you feel this validation differs significantly enough from the other centralized validation we're doing that it warrants being on every provider script?

@krysal
Copy link
Member Author

krysal commented Apr 1, 2024

@AetherUnbound I understand better where your reasoning comes from, thank you for explaining it. It seems that we differ on that this change is making the provider scripts more complex. I find it simpler to make sure we use the correct data type that does the work we want (deduplicate) in its instantiation and even has performance benefits than checking for lists and doing type conversions later in the media store. URL validations are, of course, more complex, which we cannot easily solve with a change of type, and it's why there is a dedicated module for it. But I truly don't see the harm in changing to set when we could also save the CPU of conversions, maybe make people see this data type in action. The tags, a list of non-duplicate strings (which is also unlikely to change), are the perfect use of it.

I'm laser-focused on the case we have here, but I get the general preference to concentrate all the validation in media stores. Before implementing something else in future opportunities, I will consult with the team about possible options.

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@stacimc
@obulat
@AetherUnbound
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could require the providers to do all the cleanup that we do in the MediaStore by the same logic...

I'm not advocating for that change so I see no point in discussing it. Agree that it can be interesting as a separate issue if you all wish.

I wasn't suggesting you were advocating for this, @krysal, and for what it's worth neither am I (I would even be opposed to that idea). I was trying to illustrate that what we're doing in this PR goes against the existing convention that we have for all of the other cleanup in the MediaStore (as @AetherUnbound expanded upon in her later comment).

Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.

This was delayed primarily because there was an explicit ask earlier in the thread to get feedback for a majority opinion. I don't think the views expressed were simple style preferences or anything of that nature.

Reiterating my take for example: I do not think we should add (even minor) complexity to the provider scripts and break with our established conventions of normalizing in the MediaStore without a clear and significant benefit. I do not think deduplicating tags in the provider scripts has any meaningful performance improvement over doing it in the MediaStore, and I don't see particular benefit to provider script authors being made to think about tag duplication specifically, when that is so trivial to normalize at a higher level. So while I do not think the approach in this PR breaks anything, I disagree with the approach.

Given that the group consensus that was asked for is in favor of deduplicating in the MediaStore, can you please change the implementation to match the established patterns?

@krysal
Copy link
Member Author

krysal commented Apr 3, 2024

I'm sorry we don't agree on how we view this, but commit to the group consensus. It will be easier to make the changes in a new PR, so I'm closing this one.

@krysal krysal closed this Apr 3, 2024
@AetherUnbound AetherUnbound deleted the fix/duplicated_tags branch April 16, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Rename TAG_BLACKLIST to TAG_EXCLUDELIST and add more tags Update raw_tags to avoid duplicates in the catalog
6 participants