New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out duplicates from raw_tags
in the catalog
#3927
Conversation
raw_tags
in the catalog
raw_tags
in the catalograw_tags
in the catalog
b382a56
to
9b9cdf4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the code to clean up was originally written, it was used to clean up the old TSVs as well as the code from provider scripts. This is why there are checks to see if the tags are a dict or a list of strings. Now, we only use this with the MediaStore
, so I think we assume that the tags come as a list of strings.
With this assumption, we can simplify the tag enrichment and formatting by removing the checks for the shape of the tags (I added code suggestions inline where appropriate).
catalog/dags/common/storage/media.py
Outdated
self._format_raw_tag(tag) | ||
for tag in raw_tags | ||
if not self._tag_denylisted(tag) | ||
] | ||
|
||
def _format_raw_tag(self, tag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _format_raw_tag(self, tag): | |
def _format_raw_tag(self, tag): | |
return {"name": tag, "provider": self.provider} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can safely remove the isinstance
checks since the provider scripts always provide a list of strings.
@krysal, I converted this PR to draft to prevent more pings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that changing the type of raw_tags
to a set and all the downstream changes necessary in the provider scripts to support that is not the appropriate approach here. It brings on a greater maintenance burden for each of the individual scripts when it's something we can centralize the logic for. Additionally, I worry that the error which is raised if a contributor adds a provider DAG which returns a list instead of a set may be more opaque and may make contribution harder as a result. I changed the NYPL DAG to return a list of tags instead of a set as an example, and this was the error it raised when testing locally:
[2024-03-20, 22:33:28 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: Expected set, got <class 'list'>: ['Branch libraries', 'Libraries'].
query_params: {"q": "CC_0", "field": "use_rtxt_s", "page": 1, "per_page": 250}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
data = ingester.ingest_records()
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
raise error from ingestion_error
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 241, in ingest_records
self.record_count += self.process_batch(batch)
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 473, in process_batch
store.add_item(**record)
File "/opt/airflow/catalog/dags/common/storage/image.py", line 146, in add_item
image = self._get_image(**image_data)
File "/opt/airflow/catalog/dags/common/storage/image.py", line 153, in _get_image
image_metadata = self.clean_media_metadata(**kwargs)
File "/opt/airflow/catalog/dags/common/storage/media.py", line 157, in clean_media_metadata
media_data["tags"] = self._enrich_tags(media_data.pop("raw_tags", None))
File "/opt/airflow/catalog/dags/common/storage/media.py", line 299, in _enrich_tags
raise TypeError(f"Expected set, got {type(raw_tags)}: {raw_tags}.")
TypeError: Expected set, got <class 'list'>: ['Branch libraries', 'Libraries'].
Given that we can just perform the de-duplication on the MediaStore
class and prevent this from occurring entirely, I'd rather go with that approach.
I refute that "it brings on a greater maintenance burden for each of the individual scripts," instead, it is changed to use the correct data type. Keeping two possible different types in fact creates a greater maintenance burden down the line.
This is precisely what I'm proposing we should avoid, not using a If you see something else that currently can be problematic with the use of |
9b9cdf4
to
54ea78e
Compare
@krysal and @AetherUnbound, I think this might be a good point in the conversation to solicit advice from @obulat and @stacimc on this particular issue concerning lists and sets. Since there is a clear difference of opinion here it might be useful to see where the majority opinion is held and go from there. |
c84d288
to
ff53645
Compare
If I were writing this PR, I would not change the provider scripts and would convert the lists in
I think this error would be caught during the first run of the new (or updated) DAG, and is pretty easy to fix. We can catch it during the code review. A better solution would be to use a dataclass for the records here:
|
We could require the providers to do all the cleanup that we do in the MediaStore by the same logic, and then just raise errors in the MediaStore if it was done incorrectly; which isn't a bad thing, I just don't see a particular benefit when it's easy and convenient to apply the constraints in a centralized place. I would feel differently if there was any ambiguity in how the constraint should be applied. I don't think it's a huge hurdle to developing provider scripts to enforce this there, but it is a hurdle without significant benefit IMO and also a departure from how we use the MediaStore. If we were going to do more of this down the road Olga's suggestion for using a dataclass sounds promising to me, but I think that would require a separate issue and discussion. This discussion has raised some interesting points! |
I'm not advocating for that change so I see no point in discussing it. Agree that it can be interesting as a separate issue if you all wish. Understandably, we have different ways of approaching the problem. It is part of what it means to be a diverse team with different perspectives. It's unfortunate, though, that we disagree on the relationship between benefits and inconveniences, but we can still move forward. Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look good.
My main request for change is the update to _format_raw_tag
function
Since we are updating the shape of tags in the provider scripts, we should also update the _format_raw_tag
to stop handling tags that are not a set of strings (I added a comment on isinstance
checks inline.
license_version="4.0", | ||
license_url="https://license/url", | ||
), | ||
license_info=BY_LICENSE_INFO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
ff53645
to
f1a8aab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understandably, we have different ways of approaching the problem. It is part of what it means to be a diverse team with different perspectives. It's unfortunate, though, that we disagree on the relationship between benefits and inconveniences, but we can still move forward.
Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.
To be clear, my rationale for disagreeing with this approach is not one of personal preferences. When we initially inherited the project, each provider script implemented nearly every step that was necessary for normalizing the data and preparing it for insertion into the database. Since then, we've taken several steps to centralize those efforts rather than have all that logic exist on the providers themselves, namely with the MediaStore
and ProviderDataIngester
classes. These efforts sought to move more of the shared normalization and functionality from the individual provider scripts into a common location. Doing this was a sacrifice to the simplicity of each script at the cost of more abstraction, but it allowed us an easier way to modify the behavior of all provider scripts without having to affect each individual one.
The approach set out in this PR would be a step backwards from the ways we've worked to centralize the data normalization in the past. In the same way that we're requiring types for the tags be enforced on each provider script, we could require that each provider script provide valid and fully qualified URLs as well. Instead, we've aggregated that logic in a singe place within the MediaStore
class so that it does not need to be present in each provider. Regardless of the accuracy of the type at the provider level, deduplicating tags seems like it's exactly the sort of normalization logic that should also exist at the MediaStore
level along with all the other normalization steps we're taking there. I do not think that enforcing types at the provider level is a strong enough reason to motivate enforcing an individual validation everywhere that could be done in one place for all providers.
Perhaps I have the wrong impression or there are other elements of this I'm not considering. Do you mind sharing why you feel this validation differs significantly enough from the other centralized validation we're doing that it warrants being on every provider script?
@AetherUnbound I understand better where your reasoning comes from, thank you for explaining it. It seems that we differ on that this change is making the provider scripts more complex. I find it simpler to make sure we use the correct data type that does the work we want (deduplicate) in its instantiation and even has performance benefits than checking for lists and doing type conversions later in the media store. URL validations are, of course, more complex, which we cannot easily solve with a change of type, and it's why there is a dedicated module for it. But I truly don't see the harm in changing to I'm laser-focused on the case we have here, but I get the general preference to concentrate all the validation in media stores. Before implementing something else in future opportunities, I will consult with the team about possible options. |
f1a8aab
to
9d16254
Compare
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @stacimc Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could require the providers to do all the cleanup that we do in the MediaStore by the same logic...
I'm not advocating for that change so I see no point in discussing it. Agree that it can be interesting as a separate issue if you all wish.
I wasn't suggesting you were advocating for this, @krysal, and for what it's worth neither am I (I would even be opposed to that idea). I was trying to illustrate that what we're doing in this PR goes against the existing convention that we have for all of the other cleanup in the MediaStore (as @AetherUnbound expanded upon in her later comment).
Can I get some approval for this PR if we see no risk involved? It's beyond me why this has been delayed just for personal preferences until now.
This was delayed primarily because there was an explicit ask earlier in the thread to get feedback for a majority opinion. I don't think the views expressed were simple style preferences or anything of that nature.
Reiterating my take for example: I do not think we should add (even minor) complexity to the provider scripts and break with our established conventions of normalizing in the MediaStore without a clear and significant benefit. I do not think deduplicating tags in the provider scripts has any meaningful performance improvement over doing it in the MediaStore, and I don't see particular benefit to provider script authors being made to think about tag duplication specifically, when that is so trivial to normalize at a higher level. So while I do not think the approach in this PR breaks anything, I disagree with the approach.
Given that the group consensus that was asked for is in favor of deduplicating in the MediaStore, can you please change the implementation to match the established patterns?
I'm sorry we don't agree on how we view this, but commit to the group consensus. It will be easier to make the changes in a new PR, so I'm closing this one. |
Fixes
Fixes #3926 by @krysal
Description
This PR mainly aims to convert the
raw_tags
field to a set to ensure we're letting any duplicates pass. The number of files affected might be daunting but the commit history can help track changes.This PR also:
add_item
method of theImageStore
AudioStore
classesSomething to consider while validating tags is sorting them right before enhancing them. This could allegedly be beneficial for "consistency between runs, saving on inserts into the DB later." The flicker DAG was doing it in addition to filtering out the duplicates.
I also identified the following provider DAG aren't collecting any tags for media:
Testing Instructions
Run any number of DAGs you desire to test and confirm the tags are saved without duplicates. Run at least an audio and an image provider. Using the
AIRFLOW_VAR_INGESTION_LIMIT
variable to a low number, e.g. 100, is suggested for quick checks.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin