-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change search query approach to include only available providers #4238
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach, it makes perfect sense to me and I agree it is much clearer than the multiple steps of reasoning required for the other approach.
I've left one comment regarding the terminology used, but that's not as important to me as clarifying whether we need to use must
or filter
. Based on my understanding, I think filter
is the right choice, but if it isn't, then we should document why we use must
, considering its effect on document scoring (and conversely, we should document why we use filter
, if we do, namely so that it doesn't affect scoring).
api/test/unit/controllers/test_search_controller_search_query.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a smart approach!
I agree with Sara on the naming, something with the meaning of enabled
or included
would be better.
I'm happy to change the terminology and, of course, the clause for the query. I had similar thoughts regarding the names, but it seemed to me that they could fit with the existing environment variables, and I didn't want to give it much thought since it could also be decided in the review 😄 This is back to draft in the meantime. |
0c24148
to
00c8914
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krysal @obulat do y'all know whether the source/provider distinction is potentially getting muddled in this new approach? I can't remember if ContentProvider
only has true "providers" and not sources. If it also has sources, do we need to be careful about how it's used in relation to the source
filters (both the include and exclude parameters)?
This LGTM otherwise, but I wanted to make sure I understood where the distinction was before approving.
@sarayourfriend |
The problem with the implementation, if there is one (if sources are in To be clear, if this is a problem now, then it was a problem before with the exclusions, but will be worse now because we wouldn't be including all the sources, only those which happen to also be providers. If Maybe this isn't an issue and I am just confusing myself. But if sources are also in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the implementation, if there is one (if sources are in ContentProvider in addition to providers) is we only use the entries in ContentProvider to filter on the provider field of the ES documents, not source, whereas the query parameters refer to source.
To be clear, if this is a problem now, then it was a problem before with the exclusions, but will be worse now because we wouldn't be including all the sources, only those which happen to also be providers.
I SEE, wow, I think you're right. We're mixing sources with providers, I hadn't paid much attention to it because I thought this was covered since the "new" parameters work 😅
If ContentProvider really does have sources, not just providers, then my hunch is that we actually have a complex issue here when it comes to providers like smithsonian with many sources. Do we have an individual content provider entry for all smithsonian sources?
I don't think we have those; some sources are too small (<100 items), so we decided to skip them. But the ugly problem is that there shouldn't be a provider
entry equivalent to all source
entries (not sure if I'm explaining myself here, but hopefully so!). The provider indicates where we obtain the data from, despite which the source site is.
There could be a simpler solution tho, proposed below.
if filtered_providers: | ||
return Q("terms", provider=filtered_providers) | ||
if enabled_providers: | ||
return Q("terms", provider=enabled_providers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's along the lines of what I'm thinking... but if and only if all active sources have a ContentProvider
entry.
Which is to say, exclusion was certainly safer for sources (though the code was already problematic in that we couldn't disable a source).
Clarifying whether ContentProvider
is just providers or if it's also sources (and if not, then how sources factor into the data model) would be really helpful.
If it's just providers, then this code would work fine, if all we want is to include providers that have a ContentProvider
and are not disabled. However, if we also need to be able to selectively enable/disable sources, then either: (a) we need to overload the conception of ContentProvider
, and probably stick to only explicitly excluding sources rather than including them based on ContentProvider
(otherwise we'd have to make a ContentProvider
for all the Smithsonian sources); or, (b) we need to create a separate Source
model to track sources, and then probably only track source exclusions, considering how many sources there are and how tedious it would be to manage that by hand... though that tediousness is maybe just this first go-round and we could automatically create sources during the data refresh when a new one is detected.
All to say: this is pretty complicated but so long as we are not currently relying on ContentProvider
to exclude a source then it should work. If that's the case, then great, we can move forward with the PR as is, and then circle back to this question of how to (and whether to) manage source inclusion/exclusion. Otherwise, we probably need to stick to exclusion.
We definitely have sources in ContentProvider
(finnish_satakunnan_museum and finnish_heritage_agency, the Smithsonian sources, others too), but I can't at a glance tell which are excluded (need to do a db query for that).
These are the only hidden ones:
deploy@localhost:openledger> select provider_name from content_provider where filter_content;
+---------------------+
| provider_name |
|---------------------|
| Flora-On |
| ccMixter |
| Science Museum – UK |
+---------------------+
SELECT 3
Time: 0.222s
All of these are "provider-sources", so it should be okay for now. But if we make this change to the code, we really need to circle back and reconsider the meaning of ContentProvider
and the provider/source relationship in that model.
It would probably be a good idea to have separate models, Provider
and Source
, where Source
has a foreign key relationship to Provider
, rather than overloading the meaning of ContentProvider
. Even just reasoning through the limitations of the current approach to query building and the meaning of these models would be so much easier if that distinction was clear in absolute terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, we only had source
. Then, the provider
field was added to denote the provider of the metadata about the image 1. The PR that added the provider
field, didn't touch the ContentProvider
model, and this model wasn't updated later.
ContentProvider
currently only handles sources, not providers. You can check this if you go to https://openverse.org/sources - you'll see NASA in the table. This uses the stats
endpoint that queries sources
, not providers
.
All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider
to filter sources. I'll review the code again to check that this is true.
All to say: this is pretty complicated but so long as we are not currently relying on ContentProvider to exclude a source then it should work. If that's the case, then great, we can move forward with the PR as is, and then circle back to this question of how to (and whether to) manage source inclusion/exclusion. Otherwise, we probably need to stick to exclusion.
Could we open a new issue (maybe even a Project proposal request?) to
- Rename the
ContentProvider
model toContentSource
- Create a new
ContentProvider
model for providers - surface provider stats in the API
- use this view to add provider links to the frontend single result page
There is some additional info in the cc-archive issues and PRs (https://github.com/cc-archive/cccatalog-api/pull/548/files#r444933371, Expose provider field cc-archive/cccatalog-api#560 (comment))
By the way, some of the issues mention that a source can appear in several providers. I think we have a project for implementing this, this would reduce the number of duplicates we have, but we would need to find a way to handle a work that has several providers.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.
If ContentProvider
is only for sources then we need to change the filter here to source
, not provider
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.
If
ContentProvider
is only for sources then we need to change the filter here tosource
, notprovider
.
This is exactly what I proposed in the first comment of this thread, which it is... deleted?? I don't understand what happened 😳 Anyway, I 100% agree that the current naming is a mess and very confusing between what actually is a source and what is a provider. I'll apply the minimal changes required to fulfill the issue requirement and keep the PR simple. As @obulat suggests, the renaming sure warrants its own issue, maybe even a small project.
Originally, we only had source. Then, the provider field was added to denote the provider of the metadata about the image 1.
That is a curious fact. I thought the provider
field came before the source
given that the model's name is ContentProvider
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only care about sources and not providers, then we should "just" (in quotes because maybe it isn't trivial to do) rename ContentProvider to ContentSource
(although, I don't know why the "content" prefix is necessary) and the variables/serializers/documentation to clarify it (e.g. in the stats endpoint).
We don't need to add structure that is never used, so if ContentProvider
only ever really means to reference the sources, then we can just forget about representing providers as an exclusive category at the API database level altogether, at least until we actually have a reason to do that.
This uses the stats endpoint that queries sources, not providers
@obulat the stats endpoint queries ContentProvider
, so I don't think this is in any way obvious through any of the current implementation, on the name level, certainly not by just reading the API code... and my intuition would be to assume there was a confusion between source and provider, rather than the difference in reference on the frontend to all references in the API were intentional.
Just want to clarify that this is messy and needs to be cleaned up, otherwise we're relying on latent, undocumented understandings of how the usage of that particular model changed, without ever having made the code reflect that, not even at the public documentation level:
https://api.openverse.org/v1/#tag/images/operation/images_stats
The documentation there now says that display_name
is "The name of content provider, e.g. Flickr", contrasted to source_name
, which is "The source of the media, e.g. flickr". The difference is unexplained, and there's no way to know that ContentProvider
actually represents sources based on how the code uses it.
openverse/api/api/views/media_views.py
Lines 206 to 217 in c92d484
@action(detail=False, serializer_class=ProviderSerializer, pagination_class=None) | |
def stats(self, *_, **__): | |
source_counts = search_controller.get_sources(self.default_index) | |
context = self.get_serializer_context() | { | |
"source_counts": source_counts, | |
} | |
providers = ContentProvider.objects.filter( | |
media_type=self.default_index, filter_content=False | |
) | |
serializer = self.get_serializer(providers, many=True, context=context) | |
return Response(serializer.data) |
Note that the serializer, model, and variable name of the retrieved data are all provider, not source.
Really just wanting to make it clear that this isn't clear in the code, and I think we should consider treating resolving that with greater urgency than we have thus far. The source/provider distinction is terribly tedious, and anything we do to clarify it even for ourselves is worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree 100% with you @sarayourfriend, so I raised the issue's priority.
if filtered_providers: | ||
return Q("terms", provider=filtered_providers) | ||
if enabled_providers: | ||
return Q("terms", provider=enabled_providers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's along the lines of what I'm thinking... but if and only if all active sources have a ContentProvider
entry.
Which is to say, exclusion was certainly safer for sources (though the code was already problematic in that we couldn't disable a source).
Clarifying whether ContentProvider
is just providers or if it's also sources (and if not, then how sources factor into the data model) would be really helpful.
If it's just providers, then this code would work fine, if all we want is to include providers that have a ContentProvider
and are not disabled. However, if we also need to be able to selectively enable/disable sources, then either: (a) we need to overload the conception of ContentProvider
, and probably stick to only explicitly excluding sources rather than including them based on ContentProvider
(otherwise we'd have to make a ContentProvider
for all the Smithsonian sources); or, (b) we need to create a separate Source
model to track sources, and then probably only track source exclusions, considering how many sources there are and how tedious it would be to manage that by hand... though that tediousness is maybe just this first go-round and we could automatically create sources during the data refresh when a new one is detected.
All to say: this is pretty complicated but so long as we are not currently relying on ContentProvider
to exclude a source then it should work. If that's the case, then great, we can move forward with the PR as is, and then circle back to this question of how to (and whether to) manage source inclusion/exclusion. Otherwise, we probably need to stick to exclusion.
We definitely have sources in ContentProvider
(finnish_satakunnan_museum and finnish_heritage_agency, the Smithsonian sources, others too), but I can't at a glance tell which are excluded (need to do a db query for that).
These are the only hidden ones:
deploy@localhost:openledger> select provider_name from content_provider where filter_content;
+---------------------+
| provider_name |
|---------------------|
| Flora-On |
| ccMixter |
| Science Museum – UK |
+---------------------+
SELECT 3
Time: 0.222s
All of these are "provider-sources", so it should be okay for now. But if we make this change to the code, we really need to circle back and reconsider the meaning of ContentProvider
and the provider/source relationship in that model.
It would probably be a good idea to have separate models, Provider
and Source
, where Source
has a foreign key relationship to Provider
, rather than overloading the meaning of ContentProvider
. Even just reasoning through the limitations of the current approach to query building and the meaning of these models would be so much easier if that distinction was clear in absolute terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just requesting changes to clarify the PR is under discussion while we decide what steps we want to take now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My request for changes, in light of the sources/provider discussion, is to replace provider
with source
everywhere it's necessary, and add a comment on the ContentProvider
model to say that it's actually a content source, not provider (until we rename the model)
if filtered_providers: | ||
return Q("terms", provider=filtered_providers) | ||
if enabled_providers: | ||
return Q("terms", provider=enabled_providers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.
If
ContentProvider
is only for sources then we need to change the filter here tosource
, notprovider
.
This is exactly what I proposed in the first comment of this thread, which it is... deleted?? I don't understand what happened 😳 Anyway, I 100% agree that the current naming is a mess and very confusing between what actually is a source and what is a provider. I'll apply the minimal changes required to fulfill the issue requirement and keep the PR simple. As @obulat suggests, the renaming sure warrants its own issue, maybe even a small project.
Originally, we only had source. Then, the provider field was added to denote the provider of the metadata about the image 1.
That is a curious fact. I thought the provider
field came before the source
given that the model's name is ContentProvider
.
if enabled_providers: | ||
return Q("terms", provider=enabled_providers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if enabled_providers: | |
return Q("terms", provider=enabled_providers) | |
if enabled_providers: | |
return Q("terms", source=enabled_sources) |
00c8914
to
2e0c437
Compare
2e0c437
to
f9b0d0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll approve after testing this PR locally.
PROVIDER = "provider" | ||
QUERY_SPECIAL_CHARACTER_ERROR = "Unescaped special characters are not allowed." | ||
ENABLED_SOURCES_CACHE_KEY = "enabled_sources" | ||
ENABLED_SOURCES_CACHE_VERSION = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for removing the unused constants from 5 years ago :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't even know they were so old, wow! What is strange is that there should be a pre-commit lint step that fails due to unused variables 🤔 Not sure what happened with that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't reliably lint for unused variables declared at the module scope in Python, because all top-level variables are automatically exported. It'd be like trying to lint an unused export const whatever;
in JavaScript. It can't be done unless you statically analyse all references to the module. I don't know if mypy supports something like that, but certainly ruff never could.
Same with "unused imports" in Python, because importing in Python always execs the module (assuming the default loader is used) and so it can have side effects. I think there are even further caveats with name shadowing if you use star imports but I haven't refreshed my memory on that in a while!
All of which is just to say, it's not possible to lint for unused module-level declarations without something like mypy, which we don't use (hopefully one day we will, though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything works well locally 🎉
I think we do need to change the variable name in api/test/unit/controllers/test_search_controller.py
* Change search approach to include only available providers * Replace `get_excluded_providers_query` with `get_filtered_providers_query`
Co-authored-by: Olga Bulat <obulat@gmail.com> Co-authored-by: sarayourfriend <git@sarayourfriend.pictures>
f9b0d0e
to
f3677c5
Compare
I had to rebase it with |
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @dhruvkb Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
I'll re-review this today 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes
Fixes #4076 by @obulat
Description
This PR proposes an alternative solution to the one proposed in the issue: The search controller should limit the results queried to those associated with existing and not hidden providers. IMO, this is a simpler approach that takes advantage of the existing code structure. It fits the "filtered provider" concept in the sense that it queries for valid providers instead of skipping the "excluded." What I don't know is if we want to update the value of the
FILTERED_PROVIDERS_CACHE_VERSION
variable in this case.Testing Instructions
just a
, and go to http://localhost:50280/v1/images/ to confirm the search is working with the usual providersContentProvider
entry for Flickr.Checklist
main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin