Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change search query approach to include only available providers #4238

Merged
merged 6 commits into from
May 21, 2024

Conversation

krysal
Copy link
Member

@krysal krysal commented Apr 30, 2024

Fixes

Fixes #4076 by @obulat

Description

This PR proposes an alternative solution to the one proposed in the issue: The search controller should limit the results queried to those associated with existing and not hidden providers. IMO, this is a simpler approach that takes advantage of the existing code structure. It fits the "filtered provider" concept in the sense that it queries for valid providers instead of skipping the "excluded." What I don't know is if we want to update the value of the FILTERED_PROVIDERS_CACHE_VERSION variable in this case.

Testing Instructions

  1. Spin up the API, just a, and go to http://localhost:50280/v1/images/ to confirm the search is working with the usual providers
  2. Hide one of the image providers via Django admin, e.g. StockSnap.
  3. Confirm the results for the hidden provider are not returned http://localhost:50280/v1/images/
  4. Show the results of StockSnap again and delete the ContentProvider entry for Flickr.
  5. Confirm the results for the deleted provider are not returned http://localhost:50280/v1/images/

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md`).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (only applicable for catalog PRs).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or
(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or
(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.
(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal requested a review from a team as a code owner April 30, 2024 21:16
@krysal krysal requested review from dhruvkb and stacimc April 30, 2024 21:16
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 30, 2024
@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Apr 30, 2024
@krysal krysal removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Apr 30, 2024
Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, it makes perfect sense to me and I agree it is much clearer than the multiple steps of reasoning required for the other approach.

I've left one comment regarding the terminology used, but that's not as important to me as clarifying whether we need to use must or filter. Based on my understanding, I think filter is the right choice, but if it isn't, then we should document why we use must, considering its effect on document scoring (and conversely, we should document why we use filter, if we do, namely so that it doesn't affect scoring).

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/test/unit/controllers/elasticsearch/test_related.py Outdated Show resolved Hide resolved
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a smart approach!
I agree with Sara on the naming, something with the meaning of enabled or included would be better.

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/elasticsearch/related.py Outdated Show resolved Hide resolved
@krysal
Copy link
Member Author

krysal commented May 1, 2024

I'm happy to change the terminology and, of course, the clause for the query. I had similar thoughts regarding the names, but it seemed to me that they could fit with the existing environment variables, and I didn't want to give it much thought since it could also be decided in the review 😄

This is back to draft in the meantime.

@krysal krysal marked this pull request as draft May 1, 2024 16:19
@krysal krysal force-pushed the fix/search_with_sources_without_ContentProvider branch 5 times, most recently from 0c24148 to 00c8914 Compare May 14, 2024 19:43
@krysal krysal marked this pull request as ready for review May 14, 2024 19:57
Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krysal @obulat do y'all know whether the source/provider distinction is potentially getting muddled in this new approach? I can't remember if ContentProvider only has true "providers" and not sources. If it also has sources, do we need to be careful about how it's used in relation to the source filters (both the include and exclude parameters)?

This LGTM otherwise, but I wanted to make sure I understood where the distinction was before approving.

@krysal
Copy link
Member Author

krysal commented May 15, 2024

@sarayourfriend ContentProvider contains sources, and I don't see any issue with the functionalities. I tried hiding Freesound locally, for example, and I requested audio exclusively from it, http://localhost:50280/v1/audio/?source=freesound, and got zero rows as expected (since the source is disabled). Let me know if you or @obulat find a conflict.

@sarayourfriend
Copy link
Contributor

sarayourfriend commented May 15, 2024

The problem with the implementation, if there is one (if sources are in ContentProvider in addition to providers) is we only use the entries in ContentProvider to filter on the provider field of the ES documents, not source, whereas the query parameters refer to source.

To be clear, if this is a problem now, then it was a problem before with the exclusions, but will be worse now because we wouldn't be including all the sources, only those which happen to also be providers.

If ContentProvider really does have sources, not just providers, then my hunch is that we actually have a complex issue here when it comes to providers like smithsonian with many sources. Do we have an individual content provider entry for all smithsonian sources? I didn't think we did (I can check later), but if we don't, and also need to filter in source based on ContentProvider, then there's some kind of strange intersection issue. We need an or in there. "Include documents if provider is in the list of enabled ContentProvider OR if source is in the list of enabled ContentProvider".

Maybe this isn't an issue and I am just confusing myself. But if sources are also in ContentProvider, then there is a big risk to the current implementation that only filters in based on provider.

Copy link
Member Author

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the implementation, if there is one (if sources are in ContentProvider in addition to providers) is we only use the entries in ContentProvider to filter on the provider field of the ES documents, not source, whereas the query parameters refer to source.

To be clear, if this is a problem now, then it was a problem before with the exclusions, but will be worse now because we wouldn't be including all the sources, only those which happen to also be providers.

I SEE, wow, I think you're right. We're mixing sources with providers, I hadn't paid much attention to it because I thought this was covered since the "new" parameters work 😅

If ContentProvider really does have sources, not just providers, then my hunch is that we actually have a complex issue here when it comes to providers like smithsonian with many sources. Do we have an individual content provider entry for all smithsonian sources?

I don't think we have those; some sources are too small (<100 items), so we decided to skip them. But the ugly problem is that there shouldn't be a provider entry equivalent to all source entries (not sure if I'm explaining myself here, but hopefully so!). The provider indicates where we obtain the data from, despite which the source site is.

There could be a simpler solution tho, proposed below.

if filtered_providers:
return Q("terms", provider=filtered_providers)
if enabled_providers:
return Q("terms", provider=enabled_providers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's along the lines of what I'm thinking... but if and only if all active sources have a ContentProvider entry.

Which is to say, exclusion was certainly safer for sources (though the code was already problematic in that we couldn't disable a source).

Clarifying whether ContentProvider is just providers or if it's also sources (and if not, then how sources factor into the data model) would be really helpful.

If it's just providers, then this code would work fine, if all we want is to include providers that have a ContentProvider and are not disabled. However, if we also need to be able to selectively enable/disable sources, then either: (a) we need to overload the conception of ContentProvider, and probably stick to only explicitly excluding sources rather than including them based on ContentProvider (otherwise we'd have to make a ContentProvider for all the Smithsonian sources); or, (b) we need to create a separate Source model to track sources, and then probably only track source exclusions, considering how many sources there are and how tedious it would be to manage that by hand... though that tediousness is maybe just this first go-round and we could automatically create sources during the data refresh when a new one is detected.

All to say: this is pretty complicated but so long as we are not currently relying on ContentProvider to exclude a source then it should work. If that's the case, then great, we can move forward with the PR as is, and then circle back to this question of how to (and whether to) manage source inclusion/exclusion. Otherwise, we probably need to stick to exclusion.

We definitely have sources in ContentProvider (finnish_satakunnan_museum and finnish_heritage_agency, the Smithsonian sources, others too), but I can't at a glance tell which are excluded (need to do a db query for that).

These are the only hidden ones:

deploy@localhost:openledger> select provider_name from content_provider where filter_content;
+---------------------+
| provider_name       |
|---------------------|
| Flora-On            |
| ccMixter            |
| Science Museum – UK |
+---------------------+
SELECT 3
Time: 0.222s

All of these are "provider-sources", so it should be okay for now. But if we make this change to the code, we really need to circle back and reconsider the meaning of ContentProvider and the provider/source relationship in that model.

It would probably be a good idea to have separate models, Provider and Source, where Source has a foreign key relationship to Provider, rather than overloading the meaning of ContentProvider. Even just reasoning through the limitations of the current approach to query building and the meaning of these models would be so much easier if that distinction was clear in absolute terms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, we only had source. Then, the provider field was added to denote the provider of the metadata about the image 1. The PR that added the provider field, didn't touch the ContentProvider model, and this model wasn't updated later.

ContentProvider currently only handles sources, not providers. You can check this if you go to https://openverse.org/sources - you'll see NASA in the table. This uses the stats endpoint that queries sources, not providers.

All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.

All to say: this is pretty complicated but so long as we are not currently relying on ContentProvider to exclude a source then it should work. If that's the case, then great, we can move forward with the PR as is, and then circle back to this question of how to (and whether to) manage source inclusion/exclusion. Otherwise, we probably need to stick to exclusion.

Could we open a new issue (maybe even a Project proposal request?) to

By the way, some of the issues mention that a source can appear in several providers. I think we have a project for implementing this, this would reduce the number of duplicates we have, but we would need to find a way to handle a work that has several providers.

Footnotes

  1. https://github.com/cc-archive/cccatalog-api/issues/531

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.

If ContentProvider is only for sources then we need to change the filter here to source, not provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.

If ContentProvider is only for sources then we need to change the filter here to source, not provider.

This is exactly what I proposed in the first comment of this thread, which it is... deleted?? I don't understand what happened 😳 Anyway, I 100% agree that the current naming is a mess and very confusing between what actually is a source and what is a provider. I'll apply the minimal changes required to fulfill the issue requirement and keep the PR simple. As @obulat suggests, the renaming sure warrants its own issue, maybe even a small project.

Originally, we only had source. Then, the provider field was added to denote the provider of the metadata about the image 1.

That is a curious fact. I thought the provider field came before the source given that the model's name is ContentProvider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only care about sources and not providers, then we should "just" (in quotes because maybe it isn't trivial to do) rename ContentProvider to ContentSource (although, I don't know why the "content" prefix is necessary) and the variables/serializers/documentation to clarify it (e.g. in the stats endpoint).

We don't need to add structure that is never used, so if ContentProvider only ever really means to reference the sources, then we can just forget about representing providers as an exclusive category at the API database level altogether, at least until we actually have a reason to do that.

This uses the stats endpoint that queries sources, not providers

@obulat the stats endpoint queries ContentProvider, so I don't think this is in any way obvious through any of the current implementation, on the name level, certainly not by just reading the API code... and my intuition would be to assume there was a confusion between source and provider, rather than the difference in reference on the frontend to all references in the API were intentional.

Just want to clarify that this is messy and needs to be cleaned up, otherwise we're relying on latent, undocumented understandings of how the usage of that particular model changed, without ever having made the code reflect that, not even at the public documentation level:

https://api.openverse.org/v1/#tag/images/operation/images_stats

The documentation there now says that display_name is "The name of content provider, e.g. Flickr", contrasted to source_name, which is "The source of the media, e.g. flickr". The difference is unexplained, and there's no way to know that ContentProvider actually represents sources based on how the code uses it.

@action(detail=False, serializer_class=ProviderSerializer, pagination_class=None)
def stats(self, *_, **__):
source_counts = search_controller.get_sources(self.default_index)
context = self.get_serializer_context() | {
"source_counts": source_counts,
}
providers = ContentProvider.objects.filter(
media_type=self.default_index, filter_content=False
)
serializer = self.get_serializer(providers, many=True, context=context)
return Response(serializer.data)

Note that the serializer, model, and variable name of the retrieved data are all provider, not source.

Really just wanting to make it clear that this isn't clear in the code, and I think we should consider treating resolving that with greater urgency than we have thus far. The source/provider distinction is terribly tedious, and anything we do to clarify it even for ourselves is worth the effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the issue, by the way, @krysal! For visibility, it is here:

#4346

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100% with you @sarayourfriend, so I raised the issue's priority.

if filtered_providers:
return Q("terms", provider=filtered_providers)
if enabled_providers:
return Q("terms", provider=enabled_providers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's along the lines of what I'm thinking... but if and only if all active sources have a ContentProvider entry.

Which is to say, exclusion was certainly safer for sources (though the code was already problematic in that we couldn't disable a source).

Clarifying whether ContentProvider is just providers or if it's also sources (and if not, then how sources factor into the data model) would be really helpful.

If it's just providers, then this code would work fine, if all we want is to include providers that have a ContentProvider and are not disabled. However, if we also need to be able to selectively enable/disable sources, then either: (a) we need to overload the conception of ContentProvider, and probably stick to only explicitly excluding sources rather than including them based on ContentProvider (otherwise we'd have to make a ContentProvider for all the Smithsonian sources); or, (b) we need to create a separate Source model to track sources, and then probably only track source exclusions, considering how many sources there are and how tedious it would be to manage that by hand... though that tediousness is maybe just this first go-round and we could automatically create sources during the data refresh when a new one is detected.

All to say: this is pretty complicated but so long as we are not currently relying on ContentProvider to exclude a source then it should work. If that's the case, then great, we can move forward with the PR as is, and then circle back to this question of how to (and whether to) manage source inclusion/exclusion. Otherwise, we probably need to stick to exclusion.

We definitely have sources in ContentProvider (finnish_satakunnan_museum and finnish_heritage_agency, the Smithsonian sources, others too), but I can't at a glance tell which are excluded (need to do a db query for that).

These are the only hidden ones:

deploy@localhost:openledger> select provider_name from content_provider where filter_content;
+---------------------+
| provider_name       |
|---------------------|
| Flora-On            |
| ccMixter            |
| Science Museum – UK |
+---------------------+
SELECT 3
Time: 0.222s

All of these are "provider-sources", so it should be okay for now. But if we make this change to the code, we really need to circle back and reconsider the meaning of ContentProvider and the provider/source relationship in that model.

It would probably be a good idea to have separate models, Provider and Source, where Source has a foreign key relationship to Provider, rather than overloading the meaning of ContentProvider. Even just reasoning through the limitations of the current approach to query building and the meaning of these models would be so much easier if that distinction was clear in absolute terms.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just requesting changes to clarify the PR is under discussion while we decide what steps we want to take now.

@WordPress WordPress deleted a comment from krysal May 16, 2024
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My request for changes, in light of the sources/provider discussion, is to replace provider with source everywhere it's necessary, and add a comment on the ContentProvider model to say that it's actually a content source, not provider (until we rename the model)

if filtered_providers:
return Q("terms", provider=filtered_providers)
if enabled_providers:
return Q("terms", provider=enabled_providers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is to say that I think we can proceed with this PR, and exactly because we rely on ContentProvider to filter sources. I'll review the code again to check that this is true.

If ContentProvider is only for sources then we need to change the filter here to source, not provider.

This is exactly what I proposed in the first comment of this thread, which it is... deleted?? I don't understand what happened 😳 Anyway, I 100% agree that the current naming is a mess and very confusing between what actually is a source and what is a provider. I'll apply the minimal changes required to fulfill the issue requirement and keep the PR simple. As @obulat suggests, the renaming sure warrants its own issue, maybe even a small project.

Originally, we only had source. Then, the provider field was added to denote the provider of the metadata about the image 1.

That is a curious fact. I thought the provider field came before the source given that the model's name is ContentProvider.

Comment on lines 200 to 201
if enabled_providers:
return Q("terms", provider=enabled_providers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if enabled_providers:
return Q("terms", provider=enabled_providers)
if enabled_providers:
return Q("terms", source=enabled_sources)

@krysal krysal force-pushed the fix/search_with_sources_without_ContentProvider branch from 00c8914 to 2e0c437 Compare May 16, 2024 19:42
@krysal krysal force-pushed the fix/search_with_sources_without_ContentProvider branch from 2e0c437 to f9b0d0e Compare May 16, 2024 19:57
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve after testing this PR locally.

PROVIDER = "provider"
QUERY_SPECIAL_CHARACTER_ERROR = "Unescaped special characters are not allowed."
ENABLED_SOURCES_CACHE_KEY = "enabled_sources"
ENABLED_SOURCES_CACHE_VERSION = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for removing the unused constants from 5 years ago :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't even know they were so old, wow! What is strange is that there should be a pre-commit lint step that fails due to unused variables 🤔 Not sure what happened with that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't reliably lint for unused variables declared at the module scope in Python, because all top-level variables are automatically exported. It'd be like trying to lint an unused export const whatever; in JavaScript. It can't be done unless you statically analyse all references to the module. I don't know if mypy supports something like that, but certainly ruff never could.

Same with "unused imports" in Python, because importing in Python always execs the module (assuming the default loader is used) and so it can have side effects. I think there are even further caveats with name shadowing if you use star imports but I haven't refreshed my memory on that in a while!

All of which is just to say, it's not possible to lint for unused module-level declarations without something like mypy, which we don't use (hopefully one day we will, though).

api/test/unit/controllers/test_search_controller.py Outdated Show resolved Hide resolved
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works well locally 🎉

I think we do need to change the variable name in api/test/unit/controllers/test_search_controller.py

krysal and others added 5 commits May 20, 2024 15:20
* Change search approach to include only available providers
* Replace `get_excluded_providers_query` with `get_filtered_providers_query`
Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: sarayourfriend <git@sarayourfriend.pictures>
@krysal krysal force-pushed the fix/search_with_sources_without_ContentProvider branch from f9b0d0e to f3677c5 Compare May 20, 2024 19:22
@krysal
Copy link
Member Author

krysal commented May 20, 2024

I had to rebase it with main because the API image wasn't being built correctly. Other than that, the variable in the test is updated ✔️

@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@dhruvkb
@stacimc
@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@sarayourfriend
Copy link
Contributor

I'll re-review this today 👍

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@krysal krysal merged commit 3d7fa09 into main May 21, 2024
50 checks passed
@krysal krysal deleted the fix/search_with_sources_without_ContentProvider branch May 21, 2024 19:54
AetherUnbound added a commit that referenced this pull request May 28, 2024
AetherUnbound added a commit that referenced this pull request May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Status: 🤝 Merged
Development

Successfully merging this pull request may close these issues.

Exclude media from sources without ContentProvider record from search
4 participants