Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Use terms instead of term #764

Closed
wants to merge 4 commits into from
Closed

Conversation

PrajwalBorkar
Copy link

Fixes

Fixes #698

Description

Updated term to terms

Testing Instructions

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@PrajwalBorkar PrajwalBorkar requested a review from a team as a code owner June 21, 2022 06:23
@openverse-bot openverse-bot added this to Needs review in Openverse PRs Jun 21, 2022
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟩 priority: low Low priority and doesn't need to be rushed labels Jun 21, 2022
@sarayourfriend
Copy link
Contributor

sarayourfriend commented Jun 21, 2022

Thanks for the PR @PrajwalBorkar! Welcome to Openverse as well 🎉

If my understanding of this is correct I think we can also get rid of the for loop, we don't need to build an array of queries anymore, just the single "terms" query. @dhruvkb can hopefully confirm this though. I'm not very familiar with this.

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @sarayourfriend is correct. Changing from term to terms removes the need to build an array and we can directly specify multiple queries.

api/catalog/api/controllers/search_controller.py Outdated Show resolved Hide resolved
@PrajwalBorkar
Copy link
Author

Thanks for the PR @PrajwalBorkar! Welcome to Openverse as well 🎉

If my understanding of this is correct I think we can also get rid of the for loop, we don't need to build an array of queries anymore, just the single "terms" query. @dhruvkb can hopefully confirm this though. I'm not very familiar with this.

I'm not familiar with the code and functions so it is difficult to understand. I made the changes as per my knowledge please review.

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Jun 23, 2022

No worries. I've broken down the code in question and I'll try to walk through the discrete parts for you and explain why they need to change. Hopefully this helps. If you have any questions about it or still don't understand, please do not hesitate to ask.

def _apply_filter(
    s: Search,
    search_params: MediaSearchRequestSerializer,
    serializer_field: str,
    es_field: Optional[str] = None,
    behaviour: Literal["filter", "exclude"] = "filter",
):
    if serializer_field in search_params.data:
        filters = []
        for arg in search_params.data[serializer_field].split(","):
            _param = es_field or serializer_field
            args = {"name_or_query": "term", _param: arg}
            filters.append(Q(**args))
        method = getattr(s, behaviour)
        return method("bool", should=filters)
    else:
        return s

Let's start with the parameters.

  • s: Search. This is a stateful Search object that represents the current search we would like to execute. When we call methods on Search we manipulate the underlying search configuration. This allows us to programatically build up a search query before we actually send it to Elasticsearch for execution. You can see this in action in these two lines quoted below. They retrive one of the search manipulation methods (either exclude or include) off of the search object and then call it with the filters configured based on the search parameters passed to the _apply_filter function. Don't worry if you don't understand the exact implementation yet, as it uses some intermediate and abstract Python techniques to make the code more flexible. The important thing to know is that the search is being updated with the filters.
method = getattr(s, behaviour)
return method("bool", should=filters)
  • search_params: MediaSearchRequestSerializer. This parameter represents the search query that was sent by the requester over HTTP to the API. This will include things like the search term, potential license filters, amongst other things. For audio searches it will be an AudioSearchRequestSerializer. For image searches it will be an ImageSearchRequestSerializer. In this case we don't need to worry about which we have because we're going to apply a specific search filter. Which brings us to the next parameter...
  • searializer_field: str. This parameter corresponds to one of the fields defined on the serializer passed as search_params. For example, it could be license, size (for images), or duration (for audio), amongst others. This field name may also correspond to the field name of the indexed Elasticsearch documents. For historical reasons, sometimes they do not correspond directly, which brings us to the next parameter...
  • es_field: Optional[str]. This parameter is used when the name serializer_field name does not match the Elasticsearch index property name for the attribute being filtered against. For example, the license query param (present on the serializer as license) actually corresponds to filtering against the license__keyword property in the indexed Elasticsearch documents.
  • Finally, behaviour: Literal["filter", "exclude"]. This parameter defines whether we will include or exclude documents from Elasticsearch when they match the values passed through in the serializer. For example, we set this to filter for license when we want to include only records that have a particular license. We set this to exclude for sources when excluded_source is passed to the API to remove results from the specified provider(s). The reason we use filter instead of the potentially clearer include is because this parameter also corresponds to the name of the method on the Search object that we are passing the query configuration to.

Okay, now that we have an explanation of the parameters, let's look at how we use them.

The first thing we check is whether serializer_field is in search_params.data. Recall that search_params is a Django Rest Framework Serializer instance. data is a Python dict that holds key value pairs of the search query params where the key is a query param name and the values are the values passed for that query param. If we call the API with provider=flickr,nasa then search_params.data could look like this:

{
    "provider": "flickr,nasa"
}

Notice that if the requester does not pass a particular parameter, it will not be present in the serializer's data field. This means we can skip applying the filter altogether because we don't have any values to tell Elasticsearch to filter or exclude against anyway. That's why you see the else: return s at the end of the method. Another way to write this that would be potentially clearer would be to use an early return:

if serializer_field not in search_params.data:
    return s

# otherwise apply the filter

This is a matter of taste/preference though. I share this just to give a different perspective on the code in case it helps you to understand it better.

Once we have established that we do have data for the serializer field in question, we then build a list of filters to eventually pass to Elasticsearch. In the example I provided above for the provider=flickr,nasa example, we will build two provider filters, one for each of the values of flickr and nasa. This, precisely, is the area of the code that needs to change for the issue's description.

For the current behavior, we first define an empty filters list. This will hold each of the filter Query objects that we create that will eventually be passed to the Search object's filter or exclude method.

The Query object requires us to pass the specific Elasticsearch document property name. This is where the es_field parameter comes in. If it is defined, then we use that. Otherwise, we fall back to the serializer_field and assume that the serializer field has the same name as the Elasticsearch document property. Then we create the list of arguments to pass to the Query constructor. Query, or Q for short, mirrors the Django ORM's filter API, wherein it accepts Python keyword arguments where the key is the Elasticsearch document property and the value is the value to query for. Additionally, the Q factory function for Query takes a name_or_query parameter. This parameter is kind of confusing and (in my opinion) has a clumsy name. The important think to know about it though, is that in our use-case, whatever you pass to it needs to correspond to the type of Query DSL object you want the Q factory function to return. Underneath the hood, the Q function retrieves the DSL class that corresponds to the value you pass for name_or_query and then instantiates that class for you, giving you a Query part object to represent the part of the query. It uses the rest of the keyword arguments you pass to configure that query. In our case, we use the term DSL class and give it a single term (or Elasticsearch document property name) and a single value to match for that term.

This is where the issue's requested changes come in. In our current approach, we create a single term query part for each of the values of the serializer field by iterating over the values. Instead, we can short-cut the iteration process and use the terms query part which allows for multiple, comma separated values to be present in the query values! For example, instead of building a query that looks like this (this is pseudocode, not actually representative of what's going on under the hood with Elasticsearch's DSL!):

query = [
  { "type": "term", "provider": "flickr" },
  { "type": "term", "provider": "nasa" },
]

we can do the following:

query = { "type": "terms", "provider": "flickr,nasa" }

Much cleaner! When we consider that we can have dozens of values to iterate over, it becomes much clearer to use the terms version. It also removes some code (namely the for loop) and the best code to maintain is code that doesn't exist!

So, in the end, I think the code that needs to change goes something like this:

diff --git a/api/catalog/api/controllers/search_controller.py b/api/catalog/api/controllers/search_controller.py
index 0a9fb980..0a3fa8a5 100644
--- a/api/catalog/api/controllers/search_controller.py
+++ b/api/catalog/api/controllers/search_controller.py
@@ -168,17 +168,15 @@ def _apply_filter(
     :return: the input ``Search`` object with the filters applied
     """
 
-    if serializer_field in search_params.data:
-        filters = []
-        for arg in search_params.data[serializer_field].split(","):
-            _param = es_field or serializer_field
-            args = {"name_or_query": "term", _param: arg}
-            filters.append(Q(**args))
-        method = getattr(s, behaviour)
-        return method("bool", should=filters)
-    else:
+    arguments = search_params.data.get(serializer_field)
+    if arguments is None:
         return s
 
+    parameter = es_field or serializer_field
+    query = Q("terms", **{parameter: arguments})
+    method = getattr(s, behaviour)
+    return method("bool", should=query)
+
 
 def _exclude_filtered(s: Search):
     """

Note: I'm not sure if this is the exact final code. I haven't tested this and I'm not personally familiar enough with Elasticsearch to confirm at a glance if this is correct. However, it's a good starting point. If you apply these changes locally and run the application, see if you can read the logs and debug any potential issues that come up. I would start this process by sending some queries to your local API that use both single and multiple values.

@PrajwalBorkar
Copy link
Author

Thanks a ton for a great explanation. I made the changes accordingly please review and test for any changes :)

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test this PR, you can use the license query parameter like so:
http://localhost:8000/v1/images/?q=cat&license=by-sa,cc0
It should return all the items that have CC BY_SA licenses, as well as the CC0. In this PR, it returns
"detail": "RequestError(400, 'x_content_parse_exception', '[terms] query does not support [license.keyword]')"
I'm not sure what needs to be updated, but I can help you figure it out if you want.

Co-authored-by: Olga Bulat <obulat@gmail.com>
@PrajwalBorkar
Copy link
Author

To test this PR, you can use the license query parameter like so: http://localhost:8000/v1/images/?q=cat&license=by-sa,cc0 It should return all the items that have CC BY_SA licenses, as well as the CC0. In this PR, it returns "detail": "RequestError(400, 'x_content_parse_exception', '[terms] query does not support [license.keyword]')" I'm not sure what needs to be updated, but I can help you figure it out if you want.

Thanks for the suggestions. Also I will need help to test this.

@sarayourfriend
Copy link
Contributor

@PrajwalBorkar have you been able to get the API running locally on your computer? We have instructions for how to do it here: https://wordpress.github.io/openverse-api/guides/quickstart.html

Then you should be able to visit http://localhost:8000/v1/images/?q=cat&license=by-sa,cc0 in your browser on your computer as Olga suggested.

@krysal
Copy link
Member

krysal commented Jul 29, 2022

What a nice detailed instructions @sarayourfriend! Is anyone interested in resuming this PR? We haven't heard from the author for a long time.

@AetherUnbound
Copy link
Contributor

We are going to close this, @ramadanomar will attempt the approach in a new PR 🙂

Openverse PRs automation moved this from Needs review to Closed Aug 23, 2022
@ramadanomar ramadanomar mentioned this pull request Aug 28, 2022
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed
Projects
No open projects
Openverse PRs
  
Closed
Development

Successfully merging this pull request may close these issues.

Use terms instead of term
7 participants