Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable unused simple query string features #3327

Closed
sarayourfriend opened this issue Nov 8, 2023 · 2 comments · Fixed by #3360
Closed

Disable unused simple query string features #3327

sarayourfriend opened this issue Nov 8, 2023 · 2 comments · Fixed by #3360
Assignees
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend

Comments

@sarayourfriend
Copy link
Contributor

Problem

Simple query string makes it easy for us to accept user input to directly pass as a query term to Elasticsearch without worrying about messy or bad inputs causing unexpected behaviour for users. However, simple query string is expensive, so if it's possible to turn off any of the features that (a) we don't see people using (when we check logs) or (b) don't want people using (fuzzy, wildcard, etc?) then we can gain some performance benefits.

Description

This is a two part issue. For each of the simple query string features that can be turned off by the limit operators investigate whether they've been used in the last month by querying the API logs in CloudWatch. Any that have been used less than 100 times in the last month are worth turning off (that would represent a statistically meaningless number of search requests, 3 per day in the context of several hundred thousand requests per day).

My guess is we can probably turn off the following:

  • Fuzzy
  • Near
  • Prefix
  • Slop
  • Escape

Note: This also requires modifying the frontend "search syntax" guide to remove mention of features we no longer support after this change, hence the "frontend" label alongside the API label.

Additional context

Suggested to us by Greg who works on Jetpack.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🕹 aspect: interface Concerns end-users' experience with the software 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend labels Nov 8, 2023
@obulat obulat self-assigned this Nov 14, 2023
@obulat
Copy link
Contributor

obulat commented Nov 15, 2023

We are currently using the default flags for the simple query string, which is identical to "flags": "AND|ESCAPE|FUZZY|SLOP|NEAR|NOT|OR|PHRASE|PRECEDENCE|PREFIX|WHITESPACE"

I used a query like this to investigate how much each feature was used in the last 4 weeks:

filter @logStream like "nginx"
| parse request /(?<httpMethod>(GET|POST|PUT|DELETE)) \/v1\/(?<mediaType>(images|audio))\/(?<queryParams>\?.*)?/
| parse queryParams /q=(?<qValue>[^&]*)/ 
| fields status, upstream_response_time
| filter status == 200 and qValue like "~"

I also noticed that many of the requests are the same as the ones on the https://openverse.org/search-help page. For many features, most of the requests with them are for the examples from the Search help page. This probably means that the users try them out, but don't end up using them consistently, and these features can be turned off.

So, for each feature I selected the responses that include the feature but don't include the example from the search help page, like | filter status == 200 and qValue like "~" and not qValue like "theatre~1"

Flags we can remove

~: FUZZY and SLOP|NEAR

FUZZY
Enables the ~N operator after a word, where N is an integer denoting the allowed edit distance for matching. See Fuzziness.
NEAR or SLOP (these are synonyms)
Enables the ~N operator, after a phrase where N is the maximum number of positions allowed between matching tokens.

There are 1,837 records for this query in total for the last 4 weeks, but if we filter out theatre~1, which is the example we have on the search syntax page, using | filter status == 200 and qValue like "~" and qValue not like "theatre~1", we get 59 records (out of 65 million).

*: PREFIX

If we exclude the example from the Search syntax page (net*), we get 220 records in the logs (out of 65 million)

\: ESCAPE

This is not described on the search help page. The logs insights query | filter qValue like "\" or qValue like /%5C/ matches more than 1000 records, but it appears that it's the result of the users accidentally pressing on the \ key instead of the Enter key at the end of the search term, because the slash appears as the last character. E.g.: GET /v1/images/?q=%5C+basketball&license_type=commercial,modification

(): PRECEDENCE

| filter qValue like /%28.*%29/ query returns 500 results for the last 4 weeks.

Flags we shouldn't remove

+: AND

All of the spaces in the frontend requests are converted to +, so it's very widely used.

": PHRASE

We have some special handling for the quotes. There are 8892 requests with quotes in the last 4 weeks (queried using the escaped value filter qValue like /%22/).

-: NOT

I am not certain about this one. We do have more than 4000 queries with +- (which is the space followed by a dash). Some of the requests make it seem that the space was added accidentally, and the user got unexpected results. E.g., old+high+-rise+building did not match any documents with high-rise, although the user might have wanted to match it exactly, but accidentally added a space before the dash.

WHITESPACE

Enables splitting terms by whitespace.

As a result, I think we should enable the following flags for the simple query string: "AND|NOT|PHRASE|WHITESPACE"

@sarayourfriend
Copy link
Contributor Author

Fantastic research, Olga! 👏 👏 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants