New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create alternative mechanism for school search API #21717
Conversation
dashboard/lib/autocomplete_helper.rb
Outdated
@@ -15,6 +15,12 @@ def self.format_limit(limit) | |||
return [MIN_LIMIT, [limit.to_i, MAX_LIMIT].min].max | |||
end | |||
|
|||
def self.get_query_terms(query) | |||
query.strip.split(/\s+/).map do |w| | |||
w.gsub(/\W/, '').upcase.presence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're stripping word characters from query terms here but I don't see a matching operation on the other side of the query - are all the names we're matching against also limited to word characters?
My concern is that we won't match, or will at least be biased against, schools with non-word characters in their names:
- Hyphens (Jonesboro-Hodge High School)
- Periods (St. Martin's Episcopal School)
- Apostrophes (Waiʻanae High School)
Based on a little reading it looks like extended-set characters will be handled properly by the \W
class, but probably worth having a test for. (Nānākuli High)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I also see you just moved this functionality, didn't add it. Still interested in your take though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not thought about this. There are 17,744 schools with a non-word, non-whitespace character in the name or city. These are mostly -
or '
with a few others thrown in(,
, &
, @
,/
, etc.) Poking around it seems like these are inconsistent in the data. For example, sometimes we have "WINSTON-SALEM" and sometimes "WINSTON SALEM" and sometimes I see "CHILDREN'S" and sometimes "CHILDRENS". That means that we won't get the right results if we just leave those non-word characters as literals in the search. Instead, it seems reasonable to split on non-word character as well as on whitespace. That means that if you search for "WINSTON-SALEM" we'll look for "WINSTON" and for "SALEM" which will match both "WINSTON-SALEM" and "WINSTON SALEM" Since this is an issue even in the original search, I am changing it in both and adding a few tests.
Any other comments on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. LGTM!
I've been experimenting with ways to improve the school search API and this is what I came up with. It allows matches on any subset of the search terms and matches against the school name, city, and zip code. The results are sorted by the number of terms that matched. Just playing around it seemed to die reasonable results. Even so, I didn't want to replace the current behavior. Instead I parameterized the API so that you can selectively use the old or the new approach. We may want to do an A/B test to see if the new version results in fewer 'school not found' selections without an negative impact on other metrics.
I haven't had a chance to do any investigation into the performance impact of this change. That should be investigated before an significant use in prod.
I tried a few other things before landing here. The call to
MATCH(...) AGAINST ...
returns the MySQL relevance score for the match. Initially, I tried just removing the+
's from the query to allow any terms to match and sorting by relevance. That didn't work well since it would double count terms. This was especially problematic when the same term appears in both the school name and the city, which is very common. That term get over weighted and the sort order is not very good. I tried mitigating this by creating indexes on just name and just city and matching those separately. In the end I determined that the relevance score isn't going to do what we want. In particular, it often gives different weights to different terms.