Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alternative mechanism for school search API #21717

Merged
merged 3 commits into from Apr 12, 2018

Conversation

drewsamnick
Copy link
Contributor

I've been experimenting with ways to improve the school search API and this is what I came up with. It allows matches on any subset of the search terms and matches against the school name, city, and zip code. The results are sorted by the number of terms that matched. Just playing around it seemed to die reasonable results. Even so, I didn't want to replace the current behavior. Instead I parameterized the API so that you can selectively use the old or the new approach. We may want to do an A/B test to see if the new version results in fewer 'school not found' selections without an negative impact on other metrics.

I haven't had a chance to do any investigation into the performance impact of this change. That should be investigated before an significant use in prod.

I tried a few other things before landing here. The call to MATCH(...) AGAINST ... returns the MySQL relevance score for the match. Initially, I tried just removing the +'s from the query to allow any terms to match and sorting by relevance. That didn't work well since it would double count terms. This was especially problematic when the same term appears in both the school name and the city, which is very common. That term get over weighted and the sort order is not very good. I tried mitigating this by creating indexes on just name and just city and matching those separately. In the end I determined that the relevance score isn't going to do what we want. In particular, it often gives different weights to different terms.

@@ -15,6 +15,12 @@ def self.format_limit(limit)
return [MIN_LIMIT, [limit.to_i, MAX_LIMIT].min].max
end

def self.get_query_terms(query)
query.strip.split(/\s+/).map do |w|
w.gsub(/\W/, '').upcase.presence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're stripping word characters from query terms here but I don't see a matching operation on the other side of the query - are all the names we're matching against also limited to word characters?

My concern is that we won't match, or will at least be biased against, schools with non-word characters in their names:

  • Hyphens (Jonesboro-Hodge High School)
  • Periods (St. Martin's Episcopal School)
  • Apostrophes (Waiʻanae High School)

Based on a little reading it looks like extended-set characters will be handled properly by the \W class, but probably worth having a test for. (Nānākuli High)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I also see you just moved this functionality, didn't add it. Still interested in your take though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not thought about this. There are 17,744 schools with a non-word, non-whitespace character in the name or city. These are mostly - or ' with a few others thrown in(,, &, @,/, etc.) Poking around it seems like these are inconsistent in the data. For example, sometimes we have "WINSTON-SALEM" and sometimes "WINSTON SALEM" and sometimes I see "CHILDREN'S" and sometimes "CHILDRENS". That means that we won't get the right results if we just leave those non-word characters as literals in the search. Instead, it seems reasonable to split on non-word character as well as on whitespace. That means that if you search for "WINSTON-SALEM" we'll look for "WINSTON" and for "SALEM" which will match both "WINSTON-SALEM" and "WINSTON SALEM" Since this is an issue even in the original search, I am changing it in both and adding a few tests.

@drewsamnick
Copy link
Contributor Author

Any other comments on this?

Copy link
Contributor

@islemaster islemaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. LGTM!

@drewsamnick drewsamnick merged commit 6337783 into staging Apr 12, 2018
@drewsamnick drewsamnick deleted the school-search-improvement branch April 12, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants