Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ranking #941

Closed
MartinKolarik opened this issue Apr 22, 2022 · 27 comments
Closed

Improve ranking #941

MartinKolarik opened this issue Apr 22, 2022 · 27 comments

Comments

@MartinKolarik
Copy link
Collaborator

After including package downloads within the search results, we noticed a very common problem with relevance. I believe it's caused by ignoring "js" and probably typo tolerance, and I wonder if there's a way to improve this. A few examples:

  1. https://www-jsdelivr-com-pr-400.onrender.com/?query=vue
    First match "vue" (correct), then "vue.js" (garbage), then "vuejs" (garbage, even marked as deprecated), then relevant vue-related packages.

  2. https://www-jsdelivr-com-pr-400.onrender.com/?query=react
    First match "react" (correct), then "reactjs" (garbage), then re-act (garbage), then relevant react-related packages.

  3. https://www-jsdelivr-com-pr-400.onrender.com/?query=bootstrap
    First match "bootstrap" (correct), then "bootstrapjs" (garbage).

I know that in some cases, it's the other way around and "package.js" is the correct popular name while "package" is garbage, so this isn't easy, but at the same time, npm has very often multiple packages with similar names, so this happens for the majority of searches. To get correct results, we'd need to allow the "relaxed" matching (js ignore and typos) only if a popular match isn't found.

I'm not sure how doable this is but one idea I got so far - maybe instead of ignoring "js" globally, only set the other version as alternate name for packages that seem relevant (downloads above some threshold, not deprecated, have readme, ...). Non-relevant packages wouldn't get this relaxed matching.

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 25, 2022

Technically I'm not sure how this could be done, but we could indeed have a certain download number threshold before adding alternative names maybe? That way at least those three examples would not show up anymore, it bothered me too.

The one that will be harder to avoid is re-act as Algolia sees that as a single typo, which still may list higher than other matches usually would.

@bodinsamuel
Copy link
Contributor

just ordering by download should help no?

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 25, 2022

The problem is that custom ranking comes after word relevance, unless you mean resorting in the frontend @bodinsamuel

@MartinKolarik
Copy link
Collaborator Author

Downloads are already used as part of the ranking but as I understand they have lower priority than the match on project name, so they don't help here.

Something I'm not 100% sure about though - I would expect that "ignored words" apply only to whole words, not parts of them so in "vuejs" the "js" suffix would not be ignored. But based on the search results, it either is ignored even when it's a suffix (in which case my proposed solution would help) or it's some other algolia feature that makes this rank higher (in which case the threshold idea may not help).

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 25, 2022

The js is not an ignored word, but it's listed as an alternativeName, which is listed after the official name in searchableAttributes

@bodinsamuel
Copy link
Contributor

no frontend reordering :p I thought putting popular sooner in the ranking would help but it's the rule that promote those package indeed.

we could remove the rule for promoting alternativeName? (I'm not entirely when it's relevant)
and also add a rule to demote deprecated packages.
wdyt @Haroenv?

@MartinKolarik
Copy link
Collaborator Author

MartinKolarik commented Apr 25, 2022

Oh I see, I previously looked only at the index config but there's a whole bunch of logic that adds alternative names in the code. Then my previous suggestion stands - do this only if the package seems relevant based on other metadata.

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 25, 2022

deprecated should already show up lower according to ranking, but if you spell it correctly, it should still be the first result, as sometimes you would be looking for that.

I think we can add a certain popularity threshold before packages get alternative names. Will look into a PR

@MartinKolarik
Copy link
Collaborator Author

MartinKolarik commented Apr 25, 2022

I agree you should get deprecated packages when spelled correctly but not sure about alternative names - there it's a similar problem that if the name is similar enough, deprecation has too low priority to push the result down. My idea was to use multiple attributes for this because downloads may not always be enough, e.g. add it only if all:

  1. Is not deprecated.
  2. Has 1k npm downloads or 10k jsDelivr downloads.
  3. Has non-empty description field or a readme file or a source repository linked or homepage.

If you didn't want to use the other rules then I suppose a higher popularity threshold.

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 25, 2022

#951 should actually fullfill most of those use cases, at the cost of not showing popular alternative names that match in a first or second position. I think those cases are fairly small though, only one I could think of is places.js, hogan.js or floating.js (small one) but they all still eventually show up

@MartinKolarik
Copy link
Collaborator Author

Any update on the reindex progress? Are we able to test the new attributes now?

@Haroenv
Copy link
Collaborator

Haroenv commented Jun 13, 2022

I don't think we started a full reindex, as it stops the synchronisation, and would take a very long time to catch up, while we don't yet have code for handling multiple instances.

@bodinsamuel
Copy link
Contributor

yup sorry with all the holidays I did not had time to look into it.
What we did in the past is to pop a new instance that will create a new index and once it's ready swap the index.

I'll try to setup that today or tomorrow, full indexing takes more than a week.

@MartinKolarik
Copy link
Collaborator Author

@bodinsamuel did you have time to set up the other instance?

@bodinsamuel
Copy link
Contributor

Yes it's currently running ☺️

@bodinsamuel
Copy link
Contributor

NB: you can test what you want using the tmp index named npm-search-bootstrap-6 it will be renamed npm-search-6 at some point. And then will replace the main index.

@MartinKolarik
Copy link
Collaborator Author

The index doesn't use the newly added field yet because we wanted to test it first after all packages are updated (0e5194d) without affecting the main index. Considering the bootstrap runs in a separate index anyway, what would be the best course of action? I suppose we could update the config right away?

@bodinsamuel
Copy link
Contributor

ah true, forgot about that. Since we did not merge, all manual changes will be erased after some time by the script so not good. We could merge the PR in master and do not deploy the main process

@bodinsamuel
Copy link
Contributor

I have swapped the index since it had fully reindexed.
Duplicated it with the name npm-search-test and manually applied your changes from #994 on it

@MartinKolarik
Copy link
Collaborator Author

Awesome, it works really well! I tested all packages from my initial post here as well as those from comments in #951 and all of them return relevant results now.

@MartinKolarik
Copy link
Collaborator Author

@bodinsamuel when you have time please check and swap the indexes if possible 😄

@MartinKolarik
Copy link
Collaborator Author

@bodinsamuel @Haroenv ping, can we get this live, please? 😄

@bodinsamuel
Copy link
Contributor

I'll merge 👌🏻

@MartinKolarik
Copy link
Collaborator Author

Also not sure if the indexes were swapped yet.

@bodinsamuel
Copy link
Contributor

Applied on the main index 👍🏻

@MartinKolarik
Copy link
Collaborator Author

@bodinsamuel
Copy link
Contributor

Sorry for the delay, it's hard to allocate time internally 🙇🏻 Thanks a lot for your patience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants