Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor search results not replicable; something wrong on server? #721

Closed
KingMob opened this issue May 19, 2019 · 14 comments
Closed

Poor search results not replicable; something wrong on server? #721

KingMob opened this issue May 19, 2019 · 14 comments
Labels

Comments

@KingMob
Copy link

KingMob commented May 19, 2019

E.g., a search for "re-frame" has re-frame appearing on the second page of search results. Ditto for searching for "component"; com.stuartsierra/component isn't shown until the second page. This seems pretty undesirable to me.

@tobias
Copy link
Member

tobias commented May 19, 2019

I agree, this isn't desirable behavior. I don't know much about lucene, but if someone who does know know more about it and wants to look at solving this, the search code lives in https://github.com/clojars/clojars-web/blob/master/src/clojars/search.clj

@KingMob
Copy link
Author

KingMob commented May 20, 2019

Maybe someone can coordinate with the cljdoc people? They're planning to update their Lucene as well. Since cljdoc currently uses clojars for search, anything not on the front page is unsearchable on cljdoc.

I might be able to take a look later.

@KingMob
Copy link
Author

KingMob commented May 21, 2019

I'm beginning to suspect something else is wrong, maybe in the data/server, and not the code, because I can't replicate the problem locally. I was originally considering using Fieldable.setBoost() to prioritize artifact-ids over other fields, but now I'm wondering if something else is wrong with the existing index/cache/scoring/sorting/scaling/paging/etc.

I forked clojars, set it up, imported the entire clojars mvn with rsync, and made sure to overwrite the random all.edn with latest clojars version for correct stats... but I can't seem to generate the bad search results that clojars.org does. A local search for "re-frame" turns up first re-frame, then re-frame-10x, which seems pretty good. Whereas on clojars.org, how does "re-frame-re-play", a repo with only 18 downloads, appear first in the search results?

@KingMob
Copy link
Author

KingMob commented May 21, 2019

May be related to #719. This is starting to feel more like an error somewhere, and less like a suboptimal scoring algorithm.

@KingMob
Copy link
Author

KingMob commented May 22, 2019

I'm going to update the PR title to reflect recent investigation.

@KingMob KingMob changed the title Some exact matches not on first page of search results Poor search results not replicable; something wrong on server? May 22, 2019
@tobias
Copy link
Member

tobias commented May 22, 2019

Interesting. I started a full re-index of all of the jars - we'll see if that fixes things. It will take a while to complete, so I'll report back when it is done.

@tobias
Copy link
Member

tobias commented May 22, 2019

The index rebuild has finished, but I don't see that the results are any better.

@KingMob
Copy link
Author

KingMob commented May 22, 2019

Interesting. It's actually gotten worse, if anything; re-frame is now on the third page of search results.

What else could it be? The stats/all.edn itself still seems good, and the code itself hasn't changed much in years. But it's not affecting most packages. A search for create-react-class, which has been DLed 420,000+ times, shows it as the first result out of 590 matches. Ditto for other popular packages like leinjacker and humane-test-output.

You know what it might be? I think Lucene is splitting words on the "-", either when indexing, parsing the query, or both. E.g., a search for "re-frame" and "re frame" produce identical results. A search for just "re" makes "re-frame-re-play" as the number one hit. Then the reason re-frame-re-play has the top score because it has "re" twice, and the reason so many packages have higher scores than re-frame is because they mention re-frame in the description, boosting their score relative to re-frame itself.

Unfortunately, this doesn't explain why I can't replicate the issue locally. rsync failed on downloading some packages, but I thought I had enough to reasonably reproduce the problem.

@holyjak
Copy link

holyjak commented May 29, 2019

@KingMob How did you test the search locally, through the web UI i.e. in the same way we search http://clojars.org/? Perhaps a maintainer could share the generated Lucene index from the server so we could try search against it locally / compare it with manually generated one? If the search way is the same and the index doesn't make a difference, what could it be? Could clojars.org run different versions of dependencies???

@holyjak
Copy link

holyjak commented May 29, 2019

BTW I see that clucy is 6 year old abandonware, using Lucene 4 (the latest is 8). Perhaps we should consider switching to a different library or just use Lucene directly?

@holyjak
Copy link

holyjak commented May 29, 2019

FYI this is the cljdocs issue regarding search cljdoc/cljdoc#85

And this is my work-in-progress Clojure -> Lucene 8 artifact search https://github.com/holyjak/tmp-clj-artifact-search

@KingMob
Copy link
Author

KingMob commented May 30, 2019

@holyjak For local usage, I set it up as I described above in #721 (comment). I tried to pull in the public hosted all.edn file to get the stats right, but my rsync call didn't pull the whole maven repository. I think I got maybe 2/3 of all the packages available. Then, I ran it with lein repl and (go).

I noticed clucy was old. I couldn't locate javadocs for that version of Lucene, and unfortunately, the archived .zip file has some errors that prevented building. That alone suggests it's worth it to upgrade though I'm not sure if clucy is the issue here.

@tobias tobias added the search label Dec 22, 2019
@renatoalencar
Copy link
Contributor

renatoalencar commented Nov 22, 2020

I'm going through this for the past few days, and here is what I found. The people involved in this issue, should probably know by now what's happening, but I guess it's good to document things here.

Why re-frame results have re-frame-re-play first?

Which seems to be happening is that Lucene handles things like - as punctuation (as @KingMob pointed out above), and for both indexing and searching re-frame becomes two separate terms re and frame. Lucene inverted index maps terms to documents and score them based on the amount of matches it could find, based on TF-IDF. Which for artifacts like re-frame-re-play, are 6 matches, three for the artifact id and three for the group id.

Why can't we find clj-concord for clj-concordion (#719)?

This search is mapped to the terms clj and concord, Lucene searches for the term concord and does not allow for partial or fuzzy matches, for this specific implementation. Also clj itself is a very present and common term.

Why not so popular jars are being prioritized over more popular ones?

You can increase the number of matches for a particular term by including repetitions of a bunch of terms several times, I managed to overtook create-react-class first position on search by doing that, also repeating the terms on the description. No need to download the jar any time.

I couldn't understand exactly how TF-IDF is interfering on this, but I guess it's probably what we could both leverage and tweak here to improve search results performance.

@tobias
Copy link
Member

tobias commented Jan 23, 2022

I've rewritten search, and the search results look much better now (see #806 (comment)). I'm closing this issue.

@tobias tobias closed this as completed Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants