Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use MongoDB full text search #149

Merged
merged 2 commits into from
May 19, 2016
Merged

use MongoDB full text search #149

merged 2 commits into from
May 19, 2016

Conversation

MartinNowak
Copy link
Member

  • native/indexed db support for searching
  • supports full search syntax of MongoDB, i.e.
    multiple words, "phrase search", word -exclusion
  • higher weights for name and description
  • stemming of search terms (uses english by default, can be overriden on
    a per package base, by adding a language field)
  • ignores stopwords
  • does not perform proximity search (typos), but our
    levenshtein search returned a lot of nonsense
  • can't currently index the readme b/c it's not in the db, so people
    will have to optimize their short description to be nicely searchable
  • don't index categories which should be an orthogonal search filter

- native/indexed db support for searching
- supports full search syntax of MongoDB, i.e.
  `multiple words`, `"phrase search"`, `word -exclusion`
- higher weights for name and description
- stemming of search terms (uses english by default, can be overriden on
  a per package base, by adding a language field)
- ignores stopwords
- does not perform proximity search (typos), but our
  levenshtein search returned a lot of nonsense
- can't currently index the readme b/c it's not in the db, so people
  will have to optimize their short description to be nicely searchable
- don't index categories which should be an orthogonal search filter
@MartinNowak
Copy link
Member Author

Not perfect but a lot better than the current search which returns almost arbitrary results.

- use db.packages.dropIndex('searchTerms_1') and
  db.packages.update({}, {$unset: {searchTerms: ''}}, {multi: true})
  to actually remove the data
@MartinNowak
Copy link
Member Author

As a long-term solution and for autocompleted search we could use the SOLR adapter of mongo-connector.
https://github.com/mongodb-labs/mongo-connector/blob/67efbf1bef74b7c042a7b4f7dd0023d2fa21409d/mongo_connector/doc_managers/solr_doc_manager.py

@s-ludwig
Copy link
Member

Sounds/looks good. I'll merge and switch live together with the other changes. @wilzbach: Would be good to have it on alpha.dub.pm for testing.

I guess there could be a migration statement to clear the "searchTerms" field, but I can do that once from the CLI, too.

but our levenshtein search returned a lot of nonsense

Indeed, plain Levenshtein turned out to not really be suited to this. It would at least have to weight the edit operations depending on their context and the place on the keyboard and lingual meaning/similarity.

@wilzbach
Copy link
Member

Sounds/looks good. I'll merge and switch live together with the other changes. @wilzbach: Would be good to have it on alpha.dub.pm for testing.

There you go: http://alpha.dub.pm/

I just registered a couple of packages, but if you send me the package dump, I can put it there too ;-)

@wilzbach
Copy link
Member

but if you send me the package dump,

Thanks a lot. It took quite a while though until it refreshed it's cache. It had to (re)-check for every package the latest version. Should be online now ;-)

@s-ludwig
Copy link
Member

Thanks! Makes a good impression. I'll merge now.

@s-ludwig s-ludwig merged commit 4e27cdd into dlang:master May 19, 2016
@MartinNowak MartinNowak deleted the text_index branch May 19, 2016 08:43
@MartinNowak
Copy link
Member Author

I guess there could be a migration statement to clear the "searchTerms" field, but I can do that once from the CLI, too.

It's in the commit message, it was too much hassle to check whether the index still exists. With WiredTiger as storage backend you can't read system.indexes, and runCommand listIndexes return a weird format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants