Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text search with multiple words #281

Closed
PalmerAL opened this issue Jun 28, 2016 · 16 comments
Closed

Full-text search with multiple words #281

PalmerAL opened this issue Jun 28, 2016 · 16 comments
Labels

Comments

@PalmerAL
Copy link

In the fullTextSearch.js example, searching is done using this query:

 db.emails.where("messageWords").startsWithIgnoreCase("v").distinct().toArray(function (a) {

However, it is only possible to search for one word in messageWords with this query. Is there a way to use this for multi-word search queries? I'm trying to search very large documents, so Collection.filter is not fast enough.

@dfahlander
Copy link
Collaborator

dfahlander commented Jun 28, 2016

Just use startsWithAnyOf(["word1", "word2", ...]) or startsWithAnyOfIgnoreCase(["word1", "word2", ...]). And if you want, you can populate the messageWords field with the words from multiple properties instead of just a single one. That way you can have a single query search for all documents containing any of the searched words in any of the attributes.

@PalmerAL
Copy link
Author

Sorry for not being more clear, I'm trying to limit the result set to items that contain all of the words, so that I could run something like db.emails.where("messageWords").allOf(["a", "b", "c"]).distinct().toArray() and find records where messageWords contains the items "a", "b", and "c". Is there a way to do this?

@dfahlander
Copy link
Collaborator

dfahlander commented Jun 29, 2016

Ok. Since your documents are large, filter() is not ideal as you've also concluded. I would suggest the following:

function find (prefixes) {
    return db.transaction('r', db.docs, function*() {
        // Parallell search for all prefixes - just select resulting primary keys
        const results = yield Dexie.Promise.all (prefixes.map(prefix =>
            db.docs
                .where('words')
                .startsWith(prefix)
                .primaryKeys()));

        // Intersect result set of primary keys
        const reduced = results
           .reduce ((a, b) => {
                const set = new Set(b);
                return a.filter(k => set.has(k));
            });

        // Finally select entire documents from intersection
        return yield db.docs.where(':id').anyOf (reduced).toArray();
    });
}

// Sample use of the function:
find(['animal', 'food', 'nutrition']).then(docs => {
    console.table(docs);
}).catch (err => {
    console.error (err.stack || err);
});

This sample will query on indexes and prinary keys only and take out the intersection of all searched words. Finally when the intersection is known, the actual documents are accessed. It also utilize the new indexeddb getAllKeys () api method so it should be maximum performant. You could change startsWith to startsWithIgnoreCase also, but then getAllKeys() wont be possible to use internally. Don't know how much this would matter performance-wise, but I believe it's better to stick to just startsWith() or equals() and instead make sure you are storing 'words' in lowercase version and always to toLowerCase() before calling find().

Note that the result will be ordered by primary key. If primary key is incremented, this means you'll receive the oldest documents first. To reverse that, just add .reverse() to the final document selection query. Optionally with a limit() as well.

@PalmerAL
Copy link
Author

Thanks, that's really useful!

@negamaxi
Copy link

@dfahlander still recommended approach?

@dfahlander
Copy link
Collaborator

@negamaxi It's still a logical and databaseish way of searching for multiple words in a full-text manner. There are other strategies as well and I haven't yet found out which one is most performant. It might be the case that it's better to avoid roundtrips as much as possible, even though we're talking to to a local database (as event dispatching interrupts the GUI thread). Or, it could be best to do these request-intensive stuff from a Worker thread. I'm planning to create some performance tests for various scenarios (join operations, anyOf() and full-text) to find out which strategy is the best. This will hopefully result in some recommendations and / or maybe some new built-in features or addons.

One thing that I will be going to test, is whether it's faster to just query them all like this:

  const docs = flatten(await Promise.all(
    prefixes.map(prefix =>
      db.docs.where('words').startsWith(prefix)
    )
  ));

..and from there, filter in-memory out the result. Might occupy more memory though.

@PalmerAL
Copy link
Author

To add on to this, I've been using this approach for a while now, and it seems to work pretty well. For ~3k documents, with ~1500 tokens per document (so around 4 million tokens total), most searches take a couple hundred ms, which seems like a reasonable amount of time. When I originally implemented this, equals turned out to be significantly faster than startsWith, so you might want to try that as well (although it's possible that it's changed since then).

@cliftonlabrum
Copy link

@dfahlander What is that flatten() function you are using in your comment above? Is that the same as flat() as described here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat

@dfahlander
Copy link
Collaborator

yes

@aislanmaia
Copy link

How to implement a search not for a prefix or suffix, but for a pattern in middle of a string, like relational databases do it with %pattern% ?
We would like to make this full text search through 3 different stores like: categories, filters, filterOptions

@dfahlander
Copy link
Collaborator

dfahlander commented Dec 22, 2020

A database engine that allows %pattern% needs to do a full table scan to find the result. In Dexie you can do

db.friends.filter(f => /pattern/.test(f.name))

or

db.friends.toArray().then(result => result.filter(f => /pattern/.test(f.name))

...which is basically the same but a bit faster since the raw query can use IDBObjectStore.getAll() rather than IDBObjectStore.openCursor() which is slower.

But if you really need performance on searches in large tables, there's an algorithm for that trigram search. There are people that have built trigram index on top of dexie but of my knowledge no ready-to-use library or addon. The idea is to use a multiEntry index and instead of indexing all the words as we are discussing in this issue, index each three-letter combination.

To eliminate the few false positives that may occur in the result from a trigram search based on multiEntry index in Dexie, you can add a manual filter on the result.

Some databases have built-in support for this, or modules that does it (like pg_trm for postgresql)

@aislanmaia
Copy link

I'm grateful for your response @dfahlander . It helps so much!!
That's what I was looking for!
I think the both approachs will be sufficient in terms of performance, because our db is not so big, it's like 30k of items.

@cristiano-belloni
Copy link

cristiano-belloni commented Jul 7, 2021

@dfahlander what if I want to search for a literal phrase instead of a single word? Like, for example, searching for "one and two" that matches "one and two and three" but not "one and three and two"?
I guess my only options are either storing all the n-grams in a phrase (whose number explodes when n is high) or searching for the first word exactly, then brute-force filtering the results? Or maybe intersecting the n words then filtering on the intersection?

@dfahlander
Copy link
Collaborator

@cristiano-belloni If using trigrams, you should still only index each three-letter combination. Never larger than three. What you do is getting a lot of false positives that you filter out in JS from the end result. If using the typical full-text-search approach where you index each word instead (as suggested in the beginning of this issue), you will do the same - get some false positives but do a stricter JS based filter at the end.

@cristiano-belloni
Copy link

cristiano-belloni commented Jul 8, 2021

@dfahlander I guess searching for one word then filtering is good enough, given some heuristics. Trigram is attractive, but doesn't really apply to exact search, since as far as I understand it's mainly used for fuzzy search?

@dfahlander
Copy link
Collaborator

Yes, my understanding is that the best full text search for the purpose of searching human readable text, is to index the base form of each word and ignore "stop-words" like "is" "and" etc (can use lunr for this) using a multientry array. Could also complement it with a synonym lookup and a dictionary of typical wrong-spellings and it would be possible to build a really nice text search engine.

Fuzzy search are good for finding things in other sources such as logs or code, IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants