New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text search with multiple words #281

Closed
PalmerAL opened this Issue Jun 28, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@PalmerAL

PalmerAL commented Jun 28, 2016

In the fullTextSearch.js example, searching is done using this query:

 db.emails.where("messageWords").startsWithIgnoreCase("v").distinct().toArray(function (a) {

However, it is only possible to search for one word in messageWords with this query. Is there a way to use this for multi-word search queries? I'm trying to search very large documents, so Collection.filter is not fast enough.

@dfahlander

This comment has been minimized.

Show comment
Hide comment
@dfahlander

dfahlander Jun 28, 2016

Owner

Just use startsWithAnyOf(["word1", "word2", ...]) or startsWithAnyOfIgnoreCase(["word1", "word2", ...]). And if you want, you can populate the messageWords field with the words from multiple properties instead of just a single one. That way you can have a single query search for all documents containing any of the searched words in any of the attributes.

Owner

dfahlander commented Jun 28, 2016

Just use startsWithAnyOf(["word1", "word2", ...]) or startsWithAnyOfIgnoreCase(["word1", "word2", ...]). And if you want, you can populate the messageWords field with the words from multiple properties instead of just a single one. That way you can have a single query search for all documents containing any of the searched words in any of the attributes.

@PalmerAL

This comment has been minimized.

Show comment
Hide comment
@PalmerAL

PalmerAL Jun 29, 2016

Sorry for not being more clear, I'm trying to limit the result set to items that contain all of the words, so that I could run something like db.emails.where("messageWords").allOf(["a", "b", "c"]).distinct().toArray() and find records where messageWords contains the items "a", "b", and "c". Is there a way to do this?

PalmerAL commented Jun 29, 2016

Sorry for not being more clear, I'm trying to limit the result set to items that contain all of the words, so that I could run something like db.emails.where("messageWords").allOf(["a", "b", "c"]).distinct().toArray() and find records where messageWords contains the items "a", "b", and "c". Is there a way to do this?

@dfahlander

This comment has been minimized.

Show comment
Hide comment
@dfahlander

dfahlander Jun 29, 2016

Owner

Ok. Since your documents are large, filter() is not ideal as you've also concluded. I would suggest the following:

function find (prefixes) {
    return db.transaction('r', db.docs, function*() {
        // Parallell search for all prefixes - just select resulting primary keys
        const results = yield Dexie.Promise.all (prefixes.map(prefix =>
            db.docs
                .where('words')
                .startsWith(prefix)
                .primaryKeys()));

        // Intersect result set of primary keys
        const reduced = results
           .reduce ((a, b) => {
                const set = new Set(b);
                return a.filter(k => set.has(k));
            });

        // Finally select entire documents from intersection
        return yield db.docs.where(':id').anyOf (reduced).toArray();
    });
}

// Sample use of the function:
find(['animal', 'food', 'nutrition']).then(docs => {
    console.table(docs);
}).catch (err => {
    console.error (err.stack || err);
});

This sample will query on indexes and prinary keys only and take out the intersection of all searched words. Finally when the intersection is known, the actual documents are accessed. It also utilize the new indexeddb getAllKeys () api method so it should be maximum performant. You could change startsWith to startsWithIgnoreCase also, but then getAllKeys() wont be possible to use internally. Don't know how much this would matter performance-wise, but I believe it's better to stick to just startsWith() or equals() and instead make sure you are storing 'words' in lowercase version and always to toLowerCase() before calling find().

Note that the result will be ordered by primary key. If primary key is incremented, this means you'll receive the oldest documents first. To reverse that, just add .reverse() to the final document selection query. Optionally with a limit() as well.

Owner

dfahlander commented Jun 29, 2016

Ok. Since your documents are large, filter() is not ideal as you've also concluded. I would suggest the following:

function find (prefixes) {
    return db.transaction('r', db.docs, function*() {
        // Parallell search for all prefixes - just select resulting primary keys
        const results = yield Dexie.Promise.all (prefixes.map(prefix =>
            db.docs
                .where('words')
                .startsWith(prefix)
                .primaryKeys()));

        // Intersect result set of primary keys
        const reduced = results
           .reduce ((a, b) => {
                const set = new Set(b);
                return a.filter(k => set.has(k));
            });

        // Finally select entire documents from intersection
        return yield db.docs.where(':id').anyOf (reduced).toArray();
    });
}

// Sample use of the function:
find(['animal', 'food', 'nutrition']).then(docs => {
    console.table(docs);
}).catch (err => {
    console.error (err.stack || err);
});

This sample will query on indexes and prinary keys only and take out the intersection of all searched words. Finally when the intersection is known, the actual documents are accessed. It also utilize the new indexeddb getAllKeys () api method so it should be maximum performant. You could change startsWith to startsWithIgnoreCase also, but then getAllKeys() wont be possible to use internally. Don't know how much this would matter performance-wise, but I believe it's better to stick to just startsWith() or equals() and instead make sure you are storing 'words' in lowercase version and always to toLowerCase() before calling find().

Note that the result will be ordered by primary key. If primary key is incremented, this means you'll receive the oldest documents first. To reverse that, just add .reverse() to the final document selection query. Optionally with a limit() as well.

@dfahlander dfahlander added the question label Jun 29, 2016

@PalmerAL

This comment has been minimized.

Show comment
Hide comment
@PalmerAL

PalmerAL Jun 29, 2016

Thanks, that's really useful!

PalmerAL commented Jun 29, 2016

Thanks, that's really useful!

@negamaxi

This comment has been minimized.

Show comment
Hide comment
@negamaxi

negamaxi Jun 26, 2018

@dfahlander still recommended approach?

negamaxi commented Jun 26, 2018

@dfahlander still recommended approach?

@dfahlander

This comment has been minimized.

Show comment
Hide comment
@dfahlander

dfahlander Jun 26, 2018

Owner

@negamaxi It's still a logical and databaseish way of searching for multiple words in a full-text manner. There are other strategies as well and I haven't yet found out which one is most performant. It might be the case that it's better to avoid roundtrips as much as possible, even though we're talking to to a local database (as event dispatching interrupts the GUI thread). Or, it could be best to do these request-intensive stuff from a Worker thread. I'm planning to create some performance tests for various scenarios (join operations, anyOf() and full-text) to find out which strategy is the best. This will hopefully result in some recommendations and / or maybe some new built-in features or addons.

One thing that I will be going to test, is whether it's faster to just query them all like this:

  const docs = flatten(await Promise.all(
    prefixes.map(prefix =>
      db.docs.where('words').startsWith(prefix)
    )
  ));

..and from there, filter in-memory out the result. Might occupy more memory though.

Owner

dfahlander commented Jun 26, 2018

@negamaxi It's still a logical and databaseish way of searching for multiple words in a full-text manner. There are other strategies as well and I haven't yet found out which one is most performant. It might be the case that it's better to avoid roundtrips as much as possible, even though we're talking to to a local database (as event dispatching interrupts the GUI thread). Or, it could be best to do these request-intensive stuff from a Worker thread. I'm planning to create some performance tests for various scenarios (join operations, anyOf() and full-text) to find out which strategy is the best. This will hopefully result in some recommendations and / or maybe some new built-in features or addons.

One thing that I will be going to test, is whether it's faster to just query them all like this:

  const docs = flatten(await Promise.all(
    prefixes.map(prefix =>
      db.docs.where('words').startsWith(prefix)
    )
  ));

..and from there, filter in-memory out the result. Might occupy more memory though.

@PalmerAL

This comment has been minimized.

Show comment
Hide comment
@PalmerAL

PalmerAL Jun 26, 2018

To add on to this, I've been using this approach for a while now, and it seems to work pretty well. For ~3k documents, with ~1500 tokens per document (so around 4 million tokens total), most searches take a couple hundred ms, which seems like a reasonable amount of time. When I originally implemented this, equals turned out to be significantly faster than startsWith, so you might want to try that as well (although it's possible that it's changed since then).

PalmerAL commented Jun 26, 2018

To add on to this, I've been using this approach for a while now, and it seems to work pretty well. For ~3k documents, with ~1500 tokens per document (so around 4 million tokens total), most searches take a couple hundred ms, which seems like a reasonable amount of time. When I originally implemented this, equals turned out to be significantly faster than startsWith, so you might want to try that as well (although it's possible that it's changed since then).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment