Full-Text Indexes #1871

Hydrocharged · 2023-07-10T13:59:30Z

This is a partial implementation of Full-Text indexes. There is still quite a bit to finish on the GMS side (as can be seen from the copious amount of TODOs), but this shows the broad strokes of how it's implemented, along with most of the "difficult" design choices being implemented. The major choice that has not yet been finalized is how to deal with FTS_DOC_ID, as it's an AUTO_INCREMENT column in MySQL, but that would not play well with Dolt merging. I already have ideas on how to handle that (taking into account remotes, etc.), but that would come from a later PR.

https://docs.google.com/document/d/1nGyYg461AhxQjFLzhEEj01XMz0VaTBaBaA44WNu0fc4/edit

Quite a few things have changed from the initial design doc, mostly based on feedback during the meeting, however some of it was post-meeting. There are three tables instead of 1: Config (stores table-specific information shared across all indexes), WordToPos (maps words to an ID and position, not fully used in the default search), and Count (used to calculate relevancy, also not fully used in the default search). I was planning on converting MATCH ... AGAINST ... to a join between the tables, which would work when fetching results, but MATCH ... AGAINST ... may also be used as a result, which necessitated writing all of the functionality anyway, so the join plan was dropped.

Last thing to mention, is that I'm pretty sure that Full-Text indexes actually do a full table scan. It seems weird, but AFAICT the indexes are used to quickly calculate relevancy for each search mode. It seems that, for overly large tables, the search time increases even when other index operations continue to operate nearly instantaneously.

I've tagged two people for review to make it a bit easier. Of course, feel free to take a look at more if you desire.

@reltuk The sql/fulltext/fulltext.go file is an expansion of the file you've previously reviewed (all still kept to a single file for now). To complement it and see how it'll be implemented on the Dolt side, you can look at memory/table.go. Dolt's table editor will be similar, and the merge paths will only use the FulltextEditor, which special logic to interface with it from those paths.

@max-hoffman Take a look at the analyzer changes, along with the sql/plan/ddl.go file. You'll probably need to reference sql/fulltext/fulltext.go as well.

max-hoffman

I am maybe confused, but after reading docs about full text indexes I was expecting a workflow pattern of 1) convert target text to lexemes, 2) inverse lookup from lexemes -> row pks -> rows, 3) filter rows for false positives. Is the workflow pattern we're targeting just "regex but maybe faster for some workloads"? I'm also skeptical that extra table lookups will be faster than in-memory regex DFAs.

sql/analyzer/indexes.go

reltuk

Took a pass at the draft. The editor piece is looking in the right direction AFAICT, but had some comments. As evident from my comments, I had some questions on the current sketchy of the execution piece.

sql/analyzer/match_against.go

sql/expression/matchagainst.go

sql/fulltext/fulltext.go

Hydrocharged · 2023-07-19T15:25:28Z

I've removed the dependency on the FTS_DOC_ID column, and I've implemented three strategies in its place. The first is to use the primary key if it exists, considering FTS_DOC_ID was just being used to uniquely identify a row. The second is similar to the first, and that's to use a unique key if a primary key doesn't exist. This is mainly for keyless tables that may still have a unique key even though it doesn't have a primary.

The third deals with raw keyless, and I decided that we can just store the row hash along with an occurence count. This will work fine for insertions, and when calculating the relevancy. The one place that this breaks down is when we want to use the index to reduce the number of rows read. In this case, we'll just do a full table scan, which will still end up with the correct results. For the simplicity, I think this is a fair trade-off, as otherwise we'd need to return a unique identifier from the integrator. For Dolt, we do have a way to identify a specific row, but that's not going to be true for other integrators, including GMS's memory implementation. The additional interfaces would not be ergonomic as well, as we'd need to have a different table editor interface that can return values, among other interfaces. Also, I haven't implemented the raw keyless path just yet, so you won't find the code here.

Lastly, the FIRST_DOC_ID and LAST_DOC_ID columns seem like display artifacts, rather than values that are actually stored in the tables. In addition, MySQL stores half of the relevancy calculation alongside the word data to reduce the number of lookups. Since our relevancy calculation is subject to change, I'm storing the word's count within a document, and also how many documents contain the word. That puts us at 3 specific tables per index (up from 2), and one shared table for each table with fulltext indexes.

Hydrocharged · 2023-07-19T15:26:39Z

All of the new changes have been grouped into the 2nd commit.

max-hoffman

This looks a lot simpler, still missing some of the execution boilerplate, editor changes are a bit complex it's probably better in GMS than in Dolt. One concern about resolving the MatchAgainst expression. Couple other small comments about dividing planning and execution code between plan and rowexec and some naming things.

sql/analyzer/fulltext_filter.go

sql/parse/parse.go

max-hoffman · 2023-07-19T16:02:21Z

enginetest/memory_engine_test.go

@@ -753,6 +753,10 @@ func TestForeignKeys(t *testing.T) {
 	enginetest.TestForeignKeys(t, enginetest.NewDefaultMemoryHarness())
 }

+func TestFulltextIndexes(t *testing.T) {


would be helpful to add this to test the new planbuilder name resolution

func TestFulltextIndexes_Exp(t *testing.T) { enginetest.TestFulltextIndexes(t, enginetest.NewDefaultMemoryHarness().WithVersion(sql.VersionExperimental)) }

also some TestTextIndexPlans that mirror TestSpatialIndexPlans would be helpful to make sure we apply indexes in the right places

sql/plan/ddl.go

sql/fulltext/fulltext.go

sql/analyzer/generate_index_scans.go

Hydrocharged · 2023-07-25T19:58:16Z

I'm a bit tripped up on why some Query Plans are failing, but I don't want to hold this PR up any longer while trying to find and fix that when it's probably something relatively small. I'm also still missing quite a few tests (due to Query Plans), but I'm not anticipating any issues arising from them as I was doing a bit of testing while developing to make sure things worked as expected.

max-hoffman

Mostly LGTM, thank you for helping me on the analyzer/resolve side. I looked less closely at the editor side because I remember that looking pretty good last Friday. I think most of my comments are suggestions around clarifying the lifecycle of text indexes (1) on the analyzer side and (2) on the execution side. A couple things that help are (i) targeted doc comments (e.g. how execution lifecycle components cooperate), (ii) teasing out component relationships where relevant (ex: match normalization vs relevancy stats), and (iii) forward looking unit tests when we want to add edge cases (ex: primary key/key generation selection; normalization, ...).

sql/analyzer/match_against.go

sql/expression/matchagainst.go

sql/fulltext/default_parser.go

max-hoffman · 2023-07-25T22:46:40Z

sql/fulltext/fulltext.go

+)
+
+// HashRow returns a 64 character lowercase hexadecimal hash of the given row. This is intended for use with keyless tables.
+func HashRow(row sql.Row) (string, error) {


would this benefit from some mini unit tests in-case we need at add/edit type handling in the future?

sql/fulltext/schema.go

sql/rowexec/fulltext_filter.go

sql/expression/matchagainst.go

Hydrocharged requested review from reltuk and max-hoffman July 10, 2023 13:59

max-hoffman reviewed Jul 10, 2023

View reviewed changes

reltuk reviewed Jul 11, 2023

View reviewed changes

sql/analyzer/indexes.go Outdated Show resolved Hide resolved

reltuk reviewed Jul 11, 2023

View reviewed changes

sql/analyzer/match_against.go Outdated Show resolved Hide resolved

sql/expression/matchagainst.go Show resolved Hide resolved

sql/fulltext/fulltext.go Outdated Show resolved Hide resolved

sql/fulltext/fulltext.go Outdated Show resolved Hide resolved

Hydrocharged requested review from max-hoffman and reltuk July 19, 2023 15:08

max-hoffman reviewed Jul 19, 2023

View reviewed changes

Hydrocharged force-pushed the daylon/fulltext branch 2 times, most recently from 7b3eec9 to 92b3941 Compare July 25, 2023 19:55

Hydrocharged changed the title ~~[DRAFT] Full-Text Indexes~~ Full-Text Indexes Jul 25, 2023

Hydrocharged marked this pull request as ready for review July 25, 2023 19:55

max-hoffman approved these changes Jul 25, 2023

View reviewed changes

Hydrocharged force-pushed the daylon/fulltext branch from 92b3941 to ca0b602 Compare July 26, 2023 21:50

Implemented Full-Text indexes

53bb534

Hydrocharged force-pushed the daylon/fulltext branch from ca0b602 to 53bb534 Compare July 27, 2023 05:59

Hydrocharged merged commit b6f7f09 into main Jul 27, 2023
6 checks passed

Hydrocharged deleted the daylon/fulltext branch July 27, 2023 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-Text Indexes #1871

Full-Text Indexes #1871

Hydrocharged commented Jul 10, 2023

max-hoffman left a comment

reltuk left a comment

Hydrocharged commented Jul 19, 2023

Hydrocharged commented Jul 19, 2023

max-hoffman left a comment

max-hoffman Jul 19, 2023

max-hoffman Jul 19, 2023

Hydrocharged commented Jul 25, 2023

max-hoffman left a comment

max-hoffman Jul 25, 2023

Full-Text Indexes #1871

Full-Text Indexes #1871

Conversation

Hydrocharged commented Jul 10, 2023

max-hoffman left a comment

Choose a reason for hiding this comment

reltuk left a comment

Choose a reason for hiding this comment

Hydrocharged commented Jul 19, 2023

Hydrocharged commented Jul 19, 2023

max-hoffman left a comment

Choose a reason for hiding this comment

max-hoffman Jul 19, 2023

Choose a reason for hiding this comment

max-hoffman Jul 19, 2023

Choose a reason for hiding this comment

Hydrocharged commented Jul 25, 2023

max-hoffman left a comment

Choose a reason for hiding this comment

max-hoffman Jul 25, 2023

Choose a reason for hiding this comment