New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full-Text Indexes #1871
Full-Text Indexes #1871
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am maybe confused, but after reading docs about full text indexes I was expecting a workflow pattern of 1) convert target text to lexemes, 2) inverse lookup from lexemes -> row pks -> rows, 3) filter rows for false positives. Is the workflow pattern we're targeting just "regex but maybe faster for some workloads"? I'm also skeptical that extra table lookups will be faster than in-memory regex DFAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a pass at the draft. The editor piece is looking in the right direction AFAICT, but had some comments. As evident from my comments, I had some questions on the current sketchy of the execution piece.
I've removed the dependency on the The third deals with raw keyless, and I decided that we can just store the row hash along with an occurence count. This will work fine for insertions, and when calculating the relevancy. The one place that this breaks down is when we want to use the index to reduce the number of rows read. In this case, we'll just do a full table scan, which will still end up with the correct results. For the simplicity, I think this is a fair trade-off, as otherwise we'd need to return a unique identifier from the integrator. For Dolt, we do have a way to identify a specific row, but that's not going to be true for other integrators, including GMS's memory implementation. The additional interfaces would not be ergonomic as well, as we'd need to have a different table editor interface that can return values, among other interfaces. Also, I haven't implemented the raw keyless path just yet, so you won't find the code here. Lastly, the |
All of the new changes have been grouped into the 2nd commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot simpler, still missing some of the execution boilerplate, editor changes are a bit complex it's probably better in GMS than in Dolt. One concern about resolving the MatchAgainst expression. Couple other small comments about dividing planning and execution code between plan
and rowexec
and some naming things.
@@ -753,6 +753,10 @@ func TestForeignKeys(t *testing.T) { | |||
enginetest.TestForeignKeys(t, enginetest.NewDefaultMemoryHarness()) | |||
} | |||
|
|||
func TestFulltextIndexes(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be helpful to add this to test the new planbuilder name resolution
func TestFulltextIndexes_Exp(t *testing.T) {
enginetest.TestFulltextIndexes(t, enginetest.NewDefaultMemoryHarness().WithVersion(sql.VersionExperimental))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also some TestTextIndexPlans
that mirror TestSpatialIndexPlans
would be helpful to make sure we apply indexes in the right places
7b3eec9
to
92b3941
Compare
I'm a bit tripped up on why some Query Plans are failing, but I don't want to hold this PR up any longer while trying to find and fix that when it's probably something relatively small. I'm also still missing quite a few tests (due to Query Plans), but I'm not anticipating any issues arising from them as I was doing a bit of testing while developing to make sure things worked as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, thank you for helping me on the analyzer/resolve side. I looked less closely at the editor side because I remember that looking pretty good last Friday. I think most of my comments are suggestions around clarifying the lifecycle of text indexes (1) on the analyzer side and (2) on the execution side. A couple things that help are (i) targeted doc comments (e.g. how execution lifecycle components cooperate), (ii) teasing out component relationships where relevant (ex: match normalization vs relevancy stats), and (iii) forward looking unit tests when we want to add edge cases (ex: primary key/key generation selection; normalization, ...).
) | ||
|
||
// HashRow returns a 64 character lowercase hexadecimal hash of the given row. This is intended for use with keyless tables. | ||
func HashRow(row sql.Row) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this benefit from some mini unit tests in-case we need at add/edit type handling in the future?
92b3941
to
ca0b602
Compare
ca0b602
to
53bb534
Compare
This is a partial implementation of Full-Text indexes. There is still quite a bit to finish on the GMS side (as can be seen from the copious amount of
TODO
s), but this shows the broad strokes of how it's implemented, along with most of the "difficult" design choices being implemented. The major choice that has not yet been finalized is how to deal withFTS_DOC_ID
, as it's anAUTO_INCREMENT
column in MySQL, but that would not play well with Dolt merging. I already have ideas on how to handle that (taking into account remotes, etc.), but that would come from a later PR.https://docs.google.com/document/d/1nGyYg461AhxQjFLzhEEj01XMz0VaTBaBaA44WNu0fc4/edit
Quite a few things have changed from the initial design doc, mostly based on feedback during the meeting, however some of it was post-meeting. There are three tables instead of 1: Config (stores table-specific information shared across all indexes), WordToPos (maps words to an ID and position, not fully used in the default search), and Count (used to calculate relevancy, also not fully used in the default search). I was planning on converting
MATCH ... AGAINST ...
to a join between the tables, which would work when fetching results, butMATCH ... AGAINST ...
may also be used as a result, which necessitated writing all of the functionality anyway, so the join plan was dropped.Last thing to mention, is that I'm pretty sure that Full-Text indexes actually do a full table scan. It seems weird, but AFAICT the indexes are used to quickly calculate relevancy for each search mode. It seems that, for overly large tables, the search time increases even when other index operations continue to operate nearly instantaneously.
I've tagged two people for review to make it a bit easier. Of course, feel free to take a look at more if you desire.
@reltuk The
sql/fulltext/fulltext.go
file is an expansion of the file you've previously reviewed (all still kept to a single file for now). To complement it and see how it'll be implemented on the Dolt side, you can look atmemory/table.go
. Dolt's table editor will be similar, and the merge paths will only use theFulltextEditor
, which special logic to interface with it from those paths.@max-hoffman Take a look at the analyzer changes, along with the
sql/plan/ddl.go
file. You'll probably need to referencesql/fulltext/fulltext.go
as well.