Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-Text Indexes #1871

Merged
merged 1 commit into from Jul 27, 2023
Merged

Full-Text Indexes #1871

merged 1 commit into from Jul 27, 2023

Conversation

Hydrocharged
Copy link
Contributor

This is a partial implementation of Full-Text indexes. There is still quite a bit to finish on the GMS side (as can be seen from the copious amount of TODOs), but this shows the broad strokes of how it's implemented, along with most of the "difficult" design choices being implemented. The major choice that has not yet been finalized is how to deal with FTS_DOC_ID, as it's an AUTO_INCREMENT column in MySQL, but that would not play well with Dolt merging. I already have ideas on how to handle that (taking into account remotes, etc.), but that would come from a later PR.

https://docs.google.com/document/d/1nGyYg461AhxQjFLzhEEj01XMz0VaTBaBaA44WNu0fc4/edit

Quite a few things have changed from the initial design doc, mostly based on feedback during the meeting, however some of it was post-meeting. There are three tables instead of 1: Config (stores table-specific information shared across all indexes), WordToPos (maps words to an ID and position, not fully used in the default search), and Count (used to calculate relevancy, also not fully used in the default search). I was planning on converting MATCH ... AGAINST ... to a join between the tables, which would work when fetching results, but MATCH ... AGAINST ... may also be used as a result, which necessitated writing all of the functionality anyway, so the join plan was dropped.

Last thing to mention, is that I'm pretty sure that Full-Text indexes actually do a full table scan. It seems weird, but AFAICT the indexes are used to quickly calculate relevancy for each search mode. It seems that, for overly large tables, the search time increases even when other index operations continue to operate nearly instantaneously.

I've tagged two people for review to make it a bit easier. Of course, feel free to take a look at more if you desire.

@reltuk The sql/fulltext/fulltext.go file is an expansion of the file you've previously reviewed (all still kept to a single file for now). To complement it and see how it'll be implemented on the Dolt side, you can look at memory/table.go. Dolt's table editor will be similar, and the merge paths will only use the FulltextEditor, which special logic to interface with it from those paths.

@max-hoffman Take a look at the analyzer changes, along with the sql/plan/ddl.go file. You'll probably need to reference sql/fulltext/fulltext.go as well.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am maybe confused, but after reading docs about full text indexes I was expecting a workflow pattern of 1) convert target text to lexemes, 2) inverse lookup from lexemes -> row pks -> rows, 3) filter rows for false positives. Is the workflow pattern we're targeting just "regex but maybe faster for some workloads"? I'm also skeptical that extra table lookups will be faster than in-memory regex DFAs.

sql/analyzer/indexes.go Outdated Show resolved Hide resolved
Copy link
Contributor

@reltuk reltuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a pass at the draft. The editor piece is looking in the right direction AFAICT, but had some comments. As evident from my comments, I had some questions on the current sketchy of the execution piece.

sql/analyzer/match_against.go Outdated Show resolved Hide resolved
sql/expression/matchagainst.go Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
@Hydrocharged
Copy link
Contributor Author

I've removed the dependency on the FTS_DOC_ID column, and I've implemented three strategies in its place. The first is to use the primary key if it exists, considering FTS_DOC_ID was just being used to uniquely identify a row. The second is similar to the first, and that's to use a unique key if a primary key doesn't exist. This is mainly for keyless tables that may still have a unique key even though it doesn't have a primary.

The third deals with raw keyless, and I decided that we can just store the row hash along with an occurence count. This will work fine for insertions, and when calculating the relevancy. The one place that this breaks down is when we want to use the index to reduce the number of rows read. In this case, we'll just do a full table scan, which will still end up with the correct results. For the simplicity, I think this is a fair trade-off, as otherwise we'd need to return a unique identifier from the integrator. For Dolt, we do have a way to identify a specific row, but that's not going to be true for other integrators, including GMS's memory implementation. The additional interfaces would not be ergonomic as well, as we'd need to have a different table editor interface that can return values, among other interfaces. Also, I haven't implemented the raw keyless path just yet, so you won't find the code here.

Lastly, the FIRST_DOC_ID and LAST_DOC_ID columns seem like display artifacts, rather than values that are actually stored in the tables. In addition, MySQL stores half of the relevancy calculation alongside the word data to reduce the number of lookups. Since our relevancy calculation is subject to change, I'm storing the word's count within a document, and also how many documents contain the word. That puts us at 3 specific tables per index (up from 2), and one shared table for each table with fulltext indexes.

@Hydrocharged
Copy link
Contributor Author

All of the new changes have been grouped into the 2nd commit.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot simpler, still missing some of the execution boilerplate, editor changes are a bit complex it's probably better in GMS than in Dolt. One concern about resolving the MatchAgainst expression. Couple other small comments about dividing planning and execution code between plan and rowexec and some naming things.

sql/analyzer/fulltext_filter.go Outdated Show resolved Hide resolved
sql/parse/parse.go Show resolved Hide resolved
@@ -753,6 +753,10 @@ func TestForeignKeys(t *testing.T) {
enginetest.TestForeignKeys(t, enginetest.NewDefaultMemoryHarness())
}

func TestFulltextIndexes(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be helpful to add this to test the new planbuilder name resolution

func TestFulltextIndexes_Exp(t *testing.T) {
	enginetest.TestFulltextIndexes(t, enginetest.NewDefaultMemoryHarness().WithVersion(sql.VersionExperimental))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also some TestTextIndexPlans that mirror TestSpatialIndexPlans would be helpful to make sure we apply indexes in the right places

sql/plan/ddl.go Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
sql/fulltext/fulltext.go Outdated Show resolved Hide resolved
sql/analyzer/generate_index_scans.go Outdated Show resolved Hide resolved
@Hydrocharged Hydrocharged force-pushed the daylon/fulltext branch 2 times, most recently from 7b3eec9 to 92b3941 Compare July 25, 2023 19:55
@Hydrocharged Hydrocharged changed the title [DRAFT] Full-Text Indexes Full-Text Indexes Jul 25, 2023
@Hydrocharged Hydrocharged marked this pull request as ready for review July 25, 2023 19:55
@Hydrocharged
Copy link
Contributor Author

I'm a bit tripped up on why some Query Plans are failing, but I don't want to hold this PR up any longer while trying to find and fix that when it's probably something relatively small. I'm also still missing quite a few tests (due to Query Plans), but I'm not anticipating any issues arising from them as I was doing a bit of testing while developing to make sure things worked as expected.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, thank you for helping me on the analyzer/resolve side. I looked less closely at the editor side because I remember that looking pretty good last Friday. I think most of my comments are suggestions around clarifying the lifecycle of text indexes (1) on the analyzer side and (2) on the execution side. A couple things that help are (i) targeted doc comments (e.g. how execution lifecycle components cooperate), (ii) teasing out component relationships where relevant (ex: match normalization vs relevancy stats), and (iii) forward looking unit tests when we want to add edge cases (ex: primary key/key generation selection; normalization, ...).

sql/analyzer/match_against.go Outdated Show resolved Hide resolved
sql/expression/matchagainst.go Show resolved Hide resolved
sql/expression/matchagainst.go Show resolved Hide resolved
sql/fulltext/default_parser.go Show resolved Hide resolved
)

// HashRow returns a 64 character lowercase hexadecimal hash of the given row. This is intended for use with keyless tables.
func HashRow(row sql.Row) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this benefit from some mini unit tests in-case we need at add/edit type handling in the future?

sql/fulltext/schema.go Outdated Show resolved Hide resolved
sql/rowexec/fulltext_filter.go Show resolved Hide resolved
sql/rowexec/fulltext_filter.go Show resolved Hide resolved
sql/expression/matchagainst.go Outdated Show resolved Hide resolved
sql/expression/matchagainst.go Show resolved Hide resolved
@Hydrocharged Hydrocharged merged commit b6f7f09 into main Jul 27, 2023
6 checks passed
@Hydrocharged Hydrocharged deleted the daylon/fulltext branch July 27, 2023 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants