Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completely refactor the fulltext operations #1093

Merged
merged 162 commits into from
Jan 18, 2024

Conversation

NickG-1
Copy link
Contributor

@NickG-1 NickG-1 commented Sep 14, 2023

As of this commit, the fulltext index (triggered by ql:contains-word and ql:contains-entity) uses two basic operations:

  1. TextIndexScanForWord: For a given word or prefix, return all text records that contain the word, (possibly together with the matched word in the case of a prefix, and the score of the match).
  2. TextIndexScanForEntity: For a given word or prefix, return a superset of all pairs of (text, entity) where the entity is contained in the text according to ql:contains-entity and the text contains the word. For technical reasons this is a superset: We always have to scan the complete block from the half-inverted index which might belong to a shorter prefix.

The general processing is then as follows:

  • For each word or prefix that appears as part of the object of a ql:contains-word triple, a TextIndexScanForWord is created.
  • For each entity or variable that appears as the object of a ql:contains-entity triple, a TextIndexScanForEntity is created.
  • The rest of the query processing is handled by the "ordinary" query planner using the normal operations like JOIN that are also used to process standard SPARQL queries.

This is much cleaner than the old TextOperationWith[out]Filter operations which combined the functionality of the above scan operations with JOIN operations, because the old approach lead to a lot of code duplication (the code for a join of two tables was duplicated for the fulltext module) and because the new approach makes queries easier to optimize and to reason about because the runtime information trees become much clearer if the scans and joins are represented separately.

Copy link

codecov bot commented Jan 3, 2024

Codecov Report

Attention: 69 lines in your changes are missing coverage. Please review.

Comparison is base (f7c2c32) 84.34% compared to head (c7e5855) 85.25%.

Files Patch % Lines
src/engine/QueryPlanner.cpp 78.31% 31 Missing and 5 partials ⚠️
src/index/IndexImpl.Text.cpp 81.00% 15 Missing and 4 partials ⚠️
src/engine/QueryPlanner.h 64.70% 4 Missing and 2 partials ⚠️
src/parser/sparqlParser/SparqlQleverVisitor.cpp 89.65% 0 Missing and 3 partials ⚠️
src/engine/TextIndexScanForEntity.h 95.45% 0 Missing and 2 partials ⚠️
src/index/Vocabulary.cpp 71.42% 0 Missing and 2 partials ⚠️
src/index/FTSAlgorithms.cpp 96.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1093      +/-   ##
==========================================
+ Coverage   84.34%   85.25%   +0.90%     
==========================================
  Files         304      308       +4     
  Lines       29100    29385     +285     
  Branches     3446     3464      +18     
==========================================
+ Hits        24544    25051     +507     
+ Misses       3153     2938     -215     
+ Partials     1403     1396       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second round of reviews, this is already much much much cleaner.

src/engine/QueryPlanner.cpp Show resolved Hide resolved
src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
src/engine/QueryPlanner.cpp Show resolved Hide resolved
src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
src/engine/QueryPlanner.h Outdated Show resolved Hide resolved
test/QueryPlannerTest.cpp Outdated Show resolved Hide resolved
test/engine/TextIndexScanForWordTest.cpp Outdated Show resolved Hide resolved
test/engine/TextIndexScanForEntityTest.cpp Show resolved Hide resolved
test/engine/TextIndexScanForEntityTest.cpp Outdated Show resolved Hide resolved
test/engine/TextIndexScanTestHelpers.h Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly very minor stuff.

src/engine/QueryPlanner.h Outdated Show resolved Hide resolved
src/engine/TextIndexScanForEntity.cpp Outdated Show resolved Hide resolved
src/engine/TextIndexScanForEntity.h Outdated Show resolved Hide resolved
src/engine/TextIndexScanForEntity.h Outdated Show resolved Hide resolved
src/engine/TextIndexScanForEntity.h Outdated Show resolved Hide resolved
src/engine/TextIndexScanForWord.h Outdated Show resolved Hide resolved
src/parser/data/Variable.h Outdated Show resolved Hide resolved
src/parser/data/VariableToColumnMapPrinters.cpp Outdated Show resolved Hide resolved
test/QueryPlannerTestHelpers.h Outdated Show resolved Hide resolved
test/QueryPlannerTestHelpers.h Outdated Show resolved Hide resolved
Copy link

sonarcloud bot commented Jan 18, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

13 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much,
This makes the text index code much much cleaner.

@joka921 joka921 changed the title Added WordIndexScan Completely refactor the fulltext operations Jan 18, 2024
@joka921 joka921 merged commit 8f9b13a into ad-freiburg:master Jan 18, 2024
18 checks passed
@NickG-1 NickG-1 deleted the wordIndexScan branch January 18, 2024 15:09
hannahbast pushed a commit that referenced this pull request Jan 23, 2024
)

Since #1093 we use a much simpler approach for answering full-text queries that contain `ql:contains-word` and `ql:contains-entity`. That PR made a lot of old code for the text index obsolete. This code is now deleted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants