New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SPARQL parser by using proper token lexing #271
Improve SPARQL parser by using proper token lexing #271
Conversation
@@ -78,7 +78,7 @@ set(USE_OPENMP OFF CACHE BOOL "Don't use OPENMP as default" FORCE) | |||
add_subdirectory(third_party/stxxl) | |||
# apply STXXL CXXFLAGS | |||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${STXXL_CXX_FLAGS}") | |||
include_directories(${STXXL_INCLUDE_DIRS}) | |||
include_directories(SYSTEM ${STXXL_INCLUDE_DIRS}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, on GCC 9.x the warnings were getting a bit annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to take another closer look to this split out PR as its more intrusive than the earlier changes and we already had an infinite loop problem. I've opened the PR to pushes from other maintainers so you can check this out (e.g. with the CLI instructions next to the merge button) and then commit further changes.
That said I feel like with the current state of the parser the barrier for it to be better than the current state of affairs is definitely already passed.
I'll go over this still a bit more but already want to get some comments/questions out there.
src/engine/Filter.cpp
Outdated
// our vocabulary storing iris with the greater than and | ||
// literals with their quotation marks. | ||
if (rhs_string.find('<') != std::string::npos && | ||
rhs_string.back() == '"' && rhs_string.size() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried that this check isn't strict enough, maybe we should only strip the quotes for "<…"
with no space between "
and <
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a good idea. I've made the check more strict.
src/engine/Filter.cpp
Outdated
return "<" + s.substr(1, r - 1) + ">"; | ||
} | ||
|
||
std::string Filter::stringRemoveTrailingQuotationMark(const std::string& s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to find any call sites for these? Did you intent to use these functions in the hack above but then decided not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally trying to properly implement the handling of uris and strings and wrote those funcions for that, but then realized that a proper implementation would require more significant changes. I've deleted them now.
src/engine/Filter.cpp
Outdated
return s; | ||
} | ||
|
||
std::string Filter::uriRemoveTrailingGreaterThan(const std::string& s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no call site for this either
@@ -0,0 +1,187 @@ | |||
// Copyright 2019, University of Freiburg, | |||
// Chair of Algorithms and Data Structures. | |||
// Author: Florian Kramer (florian.kramer@neptun.uni-freiburg.de) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is different with this compared to a classic generated lexer that constructs FSAs over all token definitions, is that you need to handle it manuallly if at one point in the parser a token can match the prefix of another token. In a traditional lexer (as I understand it) the lexer would keep trying to extend the token and then only emit the longest match. I haven't found any situation in our SPARQL parser where this would be a problem, but I think we should be aware of this and we should add a comment so anyone coming after us is also aware of this quirk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment into the the readNext
method, but I think the code should be quite self explanatory as to the way multiple matches are handled (given that the core function that extracts tokens is mostly a lot of else if
)
} | ||
// Assume the token is a predicate path. This will be verified | ||
// separately later. | ||
_lexer.expandNextUntilWhitespace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure there is actually whitespace after this? I assume this is covered by the _re_string
being completely consumed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the query is valid there has to be whitespace. Otherwise we will fail later on (either we reach the end of the input, or the property path is not valid, or there is a dot in place of a variable / iri).
src/engine/QueryPlanner.h
Outdated
@@ -68,13 +68,35 @@ class QueryPlanner { | |||
|
|||
Node& operator=(const Node& other) = default; | |||
|
|||
// Returns true if tje two nodes equal apart from the id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
This now looks great to me, however because they use wrong string literal syntax in their queries (that was accepted by the old parser) this breaks the auto completion UI. I'll still have to figure out how to fix this @jbuerklin can you make sure you only use |
@floriankramer I just looked this up in the SPARQL Spec and it looks like using |
@niklas88 The SPARQL grammar actually contains four types of strings. It supports both |
@floriankramer I think the old parser did somehow accept |
@niklas88 Interestingly the lexer actually already accepted |
@floriankramer very nice work. I can confirm the |
This is the new lexer and parser part of @floriankramer's original PR #258. It replaces ad-hoc searches in the parser with proper regex based lexing. In the process parsing becomes much more resilient and many broken corner cases are fixed. It also makes the parser much more maintainable, more easily extendable and a lot easier to reason about thanks @floriankramer for the amazing work.
This builds on top of #270