Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SPARQL parser by using proper token lexing #271

Merged
merged 15 commits into from Aug 6, 2019
Merged

Improve SPARQL parser by using proper token lexing #271

merged 15 commits into from Aug 6, 2019

Conversation

niklas88
Copy link
Member

@niklas88 niklas88 commented Jul 19, 2019

This is the new lexer and parser part of @floriankramer's original PR #258. It replaces ad-hoc searches in the parser with proper regex based lexing. In the process parsing becomes much more resilient and many broken corner cases are fixed. It also makes the parser much more maintainable, more easily extendable and a lot easier to reason about thanks @floriankramer for the amazing work.

This builds on top of #270

@niklas88 niklas88 removed the request for review from floriankramer July 19, 2019 15:23
@@ -78,7 +78,7 @@ set(USE_OPENMP OFF CACHE BOOL "Don't use OPENMP as default" FORCE)
add_subdirectory(third_party/stxxl)
# apply STXXL CXXFLAGS
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${STXXL_CXX_FLAGS}")
include_directories(${STXXL_INCLUDE_DIRS})
include_directories(SYSTEM ${STXXL_INCLUDE_DIRS})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, on GCC 9.x the warnings were getting a bit annoying.

Copy link
Member Author

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to take another closer look to this split out PR as its more intrusive than the earlier changes and we already had an infinite loop problem. I've opened the PR to pushes from other maintainers so you can check this out (e.g. with the CLI instructions next to the merge button) and then commit further changes.

That said I feel like with the current state of the parser the barrier for it to be better than the current state of affairs is definitely already passed.

I'll go over this still a bit more but already want to get some comments/questions out there.

// our vocabulary storing iris with the greater than and
// literals with their quotation marks.
if (rhs_string.find('<') != std::string::npos &&
rhs_string.back() == '"' && rhs_string.size() > 1) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that this check isn't strict enough, maybe we should only strip the quotes for "<…" with no space between " and <

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea. I've made the check more strict.

return "<" + s.substr(1, r - 1) + ">";
}

std::string Filter::stringRemoveTrailingQuotationMark(const std::string& s) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find any call sites for these? Did you intent to use these functions in the hack above but then decided not to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally trying to properly implement the handling of uris and strings and wrote those funcions for that, but then realized that a proper implementation would require more significant changes. I've deleted them now.

return s;
}

std::string Filter::uriRemoveTrailingGreaterThan(const std::string& s) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no call site for this either

@@ -0,0 +1,187 @@
// Copyright 2019, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Florian Kramer (florian.kramer@neptun.uni-freiburg.de)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is different with this compared to a classic generated lexer that constructs FSAs over all token definitions, is that you need to handle it manuallly if at one point in the parser a token can match the prefix of another token. In a traditional lexer (as I understand it) the lexer would keep trying to extend the token and then only emit the longest match. I haven't found any situation in our SPARQL parser where this would be a problem, but I think we should be aware of this and we should add a comment so anyone coming after us is also aware of this quirk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment into the the readNext method, but I think the code should be quite self explanatory as to the way multiple matches are handled (given that the core function that extracts tokens is mostly a lot of else if)

}
// Assume the token is a predicate path. This will be verified
// separately later.
_lexer.expandNextUntilWhitespace();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure there is actually whitespace after this? I assume this is covered by the _re_string being completely consumed, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the query is valid there has to be whitespace. Otherwise we will fail later on (either we reach the end of the input, or the property path is not valid, or there is a dot in place of a variable / iri).

@@ -68,13 +68,35 @@ class QueryPlanner {

Node& operator=(const Node& other) = default;

// Returns true if tje two nodes equal apart from the id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@niklas88
Copy link
Member Author

This now looks great to me, however because they use wrong string literal syntax in their queries (that was accepted by the old parser) this breaks the auto completion UI. I'll still have to figure out how to fix this @jbuerklin can you make sure you only use " quotes for literals?

@niklas88
Copy link
Member Author

niklas88 commented Aug 5, 2019

@floriankramer I just looked this up in the SPARQL Spec and it looks like using ' quotes is actually legal. I think you're literal regex is missing this support, right?

@floriankramer
Copy link
Member

@niklas88 The SPARQL grammar actually contains four types of strings. It supports both ' as well as " and then additional versions with three starting and closing marks (e.g. """) that don;t need escaping of quotation marks inside of them (as far as I remember). I had decided to stick to " for now for simplicities sake and because the old parser did not accept anything else either.

@niklas88
Copy link
Member Author

niklas88 commented Aug 5, 2019

@floriankramer I think the old parser did somehow accept ' in regex or something. At least somehow the UI people use it for the prefix search. But yes I see why you only supported " in this version and would have done the same.

@floriankramer
Copy link
Member

@niklas88 Interestingly the lexer actually already accepted ' as a string delimiter. An old method in the parser was throwing an exception though. It should be fixed now though.

@niklas88
Copy link
Member Author

niklas88 commented Aug 6, 2019

@floriankramer very nice work. I can confirm the ' fixes the issue with the QLever UI and the TransitivePath change fixes the test query we talked about.

@niklas88 niklas88 merged commit b82dfa8 into ad-freiburg:master Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants