Improve SPARQL parser by using proper token lexing #271

niklas88 · 2019-07-19T09:47:14Z

This is the new lexer and parser part of @floriankramer's original PR #258. It replaces ad-hoc searches in the parser with proper regex based lexing. In the process parsing becomes much more resilient and many broken corner cases are fixed. It also makes the parser much more maintainable, more easily extendable and a lot easier to reason about thanks @floriankramer for the amazing work.

This builds on top of #270

niklas88 · 2019-07-19T15:25:54Z

CMakeLists.txt

@@ -78,7 +78,7 @@ set(USE_OPENMP OFF CACHE BOOL "Don't use OPENMP as default" FORCE)
 add_subdirectory(third_party/stxxl)
 # apply STXXL CXXFLAGS
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${STXXL_CXX_FLAGS}")
-include_directories(${STXXL_INCLUDE_DIRS})
+include_directories(SYSTEM ${STXXL_INCLUDE_DIRS})


Good idea, on GCC 9.x the warnings were getting a bit annoying.

niklas88

I've decided to take another closer look to this split out PR as its more intrusive than the earlier changes and we already had an infinite loop problem. I've opened the PR to pushes from other maintainers so you can check this out (e.g. with the CLI instructions next to the merge button) and then commit further changes.

That said I feel like with the current state of the parser the barrier for it to be better than the current state of affairs is definitely already passed.

I'll go over this still a bit more but already want to get some comments/questions out there.

niklas88 · 2019-07-19T15:30:12Z

src/engine/Filter.cpp

+        // our vocabulary storing iris with the greater than and
+        // literals with their quotation marks.
+        if (rhs_string.find('<') != std::string::npos &&
+            rhs_string.back() == '"' && rhs_string.size() > 1) {


I'm a little worried that this check isn't strict enough, maybe we should only strip the quotes for "<…" with no space between " and <

That seems like a good idea. I've made the check more strict.

niklas88 · 2019-07-19T15:32:39Z

src/engine/Filter.cpp

+  return "<" + s.substr(1, r - 1) + ">";
+}
+
+std::string Filter::stringRemoveTrailingQuotationMark(const std::string& s) {


I can't seem to find any call sites for these? Did you intent to use these functions in the hack above but then decided not to?

I was originally trying to properly implement the handling of uris and strings and wrote those funcions for that, but then realized that a proper implementation would require more significant changes. I've deleted them now.

niklas88 · 2019-07-19T15:33:43Z

src/engine/Filter.cpp

+  return s;
+}
+
+std::string Filter::uriRemoveTrailingGreaterThan(const std::string& s) {


no call site for this either

niklas88 · 2019-07-19T15:41:35Z

src/parser/SparqlLexer.cpp

@@ -0,0 +1,187 @@
+// Copyright 2019, University of Freiburg,
+// Chair of Algorithms and Data Structures.
+// Author: Florian Kramer (florian.kramer@neptun.uni-freiburg.de)


One thing that is different with this compared to a classic generated lexer that constructs FSAs over all token definitions, is that you need to handle it manuallly if at one point in the parser a token can match the prefix of another token. In a traditional lexer (as I understand it) the lexer would keep trying to extend the token and then only emit the longest match. I haven't found any situation in our SPARQL parser where this would be a problem, but I think we should be aware of this and we should add a comment so anyone coming after us is also aware of this quirk.

I've added a comment into the the readNext method, but I think the code should be quite self explanatory as to the way multiple matches are handled (given that the core function that extracts tokens is mostly a lot of else if)

niklas88 · 2019-07-19T15:48:00Z

src/parser/SparqlParser.cpp

-          }
+          // Assume the token is a predicate path. This will be verified
+          // separately later.
+          _lexer.expandNextUntilWhitespace();


Are we sure there is actually whitespace after this? I assume this is covered by the _re_string being completely consumed, right?

If the query is valid there has to be whitespace. Otherwise we will fail later on (either we reach the end of the input, or the property path is not valid, or there is a dot in place of a variable / iri).

niklas88 · 2019-07-19T16:03:09Z

src/engine/QueryPlanner.h

@@ -68,13 +68,35 @@ class QueryPlanner {

      Node& operator=(const Node& other) = default;

+      // Returns true if tje two nodes equal apart from the id


niklas88 · 2019-07-31T10:13:12Z

This now looks great to me, however because they use wrong string literal syntax in their queries (that was accepted by the old parser) this breaks the auto completion UI. I'll still have to figure out how to fix this @jbuerklin can you make sure you only use " quotes for literals?

niklas88 · 2019-08-05T13:46:39Z

@floriankramer I just looked this up in the SPARQL Spec and it looks like using ' quotes is actually legal. I think you're literal regex is missing this support, right?

floriankramer · 2019-08-05T16:12:23Z

@niklas88 The SPARQL grammar actually contains four types of strings. It supports both ' as well as " and then additional versions with three starting and closing marks (e.g. """) that don;t need escaping of quotation marks inside of them (as far as I remember). I had decided to stick to " for now for simplicities sake and because the old parser did not accept anything else either.

niklas88 · 2019-08-05T18:13:06Z

@floriankramer I think the old parser did somehow accept ' in regex or something. At least somehow the UI people use it for the prefix search. But yes I see why you only supported " in this version and would have done the same.

floriankramer · 2019-08-06T11:32:24Z

@niklas88 Interestingly the lexer actually already accepted ' as a string delimiter. An old method in the parser was throwing an exception though. It should be fixed now though.

niklas88 · 2019-08-06T14:26:43Z

@floriankramer very nice work. I can confirm the ' fixes the issue with the QLever UI and the TransitivePath change fixes the test query we talked about.

niklas88 requested a review from floriankramer July 19, 2019 09:52

niklas88 mentioned this pull request Jul 19, 2019

Don't be case sensitive with SPARQL keywords #17

Closed

floriankramer and others added 12 commits July 19, 2019 17:05

Added a lexer for a subset of the sparql grammar.

ff80bce

Began changing the parser to use the lexer

7c44e68

Made the parser fulfill all unit tests.

3bf23c7

Ensured the Parser is e2e test conform.

23516fa

Added a hack to support filtering on iris

c6d4930

Removed debug printing

bbe4bec

Fixed nondeterminism in the QueryPlanner unit test

122c108

Fix infinite loop in parseSelect() add token error

91d3d95

Replaced cout with LOG(INFO) in the triplegraph.

b60891b

Added a missing default to regexIgnoreCase for filters.

0a27b66

Extend IRI regex to allow @langtag@<foo> extension

40602c8

Also allow @lang@ns:foo and remove ?a|?b syntax

5529e2d

niklas88 removed the request for review from floriankramer July 19, 2019 15:23

niklas88 commented Jul 19, 2019

View reviewed changes

niklas88 commented Jul 22, 2019

View reviewed changes

Code cleanup.

71680df

floriankramer added 2 commits August 6, 2019 13:22

Fixed TransitivePaths with fixed ends.

ad15263

Made the parser accept ' as a string delimiter.

9d60d53

niklas88 merged commit b82dfa8 into ad-freiburg:master Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SPARQL parser by using proper token lexing #271

Improve SPARQL parser by using proper token lexing #271

niklas88 commented Jul 19, 2019 •

edited

niklas88 Jul 19, 2019

niklas88 left a comment

niklas88 Jul 19, 2019

floriankramer Jul 30, 2019

niklas88 Jul 19, 2019

floriankramer Jul 30, 2019

niklas88 Jul 19, 2019

niklas88 Jul 19, 2019

floriankramer Jul 30, 2019

niklas88 Jul 19, 2019

floriankramer Jul 30, 2019

niklas88 Jul 19, 2019

floriankramer Jul 30, 2019

niklas88 commented Jul 31, 2019

niklas88 commented Aug 5, 2019

floriankramer commented Aug 5, 2019

niklas88 commented Aug 5, 2019

floriankramer commented Aug 6, 2019

niklas88 commented Aug 6, 2019

		@@ -68,13 +68,35 @@ class QueryPlanner {

		Node& operator=(const Node& other) = default;

		// Returns true if tje two nodes equal apart from the id

Improve SPARQL parser by using proper token lexing #271

Improve SPARQL parser by using proper token lexing #271

Conversation

niklas88 commented Jul 19, 2019 • edited

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented Jul 31, 2019

niklas88 commented Aug 5, 2019

floriankramer commented Aug 5, 2019

niklas88 commented Aug 5, 2019

floriankramer commented Aug 6, 2019

niklas88 commented Aug 6, 2019

niklas88 commented Jul 19, 2019 •

edited