Better lexer, better parser, better tests and VALUES #258

floriankramer · 2019-06-18T13:28:44Z

This pr adds support for inline Values statements.

niklas88 · 2019-06-19T08:07:53Z

The following query on Wikidata currently seems to result in an infinite loop during parsing.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?city WHERE {
  VALUES ?citytype { wd:Q515 wd:Q262166}
  ?city wdt:P31 ?citytype .
}

adding a space before } in the values clause like so:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>
SELECT ?city WHERE {
  VALUES ?citytype { wd:Q515 wd:Q262166 }
  ?city wdt:P31 ?citytype .
}

results in the following error

BAD INPUT STRING (The word wd:Q515is not part of the vocabulary.; in /app/src/engine/Values.cpp, line 147, function static void Values::writeValues(IdTable*, const Index&, const SparqlValues&) [with long unsigned int I = 1; IdTable = IdTableStatic<>])

Sounds like you missed a prefix expansion step

niklas88 · 2019-06-19T08:51:53Z

@floriankramer correction with the missing ' ' it not only loops infinitely but slowly consumes memory until the process is killed.

src/parser/SparqlParser.cpp

floriankramer · 2019-06-27T10:50:27Z

@niklas88 Both the namespace expansion as well as the infinite looping when terminating values without a whitespace before the closing bracket should be fixed now.

niklas88 · 2019-06-27T14:29:40Z

@floriankramer something is still really broken in the Values operation.

(OT: 🤯 GitHub supports SPARQL syntax highlighting if you use ```SPARQL\n <code>```)

Consider the following queries:

For QLever

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515}
  ?city wdt:P31 ?citytype .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}

And the equivalent for WQS

SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515}
  ?city wdt:P31 ?citytype .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

On WQS this gives 9209 results on QLever it matches only one.

From my understanding the query above should be syntactic sugar for the following

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  ?city wdt:P31 wd:Q515 .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}

However this returns 7862 results which is realistically the same as WQS given our dataset is older.

Also adding a space after wd:Q514 as in VALUES ?citytype {wd:Q515 } results in a BAD_INPUT exception so maybe it just does something weird with a single value?

niklas88 · 2019-07-17T08:52:34Z

e2e/scientists_queries.yaml

@@ -403,7 +403,7 @@ queries:
        ?x ql:has-predicate ?predicate .
      }
      GROUP BY ?predicate
-      HAVING (?predicate < <Z) (?predicate = <Religion>)
+      HAVING (?predicate < <Z-Cars>) (?predicate = <Religion>)


I think this is actually issue #249 and specifically the need to convert the predicate with str() to allow URLs to be treated as strings.

@niklas88 If we implement the filters the way #249 suggests (Using the same schema for all types of filters) we will run into a problem with the way our vocabulary is stored. In the following example we cannot return the correct value with a single greater than filter:

SELECT ?a WHERE { VALUES ?a {<cake3> <cake>} FILTER(str(?a) > "cake") }

The string <cake3> is lexicographically smaller than <cake> but we want to return <cake3> but not <cake>, as "cake3" > "cake".
The same problem occurs with literals, as the '!' character is smaller than '"'.

@niklas88 I did a quick and dirty fix of the problem by treating fitlers on literals that contain either < or " differently. While that is not standard conform the proper fix to this seems to be a more significant change that might require changes to the way the vocabulary is stored and should probably be done in its own pr.

floriankramer · 2019-07-17T16:24:48Z

@niklas88 The parser and lexer should support the full syntax the old parser supported now (minus invalid literals without quotation marks). I've also fixed the QueryPlanner tests depending on the order of HashMap, which caused the test to fail on my System and prevented me from finding an actual bug in the test. This should fix all parsing problems with VALUES. I believe that it is still possible that there is another bug (you mentioned getting fewer results using values with a single value vs a triple with the value directly), but I haven't looked into that yet.

niklas88 · 2019-07-17T17:09:38Z

@floriankramer oh wow, this is pretty amazing (as always) and that non-determinism fix is something that's really good long term. For one we can finally pass that test with using Abseil hash maps and figure out if those are faster. I'll test your changes first thing tomorrow but from what I've already seen the code already looks great.

niklas88 · 2019-07-18T08:40:21Z

I'm currently testing this with our Wikidata Full index and the queries from the crawled Wikidata examples that were previously working. It now fails for 6 of them but it looks like all of those use features that we currently don't actually support and they just got through parsing and gave some result (I currently don't have code to check the result against blazegraph). Interestingly the following use of FILTER (lang(?var), ?langCode), doesn't seem to be valid anyway, the query itself times out on WQS and I can't get this FILTER use working with another query but WQS does parse it so I'm not sure if it's supposed to work.

    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>

    SELECT ?city ?cityLabel ?country ?countryLabel ?lang ?langLabel ?langCode ?population WHERE {
     ?city wdt:P1082 ?population .
     FILTER (?population>1000000) .
     ?city wdt:P31 wd:Q515 .
     ?city wdt:P17 ?country .
     ?city rdfs:label ?cityLabel .
     ?country wdt:P37 ?lang .
     ?country rdfs:label ?countryLabel .
     ?lang wdt:P424 ?langCode .
     ?lang rdfs:label ?langLabel .
     FILTER (lang(?cityLabel)=?langCode) .
     FILTER (lang(?countryLabel)=?langCode) .
     FILTER (lang(?langLabel)=?langCode) .
    }
    LIMIT 100

As for the previously reported problem with

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515 }
  ?city wdt:P31 ?citytype .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}

This no longer seems to be dependent on parsing but still only gives one result on QLever but >7000 results on WQS.

niklas88 · 2019-07-18T10:34:21Z

@floriankramer ok another finding, there seems to be a problem with SELECT * WHERE …, the following query sends QLever into an infinite loop

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>
SELECT * WHERE {
 ?disease ?p ?o .
 ?disease wdt:P699 ?doid .
 ?o owl:sameAs ?sa .
}

The following log output suggests that it hangs during parsing.

Thu Jul 18 10:32:14.280 - DEBUG: Parsing PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>
SELECT * WHERE {
 ?disease ?p ?o .
 ?disease wdt:P699 ?doid .
 ?o owl:sameAs ?sa .
}

src/parser/SparqlParser.cpp

floriankramer · 2019-07-18T11:42:17Z

I'm currently testing this with our Wikidata Full index and the queries from the crawled Wikidata examples that were previously working. It now fails for 6 of them but it looks like all of those use features that we currently don't actually support and they just got through parsing and gave some result (I currently don't have code to check the result against blazegraph). Interestingly the following use of FILTER (lang(?var), ?langCode), doesn't seem to be valid anyway, the query itself times out on WQS and I can't get this FILTER use working with another query but WQS does parse it so I'm not sure if it's supposed to work.
    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>

    SELECT ?city ?cityLabel ?country ?countryLabel ?lang ?langLabel ?langCode ?population WHERE {
     ?city wdt:P1082 ?population .
     FILTER (?population>1000000) .
     ?city wdt:P31 wd:Q515 .
     ?city wdt:P17 ?country .
     ?city rdfs:label ?cityLabel .
     ?country wdt:P37 ?lang .
     ?country rdfs:label ?countryLabel .
     ?lang wdt:P424 ?langCode .
     ?lang rdfs:label ?langLabel .
     FILTER (lang(?cityLabel)=?langCode) .
     FILTER (lang(?countryLabel)=?langCode) .
     FILTER (lang(?langLabel)=?langCode) .
    }
    LIMIT 100
As for the previously reported problem with
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515 }
  ?city wdt:P31 ?citytype .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}
This no longer seems to be dependent on parsing but still only gives one result on QLever but >7000 results on WQS.

@niklas88 There was a bug in the galloping join code which made it terminate after the last element of the right side of the join was used once for computing the cross product. I've fixed that and also turned the galloping join into a (hopefully) proper galloping join, instead of just using binary search in the remainder of the larger list.

floriankramer · 2019-07-18T11:49:47Z

@niklas88 regarding the FILTER (lang(?var), ?langCode) as far as I can tell the part in the bracket has to be a BracketedExpression which does not allow for , as a comparator (based upon the grammar).

niklas88 · 2019-07-18T12:49:08Z

src/parser/SparqlParser.cpp

+    } else {
+      _lexer.accept();
+      throw ParseException("Error in SELECT: unexpected token: " +
+                           _lexer.current().raw);


Argh, you beat me to it my fix was almost the same ;-)

@niklas88 I'm currently fixing whatever I broke in the galloping join, and also fixing the normal join which had the same bug as the galloping case

How did you even find that?

Something had to be broken, and the values statement returned the correct number of rows, so the likely culprit was the join operation (as without the values scan and join becomes a single scan).

niklas88 · 2019-07-18T13:11:17Z

src/engine/Engine.h

-                                 return l[jc2] < needle;
-                               }) -
-              l2.begin();
+        size_t step = 1;


I guess you can remove the TODO then if it actually gallops now

niklas88 · 2019-07-18T13:16:17Z

src/engine/QueryPlanner.cpp

+    }
+    if (!hasMatch) {
+      std::cout << asString() << std::endl;
+      std::cout << other.asString() << std::endl;


niklas88 · 2019-07-18T13:17:00Z

src/engine/QueryPlanner.cpp

+  // This method is very verbose as it is currently only intended for
+  // testing
+  if (_nodeStorage.size() != other._nodeStorage.size()) {
+    std::cout << asString() << std::endl;


Even for testing I would use the LOG macros

niklas88 · 2019-07-19T08:49:45Z

@floriankramer I added the @langtag@foo:bar and @langtag<foo> stuff and also removed the syntax hack for SCORE(?a|?b) as acked by @Buchhold. I also rebased on master. Since that last step worked well I'd like to try and split this into at least two PRs we can merge separately, otherwise it's just too big for me to feel comfortable pulling it in one step. Other than the breadth of changes since I followed along with the development I don't see any blockers for merging.

niklas88 · 2019-08-06T14:28:18Z

This functionality was merged in the separate mentioned PRs, all of them are now merged

floriankramer requested a review from niklas88 June 18, 2019 13:28

niklas88 reviewed Jun 19, 2019

View reviewed changes

src/parser/SparqlParser.cpp Show resolved Hide resolved

floriankramer force-pushed the feat_values branch from 363544e to 49b8c1e Compare June 27, 2019 10:49

floriankramer force-pushed the feat_values branch from 49b8c1e to e77c080 Compare June 28, 2019 13:31

floriankramer force-pushed the feat_values branch 2 times, most recently from 97caf57 to e8e654b Compare July 11, 2019 11:19

This was referenced Jul 11, 2019

Don't be case sensitive with SPARQL keywords #17

Closed

Queries with FILTER depend on whether there is a . before #95

Closed

niklas88 reviewed Jul 17, 2019

View reviewed changes

niklas88 changed the title ~~Feat values~~ Better lexer, better parser, better tests and VALUES Jul 18, 2019

niklas88 reviewed Jul 18, 2019

View reviewed changes

src/parser/SparqlParser.cpp Show resolved Hide resolved

niklas88 reviewed Jul 18, 2019

View reviewed changes

ad-freiburg deleted a comment from floriankramer Jul 18, 2019

floriankramer added 4 commits July 19, 2019 10:33

Added parsing for VALUES statements

2253e08

Added a new values operation.

68dfd02

Added missing prefix expansion

28dc78c

Fixed a parsing error when triple part parsing ended on }

547435a

floriankramer and others added 14 commits July 19, 2019 10:33

Added a lexer for a subset of the sparql grammar.

0081ec6

Began changing the parser to use the lexer

7fd1f7f

Made the parser fulfill all unit tests.

8931637

Ensured the Parser is e2e test conform.

033c4c1

Added a hack to support filtering on iris

84e4c4a

Removed debug printing

cdd54e5

Fixed nondeterminism in the QueryPlanner unit test

95acea6

Fixed galloping joins

188e478

Fixed the galloping join jumping into matching blocks.

29fea30

Replaced cout with LOG(INFO) in the triplegraph.

ab6e975

Ensure last is initialized in the galloping join.

18b920c

Added a missing default to regexIgnoreCase for filters.

8bde8ed

Extend IRI regex to allow @langtag@<foo> extension

0f3aa01

Also allow @lang@ns:foo and remove ?a|?b syntax

2808228

niklas88 force-pushed the feat_values branch from 82f18cf to 2808228 Compare July 19, 2019 08:33

This was referenced Jul 19, 2019

Fix galloping join #269

Merged

Implement VALUES syntax and operation #270

Merged

Improve SPARQL parser by using proper token lexing #271

Merged

niklas88 closed this Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better lexer, better parser, better tests and VALUES #258

Better lexer, better parser, better tests and VALUES #258

floriankramer commented Jun 18, 2019

niklas88 commented Jun 19, 2019 •

edited

niklas88 commented Jun 19, 2019

floriankramer commented Jun 27, 2019

niklas88 commented Jun 27, 2019

niklas88 Jul 17, 2019

floriankramer Jul 17, 2019 •

edited

floriankramer Jul 17, 2019

floriankramer commented Jul 17, 2019

niklas88 commented Jul 17, 2019

niklas88 commented Jul 18, 2019

niklas88 commented Jul 18, 2019

floriankramer commented Jul 18, 2019 •

edited

floriankramer commented Jul 18, 2019

niklas88 Jul 18, 2019

floriankramer Jul 18, 2019

niklas88 Jul 18, 2019

floriankramer Jul 18, 2019

niklas88 Jul 18, 2019

niklas88 Jul 18, 2019

niklas88 Jul 18, 2019

niklas88 commented Jul 19, 2019

niklas88 commented Aug 6, 2019

Better lexer, better parser, better tests and VALUES #258

Better lexer, better parser, better tests and VALUES #258

Conversation

floriankramer commented Jun 18, 2019

niklas88 commented Jun 19, 2019 • edited

niklas88 commented Jun 19, 2019

floriankramer commented Jun 27, 2019

niklas88 commented Jun 27, 2019

Choose a reason for hiding this comment

floriankramer Jul 17, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floriankramer commented Jul 17, 2019

niklas88 commented Jul 17, 2019

niklas88 commented Jul 18, 2019

niklas88 commented Jul 18, 2019

floriankramer commented Jul 18, 2019 • edited

floriankramer commented Jul 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented Jul 19, 2019

niklas88 commented Aug 6, 2019

niklas88 commented Jun 19, 2019 •

edited

floriankramer Jul 17, 2019 •

edited

floriankramer commented Jul 18, 2019 •

edited