Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better lexer, better parser, better tests and VALUES #258

Closed
wants to merge 18 commits into from

Conversation

floriankramer
Copy link
Member

This pr adds support for inline Values statements.

@niklas88
Copy link
Member

niklas88 commented Jun 19, 2019

The following query on Wikidata currently seems to result in an infinite loop during parsing.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?city WHERE {
  VALUES ?citytype { wd:Q515 wd:Q262166}
  ?city wdt:P31 ?citytype .
}

adding a space before } in the values clause like so:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>
SELECT ?city WHERE {
  VALUES ?citytype { wd:Q515 wd:Q262166 }
  ?city wdt:P31 ?citytype .
}

results in the following error

BAD INPUT STRING (The word wd:Q515is not part of the vocabulary.; in /app/src/engine/Values.cpp, line 147, function static void Values::writeValues(IdTable*, const Index&, const SparqlValues&) [with long unsigned int I = 1; IdTable = IdTableStatic<>])

Sounds like you missed a prefix expansion step

@niklas88
Copy link
Member

@floriankramer correction with the missing ' ' it not only loops infinitely but slowly consumes memory until the process is killed.

@floriankramer
Copy link
Member Author

@niklas88 Both the namespace expansion as well as the infinite looping when terminating values without a whitespace before the closing bracket should be fixed now.

@niklas88
Copy link
Member

@floriankramer something is still really broken in the Values operation.

(OT: 🤯 GitHub supports SPARQL syntax highlighting if you use ```SPARQL\n <code>```)

Consider the following queries:

  • For QLever
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515}
  ?city wdt:P31 ?citytype .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}
  • And the equivalent for WQS
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515}
  ?city wdt:P31 ?citytype .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

On WQS this gives 9209 results on QLever it matches only one.

  • From my understanding the query above should be syntactic sugar for the following
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  ?city wdt:P31 wd:Q515 .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}

However this returns 7862 results which is realistically the same as WQS given our dataset is older.

Also adding a space after wd:Q514 as in VALUES ?citytype {wd:Q515 } results in a BAD_INPUT exception so maybe it just does something weird with a single value?

@@ -403,7 +403,7 @@ queries:
?x ql:has-predicate ?predicate .
}
GROUP BY ?predicate
HAVING (?predicate < <Z) (?predicate = <Religion>)
HAVING (?predicate < <Z-Cars>) (?predicate = <Religion>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually issue #249 and specifically the need to convert the predicate with str() to allow URLs to be treated as strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklas88 If we implement the filters the way #249 suggests (Using the same schema for all types of filters) we will run into a problem with the way our vocabulary is stored. In the following example we cannot return the correct value with a single greater than filter:

SELECT ?a WHERE {
  VALUES ?a {<cake3> <cake>}
  FILTER(str(?a) > "cake")
}

The string <cake3> is lexicographically smaller than <cake> but we want to return <cake3> but not <cake>, as "cake3" > "cake".
The same problem occurs with literals, as the '!' character is smaller than '"'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklas88 I did a quick and dirty fix of the problem by treating fitlers on literals that contain either < or " differently. While that is not standard conform the proper fix to this seems to be a more significant change that might require changes to the way the vocabulary is stored and should probably be done in its own pr.

@floriankramer
Copy link
Member Author

@niklas88 The parser and lexer should support the full syntax the old parser supported now (minus invalid literals without quotation marks). I've also fixed the QueryPlanner tests depending on the order of HashMap, which caused the test to fail on my System and prevented me from finding an actual bug in the test. This should fix all parsing problems with VALUES. I believe that it is still possible that there is another bug (you mentioned getting fewer results using values with a single value vs a triple with the value directly), but I haven't looked into that yet.

@niklas88
Copy link
Member

@floriankramer oh wow, this is pretty amazing (as always) and that non-determinism fix is something that's really good long term. For one we can finally pass that test with using Abseil hash maps and figure out if those are faster. I'll test your changes first thing tomorrow but from what I've already seen the code already looks great.

@niklas88 niklas88 changed the title Feat values Better lexer, better parser, better tests and VALUES Jul 18, 2019
@niklas88
Copy link
Member

I'm currently testing this with our Wikidata Full index and the queries from the crawled Wikidata examples that were previously working. It now fails for 6 of them but it looks like all of those use features that we currently don't actually support and they just got through parsing and gave some result (I currently don't have code to check the result against blazegraph). Interestingly the following use of FILTER (lang(?var), ?langCode), doesn't seem to be valid anyway, the query itself times out on WQS and I can't get this FILTER use working with another query but WQS does parse it so I'm not sure if it's supposed to work.

    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>

    SELECT ?city ?cityLabel ?country ?countryLabel ?lang ?langLabel ?langCode ?population WHERE {
     ?city wdt:P1082 ?population .
     FILTER (?population>1000000) .
     ?city wdt:P31 wd:Q515 .
     ?city wdt:P17 ?country .
     ?city rdfs:label ?cityLabel .
     ?country wdt:P37 ?lang .
     ?country rdfs:label ?countryLabel .
     ?lang wdt:P424 ?langCode .
     ?lang rdfs:label ?langLabel .
     FILTER (lang(?cityLabel)=?langCode) .
     FILTER (lang(?countryLabel)=?langCode) .
     FILTER (lang(?langLabel)=?langCode) .
    }
    LIMIT 100

As for the previously reported problem with

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515 }
  ?city wdt:P31 ?citytype .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}

This no longer seems to be dependent on parsing but still only gives one result on QLever but >7000 results on WQS.

@niklas88
Copy link
Member

@floriankramer ok another finding, there seems to be a problem with SELECT * WHERE …, the following query sends QLever into an infinite loop

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>
SELECT * WHERE {
 ?disease ?p ?o .
 ?disease wdt:P699 ?doid .
 ?o owl:sameAs ?sa .
}

The following log output suggests that it hangs during parsing.

Thu Jul 18 10:32:14.280 - DEBUG: Parsing PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>
SELECT * WHERE {
 ?disease ?p ?o .
 ?disease wdt:P699 ?doid .
 ?o owl:sameAs ?sa .
}

@floriankramer
Copy link
Member Author

floriankramer commented Jul 18, 2019

I'm currently testing this with our Wikidata Full index and the queries from the crawled Wikidata examples that were previously working. It now fails for 6 of them but it looks like all of those use features that we currently don't actually support and they just got through parsing and gave some result (I currently don't have code to check the result against blazegraph). Interestingly the following use of FILTER (lang(?var), ?langCode), doesn't seem to be valid anyway, the query itself times out on WQS and I can't get this FILTER use working with another query but WQS does parse it so I'm not sure if it's supposed to work.

    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>

    SELECT ?city ?cityLabel ?country ?countryLabel ?lang ?langLabel ?langCode ?population WHERE {
     ?city wdt:P1082 ?population .
     FILTER (?population>1000000) .
     ?city wdt:P31 wd:Q515 .
     ?city wdt:P17 ?country .
     ?city rdfs:label ?cityLabel .
     ?country wdt:P37 ?lang .
     ?country rdfs:label ?countryLabel .
     ?lang wdt:P424 ?langCode .
     ?lang rdfs:label ?langLabel .
     FILTER (lang(?cityLabel)=?langCode) .
     FILTER (lang(?countryLabel)=?langCode) .
     FILTER (lang(?langLabel)=?langCode) .
    }
    LIMIT 100

As for the previously reported problem with

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?city ?cityLabel ?citytype WHERE {
  VALUES ?citytype { wd:Q515 }
  ?city wdt:P31 ?citytype .
  ?city rdfs:label ?cityLabel .
  FILTER (lang(?cityLabel) = "en") .
}

This no longer seems to be dependent on parsing but still only gives one result on QLever but >7000 results on WQS.

@niklas88 There was a bug in the galloping join code which made it terminate after the last element of the right side of the join was used once for computing the cross product. I've fixed that and also turned the galloping join into a (hopefully) proper galloping join, instead of just using binary search in the remainder of the larger list.

@floriankramer
Copy link
Member Author

@niklas88 regarding the FILTER (lang(?var), ?langCode) as far as I can tell the part in the bracket has to be a BracketedExpression which does not allow for , as a comparator (based upon the grammar).

} else {
_lexer.accept();
throw ParseException("Error in SELECT: unexpected token: " +
_lexer.current().raw);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, you beat me to it my fix was almost the same ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklas88 I'm currently fixing whatever I broke in the galloping join, and also fixing the normal join which had the same bug as the galloping case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you even find that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something had to be broken, and the values statement returned the correct number of rows, so the likely culprit was the join operation (as without the values scan and join becomes a single scan).

return l[jc2] < needle;
}) -
l2.begin();
size_t step = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can remove the TODO then if it actually gallops now

}
if (!hasMatch) {
std::cout << asString() << std::endl;
std::cout << other.asString() << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

// This method is very verbose as it is currently only intended for
// testing
if (_nodeStorage.size() != other._nodeStorage.size()) {
std::cout << asString() << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for testing I would use the LOG macros

@ad-freiburg ad-freiburg deleted a comment from floriankramer Jul 18, 2019
@niklas88
Copy link
Member

@floriankramer I added the @langtag@foo:bar and @langtag<foo> stuff and also removed the syntax hack for SCORE(?a|?b) as acked by @Buchhold. I also rebased on master. Since that last step worked well I'd like to try and split this into at least two PRs we can merge separately, otherwise it's just too big for me to feel comfortable pulling it in one step. Other than the breadth of changes since I followed along with the development I don't see any blockers for merging.

@niklas88
Copy link
Member

niklas88 commented Aug 6, 2019

This functionality was merged in the separate mentioned PRs, all of them are now merged

@niklas88 niklas88 closed this Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants