Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser #296

hannahbast · 2019-12-14T21:09:12Z

The following query takes 164 seconds on http://qlever.informatik.uni-freiburg.de/Wikidata_Full :

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <http://schema.org/>
SELECT ?x WHERE {
  ?x rdf:type schema:Article .
  FILTER regex(?x, "^<https://en.wikipedia.org/wiki/Albert_Ein")
}

Here are the specs from the execution tree:

FILTER ?x regex ^<https://en.wikipedia.org/wiki/Albert_Ein
Size: 29 x 1
Cols: ?x
Time: 162,932ms

INDEX SCAN ?x <Article>
Size: 67,677,598 x 1
Cols: ?x
Time: 248ms

I though that a prefix regex FILTER is implemented by doing one or two binary searches on the sorted IDs and then manifesting strings only for the result IDs (only 29 in this case).

However, the high query time indicates that the strings are looked up for all 67,677,598 IDs. Why?

The text was updated successfully, but these errors were encountered:

joka921 · 2019-12-24T11:22:29Z

Fixed via #295

hannahbast · 2019-12-24T18:18:32Z

@joka921 I don't understand how this is fixed by #295 (which was about the pattern trick not being used in some cases). The problem for the query above is that the prefix FILTER takes forever, although it could be fast.

I just tried the query again on the current version (where #295 has been incorporated) and the problem is still there.

joka921 · 2020-01-03T12:40:40Z

I had a look at this and found the following:

The actual problem is simple, "^<https://en.wikipedia.org/wiki/Albert_Ein" is not a simple prefix regex but contains a . which is "match any character". So the actual behavior in your case is correct.
You probably wanted to escape the ., to my understanding this should be done by using two backslashes, once for Sparql and one for the regexengine, so,
FILTER regex(?x, "^<https://en\\.wikipedia\\.org/wiki/Albert_Ein")
This escaping is broken on very many Levels in the current parsing (The actual lexing regex is wrong, the handling of the escapes in the regex filter parser is wrong and the Sparql escape handling is currently nonexisting. I will have a closer look at this.

hannahbast assigned hannahbast, joka921 and floriankramer and unassigned hannahbast Dec 14, 2019

joka921 closed this as completed Dec 24, 2019

joka921 reopened this Dec 24, 2019

joka921 closed this as completed Dec 24, 2019

hannahbast reopened this Dec 24, 2019

joka921 changed the title ~~Prefix FILTER query takes very long although it shouldn't~~ Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser Jan 3, 2020

joka921 mentioned this issue Jan 3, 2020

Fixing Escaped Strings in the Sparql Parser #300

Merged

hannahbast closed this as completed May 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser #296

Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser #296

hannahbast commented Dec 14, 2019

joka921 commented Dec 24, 2019

hannahbast commented Dec 24, 2019 •

edited

joka921 commented Jan 3, 2020

Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser #296

Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser #296

Comments

hannahbast commented Dec 14, 2019

joka921 commented Dec 24, 2019

hannahbast commented Dec 24, 2019 • edited

joka921 commented Jan 3, 2020

hannahbast commented Dec 24, 2019 •

edited