Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser #296

Closed
hannahbast opened this issue Dec 14, 2019 · 3 comments
Assignees

Comments

@hannahbast
Copy link
Member

The following query takes 164 seconds on http://qlever.informatik.uni-freiburg.de/Wikidata_Full :

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <http://schema.org/>
SELECT ?x WHERE {
  ?x rdf:type schema:Article .
  FILTER regex(?x, "^<https://en.wikipedia.org/wiki/Albert_Ein")
}

Here are the specs from the execution tree:

FILTER ?x regex ^<https://en.wikipedia.org/wiki/Albert_Ein
Size: 29 x 1
Cols: ?x
Time: 162,932ms

INDEX SCAN ?x <Article>
Size: 67,677,598 x 1
Cols: ?x
Time: 248ms

I though that a prefix regex FILTER is implemented by doing one or two binary searches on the sorted IDs and then manifesting strings only for the result IDs (only 29 in this case).

However, the high query time indicates that the strings are looked up for all 67,677,598 IDs. Why?

@joka921
Copy link
Member

joka921 commented Dec 24, 2019

Fixed via #295

@joka921 joka921 closed this as completed Dec 24, 2019
@hannahbast
Copy link
Member Author

hannahbast commented Dec 24, 2019

@joka921 I don't understand how this is fixed by #295 (which was about the pattern trick not being used in some cases). The problem for the query above is that the prefix FILTER takes forever, although it could be fast.

I just tried the query again on the current version (where #295 has been incorporated) and the problem is still there.

@hannahbast hannahbast reopened this Dec 24, 2019
@joka921
Copy link
Member

joka921 commented Jan 3, 2020

I had a look at this and found the following:

  • The actual problem is simple, "^<https://en.wikipedia.org/wiki/Albert_Ein" is not a simple prefix regex but contains a . which is "match any character". So the actual behavior in your case is correct.

  • You probably wanted to escape the ., to my understanding this should be done by using two backslashes, once for Sparql and one for the regexengine, so,
    FILTER regex(?x, "^<https://en\\.wikipedia\\.org/wiki/Albert_Ein")

  • This escaping is broken on very many Levels in the current parsing (The actual lexing regex is wrong, the handling of the escapes in the regex filter parser is wrong and the Sparql escape handling is currently nonexisting. I will have a closer look at this.

@joka921 joka921 changed the title Prefix FILTER query takes very long although it shouldn't Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants