Correctly handle Escapes in the Turtle input. #317

joka921 · 2020-02-27T10:15:25Z

Showing the position, where parsing fails.
Using the proper regex for pnameNS, pnameLN in the (conservative + correct) re2 parser
got rid completely of google::sparsehash
Correct unescaping of "," or "\u00e4" etc
CTRE parsing now has an additional constraint: there may be no escapes in iris like rdfsl:something, the warnings are updated to report this behavior.

hannahbast · 2020-03-20T18:15:24Z

@joka921: are these conflicts easy to fix?

niklas88

LGTM also if it parses the relevant KBs I think this is pretty low risk and I also like the newly added tests.

joka921 · 2020-11-29T10:30:37Z

src/parser/Tokenizer.h

@@ -313,6 +373,20 @@ struct TurtleToken {
        case '\\':
          res.push_back('\\');
          break;
+        case 'u': {
+          AD_CHECK(pos + 5 <= literal.size());


The <= looks odd to myself, check and comment this method again + write Unit tests for it,
since it is the core of this PR

It was indeed a mistake, and is fixed now.

joka921

Somebody else should also have a look at this.

joka921 · 2020-11-29T14:38:08Z

src/parser/Tokenizer.h

@@ -313,6 +373,20 @@ struct TurtleToken {
        case '\\':
          res.push_back('\\');
          break;
+        case 'u': {
+          AD_CHECK(pos + 5 <= literal.size());


It was indeed a mistake, and is fixed now.

joka921 · 2020-11-29T14:44:43Z

@floriankramer
If you have time, you could have a look at this one.
It is relatively important for the people not using Wikidata, e.g. the OSM people.

- also apply the normalization of literals correctly during index build time - Adapt the Index unit tests to "legal" knowledge bases - Get rid of misleading warning in case of whitespace at the end of a TTL file. - Previously there was a "parsing of ttl has failed, but there is still content left" warning, although the remainder of the ttl input was only whitespace. - This was due to a bug in the Parser's skipWhitespace() function which failed if the input consisted of ONLY whitespace. This is now fixed. - The case, where a prefix was used with an empty "content" (e.g. <a> wd: <b>) was broken before, luckily there was a unit test and this is now fixed.

joka921

Quite some stuff to do Yet

src/index/PrefixHeuristic.cpp

src/index/PrefixHeuristic.h

src/index/VocabularyGeneratorImpl.h

src/index/VocabularyImpl.h

src/parser/Tokenizer.h

src/parser/TurtleParser.cpp

src/parser/TurtleParser.h

test/TokenTest.cpp

test/TurtleParserTest.cpp

There is still one (not too hard) Test failure.

…nts.

hannahbast

Partial review in 1-1 with Johannes, thanks a lot!

src/parser/RdfEscaping.h

src/parser/TurtleParser.h

src/parser/Tokenizer.h

src/parser/TurtleParser.h

hannahbast

Finished review in 1-1 with Johannes, thank you very much!

test/IndexTest.cpp

test/TransitivePathTest.cpp

test/TurtleParserTest.cpp

hannahbast · 2021-05-06T19:52:58Z

src/parser/PropertyPathParser.cpp

+    if (!inside_iri && c == '\\') {
+      escaped = !escaped;
+    } else if (!inside_iri && DELIMITER_CHARS[(uint8_t)str[pos]] && escaped) {
+      escaped = false;
+    } else {
+      escaped = false;
+      if (!inside_iri && DELIMITER_CHARS[(uint8_t)str[pos]] &&
+          (pos != 0 || c != '?')) {
+        if (start != pos) {
+          // add the string up to but not including the new token


Can this maybe be simplified? If not, a brief comment explaining each case would be nice.

hannahbast · 2021-05-06T19:59:01Z

src/index/Vocabulary.cpp

+      auto str = RdfEscaping::unescapeNewlineAndBackslash(
+          expandPrefix(CompressedString::fromString(line)));
+
+      _words.push_back(compressPrefix(str));


Is this an efficiency problem?

Currently it unfortunately has to remain like this. There might be escape sequences partly in the compressed part etc.
This has to be solved properly when restructuring the Vocabulary for faster engine-Startup in general.

src/index/PrefixHeuristic.h

src/index/ExternalVocabulary.cpp

src/global/Constants.h

src/parser/RdfEscaping.h

src/parser/Tokenizer.h

The commentint/Restructuring of the PropertyPathParser ist still todo.

joka921 force-pushed the f.abseilHashSet branch 5 times, most recently from 255d0e0 to 48e06a7 Compare April 13, 2020 14:54

joka921 changed the title ~~DO NOT MERGE! Unfinished : Fixes for Turtle Parser~~ Correctly handle Escapes in the Turtle input. Apr 13, 2020

joka921 requested review from floriankramer, niklas88, Theresa93 and hannahbast July 30, 2020 14:09

joka921 force-pushed the f.abseilHashSet branch from 3de8948 to d5a6d9d Compare July 30, 2020 15:33

niklas88 approved these changes Aug 15, 2020

View reviewed changes

joka921 force-pushed the f.abseilHashSet branch from b2fc96d to 4d8c07e Compare November 29, 2020 10:17

joka921 commented Nov 29, 2020

View reviewed changes

joka921 force-pushed the f.abseilHashSet branch from 545edb4 to 312fa5c Compare December 15, 2020 16:05

joka921 force-pushed the f.abseilHashSet branch from 312fa5c to fcc127c Compare March 23, 2021 15:24

Simplified the interface and moved the escaping to the TurtleParser.

7de8449

joka921 commented Apr 26, 2021

View reviewed changes

joka921 added 8 commits April 26, 2021 18:16

Merged this whole business.

9b75ad9

There is still one (not too hard) Test failure.

Nonworking! Moved everything to the RdfEscaping.h file.

98e8849

Changed to the more explicit escaping routines.

4c2e3e2

Fixed the Unit Tests s.t. they compile and pass.

a2802df

Some small nitpicks from Self-Review.

468db6d

fixed stxxl submodule

8a05e23

Refactored the escaping stuff into A cpp file and added several comme…

2c9b75f

…nts.

Refactored the escaping stuff into A cpp file and added several comme…

469ae8d

…nts.

hannahbast requested changes Apr 30, 2021

View reviewed changes

joka921 force-pushed the master branch from b553831 to 6680993 Compare May 5, 2021 08:18

joka921 added 2 commits May 5, 2021 18:42

Merge branch 'master' into f.abseilHashSet

7c49326

clang-format

443936e

hannahbast requested changes May 6, 2021

View reviewed changes

joka921 commented May 7, 2021

View reviewed changes

src/parser/RdfEscaping.h Outdated Show resolved Hide resolved

src/parser/RdfEscaping.h Outdated Show resolved Hide resolved

src/parser/RdfEscaping.h Outdated Show resolved Hide resolved

src/parser/Tokenizer.h Show resolved Hide resolved

joka921 added 6 commits May 7, 2021 11:03

Most of the changes requested by Hannah's review.

b56ccf4

The commentint/Restructuring of the PropertyPathParser ist still todo.

clang-format

9c119b0

Commented the (hacky) behavior of the property path parser.

66a1af6

Fixed a compilation bug.

6740a2c

Missing include.

6e9c9fa

removed unused... warning from clang.

78c428b

joka921 force-pushed the f.abseilHashSet branch from e361bfa to 78c428b Compare May 7, 2021 12:29

just a small comment to bump Travis.

3c32ed8

joka921 merged commit a3bb4c9 into ad-freiburg:master May 8, 2021

joka921 deleted the f.abseilHashSet branch May 8, 2021 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly handle Escapes in the Turtle input. #317

Correctly handle Escapes in the Turtle input. #317

joka921 commented Feb 27, 2020 •

edited

hannahbast commented Mar 20, 2020

niklas88 left a comment

joka921 Nov 29, 2020

joka921 Nov 29, 2020

joka921 left a comment

joka921 Nov 29, 2020

joka921 commented Nov 29, 2020

joka921 left a comment

hannahbast left a comment

hannahbast left a comment

hannahbast May 6, 2021

hannahbast May 6, 2021

joka921 May 7, 2021

Correctly handle Escapes in the Turtle input. #317

Correctly handle Escapes in the Turtle input. #317

Conversation

joka921 commented Feb 27, 2020 • edited

hannahbast commented Mar 20, 2020

niklas88 left a comment

Choose a reason for hiding this comment

joka921 Nov 29, 2020

Choose a reason for hiding this comment

joka921 Nov 29, 2020

Choose a reason for hiding this comment

joka921 left a comment

Choose a reason for hiding this comment

joka921 Nov 29, 2020

Choose a reason for hiding this comment

joka921 commented Nov 29, 2020

joka921 left a comment

Choose a reason for hiding this comment

hannahbast left a comment

Choose a reason for hiding this comment

hannahbast left a comment

Choose a reason for hiding this comment

hannahbast May 6, 2021

Choose a reason for hiding this comment

hannahbast May 6, 2021

Choose a reason for hiding this comment

joka921 May 7, 2021

Choose a reason for hiding this comment

joka921 commented Feb 27, 2020 •

edited