Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle Escapes in the Turtle input. #317

Merged
merged 19 commits into from May 8, 2021

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Feb 27, 2020

  • Showing the position, where parsing fails.
  • Using the proper regex for pnameNS, pnameLN in the (conservative + correct) re2 parser
  • got rid completely of google::sparsehash
  • Correct unescaping of "," or "\u00e4" etc
  • CTRE parsing now has an additional constraint: there may be no escapes in iris like rdfsl:something, the warnings are updated to report this behavior.

@hannahbast
Copy link
Member

@joka921: are these conflicts easy to fix?

@joka921 joka921 force-pushed the f.abseilHashSet branch 5 times, most recently from 255d0e0 to 48e06a7 Compare April 13, 2020 14:54
@joka921 joka921 changed the title DO NOT MERGE! Unfinished : Fixes for Turtle Parser Correctly handle Escapes in the Turtle input. Apr 13, 2020
Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also if it parses the relevant KBs I think this is pretty low risk and I also like the newly added tests.

@@ -313,6 +373,20 @@ struct TurtleToken {
case '\\':
res.push_back('\\');
break;
case 'u': {
AD_CHECK(pos + 5 <= literal.size());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <= looks odd to myself, check and comment this method again + write Unit tests for it,
since it is the core of this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was indeed a mistake, and is fixed now.

Copy link
Member Author

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somebody else should also have a look at this.

@@ -313,6 +373,20 @@ struct TurtleToken {
case '\\':
res.push_back('\\');
break;
case 'u': {
AD_CHECK(pos + 5 <= literal.size());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was indeed a mistake, and is fixed now.

@joka921
Copy link
Member Author

joka921 commented Nov 29, 2020

@floriankramer
If you have time, you could have a look at this one.
It is relatively important for the people not using Wikidata, e.g. the OSM people.

- also apply the normalization of literals correctly during index build time
- Adapt the Index unit tests to "legal" knowledge bases

- Get rid of misleading warning in case of whitespace at the
  end of a TTL file.
  - Previously there was a "parsing of ttl has failed, but there is still content left"
    warning, although the remainder of the ttl input was only whitespace.
  - This was due to a bug in the Parser's skipWhitespace() function which failed if the
    input consisted of ONLY whitespace. This is now fixed.

- The case, where a prefix was used with an empty "content" (e.g. <a> wd: <b>) was
  broken before, luckily there was a unit test and this is now fixed.
Copy link
Member Author

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite some stuff to do Yet

src/index/PrefixHeuristic.cpp Outdated Show resolved Hide resolved
src/index/PrefixHeuristic.cpp Outdated Show resolved Hide resolved
src/index/PrefixHeuristic.h Outdated Show resolved Hide resolved
src/index/VocabularyGeneratorImpl.h Outdated Show resolved Hide resolved
src/index/VocabularyImpl.h Outdated Show resolved Hide resolved
src/parser/Tokenizer.h Outdated Show resolved Hide resolved
src/parser/TurtleParser.cpp Show resolved Hide resolved
src/parser/TurtleParser.h Outdated Show resolved Hide resolved
test/TokenTest.cpp Outdated Show resolved Hide resolved
test/TurtleParserTest.cpp Outdated Show resolved Hide resolved
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review in 1-1 with Johannes, thanks a lot!

src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/TurtleParser.h Outdated Show resolved Hide resolved
src/parser/TurtleParser.h Outdated Show resolved Hide resolved
src/parser/TurtleParser.h Show resolved Hide resolved
src/parser/Tokenizer.h Show resolved Hide resolved
src/parser/TurtleParser.h Outdated Show resolved Hide resolved
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished review in 1-1 with Johannes, thank you very much!

test/IndexTest.cpp Show resolved Hide resolved
test/TransitivePathTest.cpp Outdated Show resolved Hide resolved
test/TurtleParserTest.cpp Outdated Show resolved Hide resolved
test/TurtleParserTest.cpp Outdated Show resolved Hide resolved
test/TurtleParserTest.cpp Outdated Show resolved Hide resolved
Comment on lines +65 to +74
if (!inside_iri && c == '\\') {
escaped = !escaped;
} else if (!inside_iri && DELIMITER_CHARS[(uint8_t)str[pos]] && escaped) {
escaped = false;
} else {
escaped = false;
if (!inside_iri && DELIMITER_CHARS[(uint8_t)str[pos]] &&
(pos != 0 || c != '?')) {
if (start != pos) {
// add the string up to but not including the new token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this maybe be simplified? If not, a brief comment explaining each case would be nice.

Comment on lines 35 to 38
auto str = RdfEscaping::unescapeNewlineAndBackslash(
expandPrefix(CompressedString::fromString(line)));

_words.push_back(compressPrefix(str));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an efficiency problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it unfortunately has to remain like this. There might be escape sequences partly in the compressed part etc.
This has to be solved properly when restructuring the Vocabulary for faster engine-Startup in general.

src/index/PrefixHeuristic.h Outdated Show resolved Hide resolved
src/index/ExternalVocabulary.cpp Outdated Show resolved Hide resolved
src/global/Constants.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/RdfEscaping.h Outdated Show resolved Hide resolved
src/parser/Tokenizer.h Show resolved Hide resolved
@joka921 joka921 merged commit a3bb4c9 into ad-freiburg:master May 8, 2021
@joka921 joka921 deleted the f.abseilHashSet branch May 8, 2021 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants