New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly handle Escapes in the Turtle input. #317
Conversation
@joka921: are these conflicts easy to fix? |
255d0e0
to
48e06a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also if it parses the relevant KBs I think this is pretty low risk and I also like the newly added tests.
b2fc96d
to
4d8c07e
Compare
src/parser/Tokenizer.h
Outdated
@@ -313,6 +373,20 @@ struct TurtleToken { | |||
case '\\': | |||
res.push_back('\\'); | |||
break; | |||
case 'u': { | |||
AD_CHECK(pos + 5 <= literal.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The <=
looks odd to myself, check and comment this method again + write Unit tests for it,
since it is the core of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was indeed a mistake, and is fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somebody else should also have a look at this.
src/parser/Tokenizer.h
Outdated
@@ -313,6 +373,20 @@ struct TurtleToken { | |||
case '\\': | |||
res.push_back('\\'); | |||
break; | |||
case 'u': { | |||
AD_CHECK(pos + 5 <= literal.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was indeed a mistake, and is fixed now.
@floriankramer |
545edb4
to
312fa5c
Compare
- also apply the normalization of literals correctly during index build time - Adapt the Index unit tests to "legal" knowledge bases - Get rid of misleading warning in case of whitespace at the end of a TTL file. - Previously there was a "parsing of ttl has failed, but there is still content left" warning, although the remainder of the ttl input was only whitespace. - This was due to a bug in the Parser's skipWhitespace() function which failed if the input consisted of ONLY whitespace. This is now fixed. - The case, where a prefix was used with an empty "content" (e.g. <a> wd: <b>) was broken before, luckily there was a unit test and this is now fixed.
312fa5c
to
fcc127c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite some stuff to do Yet
There is still one (not too hard) Test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review in 1-1 with Johannes, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished review in 1-1 with Johannes, thank you very much!
if (!inside_iri && c == '\\') { | ||
escaped = !escaped; | ||
} else if (!inside_iri && DELIMITER_CHARS[(uint8_t)str[pos]] && escaped) { | ||
escaped = false; | ||
} else { | ||
escaped = false; | ||
if (!inside_iri && DELIMITER_CHARS[(uint8_t)str[pos]] && | ||
(pos != 0 || c != '?')) { | ||
if (start != pos) { | ||
// add the string up to but not including the new token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this maybe be simplified? If not, a brief comment explaining each case would be nice.
src/index/Vocabulary.cpp
Outdated
auto str = RdfEscaping::unescapeNewlineAndBackslash( | ||
expandPrefix(CompressedString::fromString(line))); | ||
|
||
_words.push_back(compressPrefix(str)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an efficiency problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it unfortunately has to remain like this. There might be escape sequences partly in the compressed part etc.
This has to be solved properly when restructuring the Vocabulary for faster engine-Startup in general.
The commentint/Restructuring of the PropertyPathParser ist still todo.
rdfsl:something
, the warnings are updated to report this behavior.