New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Turtle Parsing using Compile Time Regexes #307
Conversation
@joka921 Thanks a lot! Can you briefly comment on the expected speedup of this PR with and without the changes from PR 302 (pipelined index build)? That is:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joka921 Sorry for asking again:
-
Is this the only way to avoid the "vocabulary is not sorted in ascending order" errors we are currently experiencing on start-up?
-
Is there any negative performance impact by using the IDENTICAL level here and if yes, how much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments but in reality I just find your new work really epic!
@@ -79,6 +99,9 @@ void printUsage(char* execName) { | |||
<< " The NTriples file to be Written to. If omitted, we will write to " | |||
"stdout" | |||
<< endl; | |||
cout << " " << std::setw(20) << "r, regex-engine" << std::setw(1) << " " | |||
<< R"( The regex engine used for lexing. Must be one of "re2" or "ctre")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add the information that ctre is faster but only handles ASCII prefixes here
src/index/StringSortComparator.h
Outdated
@@ -545,6 +574,18 @@ class TripleComponentComparator { | |||
return _locManager; | |||
} | |||
|
|||
/** | |||
* @brief Normalize a Utf8 string to a canonical representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a risk of this comment and the one for the method in LocaleManager getting out of sync, maybe just turn this into an // @see LocaleManager::normalizeUtf8()
src/parser/Tokenizer.h
Outdated
|
||
/// One entry for each Token in the Turtle Grammar. Used to create a unified | ||
/// Interface to the two different Tokenizers | ||
enum class TokId : uint16_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important that this is an uint16_t
? Otherwise I would just use int
src/parser/Tokenizer.h
Outdated
// TODO<joka921: Currently CTRE does not really support UTF-8. I tried to hack | ||
the UTF-8 byte patterns | ||
// for this regex, see the file Utf8RegexTest.cpp, but this did increase | ||
compile time by an infeasible amount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come on you can't put a statement like that here without giving the reader ballpark numbers to marvel at
test/StringSortComparatorTest.cpp
Outdated
@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) { | |||
} | |||
} | |||
|
|||
TEST(LocaleManagerTest, Normalization) { | |||
// é as single codepoints | |||
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could use '\xC3'
I think
- We now have two tokenizer's, one using Google's RE2 and one using hanickadot (Hana Dusikova's) CTRE (compile time regex) library. - The CTRE tokenizer is faster but currently only supports prefixes that contain only ascii character's - for exactly this reason they have to be explicitly activated in the settings file - Added Unit Tests for the CTRE Tokenizer Some of them are commented out, because they test the currently unsupported none-ascii prefixes - Implemented the Regexes for using correct prefixes in CTRE in the UTF8RegexTest.cpp file but don't use them in actual code because they bloat up the compile-time by an unacceptable amount. - Included the two Parsing Modes (CTRE with relaxed prefixes and Google Re2 as before) into the IndexBuilderMain
Also use the ctre parser in the e2e tests.
6a7d1a4
to
511a711
Compare
@niklas88 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG(reat)TM
LOG(ERROR) | ||
<< "Please specify a valid regex engine via the -r flag. " | ||
"Options are \"re2\" or \"ctre\" (The latter only works correct if " | ||
"prefix names only use ascii characters but is faster.\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing closing )
.
if (ad_utility::startsWith(v, "#")) { | ||
auto pos = v.find("\n"); | ||
if (pos == string::npos) { | ||
// TODO: this actually should yield a n error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield a n error
or yield an error
?
|
||
// same for ctre | ||
// TODO<joka921>: fix those regexes to the unicode stuff and test more | ||
// extensively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be another PR?
/// Helper function for ctre: concatenation of fixed_strings | ||
template <size_t A, size_t B> | ||
constexpr ctll::fixed_string<A + B - 1> operator+( | ||
const ctll::fixed_string<A>& a, const char (&b)[B]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of fixed_strings
second argument is not a ctll::fixed_string
/// Helper function for ctre: concatenation of fixed_strings | ||
template <size_t A, size_t B> | ||
constexpr ctll::fixed_string<A + B - 1> operator+( | ||
const char (&a)[A], const ctll::fixed_string<B>& b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of fixed_strings
first argument is not a ctll::fixed_string
Implement An alternative Lexer for the Turtle Parser using Hana Dusikova's Compile-time-regex engine [(https://github.com/hanickadot/compile-time-regular-expressions)]
This alternative Tokenizer currently increases the Parsing speed by 25% and seems to be required to actually speed up the Index Building pipeline.
Currently it only alllows Ascii characters in prefixes or blankNodeLabels. This suffices to handle the Wikidata dumps, but is stricter than the actual sparql standard. This is due to the fact, that the ctre library does not properly support UTF-8 as of now.
It is possible to optionally activate this alternative parsing mode in the
settings.json
, which will cause a warning to the user about the restrictions imposed above.