Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Turtle Parsing using Compile Time Regexes #307

Merged
merged 2 commits into from Mar 20, 2020

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jan 15, 2020

  • Implement An alternative Lexer for the Turtle Parser using Hana Dusikova's Compile-time-regex engine [(https://github.com/hanickadot/compile-time-regular-expressions)]

  • This alternative Tokenizer currently increases the Parsing speed by 25% and seems to be required to actually speed up the Index Building pipeline.

  • Currently it only alllows Ascii characters in prefixes or blankNodeLabels. This suffices to handle the Wikidata dumps, but is stricter than the actual sparql standard. This is due to the fact, that the ctre library does not properly support UTF-8 as of now.

  • It is possible to optionally activate this alternative parsing mode in the settings.json, which will cause a warning to the user about the restrictions imposed above.

@hannahbast
Copy link
Member

@joka921 Thanks a lot! Can you briefly comment on the expected speedup of this PR with and without the changes from PR 302 (pipelined index build)? That is:

  1. Which speedup, if any, do you expect when merging this with the current master, but without PR 302?

  2. Which speedup do you expect when merging this with the current master together wth PR 302?

@joka921
Copy link
Member Author

joka921 commented Jan 16, 2020

@hannahbast

  1. I don't expect any significant speedup by merging this PR alone because currently, Parsing and what has to be done after parsing a triple are already performed in parallel and currently those steps seem to take the same amount of time.
  2. My suggestion is to first review and merge this PR and then to refactor the Pipeline branch to actually obtain a measurable speedup (I figured that we have to do the Unicode sorting a little bit differently to make it not a bottleneck).
  3. I had a somewhat representative Prototype of everything I wanted to do merged together which yielded a speed of about 2.5 billion Triples per hour for the first Pass over the File, which is more than twice the current speed. However this was done very quick and dirty to just figure out if we can expect to obtain speedup at all and yet has to be done properly. But I am optimistic that we can benefit from this in quite some amount.

Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joka921 Sorry for asking again:

  1. Is this the only way to avoid the "vocabulary is not sorted in ascending order" errors we are currently experiencing on start-up?

  2. Is there any negative performance impact by using the IDENTICAL level here and if yes, how much?

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but in reality I just find your new work really epic!

@@ -79,6 +99,9 @@ void printUsage(char* execName) {
<< " The NTriples file to be Written to. If omitted, we will write to "
"stdout"
<< endl;
cout << " " << std::setw(20) << "r, regex-engine" << std::setw(1) << " "
<< R"( The regex engine used for lexing. Must be one of "re2" or "ctre")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add the information that ctre is faster but only handles ASCII prefixes here

@@ -545,6 +574,18 @@ class TripleComponentComparator {
return _locManager;
}

/**
* @brief Normalize a Utf8 string to a canonical representation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a risk of this comment and the one for the method in LocaleManager getting out of sync, maybe just turn this into an // @see LocaleManager::normalizeUtf8()


/// One entry for each Token in the Turtle Grammar. Used to create a unified
/// Interface to the two different Tokenizers
enum class TokId : uint16_t {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it important that this is an uint16_t? Otherwise I would just use int

// TODO<joka921: Currently CTRE does not really support UTF-8. I tried to hack
the UTF-8 byte patterns
// for this regex, see the file Utf8RegexTest.cpp, but this did increase
compile time by an infeasible amount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come on you can't put a statement like that here without giving the reader ballpark numbers to marvel at

@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) {
}
}

TEST(LocaleManagerTest, Normalization) {
// é as single codepoints
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could use '\xC3' I think

- We now have two tokenizer's, one using Google's RE2 and one using hanickadot
 (Hana Dusikova's) CTRE (compile time regex) library.

- The CTRE tokenizer is faster but currently only supports prefixes that
   contain only ascii character's

- for exactly this reason they have to be explicitly activated in the settings file

- Added Unit Tests for the CTRE Tokenizer
  Some of them are commented out, because they test the currently
  unsupported none-ascii prefixes

- Implemented the Regexes for using correct prefixes in CTRE in the UTF8RegexTest.cpp file but don't use them in actual code because
  they bloat up the compile-time by an unacceptable amount.

- Included the two Parsing Modes (CTRE with relaxed prefixes and Google Re2 as before) into the IndexBuilderMain
Also use the ctre parser in the e2e tests.
@joka921
Copy link
Member Author

joka921 commented Jan 30, 2020

@niklas88
I have rebased this PR to work with the pipeline. You had already given this a first reviewing look and i have included the requested changes, maybe have another glance and deal with this next.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG(reat)TM

LOG(ERROR)
<< "Please specify a valid regex engine via the -r flag. "
"Options are \"re2\" or \"ctre\" (The latter only works correct if "
"prefix names only use ascii characters but is faster.\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing closing ).

if (ad_utility::startsWith(v, "#")) {
auto pos = v.find("\n");
if (pos == string::npos) {
// TODO: this actually should yield a n error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield a n error or yield an error?


// same for ctre
// TODO<joka921>: fix those regexes to the unicode stuff and test more
// extensively
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be another PR?

/// Helper function for ctre: concatenation of fixed_strings
template <size_t A, size_t B>
constexpr ctll::fixed_string<A + B - 1> operator+(
const ctll::fixed_string<A>& a, const char (&b)[B]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of fixed_strings second argument is not a ctll::fixed_string

/// Helper function for ctre: concatenation of fixed_strings
template <size_t A, size_t B>
constexpr ctll::fixed_string<A + B - 1> operator+(
const char (&a)[A], const ctll::fixed_string<B>& b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of fixed_strings first argument is not a ctll::fixed_string

@niklas88 niklas88 merged commit 4bd364e into ad-freiburg:master Mar 20, 2020
@joka921 joka921 deleted the f.TurtleCTREParser branch May 8, 2021 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants