Faster Turtle Parsing using Compile Time Regexes #307

joka921 · 2020-01-15T23:23:22Z

Implement An alternative Lexer for the Turtle Parser using Hana Dusikova's Compile-time-regex engine [(https://github.com/hanickadot/compile-time-regular-expressions)]
This alternative Tokenizer currently increases the Parsing speed by 25% and seems to be required to actually speed up the Index Building pipeline.
Currently it only alllows Ascii characters in prefixes or blankNodeLabels. This suffices to handle the Wikidata dumps, but is stricter than the actual sparql standard. This is due to the fact, that the ctre library does not properly support UTF-8 as of now.
It is possible to optionally activate this alternative parsing mode in the settings.json, which will cause a warning to the user about the restrictions imposed above.

hannahbast · 2020-01-16T08:14:45Z

@joka921 Thanks a lot! Can you briefly comment on the expected speedup of this PR with and without the changes from PR 302 (pipelined index build)? That is:

Which speedup, if any, do you expect when merging this with the current master, but without PR 302?
Which speedup do you expect when merging this with the current master together wth PR 302?

joka921 · 2020-01-16T12:52:07Z

@hannahbast

I don't expect any significant speedup by merging this PR alone because currently, Parsing and what has to be done after parsing a triple are already performed in parallel and currently those steps seem to take the same amount of time.
My suggestion is to first review and merge this PR and then to refactor the Pipeline branch to actually obtain a measurable speedup (I figured that we have to do the Unicode sorting a little bit differently to make it not a bottleneck).
I had a somewhat representative Prototype of everything I wanted to do merged together which yielded a speed of about 2.5 billion Triples per hour for the first Pass over the File, which is more than twice the current speed. However this was done very quick and dirty to just figure out if we can expect to obtain speedup at all and yet has to be done properly. But I am optimistic that we can benefit from this in quite some amount.

hannahbast

@joka921 Sorry for asking again:

Is this the only way to avoid the "vocabulary is not sorted in ascending order" errors we are currently experiencing on start-up?
Is there any negative performance impact by using the IDENTICAL level here and if yes, how much?

niklas88

A few comments but in reality I just find your new work really epic!

niklas88 · 2020-01-22T19:49:12Z

src/TurtleParserMain.cpp

@@ -79,6 +99,9 @@ void printUsage(char* execName) {
       << " The NTriples file to be Written to. If omitted, we will write to "
          "stdout"
       << endl;
+  cout << "  " << std::setw(20) << "r, regex-engine" << std::setw(1) << "    "
+       << R"( The regex engine used for lexing. Must be one of "re2" or "ctre")"


You could add the information that ctre is faster but only handles ASCII prefixes here

niklas88 · 2020-01-22T19:53:07Z

src/index/StringSortComparator.h

@@ -545,6 +574,18 @@ class TripleComponentComparator {
    return _locManager;
  }

+  /**
+   * @brief Normalize a Utf8 string to a canonical representation.


I see a risk of this comment and the one for the method in LocaleManager getting out of sync, maybe just turn this into an // @see LocaleManager::normalizeUtf8()

niklas88 · 2020-01-22T19:55:12Z

src/parser/Tokenizer.h

+
+/// One entry for each Token in the Turtle Grammar. Used to create a unified
+/// Interface to the two different Tokenizers
+enum class TokId : uint16_t {


is it important that this is an uint16_t? Otherwise I would just use int

niklas88 · 2020-01-22T19:56:14Z

src/parser/Tokenizer.h

+  // TODO<joka921: Currently CTRE does not really support UTF-8. I tried to hack
+  the UTF-8 byte patterns
+  // for this regex, see the file Utf8RegexTest.cpp, but this did increase
+  compile time by an infeasible amount


Come on you can't put a statement like that here without giving the reader ballpark numbers to marvel at

niklas88 · 2020-01-22T20:04:56Z

test/StringSortComparatorTest.cpp

@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) {
  }
 }

+TEST(LocaleManagerTest, Normalization) {
+  // é as single codepoints
+  const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0};


this could use '\xC3' I think

- We now have two tokenizer's, one using Google's RE2 and one using hanickadot (Hana Dusikova's) CTRE (compile time regex) library. - The CTRE tokenizer is faster but currently only supports prefixes that contain only ascii character's - for exactly this reason they have to be explicitly activated in the settings file - Added Unit Tests for the CTRE Tokenizer Some of them are commented out, because they test the currently unsupported none-ascii prefixes - Implemented the Regexes for using correct prefixes in CTRE in the UTF8RegexTest.cpp file but don't use them in actual code because they bloat up the compile-time by an unacceptable amount. - Included the two Parsing Modes (CTRE with relaxed prefixes and Google Re2 as before) into the IndexBuilderMain

Also use the ctre parser in the e2e tests.

joka921 · 2020-01-30T21:57:10Z

@niklas88
I have rebased this PR to work with the pipeline. You had already given this a first reviewing look and i have included the requested changes, maybe have another glance and deal with this next.

niklas88

LG(reat)TM

lehmann-4178656ch · 2020-02-19T14:44:50Z

src/TurtleParserMain.cpp

+    LOG(ERROR)
+        << "Please specify a valid regex engine via the -r flag. "
+           "Options are \"re2\" or \"ctre\" (The latter only works correct if "
+           "prefix names only use ascii characters but is faster.\n";


Missing closing ).

lehmann-4178656ch · 2020-02-19T14:57:27Z

src/parser/Tokenizer.h

+    if (ad_utility::startsWith(v, "#")) {
+      auto pos = v.find("\n");
+      if (pos == string::npos) {
+        // TODO: this actually should yield a n error


yield a n error or yield an error?

lehmann-4178656ch · 2020-02-19T15:07:29Z

test/TokenTest.cpp

+
+  // same for ctre
+  // TODO<joka921>: fix those regexes to the unicode stuff and test more
+  // extensively


Will this be another PR?

lehmann-4178656ch · 2020-02-19T15:08:43Z

src/parser/Tokenizer.h

+/// Helper function for ctre: concatenation of fixed_strings
+template <size_t A, size_t B>
+constexpr ctll::fixed_string<A + B - 1> operator+(
+    const ctll::fixed_string<A>& a, const char (&b)[B]) {


of fixed_strings second argument is not a ctll::fixed_string

lehmann-4178656ch · 2020-02-19T15:09:49Z

src/parser/Tokenizer.h

+/// Helper function for ctre: concatenation of fixed_strings
+template <size_t A, size_t B>
+constexpr ctll::fixed_string<A + B - 1> operator+(
+    const char (&a)[A], const ctll::fixed_string<B>& b) {


of fixed_strings first argument is not a ctll::fixed_string

hannahbast reviewed Jan 18, 2020

View reviewed changes

niklas88 approved these changes Jan 22, 2020

View reviewed changes

joka921 added 2 commits January 30, 2020 18:41

Rebased to the PipelinedIndexBuild PR

511a711

Also use the ctre parser in the e2e tests.

joka921 force-pushed the f.TurtleCTREParser branch from 6a7d1a4 to 511a711 Compare January 30, 2020 21:31

niklas88 approved these changes Feb 4, 2020

View reviewed changes

lehmann-4178656ch reviewed Feb 19, 2020

View reviewed changes

niklas88 merged commit 4bd364e into ad-freiburg:master Mar 20, 2020

joka921 deleted the f.TurtleCTREParser branch May 8, 2021 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Turtle Parsing using Compile Time Regexes #307

Faster Turtle Parsing using Compile Time Regexes #307

joka921 commented Jan 15, 2020

hannahbast commented Jan 16, 2020

joka921 commented Jan 16, 2020

hannahbast left a comment

niklas88 left a comment

niklas88 Jan 22, 2020

niklas88 Jan 22, 2020

niklas88 Jan 22, 2020

niklas88 Jan 22, 2020

niklas88 Jan 22, 2020

joka921 commented Jan 30, 2020

niklas88 left a comment

lehmann-4178656ch Feb 19, 2020

lehmann-4178656ch Feb 19, 2020

lehmann-4178656ch Feb 19, 2020

lehmann-4178656ch Feb 19, 2020

lehmann-4178656ch Feb 19, 2020

Faster Turtle Parsing using Compile Time Regexes #307

Faster Turtle Parsing using Compile Time Regexes #307

Conversation

joka921 commented Jan 15, 2020

hannahbast commented Jan 16, 2020

joka921 commented Jan 16, 2020

hannahbast left a comment

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joka921 commented Jan 30, 2020

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment