Parsing from (un)compressed Streams. #220

joka921 · 2019-03-30T15:57:51Z

Supports parsing from piped streams so we can do

wget hugeKB.ttl.bz2 | bunzip | IndexBuilderMain|
wget hugeKB.ttl.bz2 | bunzip | TurtleParserMain

It changes the interface of the IndexBuilder since we have to know, which format a piped in stream has.

It assumes, that a certain buffer size is enough to parse some triples (Two build time constants):
If B contiguous bytes are not enough to extract at least N triples from them, we get an exception.
(Default : 1GB for 1000 triples).
The code, that does not have this constraint (completely Mmaping an uncompressed .ttl file) is still present, but currently not linked to the index builder, shall we keep it?

@lehmann-4178656ch can you try out if piping the parser works for you?
@niklas88 I'll have a close look tomorrow on stuff I find myself, so you don't need to hurry with a review.

niklas88

This looks pretty awesome. Considering Blazegraph takes a week to load Wikidata even on SSDs, being able to build our index directly out of a wget stream and be done over night is pretty epic.

niklas88 · 2019-04-04T09:06:44Z

e2e/e2e.sh

@@ -59,7 +59,8 @@ if [ "$1" != "no-index" ]; then
 	pushd "$BINARY_DIR"
 	echo "Building index $INDEX"
 	./IndexBuilderMain -l -i "$INDEX" \
-		-n "$INPUT.nt" \
+	    -F ttl \


Can you add this to the README and the two quickstart documents as well? Also for the Wikidata Quickstart I think we should then remove the uncompression step. I would still save the compressed Wikidata to disk though, just so it doesn't need to be downloaded again if something goes wrong

I reworked the two documents. Can you look at it and review this, I hope I didn't make any mistakes.

niklas88 · 2019-04-04T09:10:29Z

src/TurtleParserMain.cpp

+
+  cout << "  " << std::setw(20) << "F, file-format" << std::setw(1) << "    "
+       << " Specify format of the input file. Must be one of "
+          "[tsv|nt|ttl|mmap]."


You mention mmap here but later check for bzip2 not mmap.

niklas88 · 2019-04-04T09:11:55Z

src/TurtleParserMain.cpp

+       << "(mmap assumes an on-disk turtle file that can be mmapped to memory)"
+       << endl;
+  cout << "  " << std::setw(20) << "i, input-file" << std::setw(1) << "    "
+       << " The file to be parsed from. If omitted, we will read from stdin"


We could also use the - means stdin convention but omitting is fine too.

We now have both options.

niklas88 · 2019-04-04T09:13:12Z

src/index/ConstantsIndexCreation.h

+
+// That many triples does the turtle parser have to buffer before the call to
+// getline returns. This makes the Bzip-Parsing probably faster
+static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;


I assume if it reaches EOF before that it still works?

Commented to clarify.

niklas88 · 2019-04-04T09:14:51Z

src/index/ConstantsIndexCreation.h

+static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;
+
+// When reading from a file, Chunks of this size will
+// be fed to the parser at once


Maybe the comment should say that this is about 100 MB

niklas88 · 2019-04-04T09:20:05Z

src/TurtleParserMain.cpp

+    } else if (ad_utility::endsWith(inputFile, ".ttl")) {
+      fileFormat = "ttl";
+    } else if (ad_utility::endsWith(inputFile, ".ttl.bz2")) {
+      fileFormat = "bzip2";


here it's bzip2 above it was mmap and I don't see where we still link with the bz2 lib

niklas88 · 2019-04-04T09:22:07Z

src/parser/TurtleParser.h


-  // interface needed for convenient testing
+  // only for testing first, stores the triples that have been parsed


only for testing?

of course not.

niklas88 · 2019-04-04T09:23:15Z

src/parser/TurtleParser.h

+  bool resetStateAndRead(const TurtleParserBackupState state);
+
+  // used with compressed inputs (a certain number of bytes is uncompressed at
+  // once, might also be cut in the middle of a utf8 string. (bzip2 does not no


s/no/know/g

the whole comment did not make too much sense without bzip2, I reformulated.

- Made BZIP2 parsing work efficiently (the decompression is done in parallel in chunks so we have almost no overhead) This allows us to pipe in from wget, bzip2, etc... - Allowed redirecting the LOG(...) macros to another stream. Useful if we want to use stdout for the triple output (piping). (In this case we can redirect to stderr) - IndexBuilder and TurtleParser now take two arguments: one for the input file, and one for its format. If the input file is empty or "-" we parse from stdin, if the file format is empty, we try to decide on the file extension, else we assume turtle. - Moved many of the longer parsing methods from the header to the cpp file to keep the structure clean - Completely got rid of BZIP2, since we can pipe in now.

…anged interface.

niklas88

LGTM after a few small changes in the docs

niklas88 · 2019-04-04T14:25:16Z

README.md

@@ -121,10 +121,9 @@ section](#running-qlever).
 If your input knowledge base is in the standard *NTriple* or *Turtle* format
 create the index with the following command

-    IndexBuilderMain -a -l -i /index/<prefix> -n /input/knowledge_base.ttl
+    IndexBuilderMain  -l -i /index/<prefix> -f /input/knowledge_base.ttl


remove one space before -l

niklas88 · 2019-04-04T14:25:58Z

docs/wikidata.md

@@ -26,27 +26,23 @@ build the index under a different path.
    cd qlever
    docker build -t qlever .

-## Download and uncompress Wikidata
+## Download and Wikidata


remove "and"

niklas88 · 2019-04-04T14:26:34Z

docs/wikidata.md

-Turtle format, you can skip this step. Otherwise we will download and uncompress
-it in this step.
+If you already downloaded Wikidata in the bzip2-compressed
+Turtle format, you can skip this step. Otherwise we will download in this step.


"download it in this step"

niklas88 · 2019-04-04T14:32:11Z

docs/wikidata.md

-Together, the unpacked Wikidata Turtle file, created in this step, and the index
-we will create in the next section, will use up to about 2 TB.
+their servers are throttled. In case you are in a hurry you can directly pipe the file
+into the index-building pipeline (see next section). But in this case you'll have to redownload


This sounds too negative I think. It shouldn't go wrong after all. Maybe just remove that last sentence. We could say that it's always a good idea to keep the original input for comparisons and reproducibility, but that sounds almost condescending.

niklas88 · 2019-04-04T14:33:02Z

docs/wikidata.md


 ## Build a QLever Index

-Now we can build a QLever Index from the `latest-all.ttl` Wikidata Turtle file.
+Now we can build a QLever Index from the `latest-all.ttl.bzip2` Wikidata Turtle file.


it's ".bz2" not ".bzip2"

niklas88 · 2019-04-04T14:35:06Z

Dockerfile

@@ -17,7 +17,7 @@ RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=DEBUG -DUSE_PARALLEL=true .. &&

 FROM base as runtime
 WORKDIR /app
-RUN apt-get update && apt-get install -y wget python3-yaml unzip curl
+RUN apt-get update && apt-get install -y wget python3-yaml unzip curl bzip2


niklas88 reviewed Apr 4, 2019

View reviewed changes

joka921 added 2 commits April 4, 2019 13:48

Adapted the Documentation, Dockerfile and Quickstart guides to the ch…

d523959

…anged interface.

joka921 force-pushed the f.bzipParsing branch from ca8b527 to d523959 Compare April 4, 2019 12:27

niklas88 approved these changes Apr 4, 2019

View reviewed changes

Fixed the documentation according to Niklas' suggestions.

455c1b5

niklas88 merged commit a963575 into ad-freiburg:master Apr 5, 2019

joka921 deleted the f.bzipParsing branch April 5, 2019 07:44

niklas88 mentioned this pull request Apr 5, 2019

Add missing file type flag #226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing from (un)compressed Streams. #220

Parsing from (un)compressed Streams. #220

joka921 commented Mar 30, 2019

niklas88 left a comment

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 Apr 4, 2019

joka921 Apr 4, 2019

niklas88 left a comment

niklas88 Apr 4, 2019

niklas88 Apr 4, 2019

niklas88 Apr 4, 2019

niklas88 Apr 4, 2019

niklas88 Apr 4, 2019

niklas88 Apr 4, 2019


		// interface needed for convenient testing
		// only for testing first, stores the triples that have been parsed

Parsing from (un)compressed Streams. #220

Parsing from (un)compressed Streams. #220

Conversation

joka921 commented Mar 30, 2019

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment