Adapted the Documentation, Dockerfile and Quickstart guides to the ch…

…anged interface.
ad-freiburg · Apr 4, 2019 · d523959 · d523959
1 parent fdf3723
commit d523959
Show file tree

Hide file tree

Showing 7 changed files with 32 additions and 29 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -17,7 +17,7 @@ RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=DEBUG -DUSE_PARALLEL=true .. &&
 
 FROM base as runtime
 WORKDIR /app
-RUN apt-get update && apt-get install -y wget python3-yaml unzip curl
+RUN apt-get update && apt-get install -y wget python3-yaml unzip curl bzip2
 RUN apt-get update && apt-get install -y libgomp1
 
 ARG UID=1000

diff --git a/README.md b/README.md
@@ -121,10 +121,9 @@ section](#running-qlever).
 If your input knowledge base is in the standard *NTriple* or *Turtle* format
 create the index with the following command
 
-    IndexBuilderMain -a -l -i /index/<prefix> -n /input/knowledge_base.ttl
+    IndexBuilderMain  -l -i /index/<prefix> -f /input/knowledge_base.ttl
 
-Where `<prefix>` is the base name for all index files, `-a` enables certain
-queries using predicate variables and `-l` externalizes long literals to disk.
+Where `<prefix>` is the base name for all index files and `-l` externalizes long literals to disk.
 If you use `index` as the prefix you can later skip the `-e
 INDEX_PREFIX=<prefix>` flag.
 
@@ -134,11 +133,11 @@ To include a text collection, the wordsfile and docsfiles (see
 
 Then the full command will look like this:
 
-    IndexBuilderMain -l -a -i /index/<prefix> -n /input/knowledge_base.ttl \
+    IndexBuilderMain -l -i /index/<prefix> -f /input/knowledge_base.ttl \
       -w /input/wordsfile.tsv -d /input/docsfile.tsv
 
 You can also add a text index to an existing knowledge base index by adding the
-`-A` flag and ommitting the `-n` flag.
+`-A` flag and ommitting the `-f` flag.
 
 # Running QLever
 

diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -19,7 +19,7 @@ Base.
         -v "$(pwd)/scientists:/input" \
         -v "$(pwd)/index:/index" --entrypoint "bash" qlever
     qlever@xyz:/app$ IndexBuilderMain -l -i /index/scientists \
-        -n /input/scientists.nt \
+        -f /input/scientists.nt \
         -w /input/scientists.wordsfile.tsv \
         -d /input/scientists.docsfile.tsv
     qlever@xyz:/app$ exit

diff --git a/docs/wikidata.md b/docs/wikidata.md
@@ -26,27 +26,23 @@ build the index under a different path.
     cd qlever
     docker build -t qlever .
 
-## Download and uncompress Wikidata
+## Download and Wikidata
 
-If you already downloaded **and decrompressed** Wikidata to the uncompressed
-Turtle format, you can skip this step. Otherwise we will download and uncompress
-it in this step.
+If you already downloaded Wikidata in the bzip2-compressed
+Turtle format, you can skip this step. Otherwise we will download in this step.
 
 **Note:** This takes several hours as Wikidata is about 42 GB compressed and
-their servers are throttled.
-
-**This is the first step that needs significant amounts of storage.**
-Together, the unpacked Wikidata Turtle file, created in this step, and the index
-we will create in the next section, will use up to about 2 TB.
+their servers are throttled. In case you are in a hurry you can directly pipe the file
+into the index-building pipeline (see next section). But in this case you'll have to redownload
+if something goes wrong during the index build.
 
     mkdir wikidata-input
-    wget -O - \
-      https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 \
-      | bzcat > wikidata-input/latest-all.ttl
+    wget -O wikidata-input/latest-all.ttl.bz2 \
+      https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
 
 ## Build a QLever Index
 
-Now we can build a QLever Index from the `latest-all.ttl` Wikidata Turtle file.
+Now we can build a QLever Index from the `latest-all.ttl.bzip2` Wikidata Turtle file.
 For the process of building an index we can tune some settings to the particular
 Knowledge Base. The most important of these is a list of relations which can safely be
 stored on disk, as their actual values are rarely accessed. For Wikidata these
@@ -65,12 +61,19 @@ inside the container e.g. by running `chmod -R o+rw ./index`
     docker run -it --rm \
         -v "$(pwd)/wikidata-input/:/input" \
         -v "$(pwd)/index:/index" --entrypoint "bash" qlever
-    qlever@xyz:/app$ IndexBuilderMain -a -l -i /index/wikidata-full \
-        -n /input/latest-all.ttl \
+    qlever@xyz:/app$ bzcat /input/latest-all.ttl.bz2 | IndexBuilderMain -l -i /index/wikidata-full \
+        -f - \
         -s /input/wikidata_settings.json
     … wait for about half a day …
     qlever@xyz:/app$ exit
 
+In case you don't want to download and save the `.ttl.bz2` file first the second step becomes
+
+    qlever@xyz:/app$ wget -O - https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 \
+                   | bzcat /input/latest-all.ttl.bz2 
+                   | IndexBuilderMain -l -i /index/wikidata-full -f - \
+                                      -s /input/wikidata_settings.json
+        
 ## Run QLever
 
 Finally we are ready to launch a QLever instance using the newly build Wikidata

diff --git a/src/TurtleParserMain.cpp b/src/TurtleParserMain.cpp
@@ -136,17 +136,17 @@ int main(int argc, char** argv) {
       filetypeDeduced = true;
     } else {
       LOG(WARN)
-        << " Could not deduce the type of the input knowledge-base-file by "
-           "its extension. Assuming the input to be turtle. Please specify "
-           "--file-format (-F)\n";
+          << " Could not deduce the type of the input knowledge-base-file by "
+             "its extension. Assuming the input to be turtle. Please specify "
+             "--file-format (-F)\n";
       LOG(WARN) << "In case this is not correct \n";
     }
     if (filetypeDeduced) {
       LOG(INFO) << "Assuming input file format to be " << fileFormat
                 << " due to the input file's extension.\n";
       LOG(INFO)
-        << "If this is wrong, please manually specify the --file-format "
-           "(-F) flag.\n";
+          << "If this is wrong, please manually specify the --file-format "
+             "(-F) flag.\n";
     }
   }
 

diff --git a/src/index/ConstantsIndexCreation.h b/src/index/ConstantsIndexCreation.h
@@ -26,7 +26,8 @@ static const int NUM_TRIPLES_PER_PARTIAL_VOCAB = 100000000;
 static const size_t PARSER_BATCH_SIZE = 1000000;
 
 // That many triples does the turtle parser have to buffer before the call to
-// getline returns (unless our input reaches EOF). This makes parsing from streams faster.
+// getline returns (unless our input reaches EOF). This makes parsing from
+// streams faster.
 static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;
 
 // When reading from a file, Chunks of this size will

diff --git a/src/parser/TurtleParser.h b/src/parser/TurtleParser.h
@@ -83,7 +83,7 @@ class TurtleParser {
   bool statement();
   /* Data Members */
 
-  //Stores the triples that have been parsed but not retrieved yet.
+  // Stores the triples that have been parsed but not retrieved yet.
   std::vector<std::array<string, 3>> _triples;
 
   // if this is set, there is nothing else to parse and we will only