Skip to content

Commit

Permalink
Adapted the Documentation, Dockerfile and Quickstart guides to the ch…
Browse files Browse the repository at this point in the history
…anged interface.
  • Loading branch information
joka921 committed Apr 4, 2019
1 parent fdf3723 commit d523959
Show file tree
Hide file tree
Showing 7 changed files with 32 additions and 29 deletions.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=DEBUG -DUSE_PARALLEL=true .. &&

FROM base as runtime
WORKDIR /app
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl bzip2
RUN apt-get update && apt-get install -y libgomp1

ARG UID=1000
Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,9 @@ section](#running-qlever).
If your input knowledge base is in the standard *NTriple* or *Turtle* format
create the index with the following command

IndexBuilderMain -a -l -i /index/<prefix> -n /input/knowledge_base.ttl
IndexBuilderMain -l -i /index/<prefix> -f /input/knowledge_base.ttl

Where `<prefix>` is the base name for all index files, `-a` enables certain
queries using predicate variables and `-l` externalizes long literals to disk.
Where `<prefix>` is the base name for all index files and `-l` externalizes long literals to disk.
If you use `index` as the prefix you can later skip the `-e
INDEX_PREFIX=<prefix>` flag.

Expand All @@ -134,11 +133,11 @@ To include a text collection, the wordsfile and docsfiles (see

Then the full command will look like this:

IndexBuilderMain -l -a -i /index/<prefix> -n /input/knowledge_base.ttl \
IndexBuilderMain -l -i /index/<prefix> -f /input/knowledge_base.ttl \
-w /input/wordsfile.tsv -d /input/docsfile.tsv

You can also add a text index to an existing knowledge base index by adding the
`-A` flag and ommitting the `-n` flag.
`-A` flag and ommitting the `-f` flag.

# Running QLever

Expand Down
2 changes: 1 addition & 1 deletion docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Base.
-v "$(pwd)/scientists:/input" \
-v "$(pwd)/index:/index" --entrypoint "bash" qlever
qlever@xyz:/app$ IndexBuilderMain -l -i /index/scientists \
-n /input/scientists.nt \
-f /input/scientists.nt \
-w /input/scientists.wordsfile.tsv \
-d /input/scientists.docsfile.tsv
qlever@xyz:/app$ exit
Expand Down
33 changes: 18 additions & 15 deletions docs/wikidata.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,27 +26,23 @@ build the index under a different path.
cd qlever
docker build -t qlever .

## Download and uncompress Wikidata
## Download and Wikidata

If you already downloaded **and decrompressed** Wikidata to the uncompressed
Turtle format, you can skip this step. Otherwise we will download and uncompress
it in this step.
If you already downloaded Wikidata in the bzip2-compressed
Turtle format, you can skip this step. Otherwise we will download in this step.

**Note:** This takes several hours as Wikidata is about 42 GB compressed and
their servers are throttled.

**This is the first step that needs significant amounts of storage.**
Together, the unpacked Wikidata Turtle file, created in this step, and the index
we will create in the next section, will use up to about 2 TB.
their servers are throttled. In case you are in a hurry you can directly pipe the file
into the index-building pipeline (see next section). But in this case you'll have to redownload
if something goes wrong during the index build.

mkdir wikidata-input
wget -O - \
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 \
| bzcat > wikidata-input/latest-all.ttl
wget -O wikidata-input/latest-all.ttl.bz2 \
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2

## Build a QLever Index

Now we can build a QLever Index from the `latest-all.ttl` Wikidata Turtle file.
Now we can build a QLever Index from the `latest-all.ttl.bzip2` Wikidata Turtle file.
For the process of building an index we can tune some settings to the particular
Knowledge Base. The most important of these is a list of relations which can safely be
stored on disk, as their actual values are rarely accessed. For Wikidata these
Expand All @@ -65,12 +61,19 @@ inside the container e.g. by running `chmod -R o+rw ./index`
docker run -it --rm \
-v "$(pwd)/wikidata-input/:/input" \
-v "$(pwd)/index:/index" --entrypoint "bash" qlever
qlever@xyz:/app$ IndexBuilderMain -a -l -i /index/wikidata-full \
-n /input/latest-all.ttl \
qlever@xyz:/app$ bzcat /input/latest-all.ttl.bz2 | IndexBuilderMain -l -i /index/wikidata-full \
-f - \
-s /input/wikidata_settings.json
… wait for about half a day …
qlever@xyz:/app$ exit

In case you don't want to download and save the `.ttl.bz2` file first the second step becomes

qlever@xyz:/app$ wget -O - https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 \
| bzcat /input/latest-all.ttl.bz2
| IndexBuilderMain -l -i /index/wikidata-full -f - \
-s /input/wikidata_settings.json
## Run QLever

Finally we are ready to launch a QLever instance using the newly build Wikidata
Expand Down
10 changes: 5 additions & 5 deletions src/TurtleParserMain.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -136,17 +136,17 @@ int main(int argc, char** argv) {
filetypeDeduced = true;
} else {
LOG(WARN)
<< " Could not deduce the type of the input knowledge-base-file by "
"its extension. Assuming the input to be turtle. Please specify "
"--file-format (-F)\n";
<< " Could not deduce the type of the input knowledge-base-file by "
"its extension. Assuming the input to be turtle. Please specify "
"--file-format (-F)\n";
LOG(WARN) << "In case this is not correct \n";
}
if (filetypeDeduced) {
LOG(INFO) << "Assuming input file format to be " << fileFormat
<< " due to the input file's extension.\n";
LOG(INFO)
<< "If this is wrong, please manually specify the --file-format "
"(-F) flag.\n";
<< "If this is wrong, please manually specify the --file-format "
"(-F) flag.\n";
}
}

Expand Down
3 changes: 2 additions & 1 deletion src/index/ConstantsIndexCreation.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ static const int NUM_TRIPLES_PER_PARTIAL_VOCAB = 100000000;
static const size_t PARSER_BATCH_SIZE = 1000000;

// That many triples does the turtle parser have to buffer before the call to
// getline returns (unless our input reaches EOF). This makes parsing from streams faster.
// getline returns (unless our input reaches EOF). This makes parsing from
// streams faster.
static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;

// When reading from a file, Chunks of this size will
Expand Down
2 changes: 1 addition & 1 deletion src/parser/TurtleParser.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ class TurtleParser {
bool statement();
/* Data Members */

//Stores the triples that have been parsed but not retrieved yet.
// Stores the triples that have been parsed but not retrieved yet.
std::vector<std::array<string, 3>> _triples;

// if this is set, there is nothing else to parse and we will only
Expand Down

0 comments on commit d523959

Please sign in to comment.