New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing from (un)compressed Streams. #220
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty awesome. Considering Blazegraph takes a week to load Wikidata even on SSDs, being able to build our index directly out of a wget
stream and be done over night is pretty epic.
@@ -59,7 +59,8 @@ if [ "$1" != "no-index" ]; then | |||
pushd "$BINARY_DIR" | |||
echo "Building index $INDEX" | |||
./IndexBuilderMain -l -i "$INDEX" \ | |||
-n "$INPUT.nt" \ | |||
-F ttl \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this to the README and the two quickstart documents as well? Also for the Wikidata Quickstart I think we should then remove the uncompression step. I would still save the compressed Wikidata to disk though, just so it doesn't need to be downloaded again if something goes wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the two documents. Can you look at it and review this, I hope I didn't make any mistakes.
|
||
cout << " " << std::setw(20) << "F, file-format" << std::setw(1) << " " | ||
<< " Specify format of the input file. Must be one of " | ||
"[tsv|nt|ttl|mmap]." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention mmap
here but later check for bzip2
not mmap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
<< "(mmap assumes an on-disk turtle file that can be mmapped to memory)" | ||
<< endl; | ||
cout << " " << std::setw(20) << "i, input-file" << std::setw(1) << " " | ||
<< " The file to be parsed from. If omitted, we will read from stdin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use the -
means stdin
convention but omitting is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have both options.
|
||
// That many triples does the turtle parser have to buffer before the call to | ||
// getline returns. This makes the Bzip-Parsing probably faster | ||
static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume if it reaches EOF before that it still works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented to clarify.
src/index/ConstantsIndexCreation.h
Outdated
static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000; | ||
|
||
// When reading from a file, Chunks of this size will | ||
// be fed to the parser at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the comment should say that this is about 100 MB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/TurtleParserMain.cpp
Outdated
} else if (ad_utility::endsWith(inputFile, ".ttl")) { | ||
fileFormat = "ttl"; | ||
} else if (ad_utility::endsWith(inputFile, ".ttl.bz2")) { | ||
fileFormat = "bzip2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it's bzip2
above it was mmap
and I don't see where we still link with the bz2
lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/parser/TurtleParser.h
Outdated
|
||
// interface needed for convenient testing | ||
// only for testing first, stores the triples that have been parsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course not.
src/parser/TurtleParser.h
Outdated
bool resetStateAndRead(const TurtleParserBackupState state); | ||
|
||
// used with compressed inputs (a certain number of bytes is uncompressed at | ||
// once, might also be cut in the middle of a utf8 string. (bzip2 does not no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/no/know/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole comment did not make too much sense without bzip2, I reformulated.
- Made BZIP2 parsing work efficiently (the decompression is done in parallel in chunks so we have almost no overhead) This allows us to pipe in from wget, bzip2, etc... - Allowed redirecting the LOG(...) macros to another stream. Useful if we want to use stdout for the triple output (piping). (In this case we can redirect to stderr) - IndexBuilder and TurtleParser now take two arguments: one for the input file, and one for its format. If the input file is empty or "-" we parse from stdin, if the file format is empty, we try to decide on the file extension, else we assume turtle. - Moved many of the longer parsing methods from the header to the cpp file to keep the structure clean - Completely got rid of BZIP2, since we can pipe in now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after a few small changes in the docs
README.md
Outdated
@@ -121,10 +121,9 @@ section](#running-qlever). | |||
If your input knowledge base is in the standard *NTriple* or *Turtle* format | |||
create the index with the following command | |||
|
|||
IndexBuilderMain -a -l -i /index/<prefix> -n /input/knowledge_base.ttl | |||
IndexBuilderMain -l -i /index/<prefix> -f /input/knowledge_base.ttl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove one space before -l
docs/wikidata.md
Outdated
@@ -26,27 +26,23 @@ build the index under a different path. | |||
cd qlever | |||
docker build -t qlever . | |||
|
|||
## Download and uncompress Wikidata | |||
## Download and Wikidata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "and"
docs/wikidata.md
Outdated
Turtle format, you can skip this step. Otherwise we will download and uncompress | ||
it in this step. | ||
If you already downloaded Wikidata in the bzip2-compressed | ||
Turtle format, you can skip this step. Otherwise we will download in this step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"download it in this step"
docs/wikidata.md
Outdated
Together, the unpacked Wikidata Turtle file, created in this step, and the index | ||
we will create in the next section, will use up to about 2 TB. | ||
their servers are throttled. In case you are in a hurry you can directly pipe the file | ||
into the index-building pipeline (see next section). But in this case you'll have to redownload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds too negative I think. It shouldn't go wrong after all. Maybe just remove that last sentence. We could say that it's always a good idea to keep the original input for comparisons and reproducibility, but that sounds almost condescending.
docs/wikidata.md
Outdated
|
||
## Build a QLever Index | ||
|
||
Now we can build a QLever Index from the `latest-all.ttl` Wikidata Turtle file. | ||
Now we can build a QLever Index from the `latest-all.ttl.bzip2` Wikidata Turtle file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ".bz2" not ".bzip2"
@@ -17,7 +17,7 @@ RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=DEBUG -DUSE_PARALLEL=true .. && | |||
|
|||
FROM base as runtime | |||
WORKDIR /app | |||
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl | |||
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl bzip2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find
It changes the interface of the IndexBuilder since we have to know, which format a piped in stream has.
It assumes, that a certain buffer size is enough to parse some triples (Two build time constants):
If
B
contiguous bytes are not enough to extract at leastN
triples from them, we get an exception.(Default : 1GB for 1000 triples).
The code, that does not have this constraint (completely Mmaping an uncompressed .ttl file) is still present, but currently not linked to the index builder, shall we keep it?
@lehmann-4178656ch can you try out if piping the parser works for you?
@niklas88 I'll have a close look tomorrow on stuff I find myself, so you don't need to hurry with a review.