Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing from (un)compressed Streams. #220

Merged
merged 3 commits into from Apr 5, 2019

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Mar 30, 2019

  • Supports parsing from piped streams so we can do
wget hugeKB.ttl.bz2 | bunzip | IndexBuilderMain|
wget hugeKB.ttl.bz2 | bunzip | TurtleParserMain

It changes the interface of the IndexBuilder since we have to know, which format a piped in stream has.

It assumes, that a certain buffer size is enough to parse some triples (Two build time constants):
If B contiguous bytes are not enough to extract at least N triples from them, we get an exception.
(Default : 1GB for 1000 triples).
The code, that does not have this constraint (completely Mmaping an uncompressed .ttl file) is still present, but currently not linked to the index builder, shall we keep it?

@lehmann-4178656ch can you try out if piping the parser works for you?
@niklas88 I'll have a close look tomorrow on stuff I find myself, so you don't need to hurry with a review.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty awesome. Considering Blazegraph takes a week to load Wikidata even on SSDs, being able to build our index directly out of a wget stream and be done over night is pretty epic.

@@ -59,7 +59,8 @@ if [ "$1" != "no-index" ]; then
pushd "$BINARY_DIR"
echo "Building index $INDEX"
./IndexBuilderMain -l -i "$INDEX" \
-n "$INPUT.nt" \
-F ttl \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this to the README and the two quickstart documents as well? Also for the Wikidata Quickstart I think we should then remove the uncompression step. I would still save the compressed Wikidata to disk though, just so it doesn't need to be downloaded again if something goes wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the two documents. Can you look at it and review this, I hope I didn't make any mistakes.


cout << " " << std::setw(20) << "F, file-format" << std::setw(1) << " "
<< " Specify format of the input file. Must be one of "
"[tsv|nt|ttl|mmap]."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention mmap here but later check for bzip2 not mmap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

<< "(mmap assumes an on-disk turtle file that can be mmapped to memory)"
<< endl;
cout << " " << std::setw(20) << "i, input-file" << std::setw(1) << " "
<< " The file to be parsed from. If omitted, we will read from stdin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use the - means stdin convention but omitting is fine too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have both options.


// That many triples does the turtle parser have to buffer before the call to
// getline returns. This makes the Bzip-Parsing probably faster
static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if it reaches EOF before that it still works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented to clarify.

static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;

// When reading from a file, Chunks of this size will
// be fed to the parser at once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment should say that this is about 100 MB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

} else if (ad_utility::endsWith(inputFile, ".ttl")) {
fileFormat = "ttl";
} else if (ad_utility::endsWith(inputFile, ".ttl.bz2")) {
fileFormat = "bzip2";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it's bzip2 above it was mmap and I don't see where we still link with the bz2 lib

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


// interface needed for convenient testing
// only for testing first, stores the triples that have been parsed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only for testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course not.

bool resetStateAndRead(const TurtleParserBackupState state);

// used with compressed inputs (a certain number of bytes is uncompressed at
// once, might also be cut in the middle of a utf8 string. (bzip2 does not no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/no/know/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole comment did not make too much sense without bzip2, I reformulated.

- Made BZIP2 parsing work efficiently (the decompression is done in parallel in chunks so we have almost no overhead)
  This allows us to pipe in from wget, bzip2, etc...
- Allowed redirecting the LOG(...) macros to another stream. Useful if we want to use stdout for the triple output (piping).
  (In this case we can redirect to stderr)

- IndexBuilder and TurtleParser now take two arguments: one for the input file, and one for its format.
  If the input file is empty or "-" we parse from stdin, if the file format is empty, we try to decide on the
  file extension, else we assume turtle.

- Moved many of the longer parsing methods from the header to the cpp file to keep the structure clean
- Completely got rid of BZIP2, since we can pipe in now.
Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after a few small changes in the docs

README.md Outdated
@@ -121,10 +121,9 @@ section](#running-qlever).
If your input knowledge base is in the standard *NTriple* or *Turtle* format
create the index with the following command

IndexBuilderMain -a -l -i /index/<prefix> -n /input/knowledge_base.ttl
IndexBuilderMain -l -i /index/<prefix> -f /input/knowledge_base.ttl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove one space before -l

docs/wikidata.md Outdated
@@ -26,27 +26,23 @@ build the index under a different path.
cd qlever
docker build -t qlever .

## Download and uncompress Wikidata
## Download and Wikidata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "and"

docs/wikidata.md Outdated
Turtle format, you can skip this step. Otherwise we will download and uncompress
it in this step.
If you already downloaded Wikidata in the bzip2-compressed
Turtle format, you can skip this step. Otherwise we will download in this step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"download it in this step"

docs/wikidata.md Outdated
Together, the unpacked Wikidata Turtle file, created in this step, and the index
we will create in the next section, will use up to about 2 TB.
their servers are throttled. In case you are in a hurry you can directly pipe the file
into the index-building pipeline (see next section). But in this case you'll have to redownload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds too negative I think. It shouldn't go wrong after all. Maybe just remove that last sentence. We could say that it's always a good idea to keep the original input for comparisons and reproducibility, but that sounds almost condescending.

docs/wikidata.md Outdated

## Build a QLever Index

Now we can build a QLever Index from the `latest-all.ttl` Wikidata Turtle file.
Now we can build a QLever Index from the `latest-all.ttl.bzip2` Wikidata Turtle file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ".bz2" not ".bzip2"

@@ -17,7 +17,7 @@ RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=DEBUG -DUSE_PARALLEL=true .. &&

FROM base as runtime
WORKDIR /app
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl bzip2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find

@niklas88 niklas88 merged commit a963575 into ad-freiburg:master Apr 5, 2019
@joka921 joka921 deleted the f.bzipParsing branch April 5, 2019 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants