Speeding up the first phase of Index Building #302

- Identified the writing of Triple elements to hash maps for deduplication as a bottleneck. - Speed this up by writing to multiple hash maps at once and merging them afterwards - Also concurrently pipelined all the other steps in the first index building phase. - That way the TurtleParsing now becomes the bottleneck (Teaser: This can be sped up by 30% using compile time regexes, so this might become even faster in a later PR). - This was done by Implementing an abstract, templated BatchedPipeline that abstracts over the creation and transformation of values in a pipeline that allows to control the degree of concurrency used on each level and between the different levels. - This pipeline infrastructure was heavily unit-tested to ensure its correctness since it internally uses quite some template magic. - This Commit also introduces the absl::flat_hash_map that is faster than the dense_hash_map used before. But in doubt I can also revert this change since in the meantime the parallelism is what helps us more than the faster hash maps. But I have thought, that we wanted to try out those anyway. - This can already be reviewed, especially the BatchedPipeline.h file. Before merging I would suggest merging the Unicode PR, because there is some merging work to be done (but not too much) between those PRs.

…sure total ordering.

… Builds TODO: this also appears in another Branch, after merging, rewrite the history?

I would like to test the performance before merging this, as it should be at least somewhat fast.

…ement when we first see it - This boosts up the speed in our parallel pipeline.

It is much faster, especially for large hash maps Replaced the default implementation of ad_utility::HashMap. We no longer need a default-key provider. absl strictly randomizes the iteration order of the hash map, so some unit tests had to be changed.

…o account for the overhead of the SortKeys.

- (mostly) In the unit tests there sometimes are Execution(Sub)Trees that are equivalent and return the same costEstimate(). The query planner must deterministically decide between them to make the current unit tests pass. - This is not the case with the newly integrated absl::flat_hash_map which purposely randomizes its iteration order. - with this commit, The QueryPlanner detects that it is in the unit test mode (no ExecutionContext assigned) and then deterministically chooses the alternative with the smaller cache key on equal cost. - Some of the unit tests had to be adapted to match this behavior.

Previously the size estimate dummys for Execution trees used std::hash<string> which is implementation-defined. Thus the QueryPlannerTest failed on some platforms without indicating a bug in the QueryPlanner or QLever in general. Now we use deterministic estimates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up the first phase of Index Building #302

Speeding up the first phase of Index Building #302

Commits on Jan 27, 2020

Commits on Jan 28, 2020

Commits on Jan 29, 2020

Commits on Jan 30, 2020