Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up the first phase of Index Building #302

Merged
merged 14 commits into from
Jan 30, 2020

Commits on Jan 27, 2020

  1. Speeding up the first phase of Index Building

    - Identified the writing of Triple elements to hash maps for deduplication as a bottleneck.
    - Speed this up by writing to multiple hash maps at once and merging them afterwards
    - Also concurrently pipelined all the other steps in the first index building phase.
    - That way the TurtleParsing now becomes the bottleneck (Teaser: This can be sped up by 30% using
      compile time regexes, so this might become even faster in a later PR).
    
    - This was done by Implementing an abstract, templated BatchedPipeline that abstracts
      over the creation and transformation of values in a pipeline that allows to control
      the degree of concurrency used on each level and between the different levels.
    
    - This pipeline infrastructure was heavily unit-tested to ensure its correctness since it internally
      uses quite some template magic.
    
    - This Commit also introduces the absl::flat_hash_map that is faster than the dense_hash_map used before.
      But in doubt I can also revert this change since in the meantime the parallelism is what helps us more
      than the faster hash maps. But I have thought, that we wanted to try out those anyway.
    
    - This can already be reviewed, especially the BatchedPipeline.h file. Before merging I would suggest merging
      the Unicode PR, because there is some merging work to be done (but not too much) between those PRs.
    joka921 committed Jan 27, 2020
    Configuration menu
    Copy the full SHA
    68196f3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a8d47fb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e712839 View commit details
    Browse the repository at this point in the history
  4. Included Unicode Normalization To ensure the reproducability of Index…

    … Builds
    
    TODO: this also appears in another Branch, after merging, rewrite the history?
    joka921 committed Jan 27, 2020
    Configuration menu
    Copy the full SHA
    8781f06 View commit details
    Browse the repository at this point in the history
  5. Adressed the changes from Niklas' Review.

    I would like to test the performance before merging this, as it should be at least somewhat fast.
    joka921 committed Jan 27, 2020
    Configuration menu
    Copy the full SHA
    0f3dfbc View commit details
    Browse the repository at this point in the history
  6. We don't collate using ICU directly, but create SortKeys for every El…

    …ement when we first see it
    
    - This boosts up the speed in our parallel pipeline.
    joka921 committed Jan 27, 2020
    Configuration menu
    Copy the full SHA
    0e9934d View commit details
    Browse the repository at this point in the history
  7. replace google::sparsehash by absl::flat_hash_map

    It is much faster, especially for large hash maps
    Replaced the default implementation of ad_utility::HashMap.
    We no longer need a default-key provider.
    absl strictly randomizes the iteration order of the hash map, so some unit tests had to be changed.
    joka921 committed Jan 27, 2020
    Configuration menu
    Copy the full SHA
    cc711ec View commit details
    Browse the repository at this point in the history
  8. Only use 50M triples per Partial vocabulary for the wikidata build, t…

    …o account for the overhead of the SortKeys.
    joka921 committed Jan 27, 2020
    Configuration menu
    Copy the full SHA
    d22e798 View commit details
    Browse the repository at this point in the history

Commits on Jan 28, 2020

  1. Made the QueryPlanner return deterministic plans for the unit tests

    - (mostly) In the unit tests there sometimes are Execution(Sub)Trees that are equivalent and return the same costEstimate(). The query planner must deterministically decide between them to make
       the current unit tests pass.
    - This is not the case with the newly integrated absl::flat_hash_map which purposely randomizes its iteration order.
    - with this commit, The QueryPlanner detects that it is in the unit test mode (no ExecutionContext assigned) and then deterministically chooses the alternative with the smaller cache key on equal cost.
    
    - Some of the unit tests had to be adapted to match this behavior.
    joka921 committed Jan 28, 2020
    Configuration menu
    Copy the full SHA
    7a77acd View commit details
    Browse the repository at this point in the history
  2. Fixed the build failure

    joka921 committed Jan 28, 2020
    Configuration menu
    Copy the full SHA
    72e1aef View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2020

  1. Configuration menu
    Copy the full SHA
    2c68abc View commit details
    Browse the repository at this point in the history
  2. Also output size estimate.

    joka921 committed Jan 29, 2020
    Configuration menu
    Copy the full SHA
    0d32451 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2020

  1. QueryPlannerTest is no longer implementation-dependent.

    Previously the size estimate dummys for Execution trees used std::hash<string> which is implementation-defined. Thus the QueryPlannerTest failed on some platforms without indicating
      a bug in the QueryPlanner or QLever in general.
    
    Now we use deterministic estimates.
    joka921 committed Jan 30, 2020
    Configuration menu
    Copy the full SHA
    0cbcf8b View commit details
    Browse the repository at this point in the history
  2. QueryPlannerTest is no longer implementation-dependent.

    Previously the size estimate dummys for Execution trees used std::hash<string> which is implementation-defined. Thus the QueryPlannerTest failed on some platforms without indicating
      a bug in the QueryPlanner or QLever in general.
    
    Now we use deterministic estimates.
    joka921 committed Jan 30, 2020
    Configuration menu
    Copy the full SHA
    3783f38 View commit details
    Browse the repository at this point in the history