New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up the first phase of Index Building #302
Conversation
@joka921 That sounds amazing! I have a few questions:
|
Ok, so for each triple we have to do the following steps:
In a pipeline this looks as follows:
Trick a) all of those levels happen at the same time for different batches (pipeline principle) After we are finished, we have the problem, that some words may have multiple Ids (if they occured in different triples that were assigned to different hash maps). Unifying these to one Id and updating the triple after we are done The speed up is s.t. the parser becomes the bottle neck. It is about 40 Million triples per Minute so The only thing the |
@joka921 Thanks for the explanation! You write that "We make sure that the hashMaps hand out disjoint sets of Ids (this is easy because we know the maximum number of triples we deal with at once in this whole process.)". Does this mean that with this PR there are gaps in the ID space? Or do these gaps vanish in the ID unification process. I am slightly worried about code complexity and maintainability. Do I understand you correctly that all the complexity is hidden in the BatchedPipeline class? Is the code outside that class reasonably simple? For example, if someone else wants to extend the parser in some (not overly dramatic way), will they be able to do it without understanding the intricacies of your batched pipeline? |
|
if this is too dense, each of the inner lambdas can also be setup externally |
@niklas88 The Travis build says that it has passed when I click on the |
I just reran the build and this time its propagated correctly. So yeaah looks like a Travis hickup. I'll look into this in more detail soon but it already sounds pretty amazing! |
} | ||
if (i % 10000000 == 0) { | ||
LOG(INFO) << "Lines (from KB-file) processed: " << i << '\n'; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joka921 Could you add a few well-placed and concise comments in this code block so that the structure becomes clearer and what the individual lambdas do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned that you could also replace the lambdas with well named functions. I think this might indeed be helpful here
I just tried to build a version of the current master (includes the unicode upgrade) with this PR merged locally and encountered this error while building the docker container:
|
`git submodule init`
`git submodule update`
should Download abseil.
Did the merge Work without conflicts?
Hannah Bast <notifications@github.com> schrieb am Sa., 11. Jan. 2020, 09:12:
… I just tried to build a version of the current master (includes the
unicode upgrade) with this PR merged locally and encountered this error
while building the docker container:
CMake Error at CMakeLists.txt:60 (add_subdirectory):
The source directory
/app/third_party/abseil-cpp
does not contain a CMakeLists.txt file.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#302>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADS6SATV2RHIZOSOTKB57W3Q5F5QDANCNFSM4KEKWXEQ>
.
|
@joka921 In fact, it didn't, sorry for the confusion. Below you the see the list of the conflicting files (and the GitHub page for the PR shows the exact same list of files). So I will wait until you have modified this pull request to be mergeable with the current master.
|
3069736
to
9c801b7
Compare
Awesome, I have started a build with this code at 9:39 h today, see my WA message. Fingers crossed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ 5dbe3fc
Didn't we say earlier that using the QUATERNARY level is enough and that using the IDENTICAL level would affect the performance negatively, or am I confusing something here? Here is the corresponding quote from the ICU documentation, which Niklas included in his comments on this PR on 2020-01-02 13:57, highlights added by me:
"Identical Level: When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared at this level, just in case there is no difference at levels 1-4 . For example, Hebrew cantillation marks are only distinguished at this level. This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. Using this level substantially decreases the performance for both incremental comparison and sort key generation (as well as increasing the sort key length). It is also known as level 5 strength."
@joka921 A minor detail: can you please call the intermediate files
Note that this entails three changes: (1) a dash before the final number, (2) fixed-width formatting of the number, so that the lexicographic order makes more sense when seeing them in a directory listing, (3) a dash instead of an underscore. If you feel that a fixed width of three is not enough, you can also make it four. |
I will make this optionally. Without Level 5 there can be literals that
compare equal but are different. This leads to Index builds that behave
equally, but Producer different Indices in Case two of those equal literals
are sorted in a different Order in the vocabulary. I Wang to have
reproducible Index builds at least optionally as a Check dir Changes in the
Index builder. Thema I can simply Run cmp in the Index Files to See If I
have broken anything.
Hannah Bast <notifications@github.com> schrieb am So., 12. Jan. 2020, 10:02:
… ***@***.**** commented on this pull request.
Didn't we say earlier that using the QUATERNARY level is enough and that
using the IDENTICAL level would affect the performance negatively, or am I
confusing something here? Here is the corresponding quote from the ICU
documentation, which Niklas included in his comments on this PR on
2020-01-02 13:57, highlights added by me:
"Identical Level: When all other levels are equal, the identical level is
used as a tiebreaker. The Unicode code point values of the NFD form of each
string are compared at this level, just in case there is no difference at
levels 1-4 . For example, Hebrew cantillation marks are only distinguished
at this level. *This level should be used sparingly*, as only code point
values differences between two strings is an extremely rare occurrence. *Using
this level substantially decreases the performance* for both incremental
comparison and sort key generation (as well as increasing the sort key
length). It is also known as level 5 strength."
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#302>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADS6SAQT3BQEIPAP7LUTHBLQ5LMDJANCNFSM4KEKWXEQ>
.
|
@joka921 Travis complains that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a first very rough pass of this. Looks great so far but I had a few comments and would like to look at this with my non-after-work eyes again. If you merge the CTRE PR I think some of it is in here too so it would become smaller right?
src/index/ConstantsIndexCreation.h
Outdated
@@ -53,3 +53,7 @@ static const std::string PARTIAL_MMAP_IDS = ".partial-ids-mmap"; | |||
|
|||
// ________________________________________________________________ | |||
static const std::string TMP_BASENAME_COMPRESSION = ".tmp.compression_index"; | |||
|
|||
// _________________________________________________________________ | |||
// TODO: Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I guess you're right
} | ||
if (i % 10000000 == 0) { | ||
LOG(INFO) << "Lines (from KB-file) processed: " << i << '\n'; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned that you could also replace the lambdas with well named functions. I think this might indeed be helpful here
src/index/VocabularyGeneratorImpl.h
Outdated
} else { | ||
std::sort(begin(els), end(els), pred); | ||
// ____________________________________________________________________________________________________________ | ||
absl::flat_hash_map<Id, Id> createInternalMapping(std::vector<std::pair<string, Id>>* elsPtr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to stick to just one hash map implementation in normal QLever code (i.e. in libraries it's ok). So I think we should make util::HashMap
use absl::flat_hash_map
I think I once refactored it so that this shouldn't be too hard. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR I removed absl
again, since the parallelism takes care of my current performance issues and we can do the change in a clean separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments but this is great work!
src/index/Index.cpp
Outdated
@@ -235,8 +256,11 @@ VocabularyData Index::passFileForVocabulary(const string& filename, | |||
VocabularyMerger::VocMergeRes mergeRes; | |||
{ | |||
VocabularyMerger v; | |||
mergeRes = | |||
v.mergeVocabulary(_onDiskBase, numFiles, _vocab.getCaseComparator()); | |||
auto identicalPred = [c = _vocab.getCaseComparator()](const auto& a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be const auto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it, but can you point out any source where making lambdas const auto
helps performance. Even the all things constexpr
guys seem to always use plain auto
for lambdas.
The only thing that you can prevent that way, is moving the lambda out, but typically compilers see through the lambda very well and the call operator of lambdas is always const
by default, so I am not convinced that this substantially helps the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point I guess you're right
src/index/Index.cpp
Outdated
ItemMap& map = *mapPtr; | ||
// ____ | ||
// TODO<joka921: are those unused now and can be | ||
// removed?>_______________________________________________________________________ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's not too hard to test in a compiled language :D
src/index/Index.h
Outdated
@@ -444,7 +445,11 @@ class Index { | |||
LOG(DEBUG) << "Scan done, got " << result->size() << " elements.\n"; | |||
} | |||
|
|||
using ItemMap = ad_utility::HashMap<string, Id>; | |||
template <typename K, typename V> | |||
using HashMap = absl::flat_hash_map<K, V>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the other comment, if we can I'd prefer a single hash map library to be used and I think Abseil is definitely a great one and this would also kill dependence on order. I'd be fine with splitting this in another PR of course
src/index/Index.h
Outdated
template <class Map> | ||
static Id assignNextId(Map* mapPtr, const string& key); | ||
|
||
// TODO<joka921> This should also be unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it should be removed ;-) removed code is bug free code
* @param input The String to be normalized. Must be UTF-8 encoded | ||
* @return The NFC canonical form of NFC in UTF-8 encoding. | ||
*/ | ||
std::string normalizeUtf8(std::string_view input) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the LocaleManager
to the uilt/
folder and the other string helpers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened Issue #313 for this so I don't forget it. I do not want to do this while I still have PRs open that modify it to not get into rebasing hell. Otherwise I agree.
src/util/BatchedPipeline.h
Outdated
return std::move(_buffer[_bufferPosition++]); | ||
} else { | ||
// we can only reach this if the buffer is exhausted and there is nothing | ||
// more to parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no parsing
src/util/BatchedPipeline.h
Outdated
|
||
/** | ||
* @brief setup a pipeline that efficiently creates and transforms values. The | ||
* Concurrency is used between the different levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think steps or stages would be more fitting words than levels because that applies a hierarchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose stages.
namespace detail { | ||
/* Implementation of setupTupleFromCallable (see below) | ||
* Needed because we need the Pattern matching on the index_sequence | ||
* TODO<joka921> In C++ 20 this could be done in place with templated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we'll get a C++20 capable compiler with Ubuntu 20.04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I highly doubt it since the standard is feature freezed but not yet published.
test/BatchedPipelineTest.cpp
Outdated
} | ||
return std::nullopt; | ||
}, | ||
[a = int(0)](const auto& x) mutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of int(0)
is inconsistent to the above use of [i = 0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work and thanks for addressing my questions!
903a42d
to
90c946c
Compare
- Identified the writing of Triple elements to hash maps for deduplication as a bottleneck. - Speed this up by writing to multiple hash maps at once and merging them afterwards - Also concurrently pipelined all the other steps in the first index building phase. - That way the TurtleParsing now becomes the bottleneck (Teaser: This can be sped up by 30% using compile time regexes, so this might become even faster in a later PR). - This was done by Implementing an abstract, templated BatchedPipeline that abstracts over the creation and transformation of values in a pipeline that allows to control the degree of concurrency used on each level and between the different levels. - This pipeline infrastructure was heavily unit-tested to ensure its correctness since it internally uses quite some template magic. - This Commit also introduces the absl::flat_hash_map that is faster than the dense_hash_map used before. But in doubt I can also revert this change since in the meantime the parallelism is what helps us more than the faster hash maps. But I have thought, that we wanted to try out those anyway. - This can already be reviewed, especially the BatchedPipeline.h file. Before merging I would suggest merging the Unicode PR, because there is some merging work to be done (but not too much) between those PRs.
…sure total ordering.
… Builds TODO: this also appears in another Branch, after merging, rewrite the history?
I would like to test the performance before merging this, as it should be at least somewhat fast.
…ement when we first see it - This boosts up the speed in our parallel pipeline.
It is much faster, especially for large hash maps Replaced the default implementation of ad_utility::HashMap. We no longer need a default-key provider. absl strictly randomizes the iteration order of the hash map, so some unit tests had to be changed.
96babbd
to
cc711ec
Compare
…o account for the overhead of the SortKeys.
@hannahbast Sorry for the internal and meaningsless commit messages, I used this branch to transport content from my local machine to Galera. @hannahbast @niklas88 I have decided to integrate two more changes into this PR, they are separate commits and thus they should be easy to review separately.
All in all, maybe @hannahbast should try building an index using this PR, and @niklas88 could have a look at the two additional changes. |
@joka921 the code looks good but the commit for replacing |
@joka921 I actually started on fixing that damn |
@niklas88 @hannahbast The issue of the QueryPlannerTest seemed fixable to me (Whenever the QueryPlanner can choose between equivalent Trees, we break the tie according to the CacheKey. However implementing this I stumbled upon a probably serious issue: #314 |
- (mostly) In the unit tests there sometimes are Execution(Sub)Trees that are equivalent and return the same costEstimate(). The query planner must deterministically decide between them to make the current unit tests pass. - This is not the case with the newly integrated absl::flat_hash_map which purposely randomizes its iteration order. - with this commit, The QueryPlanner detects that it is in the unit test mode (no ExecutionContext assigned) and then deterministically chooses the alternative with the smaller cache key on equal cost. - Some of the unit tests had to be adapted to match this behavior.
@niklas88 I applied a fix for the queryPlannerTest business. Let me know what you think of it (last commit) and verify if this indeed fixes the problem (I don't have that many machines to my disposal with different architectures). |
@joka921 the build currently fails both on Travis and locally with GCC 9.2.x |
|
On the other hand absl uses |
@niklas88 there is |
Ok so I'm still getting a test failure in the The abseil CMake magic used |
I just spent some time running a debugger on the first of your failing unit tests. |
@niklas88 |
test_fail.txt |
Previously the size estimate dummys for Execution trees used std::hash<string> which is implementation-defined. Thus the QueryPlannerTest failed on some platforms without indicating a bug in the QueryPlanner or QLever in general. Now we use deterministic estimates.
Previously the size estimate dummys for Execution trees used std::hash<string> which is implementation-defined. Thus the QueryPlannerTest failed on some platforms without indicating a bug in the QueryPlanner or QLever in general. Now we use deterministic estimates.
@niklas88 |
Great work! The last commit also now fixes #294 and make all tests pass on s390x including the E2E Tests |
Identified the writing of Triple elements to hash maps for deduplication as a bottleneck.
Speed this up by writing to multiple hash maps at once and merging them afterwards
Also concurrently pipelined all the other steps in the first index building phase.
That way the TurtleParsing now becomes the bottleneck (Teaser: This can be sped up by 30% using
compile time regexes, so this might become even faster in a later PR).
This was done by Implementing an abstract, templated BatchedPipeline that abstracts
over the creation and transformation of values in a pipeline that allows to control
the degree of concurrency used on each level and between the different levels.
This pipeline infrastructure was heavily unit-tested to ensure its correctness since it internally
uses quite some template magic.
This Commit also introduces the absl::flat_hash_map that is faster than the dense_hash_map used before.
But in doubt I can also revert this change since in the meantime the parallelism is what helps us more
than the faster hash maps. But I have thought, that we wanted to try out those anyway.
This can already be reviewed, especially the BatchedPipeline.h file. Before merging I would suggest merging
the Unicode PR, because there is some merging work to be done (but not too much) between those PRs.