New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make externalization more configurable #207
Conversation
On Wikidata extracting all labels of all entities is very slow because looking up the labels hits a lot of externalized literals (it is fast in Id space). It might be a lot faster on SSDs but I think with Vocabulary compression we should be able to internalize longer literals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one stylistic suggestion, otherwise this looks good and is a very useful feature.
bool Vocabulary<S>::shouldBeExternalized(const string& word) const { | ||
if constexpr (isEntity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is currently no use for this compile-time check and I asume that the string operations will always dominate the runtime it is fine for me to remove this (I currently don't remember the initial reason for this)
src/index/VocabularyImpl.h
Outdated
void Vocabulary<S>::initializeInternalizedLangs(const StringRange& s) { | ||
_internalizedLangs.clear(); | ||
for (const auto& el : s) { | ||
_internalizedLangs.push_back(el); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we write _internalizedLangs.insert(_internalizedLangs.begin, s.begin(), s.end());
instead of the loop to be more expressive (+ efficient, doesn't matter here, bit is a good patter probably)
@joka921 so I'm currently testing the increase in maximum internal literal size with building Wikidata Full and I've seen the memory usage go up to 67 GB. What do you think, we might have to dial this down again if we want to keep our 64 GB limit. Then again, being able to export e.g. all labels in a reasonable time (currently takes about an hour) is an important use case and I think this highly depends on fewer external literals. |
@niklas88 Do I understand correctly that you are talking about 67GB at IndexBuildTime? In this case I would wait if this stays the maximum and then check how much memory the server needs. The IndexBuild is currently a tradeoff beween runtime and memory consumption and we can tweak the memory again. Do you know at which step we consumed this much memory? |
@joka921 it later settled below 40 GB I think this is still fine, I'm not sure which step this was at but I think still during partial vocabularies. That said the final vocabulary is only 13 GB (vs 12 GB before) even with longer literals. On top of this storing up to 128 bytes long literals cuts the time to export all Wikidata literals from about 45 minutes to less than 1 minute and actually also improves our internal evaluation so I think this is very valuable |
This enables index build time configuration of which languages to keep in the internal vocabulary. Also the maximum number of bytes for internalized literals is turned into a constant. Finally I want to test increasing this for Wikidata to speed up extracting all entity labels on Wikidata and other large mappings.