Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make externalization more configurable #207

Merged
merged 4 commits into from Mar 18, 2019
Merged

Make externalization more configurable #207

merged 4 commits into from Mar 18, 2019

Conversation

niklas88
Copy link
Member

This enables index build time configuration of which languages to keep in the internal vocabulary. Also the maximum number of bytes for internalized literals is turned into a constant. Finally I want to test increasing this for Wikidata to speed up extracting all entity labels on Wikidata and other large mappings.

On Wikidata extracting all labels of all entities is very slow because
looking up the labels hits a lot of externalized literals (it is fast in
Id space). It might be a lot faster on SSDs but I think with Vocabulary
compression we should be able to internalize longer literals
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one stylistic suggestion, otherwise this looks good and is a very useful feature.

bool Vocabulary<S>::shouldBeExternalized(const string& word) const {
if constexpr (isEntity) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is currently no use for this compile-time check and I asume that the string operations will always dominate the runtime it is fine for me to remove this (I currently don't remember the initial reason for this)

void Vocabulary<S>::initializeInternalizedLangs(const StringRange& s) {
_internalizedLangs.clear();
for (const auto& el : s) {
_internalizedLangs.push_back(el);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we write _internalizedLangs.insert(_internalizedLangs.begin, s.begin(), s.end()); instead of the loop to be more expressive (+ efficient, doesn't matter here, bit is a good patter probably)

@niklas88 niklas88 merged commit ef4c7ec into ad-freiburg:master Mar 18, 2019
@niklas88
Copy link
Member Author

@joka921 so I'm currently testing the increase in maximum internal literal size with building Wikidata Full and I've seen the memory usage go up to 67 GB. What do you think, we might have to dial this down again if we want to keep our 64 GB limit. Then again, being able to export e.g. all labels in a reasonable time (currently takes about an hour) is an important use case and I think this highly depends on fewer external literals.

@joka921
Copy link
Member

joka921 commented Mar 19, 2019

@niklas88 Do I understand correctly that you are talking about 67GB at IndexBuildTime? In this case I would wait if this stays the maximum and then check how much memory the server needs.

The IndexBuild is currently a tradeoff beween runtime and memory consumption and we can tweak the memory again. Do you know at which step we consumed this much memory?

@niklas88
Copy link
Member Author

@joka921 it later settled below 40 GB I think this is still fine, I'm not sure which step this was at but I think still during partial vocabularies. That said the final vocabulary is only 13 GB (vs 12 GB before) even with longer literals. On top of this storing up to 128 bytes long literals cuts the time to export all Wikidata literals from about 45 minutes to less than 1 minute and actually also improves our internal evaluation so I think this is very valuable

@niklas88 niklas88 deleted the config_externalize branch October 1, 2019 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants