Make externalization more configurable #207

niklas88 · 2019-03-15T13:41:04Z

This enables index build time configuration of which languages to keep in the internal vocabulary. Also the maximum number of bytes for internalized literals is turned into a constant. Finally I want to test increasing this for Wikidata to speed up extracting all entity labels on Wikidata and other large mappings.

On Wikidata extracting all labels of all entities is very slow because looking up the labels hits a lot of externalized literals (it is fast in Id space). It might be a lot faster on SSDs but I think with Vocabulary compression we should be able to internalize longer literals

joka921

Only one stylistic suggestion, otherwise this looks good and is a very useful feature.

joka921 · 2019-03-16T09:23:31Z

src/index/VocabularyImpl.h

 bool Vocabulary<S>::shouldBeExternalized(const string& word) const {
-  if constexpr (isEntity) {


As there is currently no use for this compile-time check and I asume that the string operations will always dominate the runtime it is fine for me to remove this (I currently don't remember the initial reason for this)

joka921 · 2019-03-16T09:27:11Z

src/index/VocabularyImpl.h

+void Vocabulary<S>::initializeInternalizedLangs(const StringRange& s) {
+  _internalizedLangs.clear();
+  for (const auto& el : s) {
+    _internalizedLangs.push_back(el);


Can't we write _internalizedLangs.insert(_internalizedLangs.begin, s.begin(), s.end()); instead of the loop to be more expressive (+ efficient, doesn't matter here, bit is a good patter probably)

niklas88 · 2019-03-19T15:57:21Z

@joka921 so I'm currently testing the increase in maximum internal literal size with building Wikidata Full and I've seen the memory usage go up to 67 GB. What do you think, we might have to dial this down again if we want to keep our 64 GB limit. Then again, being able to export e.g. all labels in a reasonable time (currently takes about an hour) is an important use case and I think this highly depends on fewer external literals.

joka921 · 2019-03-19T17:25:47Z

@niklas88 Do I understand correctly that you are talking about 67GB at IndexBuildTime? In this case I would wait if this stays the maximum and then check how much memory the server needs.

The IndexBuild is currently a tradeoff beween runtime and memory consumption and we can tweak the memory again. Do you know at which step we consumed this much memory?

niklas88 · 2019-03-22T11:26:50Z

@joka921 it later settled below 40 GB I think this is still fine, I'm not sure which step this was at but I think still during partial vocabularies. That said the final vocabulary is only 13 GB (vs 12 GB before) even with longer literals. On top of this storing up to 128 bytes long literals cuts the time to export all Wikidata literals from about 45 minutes to less than 1 minute and actually also improves our internal evaluation so I think this is very valuable

niklas88 added 3 commits March 15, 2019 12:10

Turn size limits of internal literals into a const

8a56586

Make internalized languages build configurable

66b0767

niklas88 requested a review from joka921 March 15, 2019 13:41

niklas88 mentioned this pull request Mar 15, 2019

Trim the README.md and add quickstart guide #202

Merged

joka921 approved these changes Mar 16, 2019

View reviewed changes

Use range insertion instead of for loop

bdbbfab

niklas88 merged commit ef4c7ec into ad-freiburg:master Mar 18, 2019

niklas88 deleted the config_externalize branch October 1, 2019 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make externalization more configurable #207

Make externalization more configurable #207

niklas88 commented Mar 15, 2019

joka921 left a comment

joka921 Mar 16, 2019

joka921 Mar 16, 2019

niklas88 commented Mar 19, 2019

joka921 commented Mar 19, 2019

niklas88 commented Mar 22, 2019

		bool Vocabulary<S>::shouldBeExternalized(const string& word) const {
		if constexpr (isEntity) {

Make externalization more configurable #207

Make externalization more configurable #207

Conversation

niklas88 commented Mar 15, 2019

joka921 left a comment

Choose a reason for hiding this comment

joka921 Mar 16, 2019

Choose a reason for hiding this comment

joka921 Mar 16, 2019

Choose a reason for hiding this comment

niklas88 commented Mar 19, 2019

joka921 commented Mar 19, 2019

niklas88 commented Mar 22, 2019