Skip to content

Conversation

@nfsantos
Copy link
Contributor

@nfsantos nfsantos commented Sep 5, 2024

Fix: when full, intern cache was no longer being used to lookup for already interned strings.

INTERN_CACHE.size() < MAX_INTERN_CACHE && part.length() < MAX_INTERNED_STRING_LENGTH) {
pathElements[i] = INTERN_CACHE.computeIfAbsent(part, String::intern);
if ((i < 3 || part.length() == 1 || part.startsWith("jcr:") || COMMON_PATH_WORDS.contains(part)) && part.length() < MAX_INTERNED_STRING_LENGTH) {
pathElements[i] = INTERN_CACHE.size() < MAX_INTERN_CACHE ?
Copy link
Member

@thomasmueller thomasmueller Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so we want to threat it as read-only once it is full... Yes that makes sense. But existing entries that are not popular are not replaced still... So if the "wrong" entries appear early on, then the cache is not effective.

I would check if a simple one-way set-associative (direct-mapped) cache, backed by an array, isn't more efficient...

Copy link
Contributor Author

@nfsantos nfsantos Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unpopular entries are not replaced, so this solution is not good in the general case. In our particular case, from the testing I have done, it is effective. I have tried with data from a real repository and after traversing and storing in memory 10M nodes, there were only 47 elements in the intern cache. And even with so few interned strings, the memory usage at the end of the test, which includes the space taken by the SortKeys of 10M entries, was 3GB lower with interning and sorting the array with the SortedKeys took 3 instead of 5 seconds. The names in the first three levels of the tree are heavily repeated, so interning them provides most of the gains.

Your proposed solution might be more efficient, I can investigate it in the future. For the time being, this PR is just to avoid the degenerate case where when the cache is full, we stop using the interned strings.

@nfsantos nfsantos merged commit e4ca451 into apache:trunk Sep 6, 2024
@nfsantos nfsantos deleted the OAK-11082 branch September 6, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants