Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

improve zap chunking #19

Merged
merged 2 commits into from
May 26, 2020
Merged

improve zap chunking #19

merged 2 commits into from
May 26, 2020

Conversation

mschoch
Copy link
Member

@mschoch mschoch commented May 23, 2020

chunk mode 1025 made singinifcant improvement for indexes
with many unique fields (or very low cardinality)

here we introduce chunk mode 1026 which takes the same
approach as 1025, but extends it to improve chunking for
all other cardinalities

specifically, we try to create the fewest number of dense chunks
chunk size is capped at 1024 documents as before

chunk mode 1025 made singinifcant improvement for indexes
with many unique fields (or very low cardinality)

here we introduce chunk mode 1026 which takes the same
approach as 1025, but extends it to improve chunking for
all other cardinalities

specifically, we try to create the fewest number of dense chunks
chunk size is capped at 1024 documents as before
@mschoch
Copy link
Member Author

mschoch commented May 23, 2020

For some reason this produces larger indexes.

@sreekanth-cb
Copy link
Contributor

Tried indexing wikipedia text for about 2.3M documents with the default mapping.
Size in v13 format - 12.69GB single segment index.

Size with v14 format - 12.26GB single segment index.
So ~430MB size reduction observed.

Also size reduction noted between a multi segment index to single segment index here - 12.79 => 12.26 GB with v14.

@mschoch , any context about the size increase you observed?

@mschoch
Copy link
Member Author

mschoch commented May 26, 2020

@sreekanth-cb using bleve-blast to index 1 million documents I got the following:

$ du -sh bleve-scorch-zapv14-mil.bleve 
5.9G	bleve-scorch-zapv14-mil.bleve
$ du -sh bleve-scorch-zapv13-mil.bleve 
3.5G	bleve-scorch-zapv13-mil.bleve

It's not single segment, so probably a merge was in progress. I have to update bleve-blast to have an option to use the IndexBuilder.

@mschoch
Copy link
Member Author

mschoch commented May 26, 2020

@sreekanth-cb nevermind, my bleve-blast silently revereted to zapv11 again because we still have not merged this fix:

blevesearch/bleve#1401

@mschoch
Copy link
Member Author

mschoch commented May 26, 2020

Confirmed small improvement with v14 with IndexBuilder on wiki dataset and confirmed the integration tests pass with v14, so will proceed to merge this, and do the bleve/zap release dance.

@mschoch mschoch merged commit 33840bf into master May 26, 2020
@mschoch mschoch deleted the improve-chunking-v14 branch May 26, 2020 16:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants