New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use group-varint encode the positions #12842
base: main
Are you sure you want to change the base?
Conversation
Thanks for looking. Unfortunately, the case I'm most interested in is when |
Thanks for your suggestion, i'm thinking about that too, i will continue working on this. |
Sorry for the late update! i spent some more time on other PR, i encoded the positions with group-varint when The simple benchmark summary:
Benchmark codepublic class SortedStringWriteBenchmark {
static class Benchark {
Random rand = new Random(0);
String randomString(int termsPerField, int freqPerTerm) {
List<String> values = new ArrayList<>();
for (int i = 0; i < termsPerField; ) {
String s = TestUtil.randomSimpleString(rand, 5, 10);
for (int j = 0; j < freqPerTerm; j++) {
values.add(s);
}
i += freqPerTerm;
}
Collections.shuffle(values);
String text = String.join(" ", values);
return text;
}
List<String> randomStrings(int max, int termsPerField, int freqPerTerm) {
List<String> values = new ArrayList<>();
for (int i = 0; i < max; i++) {
values.add(randomString(termsPerField, freqPerTerm));
}
return values;
}
long write() throws IOException {
List<String> terms = randomStrings(10000, 200, 100);
Path temp = Files.createTempDirectory(Paths.get("/Volumes/RamDisk"), "tmpDirPrefix");
Directory dir = MMapDirectory.open(temp);
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
config.setIndexSort(new Sort(new SortField("sort", SortField.Type.LONG)));
config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
IndexWriter w = new IndexWriter(dir, config);
for (int i = 0; i < terms.size(); ++i) {
Document doc = new Document();
doc.add(new NumericDocValuesField("sort", rand.nextInt()));
doc.add(new TextField("field", terms.get(i), Field.Store.NO));
w.addDocument(doc);
}
long t0 = System.currentTimeMillis();
w.flush();
long took = System.currentTimeMillis() - t0;
w.close();
dir.close();
return took;
}
}
public static void main(final String[] args) throws Exception {
int iter = 50;
Benchark benchark = new Benchark();
List<Long> times = new ArrayList<>();
for (int i = 0; i < iter; i++) {
long took = benchark.write();
times.add(took);
System.out.println("iteration " + i + ",took(ms):" + took);
}
double avg = times.stream().skip(iter / 2).mapToLong(Number::longValue).average().getAsDouble();
long min = times.stream().mapToLong(Number::longValue).min().getAsLong();
System.out.println("best took(ms) avg:" + avg + ", min:" + min);
} |
emmm... there's still no significant performance improvement, possibly the write path JMH output:
JMH benchmark code for writepublic class GroupVIntBenchmark {
// Cumulative frequency for each number of bits per value used by doc deltas of tail postings on
// wikibigall.
private static final float[] CUMULATIVE_FREQUENCY_BY_BITS_REQUIRED =
new float[] {
0.0f,
0.01026574f,
0.021453038f,
0.03342156f,
0.046476692f,
0.060890317f,
0.07644147f,
0.093718216f,
0.11424741f,
0.13989712f,
0.17366524f,
0.22071244f,
0.2815692f,
0.3537585f,
0.43655503f,
0.52308f,
0.6104675f,
0.7047371f,
0.78155357f,
0.8671179f,
0.9740598f,
1.0f
};
final int maxSize = 256;
final long[] docs = new long[maxSize];
// benchmark for write
final ByteBuffersDataOutput byteBuffersDataOutput = new ByteBuffersDataOutput();
@Param({"64"})
public int size;
@Setup(Level.Trial)
public void init() throws Exception {
Random r = new Random(0);
for (int i = 0; i < maxSize; ++i) {
float randomFloat = r.nextFloat();
// Reproduce the distribution of the number of bits per values that we're observing for tail
// postings on wikibigall.
int numBits = 1 + Arrays.binarySearch(CUMULATIVE_FREQUENCY_BY_BITS_REQUIRED, randomFloat);
if (numBits < 0) {
numBits = -numBits;
}
docs[i] = r.nextInt(1 << (numBits - 1), 1 << numBits);
}
}
@Benchmark
public void benchByteBuffersIndexInput_writeGroupVInt(Blackhole bh) throws IOException {
byteBuffersDataOutput.reset();
byteBuffersDataOutput.writeGroupVInts(docs, size);
}
@Benchmark
public void benchByteBuffersIndexInput_writeVInt(Blackhole bh) throws IOException {
byteBuffersDataOutput.reset();
for (int i = 0; i < size; i++) {
byteBuffersDataOutput.writeVInt((int)docs[i]);
}
}
} |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this PR was no longer on my radar, it looks promising!
private void writePositionsWithOutOffsets(final PostingsEnum in, final DataOutput out, int freq) | ||
throws IOException { | ||
int previousPosition = 0; | ||
if (storePayloads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe do this branch if storePayloads || storeOffsets
(and re-introduce the logic for offsets below) since the common case is when positions are indexed, but not offsets and not payloads? This way we can remove writePositionsWithOffsets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking this! I will update this later next week.
Hi Adrien, for simplicity, I changed the logic to use group-varint for when positions only (no offsets and no payloads). In addition, since I wrote a new ugly JMH benchmark for
Code@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 5, time = 5)
@Fork(
value = 1,
jvmArgsPrepend = {"--add-modules=jdk.unsupported"})
public class PosGroupVIntBenchmark {
Directory dir;
private Random rand = new Random(0);
PostingsEnum reuse;
TermsEnum termsEnum;
// copy from TestUtil
private String randomSimpleString(Random r, int minLength, int maxLength) {
final int end = r.nextInt(minLength, maxLength);
if (end == 0) {
// allow 0 length
return "";
}
final char[] buffer = new char[end];
for (int i = 0; i < end; i++) {
buffer[i] = (char) r.nextInt('a', 'z');
}
return new String(buffer, 0, end);
}
private String randomString(int termsPerField, int freqPerTerm) {
List<String> values = new ArrayList<>();
for (int i = 0; i < termsPerField; ) {
String s = randomSimpleString(rand, 5, 10);
for (int j = 0; j < freqPerTerm; j++) {
values.add(s);
}
i += freqPerTerm;
}
Collections.shuffle(values, rand);
String text = String.join(" ", values);
return text;
}
private List<String> randomStrings(int size, int termsPerField, int freqPerTerm) {
List<String> values = new ArrayList<>();
for (int i = 0; i < size; i++) {
values.add(randomString(termsPerField, freqPerTerm));
}
return values;
}
@Setup(Level.Trial)
public void init() throws Exception {
dir = new ByteBuffersDirectory();
List<String> terms = randomStrings(10, 200, 100);
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
Sort indexSort = new Sort(new SortField("sort", SortField.Type.LONG));
config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
IndexWriter w = new IndexWriter(dir, config);
for (int i = 0; i < terms.size(); ++i) {
Document doc = new Document();
doc.add(new NumericDocValuesField("sort", rand.nextInt()));
doc.add(new TextField("field", terms.get(i), Field.Store.NO));
w.addDocument(doc);
}
w.commit();
IndexReader r = DirectoryReader.open(w);
CodecReader cr = (CodecReader) r.leaves().get(0).reader();
CodecReader wrap = SortingCodecReader.wrap(cr, indexSort);
Terms fieldTerms = wrap.getPostingsReader().terms("field");
termsEnum = fieldTerms.iterator();
termsEnum.next();
reuse = termsEnum.postings(null, PostingsEnum.POSITIONS);
w.close();
}
@Benchmark
public void benchmark_addPositions() throws Exception {
termsEnum.postings(reuse, PostingsEnum.POSITIONS);
}
@Benchmark
public void benchmark_nextPositions() throws Exception {
termsEnum.postings(reuse, PostingsEnum.POSITIONS);
reuse.nextDoc();
int freq = reuse.freq();
for (int i = 0; i < freq; i++) {
reuse.nextPosition();
}
}
} |
It looks like |
Yeah, it looks like we can optimize |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Thanks the suggestion from @jpountz , as discussed in #12826
This PR use group-varint to encode some vint values if
storeOffsets
is true, it's still using classGroupVIntReader
andGroupVIntWriter
, i will update it after #12841 is finished.Currently i don't use group-vint if
(storeOffsets==false && storePayload==false)
, which means onlytoken
is stored, because i'm worried that it will use extra memory when bulk decoding. Feel free to correct me.Then benchmark and file size changes i'll add next week.