Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SstFileWriter performance #2668

Closed
oboxer opened this issue Jul 29, 2017 · 17 comments
Closed

SstFileWriter performance #2668

oboxer opened this issue Jul 29, 2017 · 17 comments

Comments

@oboxer
Copy link

oboxer commented Jul 29, 2017

Was testing SstFileWriter with 5.5.5. Below is the summary.

I have a sorted 1 GB text file, keys are 15 bytes, values are also 15 bytes.

This takes about ~70s write time
val sstFileWriter = new SstFileWriter(new EnvOptions, new Options())

This takes about ~140s write time
val sstFileWriter = new SstFileWriter(new EnvOptions, new Options(), new BytewiseComparator(new ComparatorOptions))

rest of the code.

sstFileWriter.open(filePath)
var line = buffReader.readLine()
while (line != null) {
  val kv = line.splitAt(15)
  sstFileWriter.add(new Slice(kv._1), new Slice(kv._2))
  line = buffReader.readLine()
}
sstFileWriter.finish()

I understand there are parameter options we can set to optimize for write performance, but even with default parameters, I was expecting the write performance to be better than what I got. Am I doing something wrong?

@mikhail-antonov
Copy link
Contributor

Yeah I'd expect better write throughput too, especially given that w/o params we should be using cheap (CPU-wise) block compression.
Just to check - did you you pre-load entire test file in memory before you start SstFileWriter to make sure the speed to actually fetch the data into buffReader doesn't affect anything?

I'll try to reproduce that.

@adamretter
Copy link
Collaborator

@h-1 Just to point out, but implementing comparators in Java has quite some performance overhead and we are aware of that. There are several improvements planned in that area. For the moment if you need a custom comparator, you could implement just the comparator in C++ and set it from Java. Also I would point out that ByteWiseComparator should not be used as it is a Java reimplementation of the default C++ comparator.

@oboxer
Copy link
Author

oboxer commented Jul 29, 2017

@mikhail-antonov Thanks for the quick reply. No, I haven't tried loading the test file into memory. This is just a test, real data is much larger. And buffReader is a regular Java BufferedReader, so I don't think that's the bottleneck. I will try the same test in C++ when I get some time and will report back.

Thanks for the help!

@oboxer
Copy link
Author

oboxer commented Jul 29, 2017

@adamretter Thanks for the tip. I read the Java ByteWiseComparator performance issue in some the questions, here and here, but just thought I'd test it out.

@oboxer
Copy link
Author

oboxer commented Aug 18, 2017

Hi @mikhail-antonov , I am sure you're very busy. But have you had a chance to look into it? Thank you for the help.

@zawlazaw
Copy link
Contributor

Using RocksDb 5.7.2, I did some straightforward benchmark on generated data rather than reading from a file. The benchmark has three parts:

  • depth 1: generate key/value-strings ⇒ ca. 6M records/sec (thus, reading "input" is no issue here...)
  • depth 2: ...additionally generate Slice instances ⇒ ca. 450k records/sec
  • depth 3: ...additionally write the slices to the sst file ⇒ ca. 400k records/sec

Since 1 GB of data items of 30 bytes each are ca. 34M records, I ended up with ca. 83 seconds for the depth 3 test on my laptop (2/4 physical/virtual cores, 10 GB RAM, SSD).

So the most important insights:

  1. there is a significant overhead in creating Slice instances - probably due to JNI? Probably there is an alternative to creating Slice instances per record?
  2. the sst file can be written (incl. creating slices) at a quite constant rate of ca. 400k records per second. This is probably not that bad... but since it is quite close to the bound determined by creating Slice instances, there is probably some room for improvements? How fast is it without JNI directly in C++?

Here the detailed test output:

####################################################################################################
depth 1: started...
depth 1: 1000000 of 34000000 (2.9%, records/sec overall: 2577320, current: 2577320
depth 1: 2000000 of 34000000 (5.9%, records/sec overall: 2439024, current: 2358491
depth 1: 3000000 of 34000000 (8.8%, records/sec overall: 2767528, current: 3787879
depth 1: 4000000 of 34000000 (11.8%, records/sec overall: 2919708, current: 3508772
depth 1: 5000000 of 34000000 (14.7%, records/sec overall: 3309067, current: 7092199
depth 1: 6000000 of 34000000 (17.6%, records/sec overall: 3625378, current: 6944444
depth 1: 7000000 of 34000000 (20.6%, records/sec overall: 3540718, current: 3105590
depth 1: 8000000 of 34000000 (23.5%, records/sec overall: 3777148, current: 7142857
depth 1: 9000000 of 34000000 (26.5%, records/sec overall: 3982301, current: 7042254
depth 1: 10000000 of 34000000 (29.4%, records/sec overall: 4163197, current: 7092199
depth 1: 11000000 of 34000000 (32.4%, records/sec overall: 4293521, current: 6250000
depth 1: 12000000 of 34000000 (35.3%, records/sec overall: 4078858, current: 2631579
depth 1: 13000000 of 34000000 (38.2%, records/sec overall: 4186795, current: 6172840
depth 1: 14000000 of 34000000 (41.2%, records/sec overall: 4315660, current: 7246377
depth 1: 15000000 of 34000000 (44.1%, records/sec overall: 4440497, current: 7462687
depth 1: 16000000 of 34000000 (47.1%, records/sec overall: 4551920, current: 7352941
depth 1: 17000000 of 34000000 (50.0%, records/sec overall: 4657534, current: 7462687
depth 1: 18000000 of 34000000 (52.9%, records/sec overall: 4753103, current: 7352941
depth 1: 19000000 of 34000000 (55.9%, records/sec overall: 4845703, current: 7518797
depth 1: 20000000 of 34000000 (58.8%, records/sec overall: 4926108, current: 7246377
depth 1: 21000000 of 34000000 (61.8%, records/sec overall: 5005959, current: 7407407
depth 1: 22000000 of 34000000 (64.7%, records/sec overall: 5077314, current: 7246377
depth 1: 23000000 of 34000000 (67.6%, records/sec overall: 5146565, current: 7352941
depth 1: 24000000 of 34000000 (70.6%, records/sec overall: 5209464, current: 7299270
depth 1: 25000000 of 34000000 (73.5%, records/sec overall: 5270926, current: 7407407
depth 1: 26000000 of 34000000 (76.5%, records/sec overall: 5327869, current: 7352941
depth 1: 27000000 of 34000000 (79.4%, records/sec overall: 5383848, current: 7407407
depth 1: 28000000 of 34000000 (82.4%, records/sec overall: 5432674, current: 7194245
depth 1: 29000000 of 34000000 (85.3%, records/sec overall: 5483078, current: 7407407
depth 1: 30000000 of 34000000 (88.2%, records/sec overall: 5498534, current: 5988024
depth 1: 31000000 of 34000000 (91.2%, records/sec overall: 5492558, current: 5347594
depth 1: 32000000 of 34000000 (94.1%, records/sec overall: 5507745, current: 6060606
depth 1: 33000000 of 34000000 (97.1%, records/sec overall: 5552751, current: 7518797
depth 1: 34000000 of 34000000 (100.0%, records/sec overall: 5591186, current: 7246377
depth 1: duration 6.08 secs, that is 5591186 records/sec overall
####################################################################################################
depth 2: started...
depth 2: 1000000 of 34000000 (2.9%, records/sec overall: 831255, current: 831255
depth 2: 2000000 of 34000000 (5.9%, records/sec overall: 420080, current: 281057
depth 2: 3000000 of 34000000 (8.8%, records/sec overall: 457736, current: 558036
depth 2: 4000000 of 34000000 (11.8%, records/sec overall: 404163, current: 299133
depth 2: 5000000 of 34000000 (14.7%, records/sec overall: 420911, current: 504541
depth 2: 6000000 of 34000000 (17.6%, records/sec overall: 420521, current: 418585
depth 2: 7000000 of 34000000 (20.6%, records/sec overall: 425558, current: 458505
depth 2: 8000000 of 34000000 (23.5%, records/sec overall: 427305, current: 439947
depth 2: 9000000 of 34000000 (26.5%, records/sec overall: 435077, current: 509424
depth 2: 10000000 of 34000000 (29.4%, records/sec overall: 440470, current: 496032
depth 2: 11000000 of 34000000 (32.4%, records/sec overall: 443584, current: 477327
depth 2: 12000000 of 34000000 (35.3%, records/sec overall: 447411, current: 494315
depth 2: 13000000 of 34000000 (38.2%, records/sec overall: 444050, current: 407332
depth 2: 14000000 of 34000000 (41.2%, records/sec overall: 447699, current: 501253
depth 2: 15000000 of 34000000 (44.1%, records/sec overall: 451726, current: 516796
depth 2: 16000000 of 34000000 (47.1%, records/sec overall: 452016, current: 456413
depth 2: 17000000 of 34000000 (50.0%, records/sec overall: 450713, current: 430849
depth 2: 18000000 of 34000000 (52.9%, records/sec overall: 450405, current: 445236
depth 2: 19000000 of 34000000 (55.9%, records/sec overall: 447881, current: 406835
depth 2: 20000000 of 34000000 (58.8%, records/sec overall: 449196, current: 475964
depth 2: 21000000 of 34000000 (61.8%, records/sec overall: 448775, current: 440529
depth 2: 22000000 of 34000000 (64.7%, records/sec overall: 448989, current: 453721
depth 2: 23000000 of 34000000 (67.6%, records/sec overall: 450442, current: 484966
depth 2: 24000000 of 34000000 (70.6%, records/sec overall: 451085, current: 466636
depth 2: 25000000 of 34000000 (73.5%, records/sec overall: 449075, current: 405844
depth 2: 26000000 of 34000000 (76.5%, records/sec overall: 449516, current: 461042
depth 2: 27000000 of 34000000 (79.4%, records/sec overall: 450759, current: 485909
depth 2: 28000000 of 34000000 (82.4%, records/sec overall: 451802, current: 481928
depth 2: 29000000 of 34000000 (85.3%, records/sec overall: 452425, current: 470588
depth 2: 30000000 of 34000000 (88.2%, records/sec overall: 453076, current: 472813
depth 2: 31000000 of 34000000 (91.2%, records/sec overall: 453734, current: 474383
depth 2: 32000000 of 34000000 (94.1%, records/sec overall: 454062, current: 464468
depth 2: 33000000 of 34000000 (97.1%, records/sec overall: 454320, current: 462749
depth 2: 34000000 of 34000000 (100.0%, records/sec overall: 454612, current: 464468
depth 2: duration 74.79 secs, that is 454612 records/sec overall
####################################################################################################
depth 3: started...
depth 3: 1000000 of 34000000 (2.9%, records/sec overall: 286944, current: 286944
depth 3: 2000000 of 34000000 (5.9%, records/sec overall: 314812, current: 348675
depth 3: 3000000 of 34000000 (8.8%, records/sec overall: 340252, current: 405844
depth 3: 4000000 of 34000000 (11.8%, records/sec overall: 354610, current: 406009
depth 3: 5000000 of 34000000 (14.7%, records/sec overall: 368026, current: 434405
depth 3: 6000000 of 34000000 (17.6%, records/sec overall: 377430, current: 432713
depth 3: 7000000 of 34000000 (20.6%, records/sec overall: 383226, current: 422119
depth 3: 8000000 of 34000000 (23.5%, records/sec overall: 388708, current: 431965
depth 3: 9000000 of 34000000 (26.5%, records/sec overall: 392927, current: 430293
depth 3: 10000000 of 34000000 (29.4%, records/sec overall: 396558, current: 432526
depth 3: 11000000 of 34000000 (32.4%, records/sec overall: 398984, current: 425170
depth 3: 12000000 of 34000000 (35.3%, records/sec overall: 398671, current: 395413
depth 3: 13000000 of 34000000 (38.2%, records/sec overall: 393439, current: 339905
depth 3: 14000000 of 34000000 (41.2%, records/sec overall: 394555, current: 410341
depth 3: 15000000 of 34000000 (44.1%, records/sec overall: 400342, current: 503778
depth 3: 16000000 of 34000000 (47.1%, records/sec overall: 400952, current: 410341
depth 3: 17000000 of 34000000 (50.0%, records/sec overall: 402081, current: 421230
depth 3: 18000000 of 34000000 (52.9%, records/sec overall: 403280, current: 424989
depth 3: 19000000 of 34000000 (55.9%, records/sec overall: 403466, current: 406835
depth 3: 20000000 of 34000000 (58.8%, records/sec overall: 403112, current: 396511
depth 3: 21000000 of 34000000 (61.8%, records/sec overall: 404398, current: 431965
depth 3: 22000000 of 34000000 (64.7%, records/sec overall: 405381, current: 427168
depth 3: 23000000 of 34000000 (67.6%, records/sec overall: 405422, current: 406339
depth 3: 24000000 of 34000000 (70.6%, records/sec overall: 405933, current: 418060
depth 3: 25000000 of 34000000 (73.5%, records/sec overall: 406795, current: 428633
depth 3: 26000000 of 34000000 (76.5%, records/sec overall: 407619, current: 429369
depth 3: 27000000 of 34000000 (79.4%, records/sec overall: 407430, current: 402576
depth 3: 28000000 of 34000000 (82.4%, records/sec overall: 406728, current: 388651
depth 3: 29000000 of 34000000 (85.3%, records/sec overall: 406989, current: 414422
depth 3: 30000000 of 34000000 (88.2%, records/sec overall: 406460, current: 391696
depth 3: 31000000 of 34000000 (91.2%, records/sec overall: 406030, current: 393546
depth 3: 32000000 of 34000000 (94.1%, records/sec overall: 406525, current: 422476
depth 3: 33000000 of 34000000 (97.1%, records/sec overall: 406740, current: 413736
depth 3: 34000000 of 34000000 (100.0%, records/sec overall: 406957, current: 414250
depth 3: duration 83.68 secs, that is 406295 records/sec overall
size of output file '/tmp/rocksdbsst': 213.60 MB

And here the full (Scala) code for the benchmark:

import java.nio.file.{Files, Paths}
import java.util.Locale

import org.rocksdb._
import org.rocksdb.util.DirectBytewiseComparator

object RocksSST extends App {

  Locale.setDefault(Locale.US)

  val sstFileWriter = new SstFileWriter(new EnvOptions, new Options)

  val filename = "/tmp/rocksdbsst"

  sstFileWriter.open(filename)

  val n = 34000000 // ca. number of records in 1 GB data of 30 bytes each

  for (depth <- 1 to 3) {
    val starttime = System.currentTimeMillis()
    var lastlogtime = System.currentTimeMillis()
    var logeach = 1000000

    println("#"*100 + s"\ndepth $depth: started...")

    for (id <- 1 to n) {
      val key = {val number = id.toString; "0" * (15 - number.length) + number} // 6x faster than "%015d".format(id)
      val value = key

      if (depth > 1) {
        val keySlice = new Slice(key.getBytes)
        val valueSlice = new Slice(value.getBytes)

        if (depth > 2) {
          sstFileWriter.add(keySlice, valueSlice)
        }
      }

      if (id % logeach == 0) {
        println(f"depth $depth%d: $id%d of $n%d (${id * 100.0 / n}%.1f%%, records/sec overall: ${id * 1000.0 / (System.currentTimeMillis - starttime)}%.0f, current: ${logeach * 1000.0 / (System.currentTimeMillis - lastlogtime)}%.0f")
        lastlogtime = System.currentTimeMillis
      }
    }

    if (depth > 2)
      sstFileWriter.finish()

    val duration = System.currentTimeMillis() - starttime
    println(f"depth $depth%d: duration ${duration / 1000.0}%.2f secs, that is ${n * 1000.0 / duration}%.0f records/sec overall")

    if (depth > 2)
      println(f"size of output file '$filename': ${Files.size(Paths.get(filename)) / 1048576.0}%.2f MB")
  }
}

@siying
Copy link
Contributor

siying commented Aug 24, 2017

CC @scv119

@IslamAbdelRahman
Copy link
Contributor

Thanks @zawlazaw for sharing such insights, @adamretter, Are we expecting such huge overhead from creating the slice ? is it because of memcpy ?

@zawlazaw
Copy link
Contributor

zawlazaw commented Aug 25, 2017

(EDIT: added DirectBuffer variants)

I did another benchmark that solely focuses on creating slices - Slice and DirectSlice. It first creates a single large byte-array that concatenates 70 million 15-byte-records. Then it proceeds in one of 6 variants:
A) do nothing ⇒ ca. 136M records/sec
B) copy each record into an individual byte-array ⇒ ca. 28M records/sec (array-copy within java)
C) copy into byte-array and create a single Slice from it ⇒ ca. 1M records/sec (passing array-data accross JNI via Slice)
D) create a new ByteBuffer and copy record into it ⇒ ca. 1.7M records/sec (direct array-creation within java)
E) create a new ByteBuffer, copy record into it, and create a new DirectSlice from it ⇒ ca. 950k records/sec
F) re-use an existing ByteBuffer, copy record into it, and create a new DirectSlice from it ⇒ ca. 1.5M records/sec
G) re-use an existing ByteBuffer, copy record into it, and re-use an existing DirectSlice ⇒ ca. 47M records/sec !!!!!!

Comments on the non-direct approaches (A-C):
We create ca. 1M Slice objects per second (which fits well to the ca. 450k key-value-pairs per second in the previous benchmark). In other words, it takes ca. 1 µs = 1000 ns to create a single Slice.

According to stackoverflow, the invocation of a parameterless function on a reasonable machine via JNI takes ca. 6 ns, other sources report 10-40 ns. Thus, the most time it takes to create a Slice comes from passing data. Later in that topic, somebody passed "some arguments" via JNI within 400 ns.

Comments on the direct approaches (D-G):
As well-known, allocating a direct buffer is quite time-consuming. For that reason, D (copy into new direct ByteBuffer) is much slower than B (copy into new byte-array). Similarly, E (new DirectSlice on new direct ByteBuffer) is slightly slower than C (new Slice on new byte-array).
However, the intended use of direct ByteBuffers is to re-use them (usually via flip() rather than with explicit position & limit). This gives a significant better performance in F (new DirectSlice on re-used ByteBuffer), and the tremendous performance of G (re-use DirectSlice on re-used ByteBuffer).
Thus, G reduces the overhead to a minimum, since any java-sided write to the ByteBuffer immediately affects the data at the always-the-same pointer at C++ side. The only problem is to additionally communicate position/limit of the current data, and to not modify the same byte-buffer in parallel by multiple threads.

I already run a variant of the very first benchmak above (coming soon...) that shows how using DirectBuffer this way can significantly improve the performance on sst writes. However, there are also two issues with the current implementation of DirectBuffer that I will explain then.

Main Result:
The overhead for passing data is minimized by re-using a ByteBuffer plus re-using a wrapping DirectSlice instance, so that both java and C++ can immediately read/write to/from the same memory location.

Here the detailed results of the slice-only benchmark:

created data-array of length 1050000000
####################################################################################################
variant A: started...
variant A: 10000000 of 70000000 (14.3%, records/sec overall: 128205128, current: 128205128
variant A: 20000000 of 70000000 (28.6%, records/sec overall: 125786164, current: 135135135
variant A: 30000000 of 70000000 (42.9%, records/sec overall: 130434783, current: 142857143
variant A: 40000000 of 70000000 (57.1%, records/sec overall: 132890365, current: 140845070
variant A: 50000000 of 70000000 (71.4%, records/sec overall: 134408602, current: 140845070
variant A: 60000000 of 70000000 (85.7%, records/sec overall: 135440181, current: 142857143
variant A: 70000000 of 70000000 (100.0%, records/sec overall: 136452242, current: 142857143
variant A: duration 0.51 secs, that is 136452242 records/sec overall
####################################################################################################
variant B: started...
variant B: 10000000 of 70000000 (14.3%, records/sec overall: 24691358, current: 24691358
variant B: 20000000 of 70000000 (28.6%, records/sec overall: 26212320, current: 28011204
variant B: 30000000 of 70000000 (42.9%, records/sec overall: 25884383, current: 25252525
variant B: 40000000 of 70000000 (57.1%, records/sec overall: 27247956, current: 32362460
variant B: 50000000 of 70000000 (71.4%, records/sec overall: 27307482, current: 27548209
variant B: 60000000 of 70000000 (85.7%, records/sec overall: 27560864, current: 28985507
variant B: 70000000 of 70000000 (100.0%, records/sec overall: 28056112, current: 31446541
variant B: duration 2.50 secs, that is 28056112 records/sec overall
####################################################################################################
variant C: started...
variant C: 10000000 of 70000000 (14.3%, records/sec overall: 993838, current: 993838
variant C: 20000000 of 70000000 (28.6%, records/sec overall: 978953, current: 964599
variant C: 30000000 of 70000000 (42.9%, records/sec overall: 1008302, current: 1072731
variant C: 40000000 of 70000000 (57.1%, records/sec overall: 1050117, current: 1199328
variant C: 50000000 of 70000000 (71.4%, records/sec overall: 1049869, current: 1048878
variant C: 60000000 of 70000000 (85.7%, records/sec overall: 1063283, current: 1135847
variant C: 70000000 of 70000000 (100.0%, records/sec overall: 1067155, current: 1090988
variant C: duration 65.60 secs, that is 1067155 records/sec overall
####################################################################################################
variant D: started...
variant D: 10000000 of 70000000 (14.3%, records/sec overall: 1499925, current: 1499925
variant D: 20000000 of 70000000 (28.6%, records/sec overall: 1598338, current: 1710571
variant D: 30000000 of 70000000 (42.9%, records/sec overall: 1674761, current: 1852195
variant D: 40000000 of 70000000 (57.1%, records/sec overall: 1694915, current: 1758706
variant D: 50000000 of 70000000 (71.4%, records/sec overall: 1705611, current: 1749781
variant D: 60000000 of 70000000 (85.7%, records/sec overall: 1717623, current: 1780310
variant D: 70000000 of 70000000 (100.0%, records/sec overall: 1723289, current: 1758396
variant D: duration 40.62 secs, that is 1723289 records/sec overall
####################################################################################################
variant E: started...
variant E: 10000000 of 70000000 (14.3%, records/sec overall: 889601, current: 889601
variant E: 20000000 of 70000000 (28.6%, records/sec overall: 906331, current: 923702
variant E: 30000000 of 70000000 (42.9%, records/sec overall: 921886, current: 954654
variant E: 40000000 of 70000000 (57.1%, records/sec overall: 942174, current: 1008878
variant E: 50000000 of 70000000 (71.4%, records/sec overall: 945358, current: 958405
variant E: 60000000 of 70000000 (85.7%, records/sec overall: 950450, current: 976753
variant E: 70000000 of 70000000 (100.0%, records/sec overall: 948355, current: 935979
variant E: duration 73.81 secs, that is 948342 records/sec overall
####################################################################################################
variant F: started...
variant F: 10000000 of 70000000 (14.3%, records/sec overall: 1312164, current: 1312164
variant F: 20000000 of 70000000 (28.6%, records/sec overall: 1434514, current: 1582028
variant F: 30000000 of 70000000 (42.9%, records/sec overall: 1441961, current: 1457089
variant F: 40000000 of 70000000 (57.1%, records/sec overall: 1484781, current: 1630258
variant F: 50000000 of 70000000 (71.4%, records/sec overall: 1512402, current: 1633987
variant F: 60000000 of 70000000 (85.7%, records/sec overall: 1486510, current: 1369488
variant F: 70000000 of 70000000 (100.0%, records/sec overall: 1496014, current: 1555936
variant F: duration 46.79 secs, that is 1496014 records/sec overall
####################################################################################################
variant G: started...
variant G: 10000000 of 70000000 (14.3%, records/sec overall: 44642857, current: 44642857
variant G: 20000000 of 70000000 (28.6%, records/sec overall: 46728972, current: 49019608
variant G: 30000000 of 70000000 (42.9%, records/sec overall: 47244094, current: 48309179
variant G: 40000000 of 70000000 (57.1%, records/sec overall: 46189376, current: 43478261
variant G: 50000000 of 70000000 (71.4%, records/sec overall: 46511628, current: 48076923
variant G: 60000000 of 70000000 (85.7%, records/sec overall: 46620047, current: 47169811
variant G: 70000000 of 70000000 (100.0%, records/sec overall: 46885466, current: 48543689
variant G: duration 1.49 secs, that is 46885466 records/sec overall

And here the (Scala) code:

import org.rocksdb._
import java.nio.ByteBuffer

object SliceBenchmark extends App {

  RocksDB.loadLibrary()

  val n = 70000000
  val recordLen = 15
  val data = Array.ofDim[Byte](n * recordLen)
  println("created data-array of length " + data.length)

  val reusedByteBuffer = ByteBuffer.allocateDirect(4096) // allow for up to 4kB of data (although not required, here)
  val reusedSlice = new DirectSlice(reusedByteBuffer, 15) // fix the length to 15 bytes (cannot be changed later, in the current implementation)

  for (variant <- 'A' to 'G') {

    println("#"*100 + s"\nvariant $variant: started...")

    val starttime = System.currentTimeMillis()
    var lastlogtime = starttime
    var logeach = 10000000

    for (id <- 1 to n) {
      variant match {
        case 'A' => // do simply nothing...

        case 'B' => // create a java-internal copy of the relevant fragment of the data
          System.arraycopy(data, (id-1) * recordLen, Array.ofDim[Byte](recordLen), 0, recordLen)

        case 'C' => // create a Slice from a java-internal copy
          val record = Array.ofDim[Byte](recordLen)
          System.arraycopy(data, (id-1) * recordLen, record, 0, recordLen)
          new Slice(record)

        case 'D' => // create a new ByteBuffer and copy data into it
          val newByteBuffer = ByteBuffer.allocateDirect(recordLen)
          newByteBuffer.put(data, (id-1) * recordLen, recordLen)

        case 'E' => // create a new ByteBuffer and a new DirectSlice of the desired size
          val newByteBuffer = ByteBuffer.allocateDirect(recordLen)
          newByteBuffer.put(data, (id-1) * recordLen, recordLen)
          new DirectSlice(newByteBuffer, recordLen)

        case 'F' => // re-use the existing ByteBuffer, but create a new DirectSlice of the desired size
          reusedByteBuffer.clear()
          reusedByteBuffer.put(data, (id-1) * recordLen, recordLen)
          new DirectSlice(reusedByteBuffer, recordLen)

        case 'G' => // re-use the existing ByteBuffer and the existing DirectSlice
          reusedByteBuffer.clear()
          reusedByteBuffer.put(data, (id-1) * recordLen, recordLen)
          reusedSlice // do something with it... (its content is affected by writing to reusedByteBuffer, however, in the current implementation its size cannot be changed !)
      }

      if (id % logeach == 0) {
        println(f"variant $variant: $id%d of $n%d (${id * 100.0 / n}%.1f%%, records/sec overall: ${id * 1000.0 / (System.currentTimeMillis - starttime)}%.0f, current: ${logeach * 1000.0 / (System.currentTimeMillis - lastlogtime)}%.0f")
        lastlogtime = System.currentTimeMillis
      }
    }

    val duration = System.currentTimeMillis() - starttime
    println(f"variant $variant%c: duration ${duration / 1000.0}%.2f secs, that is ${n * 1000.0 / duration}%.0f records/sec overall")
  }
}

@zawlazaw
Copy link
Contributor

Note that I edited my above post on the Slice-creation benchmark by including DirectSlice.

As a follow-up, here is the final benchmark of writing sst files much faster than before, which solves the specific original issue. The trick is to use DirectSlice in the right way (which currently only works if all keys have the same length, similar for all values).

In order to not measure input data creation, we first create an array of 35M key-value-pairs (keys and values of 15 bytes each, giving ca. 1 GB in total). Then there are two variants:

A) non-direct: write the sst-file by creating a new Slice for each key and each value => ca. 450k records/sec (same as in the very first benchmark)

B) direct: write the sst-file by re-using a single ByteBuffer as well as a single DirectSlice for all keys (similar for values) => ca. 2.3M records/sec !

diff rocks.sst rocks-direct.sst shows that the two output sst-files are exactly equal.

Thus, by the new approach the total time goes down from 79 seconds to 15 seconds. Speed is now probably no longer determined by jni, and is on par with C++?

Here the output of the benchmark:

filling data-array of length 1050000000...done
####################################################################################################
variant A: started...
variant A: 5000000 of 35000000 (14.3%, records/sec overall: 466592, current: 466592
variant A: 10000000 of 35000000 (28.6%, records/sec overall: 566219, current: 720669
variant A: 15000000 of 35000000 (42.9%, records/sec overall: 484997, current: 376875
variant A: 20000000 of 35000000 (57.1%, records/sec overall: 446299, current: 360127
variant A: 25000000 of 35000000 (71.4%, records/sec overall: 439197, current: 412950
variant A: 30000000 of 35000000 (85.7%, records/sec overall: 464324, current: 650364
variant A: duration 78.53 secs, that is 445712 records/sec overall
####################################################################################################
variant B: started...
variant B: 5000000 of 35000000 (14.3%, records/sec overall: 2292526, current: 2292526
variant B: 10000000 of 35000000 (28.6%, records/sec overall: 2270663, current: 2249213
variant B: 15000000 of 35000000 (42.9%, records/sec overall: 2275831, current: 2286237
variant B: 20000000 of 35000000 (57.1%, records/sec overall: 2291476, current: 2340824
variant B: 25000000 of 35000000 (71.4%, records/sec overall: 2303723, current: 2354049
variant B: 30000000 of 35000000 (85.7%, records/sec overall: 2309291, current: 2337541
variant B: duration 15.24 secs, that is 2297191 records/sec overall

And here the (Scala) code:

import org.rocksdb._
import java.nio.ByteBuffer

object DirectRocksSST extends App {

  val sstFileWriter = new SstFileWriter(new EnvOptions, new Options)
  val filename = "/tmp/rocks.sst"
  sstFileWriter.open(filename)

  val sstFileWriterDirect = new SstFileWriter(new EnvOptions, new Options)
  val filenameDirect = "/tmp/rocks-direct.sst"
  sstFileWriterDirect.open(filenameDirect)

  val n = 35000000 // ca. number of records (15 bytes per key and per value) in 1 GB data
  val data = Array.ofDim[Byte](n * 30)
  print("filling data-array of length " + data.length + "...")
  for (id <- 0 until n) {
    val kv = {val number = id.toString; "0" * (15 - number.length) + number}.getBytes // 6x faster than "%015d".format(id)
    System.arraycopy(kv, 0, data, 30*id, 15)
    System.arraycopy(kv, 0, data, 30*id+15, 15)
  }
  println("done")

  // re-use these ByteBuffers for DirectSlice
  val keyByteBuffer = ByteBuffer.allocateDirect(4096)
  val valueByteBuffer = ByteBuffer.allocateDirect(4096)

  // re-use these DirectSlices
  val directKeySlice = new DirectSlice(keyByteBuffer,15)
  val directValueSlice = new DirectSlice(valueByteBuffer,15)

  // re-use these byte-arrays for non-direct Slice
  val keyArray = Array.ofDim[Byte](15)
  val valueArray = Array.ofDim[Byte](15)

  for (variant <- 'A' to 'B') {
    val starttime = System.currentTimeMillis()
    var lastlogtime = System.currentTimeMillis()
    var logeach = 5000000

    println("#"*100 + s"\nvariant $variant: started...")

    val keyArray = Array.ofDim[Byte](15)
    val valueArray = Array.ofDim[Byte](15)

    for (id <- 0 until n) {
      variant match {
        case 'A' => // non-direct
          System.arraycopy(data, 30*id, keyArray, 0, 15)
          val keySlice = new Slice(keyArray)

          System.arraycopy(data, 30*id+15, valueArray, 0, 15)
          val valueSlice = new Slice(valueArray)

          sstFileWriter.put(keySlice, valueSlice)

        case 'B' => // direct
          keyByteBuffer.clear() // not really necessary, only for safety if C++ would modify position/limit
          keyByteBuffer.put(data, 30*id, 15)
          keyByteBuffer.flip()

          valueByteBuffer.clear() // not really necessary, only for safety if C++ would modify position/limit
          valueByteBuffer.put(data, 30*id+15, 15)
          valueByteBuffer.flip()

          sstFileWriterDirect.put(directKeySlice, directValueSlice)
      }

      if (id > 0 && id % logeach == 0) {
        println(f"variant $variant%c: $id%d of $n%d (${id * 100.0 / n}%.1f%%, records/sec overall: ${id * 1000.0 / (System.currentTimeMillis - starttime)}%.0f, current: ${logeach * 1000.0 / (System.currentTimeMillis - lastlogtime)}%.0f")
        lastlogtime = System.currentTimeMillis
      }
    }

    variant match {
      case 'A' => sstFileWriter.finish()
      case 'B' => sstFileWriterDirect.finish()
    }

    val duration = System.currentTimeMillis() - starttime
    println(f"variant $variant%c: duration ${duration / 1000.0}%.2f secs, that is ${n * 1000.0 / duration}%.0f records/sec overall")
  }
}

Although this solves the precise original issue, it also raises a questions, and suggests some future improvements of the java-classes Slice and DirectSlice.

First, a question, since I cannot oversee the internal dataflow after calling put() on the SstFileWriter:

Question: Are the memory regions that are wrapped by the ByteBuffers used for passing data to SstFileWriter.put(...) no longer accessed by C++ as soon as the call to SstFileWriter.put(...) returns? Or may there be some asynchronous later read-access from some C++ internals that would be affected if we modify the ByteBuffers from Java immediately after the call returns?
I suppose that somewhere internally the data is again copied and thus it is safe, but can somebody confirm that?

A positive answer to this question is absolutely mandatory in order to guarantee that a single ByteBuffer and DirectSlice may be re-used.

In order to discuss the improvements of the java-classes Slice and DirectSlice, I will create a separate issue in order to discuss them - once somebody gives a positive answer to the above question. Here, a teaser: DirectSlice should provide a way to modify its size-parameter after its creation (in order to re-use the same instance for multiple data-transfers of different sizes), or even better, it should stick to the standard position/limit-approach of ByteBuffer.

@zawlazaw
Copy link
Contributor

Sorry for writing so much, but I just realized that the master branch latterly contains SstFileWriter.put(byte[], byte[]) (as an alternative to creating Slice or DirectSlice) thanks to PR 2746. An extended benchmark shows that this approach yields ca. 1.6M records/sec - while providing the flexibility of variable-length keys/values! It would be interesting to further add SstFileWriter.put(ByteBuffer, ByteBuffer) with proper position/limit-handling, following at JNI-side the great PR 2283. Probably I'll find some time for that in the short-term...

@zawlazaw
Copy link
Contributor

zawlazaw commented Aug 27, 2017

Hello again... I just implemented SstFileWriter.put(ByteBuffer, ByteBuffer). Not surprising, it yields ca. 2.3M records/sec, same as if re-using the same DirectSlice in Java. However, the strong benefit over SstFileWriter.put(DirectSlice, DirectSlice) is that every key/value can have an individual length, and the ByteBuffers can be filled/flipped in the standard java.nio way. Let me summarize the Pros/Cons of all the above:

  • SstFileWriter.put(Slice, Slice) with new Slice per put: ca. 450k records/sec, while keys/values might have individual length per put.
  • SstFileWriter.put(DirectSlice, DirectSlice) with the trick of re-using the underlying ByteBuffer and the DirectSlice: ca. 2.3M records/sec, but one cannot change the length of keys/values, and the ByteBuffers' position/limit etc. is ignored (very unfamilar modus operandi that differs from the standards of java.nio)
  • SstFileWriter.put(byte[], byte[]) as already merged into master (but not in version 5.7.2): ca. 1.6M records/sec, while keys/values can have individual lengths per put (as implicitly determined by the full array lengths - no explicit offset and length per array)
  • SstFileWriter.put(ByteBuffer, ByteBuffer) as just locally tested (I am preparing a PR soon): ca. 2.3M records/sec, while keys/values can have individual lengths per put and the byte-buffers behave in the way as expected by somebody working with java.nio.

=> Suggestion to @h-1: since your keys/values are of fixed size, you can use the above workaround with re-using the DirectSlice, until the other approaches are released (someday).

Side remark: since the change from byte[] to ByteBuffer gives a significant speedup, we can expect similar from PR 2746 for all other standard operations (get/pull/seek/...).

@zawlazaw
Copy link
Contributor

zawlazaw commented Aug 27, 2017

Final comment & repeated question: the above way of working with re-used ByteBuffer assumes that every C++-internal read access has finished as soon as the JNI method invocation returns back to Java (because then the buffer data may be changed immediately from Java code). Is it safe to assume this? Or is there some chance that a background-thread wants to access the same memory location of that C++-Slice at a later point again (and could then read corrupt/modified data)?

@scv119
Copy link
Contributor

scv119 commented Aug 29, 2017

@zawlazaw Thanks for working on this. It's interesting to me that SstFileWriter.put(ByteBuffer, ByteBuffer) provides better throughput (1.3 x) comparing the SstFileWrite.put(byte[], byte[]). Do you have any insight where these speed up comes from, as I'd expect the throughput would be very close.

Personally if the ByteBuffer provides better throughtput I think we should switch over to it.

To answer your question, it's safe to change the memory once the SstFileWriter.put/get/merge call finished.

@oboxer
Copy link
Author

oboxer commented Aug 30, 2017

Thanks, @zawlazaw , for the extensive performance testing. I just gave a test case that had fixed length to isolate issues, actual data will have varied length. Looking at the Java performance, I think I will go with C++ for now. Though still have not tested performance in C++, if you have any numbers on C++, please do share. Thanks again for working on this issue.

@zawlazaw
Copy link
Contributor

@scv119 Here are some differences between byte[] and ByteBuffer that I am aware of (but I am no expert on that).

Access at C-side: If you pass byte[] via JNI, then at C-side you access the memory via a call to GetByteArrayElements (then do something with it) and finally call ReleaseByteArrayElements. These two calls do a bit more than trivial stuff: In the worst case, GetByteArrayElements will copy the data across memory. However, usually this is avoided and a pointer to the original on-heap-data is returned, but then the Java-object needs to be pinned (marked for GC to not get garbage collected until the companion call of ReleaseByteArrayElements). In contrast to that, a direct ByteBuffer can immediately be accessed from C via GetDirectBufferAddress, which does nothing else but returning a pointer to memory. This is a plus for ByteBuffer.

Memory allocation type: I am not sure about hardware details, but byte[] arrays are on-heap-objects that stay inside the Java process memory, which seems to be treated differently by the OS than the type of memory where direct buffers live (their memory region is optimized for multi-process-RW and device-IO, e.g., able to do memory-mapping of files). Some more hints on that can be found, e.g., here.
So far, this is again a plus for ByteBuffer. However, the downside is that it takes significantly longer to create a new direct ByteBuffer (via off-heap system calls) than it takes to create a new byte[]. In the 2.benchmark it shows (in B) that create&copy to a new byte[]-array yields 23M/sec, while (in D) creating a new ByteBuffer yields only 1.7M. Interestingly, this even slows down the latest above benchmark (where we can write to an SST-file at a rate of 1.7M records/sec with byte[] and 2.3M with a re-used ByteBuffer). I just tried another alternative: we do only achieve 550k records/sec if we create a new direct ByteBuffer for every single record.

So, a crucial aspect when using ByteBuffer for most efficient I/O is to allocate it once (or at least rarely) and re-use it. The typical strategy in practice is to create a single ByteBuffer (per concurrent thread) large enough to hold any key and/or value (e.g., 1 MB or 100 MB). Then re-use this ByteBuffer via put, flip, clear, etc. This gives the maximum performance. In this sense, byte[] is slightly more flexible and easier to use, so ByteBuffer should not be a replacement for byte[], but an additional interface.

@h-1 Unfortunately, I have no C-benchmark on that, but I am very interested in such, too - as well as in benchmarking the read-performance of ByteBuffer vs. byte[]. I will continue working on this during the next weeks. However, would the (currently non-existing) SstFileWriter.put(ByteBuffer, ByteBuffer)-solution help you? I could create a PR for that within some days.

@gfosco
Copy link
Contributor

gfosco commented Jan 16, 2018

Closing this via automation due to lack of activity. If you'd like this to be re-opened, please just comment here and we'll open it back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants