POC: add LMDB as a storage engine (instead of RocksDB) #5220
Conversation
Obviously this is kind of low-priority at the moment, but I just want to point out that this is actually fully functional, with the small caveats that merges and stats computations (which have been pushed down to RocksDB-specific C++) have not yet been (re)implemented that the custom comparator necessitates an encoding change to get the right sorting of on-disk keys. Our custom RocksDB comparator exists because we wanted to save on the encoding and use I got very close to running all of the benchmarks when I last played with this a while ago, but the ordering issue above actually makes them crash/loop forever. |
I think the encoding
And now your ordering is incorrect. If you don't use a custom comparator, you need to use something like
|
I just checked and |
FYI, the custom comparator we use for RocksDB is entirely implemented on the C side. |
I am very interested in seeing this port work, and have experience running lmdb in production over the last few years. What can I do to help? |
LMDB, the Lightning Memory-Mapped Database, is a much simpler datastore than RocksDB. It uses memory-mapped files and a B+Tree, which is a read-optimized data structure. Still interesting to play with it. * In-memory stuff should be [much faster](http://symas.com/mdb/inmem/), in particular reads. But apparently writes are pretty fast too; after all, very little overhead in this design. * seeing how RocksDB performs against a competitor in our benchmarks should be fun. What's not so great: * Uses C-bindings - batches actually need to be locked to the OS thread, which means same or worse loss of cheap concurrency as with RocksDB on writes. * writes are serialized through a Mutex - easy to deadlock (just try to create two batches in the same goroutine); would need careful design to actually go into production. * No prefix scans or other gimmicks; a lot of stuff which we pushed down into C++ for RocksDB would need to be reimplemented (`Merge`, `MVCCComputeStats`). This is not yet fit for usage since merges and stats computations have not been hoisted up from the RocksDB C++ code yet (we pushed it down for performance at some point). The encoding used is quite inefficient, but can be improved once bmatsuo/lmdb-go#61 lands. Some of the horrendous allocation and performance numbers likely go back to the encoding. What's less clear is how the single-threaded model of lmdb can most efficiently be used with Cockroach. Essentially, mutations need to be serialized, and this is done through a single mutex. Additionally, transactions need to be pinned to a OS thread. The ideal usage would see one large environment per store, with individual databases per Replica (on which writes are already serialized anyway). ``` name old time/op new time/op delta MVCCScan10Versions1Row64Bytes-8 15.3µs ± 5% 8.0µs ± 4% -47.85% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 19.0µs ± 7% 10.5µs ± 8% -44.81% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 42.9µs ± 9% 31.9µs ± 1% -25.70% (p=0.000 n=10+9) MVCCScan10Versions10Rows64Bytes-8 53.2µs ± 3% 35.5µs ± 6% -33.33% (p=0.000 n=9+10) MVCCScan10Versions10Rows512Bytes-8 112µs ± 4% 52µs ± 4% -53.55% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 324µs ± 4% 247µs ± 3% -23.60% (p=0.000 n=9+9) MVCCScan10Versions100Rows64Bytes-8 390µs ± 2% 267µs ± 3% -31.40% (p=0.000 n=8+10) MVCCScan10Versions100Rows512Bytes-8 922µs ± 4% 397µs ± 1% -56.93% (p=0.000 n=10+8) MVCCScan10Versions1000Rows8Bytes-8 3.07ms ± 5% 2.20ms ± 1% -28.35% (p=0.000 n=9+10) MVCCScan10Versions1000Rows64Bytes-8 3.70ms ± 4% 2.42ms ± 2% -34.64% (p=0.000 n=10+10) MVCCScan10Versions1000Rows512Bytes-8 9.16ms ± 5% 3.82ms ± 2% -58.23% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 53.1µs ± 5% 654.6µs ±13% +1133.60% (p=0.000 n=10+9) MVCCScan100Versions10Rows512Bytes-8 297µs ± 3% 3500µs ± 5% +1077.73% (p=0.000 n=10+9) MVCCScan100Versions100Rows512Bytes-8 2.65ms ± 6% 29.69ms ±20% +1021.57% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 25.4ms ± 4% 279.7ms ±31% +998.93% (p=0.000 n=10+9) MVCCGet1Version8Bytes-8 18.9µs ± 4% 8.4µs ± 2% -55.61% (p=0.000 n=10+8) MVCCGet10Versions8Bytes-8 27.3µs ± 3% 9.1µs ± 2% -66.75% (p=0.000 n=10+10) MVCCGet100Versions8Bytes-8 42.6µs ± 6% 9.0µs ± 5% -78.80% (p=0.000 n=10+10) MVCCPut10-8 3.86µs ± 5% 189.90µs ± 8% +4825.43% (p=0.000 n=10+10) MVCCPut100-8 4.04µs ± 7% 221.56µs ± 3% +5382.69% (p=0.000 n=10+8) MVCCPut1000-8 6.25µs ± 8% 244.94µs ± 1% +3822.07% (p=0.000 n=10+7) MVCCPut10000-8 23.8µs ± 1% 235.9µs ± 5% +890.75% (p=0.000 n=9+10) MVCCConditionalPutCreate10-8 3.84µs ± 1% 185.22µs ± 1% +4724.86% (p=0.000 n=10+9) MVCCConditionalPutCreate100-8 4.00µs ± 2% 211.05µs ± 5% +5178.60% (p=0.000 n=10+10) MVCCConditionalPutCreate1000-8 6.24µs ± 2% 250.95µs ± 4% +3922.77% (p=0.000 n=8+9) MVCCConditionalPutCreate10000-8 24.0µs ± 1% 235.6µs ± 3% +883.37% (p=0.000 n=10+9) MVCCConditionalPutReplace10-8 5.42µs ± 2% 229.03µs ± 3% +4127.67% (p=0.000 n=10+9) MVCCConditionalPutReplace100-8 5.70µs ± 2% 220.28µs ± 2% +3765.44% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 10.5µs ± 2% 234.1µs ± 4% +2124.64% (p=0.000 n=10+9) MVCCConditionalPutReplace10000-8 48.9µs ± 3% 264.1µs ± 7% +439.54% (p=0.000 n=10+10) MVCCBatch1Put10-8 5.53µs ±15% 180.55µs ± 3% +3164.48% (p=0.000 n=9+10) MVCCBatch100Put10-8 3.66µs ± 6% 7.09µs ± 2% +93.64% (p=0.000 n=10+10) MVCCBatch10000Put10-8 4.17µs ± 5% 3.97µs ± 1% -4.79% (p=0.001 n=10+8) MVCCBatch100000Put10-8 4.18µs ± 5% 3.81µs ± 9% -8.76% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 110ms ± 1% 1917ms ± 3% +1648.32% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 79.5ms ± 2% 1338.7ms ± 4% +1583.62% (p=0.000 n=9+10) MVCCDeleteRange1Version256Bytes-8 24.0ms ± 6% 365.7ms ± 5% +1423.25% (p=0.000 n=9+9) name old speed new speed delta MVCCScan10Versions1Row64Bytes-8 4.19MB/s ± 5% 8.03MB/s ± 4% +91.62% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 27.0MB/s ± 6% 48.9MB/s ± 9% +81.36% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 1.87MB/s ± 8% 2.51MB/s ± 1% +34.57% (p=0.000 n=10+9) MVCCScan10Versions10Rows64Bytes-8 12.0MB/s ± 4% 18.1MB/s ± 5% +50.82% (p=0.000 n=10+10) MVCCScan10Versions10Rows512Bytes-8 45.9MB/s ± 4% 98.9MB/s ± 4% +115.26% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 2.47MB/s ± 3% 3.23MB/s ± 3% +30.77% (p=0.000 n=9+9) MVCCScan10Versions100Rows64Bytes-8 16.4MB/s ± 2% 23.9MB/s ± 3% +45.81% (p=0.000 n=8+10) MVCCScan10Versions100Rows512Bytes-8 55.6MB/s ± 4% 128.9MB/s ± 1% +132.02% (p=0.000 n=10+8) MVCCScan10Versions1000Rows8Bytes-8 2.59MB/s ± 7% 3.64MB/s ± 1% +40.59% (p=0.000 n=10+10) MVCCScan10Versions1000Rows64Bytes-8 17.3MB/s ± 4% 26.5MB/s ± 2% +52.97% (p=0.000 n=10+10) MVCCScan10Versions1000Rows512Bytes-8 56.0MB/s ± 5% 133.9MB/s ± 2% +139.24% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 9.65MB/s ± 5% 0.78MB/s ±15% -91.88% (p=0.000 n=10+9) MVCCScan100Versions10Rows512Bytes-8 17.2MB/s ± 3% 1.5MB/s ± 5% -91.49% (p=0.000 n=10+9) MVCCScan100Versions100Rows512Bytes-8 19.4MB/s ± 5% 1.7MB/s ±23% -91.00% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 20.1MB/s ± 4% 1.9MB/s ±25% -90.68% (p=0.000 n=10+9) MVCCGet1Version8Bytes-8 422kB/s ± 4% 947kB/s ± 4% +124.33% (p=0.000 n=10+9) MVCCGet10Versions8Bytes-8 294kB/s ± 5% 880kB/s ± 0% +199.32% (p=0.000 n=10+6) MVCCGet100Versions8Bytes-8 187kB/s ± 4% 881kB/s ± 4% +371.18% (p=0.000 n=10+9) MVCCPut10-8 2.60MB/s ± 5% 0.05MB/s ± 0% -98.07% (p=0.000 n=10+9) MVCCPut100-8 24.8MB/s ± 7% 0.5MB/s ± 4% -98.17% (p=0.000 n=10+8) MVCCPut1000-8 160MB/s ± 8% 4MB/s ± 1% -97.46% (p=0.000 n=10+7) MVCCPut10000-8 420MB/s ± 1% 42MB/s ± 5% -89.90% (p=0.000 n=9+10) MVCCConditionalPutCreate10-8 2.60MB/s ± 1% 0.05MB/s ± 0% -98.08% (p=0.000 n=10+9) MVCCConditionalPutCreate100-8 25.0MB/s ± 2% 0.5MB/s ± 3% -98.09% (p=0.000 n=10+9) MVCCConditionalPutCreate1000-8 160MB/s ± 2% 4MB/s ± 4% -97.51% (p=0.000 n=8+9) MVCCConditionalPutCreate10000-8 417MB/s ± 1% 42MB/s ± 3% -89.83% (p=0.000 n=10+9) MVCCConditionalPutReplace10-8 1.85MB/s ± 2% 0.04MB/s ± 0% -97.83% (p=0.000 n=10+10) MVCCConditionalPutReplace100-8 17.5MB/s ± 2% 0.5MB/s ± 3% -97.41% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 95.0MB/s ± 2% 4.3MB/s ± 3% -95.50% (p=0.000 n=10+9) MVCCConditionalPutReplace10000-8 204MB/s ± 3% 38MB/s ± 6% -81.43% (p=0.000 n=10+10) MVCCBatch1Put10-8 1.82MB/s ±14% 0.06MB/s ±11% -96.92% (p=0.000 n=9+10) MVCCBatch100Put10-8 2.74MB/s ± 6% 1.41MB/s ± 1% -48.41% (p=0.000 n=10+10) MVCCBatch10000Put10-8 2.40MB/s ± 5% 2.52MB/s ± 1% +5.04% (p=0.001 n=10+8) MVCCBatch100000Put10-8 2.40MB/s ± 5% 2.63MB/s ± 8% +9.65% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 4.78MB/s ± 2% 0.28MB/s ± 2% -94.25% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 6.59MB/s ± 2% 0.39MB/s ± 3% -94.07% (p=0.000 n=9+10) MVCCDeleteRange1Version256Bytes-8 21.8MB/s ± 5% 1.4MB/s ± 4% -93.43% (p=0.000 n=9+9) name old alloc/op new alloc/op delta MVCCScan10Versions1Row64Bytes-8 576B ± 0% 1124B ± 0% +95.14% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 1.60kB ± 0% 2.87kB ± 0% +79.10% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 2.50kB ± 0% 4.31kB ± 0% +72.40% (p=0.000 n=10+8) MVCCScan10Versions10Rows64Bytes-8 3.52kB ± 0% 6.26kB ± 0% +77.72% (p=0.000 n=10+10) MVCCScan10Versions10Rows512Bytes-8 9.68kB ± 0% 19.61kB ± 0% +102.65% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 19.9kB ± 0% 35.3kB ± 0% +77.19% (p=0.000 n=9+10) MVCCScan10Versions100Rows64Bytes-8 32.2kB ± 0% 56.9kB ± 0% +76.59% (p=0.000 n=10+10) MVCCScan10Versions100Rows512Bytes-8 81.5kB ± 0% 178.3kB ± 0% +118.85% (p=0.000 n=10+10) MVCCScan10Versions1000Rows8Bytes-8 164kB ± 0% 314kB ± 0% +92.12% (p=0.000 n=9+10) MVCCScan10Versions1000Rows64Bytes-8 213kB ± 0% 458kB ± 1% +115.23% (p=0.000 n=10+9) MVCCScan10Versions1000Rows512Bytes-8 672kB ± 0% 1632kB ± 2% +142.78% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 1.60kB ± 0% 2.90kB ± 0% +80.97% (p=0.000 n=10+10) MVCCScan100Versions10Rows512Bytes-8 9.68kB ± 0% 19.90kB ± 0% +105.62% (p=0.000 n=10+7) MVCCScan100Versions100Rows512Bytes-8 81.5kB ± 0% 181.6kB ± 3% +122.94% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 672kB ± 0% 1683kB ± 7% +150.30% (p=0.000 n=8+10) MVCCGet1Version8Bytes-8 64.0B ± 0% 464.0B ± 0% +625.00% (p=0.000 n=10+10) MVCCGet10Versions8Bytes-8 64.0B ± 0% 514.5B ± 0% +703.91% (p=0.000 n=10+10) MVCCGet100Versions8Bytes-8 64.0B ± 0% 520.0B ± 0% +712.50% (p=0.000 n=10+10) MVCCPut10-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCPut100-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCPut1000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCPut10000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate100-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate1000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutReplace10-8 16.0B ± 0% 584.0B ± 0% +3550.00% (p=0.000 n=10+10) MVCCConditionalPutReplace100-8 112B ± 0% 777B ± 0% +593.75% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 1.02kB ± 0% 2.60kB ± 0% +153.89% (p=0.000 n=10+10) MVCCConditionalPutReplace10000-8 10.5kB ± 0% 21.6kB ± 0% +105.17% (p=0.000 n=8+10) MVCCBatch1Put10-8 48.0B ± 0% 488.0B ± 0% +916.67% (p=0.000 n=10+10) MVCCBatch100Put10-8 0.00B ±NaN% 337.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch10000Put10-8 0.00B ±NaN% 336.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch100000Put10-8 0.00B ±NaN% 336.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 214kB ± 0% 3743kB ± 0% +1645.87% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 312kB ± 0% 3204kB ± 0% +925.76% (p=0.000 n=10+10) MVCCDeleteRange1Version256Bytes-8 476kB ± 0% 2066kB ± 0% +333.60% (p=0.000 n=9+9) name old allocs/op new allocs/op delta MVCCScan10Versions1Row64Bytes-8 2.00 ± 0% 20.00 ± 0% +900.00% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 3.00 ± 0% 21.00 ± 0% +600.00% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 6.00 ± 0% 89.00 ± 0% +1383.33% (p=0.000 n=10+10) MVCCScan10Versions10Rows64Bytes-8 7.00 ± 0% 90.00 ± 0% +1185.71% (p=0.000 n=10+10) MVCCScan10Versions10Rows512Bytes-8 9.00 ± 0% 92.00 ± 0% +922.22% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 11.0 ± 0% 747.1 ± 1% +6691.82% (p=0.000 n=10+10) MVCCScan10Versions100Rows64Bytes-8 13.0 ± 0% 749.5 ± 0% +5665.38% (p=0.000 n=10+10) MVCCScan10Versions100Rows512Bytes-8 16.7 ± 4% 754.8 ± 0% +4419.76% (p=0.000 n=10+10) MVCCScan10Versions1000Rows8Bytes-8 18.4 ± 3% 7254.4 ± 1% +39326.09% (p=0.000 n=10+10) MVCCScan10Versions1000Rows64Bytes-8 22.0 ± 0% 7318.1 ± 1% +33164.14% (p=0.000 n=10+9) MVCCScan10Versions1000Rows512Bytes-8 57.0 ± 0% 7300.3 ± 3% +12707.54% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 3.00 ± 0% 21.00 ± 0% +600.00% (p=0.000 n=10+10) MVCCScan100Versions10Rows512Bytes-8 9.00 ± 0% 94.40 ± 2% +948.89% (p=0.000 n=10+10) MVCCScan100Versions100Rows512Bytes-8 16.3 ± 4% 779.7 ± 5% +4683.44% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 56.5 ± 1% 7680.7 ±12% +13494.16% (p=0.000 n=10+10) MVCCGet1Version8Bytes-8 2.00 ± 0% 18.00 ± 0% +800.00% (p=0.000 n=10+10) MVCCGet10Versions8Bytes-8 2.00 ± 0% 20.00 ± 0% +900.00% (p=0.000 n=10+10) MVCCGet100Versions8Bytes-8 2.00 ± 0% 20.00 ± 0% +900.00% (p=0.000 n=10+10) MVCCPut10-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCPut100-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCPut1000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCPut10000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate100-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate1000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutReplace10-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCConditionalPutReplace100-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCConditionalPutReplace10000-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCBatch1Put10-8 1.00 ± 0% 20.00 ± 0% +1900.00% (p=0.000 n=10+10) MVCCBatch100Put10-8 0.00 ±NaN% 13.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch10000Put10-8 0.00 ±NaN% 13.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch100000Put10-8 0.00 ±NaN% 13.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 20.0 ± 0% 187328.3 ± 0% +936541.50% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 26.0 ± 0% 131152.9 ± 0% +504334.23% (p=0.000 n=10+10) MVCCDeleteRange1Version256Bytes-8 39.8 ± 3% 34567.6 ± 0% +86753.27% (p=0.000 n=10+10) ```
Hi @erichocean, good timing - I just fixed this up yesterday (updated the commit/PR message). This can now be benchmarked. With a little more elbow grease, it could presumably be hooked up to start a real node - though some ingredients are still missing and the benchmarks indicate that there's some horrible stuff going on:
If you'd like to help, there are several areas in which that'd be useful (but note that this is still just a playground - adding LMDB support isn't anywhere on our roadmap, it's mostly an exploration I started on a Friday-induced whim):
|
@petermattis I'd like to get this in -- I can either leave it as is or extract the LMDB code to an external repo ( |
@tschottdorf Are there any problems with RocksDB that we expect lmdb to address? I agree it would be simpler to maintain this code if it is in the main repo, yet doing so adds a burden on maintaining this code as we adjust interfaces in |
@petermattis my intention in either case would be to keep the interface assertions/tests/benchmarks in the main repo, the reasoning being that it isn't to our detriment for low-level storage changes to keep in mind that RocksDB may not be our only choice forever (i.e. not saying we must upgrade LMDB just for fun indefinitely, but this way we'll have it on the radar and can weigh the pros and cons with each such change). Plus, there's at least some community interest (see above) and having the POC working doesn't cost us (me) much and could enable outside contributors to pick up some work.
I think there are some interesting venues to be explored with regards to persistent memory as that becomes a thing, but obviously RocksDB is what we're going to actively use in the foreseeable future. |
@tschottdorf Yes, RocksDB may not be our only backend storage engine forever, but I don't see that as sufficient reason to support another storage engine at this time. I'm sure developing this change wasn't particularly pleasant, yet the result doesn't look particularly bad (you're primarily adding code). Seems like the existing |
I agree with @petermattis; I'd rather see this live in a separate branch than be merged into This is pretty well-contained so I don't think it would be a problem to maintain in a separate branch. It probably also wouldn't be too bad in master either, but a significant new cgo dependency is never fun (how would this interact with the jemalloc builds, for example?). I'd be fine with introducing some new code in |
Not "support", "keep in mind". Out of sight, out of mind applies here - why not keep the assertions and benchmarks in? The actual code can live somewhere else, I just don't want it to rot because nobody knows it exists.
As above, I think that's the right thing to do: if someone's breaking it, let them know it and they can still do it; after all, the engine isn't integrated and only used in MVCC tests. The prototype has garnered some interest (https://twitter.com/hyc_symas/status/712264431552569344), why not keep it working for folks to pick up if they feel like it. It doesn't cost us a thing except for some low-maintenance test grooming. |
Perhaps I'm not understanding what you're proposing to keep. What assertions and benchmarks do you want to see in the main repo? Not having another storage engine implementation certainly makes our current implementation more susceptible to creeping RocksDB-isms, but I don't think we're going to have a much tighter embrace to RocksDB than we do now (e.g. the merge operator). There is a strong cultural bias against breaking part of the code base and then disabling associated tests, even if it isn't used. See the |
I'm proposing to keep
I expect that we buy into RocksDB more without thinking much about it by On Mon, Apr 25, 2016 at 8:20 PM Peter Mattis notifications@github.com
-- Tobias |
Do not delete this branch - see #5220 -- LMDB, the Lightning Memory-Mapped Database, is a much simpler datastore than RocksDB. It uses memory-mapped files and a B+Tree, which is a read-optimized data structure. Still interesting to play with it. * In-memory stuff should be [much faster](http://symas.com/mdb/inmem/), in particular reads. But apparently writes are pretty fast too; after all, very little overhead in this design. * seeing how RocksDB performs against a competitor in our benchmarks should be fun. What's not so great: * Uses C-bindings - batches actually need to be locked to the OS thread, which means same or worse loss of cheap concurrency as with RocksDB on writes. * writes are serialized through a Mutex - easy to deadlock (just try to create two batches in the same goroutine); would need careful design to actually go into production. * No prefix scans or other gimmicks; a lot of stuff which we pushed down into C++ for RocksDB would need to be reimplemented (`Merge`, `MVCCComputeStats`). This is not yet fit for usage since merges and stats computations have not been hoisted up from the RocksDB C++ code yet (we pushed it down for performance at some point). The encoding used is quite inefficient, but can be improved once bmatsuo/lmdb-go#61 lands. Some of the horrendous allocation and performance numbers likely go back to the encoding. What's less clear is how the single-threaded model of lmdb can most efficiently be used with Cockroach. Essentially, mutations need to be serialized, and this is done through a single mutex. Additionally, transactions need to be pinned to a OS thread. The ideal usage would see one large environment per store, with individual databases per Replica (on which writes are already serialized anyway). ``` name old time/op new time/op delta MVCCScan10Versions1Row64Bytes-8 15.3µs ± 5% 8.0µs ± 4% -47.85% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 19.0µs ± 7% 10.5µs ± 8% -44.81% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 42.9µs ± 9% 31.9µs ± 1% -25.70% (p=0.000 n=10+9) MVCCScan10Versions10Rows64Bytes-8 53.2µs ± 3% 35.5µs ± 6% -33.33% (p=0.000 n=9+10) MVCCScan10Versions10Rows512Bytes-8 112µs ± 4% 52µs ± 4% -53.55% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 324µs ± 4% 247µs ± 3% -23.60% (p=0.000 n=9+9) MVCCScan10Versions100Rows64Bytes-8 390µs ± 2% 267µs ± 3% -31.40% (p=0.000 n=8+10) MVCCScan10Versions100Rows512Bytes-8 922µs ± 4% 397µs ± 1% -56.93% (p=0.000 n=10+8) MVCCScan10Versions1000Rows8Bytes-8 3.07ms ± 5% 2.20ms ± 1% -28.35% (p=0.000 n=9+10) MVCCScan10Versions1000Rows64Bytes-8 3.70ms ± 4% 2.42ms ± 2% -34.64% (p=0.000 n=10+10) MVCCScan10Versions1000Rows512Bytes-8 9.16ms ± 5% 3.82ms ± 2% -58.23% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 53.1µs ± 5% 654.6µs ±13% +1133.60% (p=0.000 n=10+9) MVCCScan100Versions10Rows512Bytes-8 297µs ± 3% 3500µs ± 5% +1077.73% (p=0.000 n=10+9) MVCCScan100Versions100Rows512Bytes-8 2.65ms ± 6% 29.69ms ±20% +1021.57% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 25.4ms ± 4% 279.7ms ±31% +998.93% (p=0.000 n=10+9) MVCCGet1Version8Bytes-8 18.9µs ± 4% 8.4µs ± 2% -55.61% (p=0.000 n=10+8) MVCCGet10Versions8Bytes-8 27.3µs ± 3% 9.1µs ± 2% -66.75% (p=0.000 n=10+10) MVCCGet100Versions8Bytes-8 42.6µs ± 6% 9.0µs ± 5% -78.80% (p=0.000 n=10+10) MVCCPut10-8 3.86µs ± 5% 189.90µs ± 8% +4825.43% (p=0.000 n=10+10) MVCCPut100-8 4.04µs ± 7% 221.56µs ± 3% +5382.69% (p=0.000 n=10+8) MVCCPut1000-8 6.25µs ± 8% 244.94µs ± 1% +3822.07% (p=0.000 n=10+7) MVCCPut10000-8 23.8µs ± 1% 235.9µs ± 5% +890.75% (p=0.000 n=9+10) MVCCConditionalPutCreate10-8 3.84µs ± 1% 185.22µs ± 1% +4724.86% (p=0.000 n=10+9) MVCCConditionalPutCreate100-8 4.00µs ± 2% 211.05µs ± 5% +5178.60% (p=0.000 n=10+10) MVCCConditionalPutCreate1000-8 6.24µs ± 2% 250.95µs ± 4% +3922.77% (p=0.000 n=8+9) MVCCConditionalPutCreate10000-8 24.0µs ± 1% 235.6µs ± 3% +883.37% (p=0.000 n=10+9) MVCCConditionalPutReplace10-8 5.42µs ± 2% 229.03µs ± 3% +4127.67% (p=0.000 n=10+9) MVCCConditionalPutReplace100-8 5.70µs ± 2% 220.28µs ± 2% +3765.44% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 10.5µs ± 2% 234.1µs ± 4% +2124.64% (p=0.000 n=10+9) MVCCConditionalPutReplace10000-8 48.9µs ± 3% 264.1µs ± 7% +439.54% (p=0.000 n=10+10) MVCCBatch1Put10-8 5.53µs ±15% 180.55µs ± 3% +3164.48% (p=0.000 n=9+10) MVCCBatch100Put10-8 3.66µs ± 6% 7.09µs ± 2% +93.64% (p=0.000 n=10+10) MVCCBatch10000Put10-8 4.17µs ± 5% 3.97µs ± 1% -4.79% (p=0.001 n=10+8) MVCCBatch100000Put10-8 4.18µs ± 5% 3.81µs ± 9% -8.76% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 110ms ± 1% 1917ms ± 3% +1648.32% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 79.5ms ± 2% 1338.7ms ± 4% +1583.62% (p=0.000 n=9+10) MVCCDeleteRange1Version256Bytes-8 24.0ms ± 6% 365.7ms ± 5% +1423.25% (p=0.000 n=9+9) name old speed new speed delta MVCCScan10Versions1Row64Bytes-8 4.19MB/s ± 5% 8.03MB/s ± 4% +91.62% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 27.0MB/s ± 6% 48.9MB/s ± 9% +81.36% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 1.87MB/s ± 8% 2.51MB/s ± 1% +34.57% (p=0.000 n=10+9) MVCCScan10Versions10Rows64Bytes-8 12.0MB/s ± 4% 18.1MB/s ± 5% +50.82% (p=0.000 n=10+10) MVCCScan10Versions10Rows512Bytes-8 45.9MB/s ± 4% 98.9MB/s ± 4% +115.26% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 2.47MB/s ± 3% 3.23MB/s ± 3% +30.77% (p=0.000 n=9+9) MVCCScan10Versions100Rows64Bytes-8 16.4MB/s ± 2% 23.9MB/s ± 3% +45.81% (p=0.000 n=8+10) MVCCScan10Versions100Rows512Bytes-8 55.6MB/s ± 4% 128.9MB/s ± 1% +132.02% (p=0.000 n=10+8) MVCCScan10Versions1000Rows8Bytes-8 2.59MB/s ± 7% 3.64MB/s ± 1% +40.59% (p=0.000 n=10+10) MVCCScan10Versions1000Rows64Bytes-8 17.3MB/s ± 4% 26.5MB/s ± 2% +52.97% (p=0.000 n=10+10) MVCCScan10Versions1000Rows512Bytes-8 56.0MB/s ± 5% 133.9MB/s ± 2% +139.24% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 9.65MB/s ± 5% 0.78MB/s ±15% -91.88% (p=0.000 n=10+9) MVCCScan100Versions10Rows512Bytes-8 17.2MB/s ± 3% 1.5MB/s ± 5% -91.49% (p=0.000 n=10+9) MVCCScan100Versions100Rows512Bytes-8 19.4MB/s ± 5% 1.7MB/s ±23% -91.00% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 20.1MB/s ± 4% 1.9MB/s ±25% -90.68% (p=0.000 n=10+9) MVCCGet1Version8Bytes-8 422kB/s ± 4% 947kB/s ± 4% +124.33% (p=0.000 n=10+9) MVCCGet10Versions8Bytes-8 294kB/s ± 5% 880kB/s ± 0% +199.32% (p=0.000 n=10+6) MVCCGet100Versions8Bytes-8 187kB/s ± 4% 881kB/s ± 4% +371.18% (p=0.000 n=10+9) MVCCPut10-8 2.60MB/s ± 5% 0.05MB/s ± 0% -98.07% (p=0.000 n=10+9) MVCCPut100-8 24.8MB/s ± 7% 0.5MB/s ± 4% -98.17% (p=0.000 n=10+8) MVCCPut1000-8 160MB/s ± 8% 4MB/s ± 1% -97.46% (p=0.000 n=10+7) MVCCPut10000-8 420MB/s ± 1% 42MB/s ± 5% -89.90% (p=0.000 n=9+10) MVCCConditionalPutCreate10-8 2.60MB/s ± 1% 0.05MB/s ± 0% -98.08% (p=0.000 n=10+9) MVCCConditionalPutCreate100-8 25.0MB/s ± 2% 0.5MB/s ± 3% -98.09% (p=0.000 n=10+9) MVCCConditionalPutCreate1000-8 160MB/s ± 2% 4MB/s ± 4% -97.51% (p=0.000 n=8+9) MVCCConditionalPutCreate10000-8 417MB/s ± 1% 42MB/s ± 3% -89.83% (p=0.000 n=10+9) MVCCConditionalPutReplace10-8 1.85MB/s ± 2% 0.04MB/s ± 0% -97.83% (p=0.000 n=10+10) MVCCConditionalPutReplace100-8 17.5MB/s ± 2% 0.5MB/s ± 3% -97.41% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 95.0MB/s ± 2% 4.3MB/s ± 3% -95.50% (p=0.000 n=10+9) MVCCConditionalPutReplace10000-8 204MB/s ± 3% 38MB/s ± 6% -81.43% (p=0.000 n=10+10) MVCCBatch1Put10-8 1.82MB/s ±14% 0.06MB/s ±11% -96.92% (p=0.000 n=9+10) MVCCBatch100Put10-8 2.74MB/s ± 6% 1.41MB/s ± 1% -48.41% (p=0.000 n=10+10) MVCCBatch10000Put10-8 2.40MB/s ± 5% 2.52MB/s ± 1% +5.04% (p=0.001 n=10+8) MVCCBatch100000Put10-8 2.40MB/s ± 5% 2.63MB/s ± 8% +9.65% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 4.78MB/s ± 2% 0.28MB/s ± 2% -94.25% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 6.59MB/s ± 2% 0.39MB/s ± 3% -94.07% (p=0.000 n=9+10) MVCCDeleteRange1Version256Bytes-8 21.8MB/s ± 5% 1.4MB/s ± 4% -93.43% (p=0.000 n=9+9) name old alloc/op new alloc/op delta MVCCScan10Versions1Row64Bytes-8 576B ± 0% 1124B ± 0% +95.14% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 1.60kB ± 0% 2.87kB ± 0% +79.10% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 2.50kB ± 0% 4.31kB ± 0% +72.40% (p=0.000 n=10+8) MVCCScan10Versions10Rows64Bytes-8 3.52kB ± 0% 6.26kB ± 0% +77.72% (p=0.000 n=10+10) MVCCScan10Versions10Rows512Bytes-8 9.68kB ± 0% 19.61kB ± 0% +102.65% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 19.9kB ± 0% 35.3kB ± 0% +77.19% (p=0.000 n=9+10) MVCCScan10Versions100Rows64Bytes-8 32.2kB ± 0% 56.9kB ± 0% +76.59% (p=0.000 n=10+10) MVCCScan10Versions100Rows512Bytes-8 81.5kB ± 0% 178.3kB ± 0% +118.85% (p=0.000 n=10+10) MVCCScan10Versions1000Rows8Bytes-8 164kB ± 0% 314kB ± 0% +92.12% (p=0.000 n=9+10) MVCCScan10Versions1000Rows64Bytes-8 213kB ± 0% 458kB ± 1% +115.23% (p=0.000 n=10+9) MVCCScan10Versions1000Rows512Bytes-8 672kB ± 0% 1632kB ± 2% +142.78% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 1.60kB ± 0% 2.90kB ± 0% +80.97% (p=0.000 n=10+10) MVCCScan100Versions10Rows512Bytes-8 9.68kB ± 0% 19.90kB ± 0% +105.62% (p=0.000 n=10+7) MVCCScan100Versions100Rows512Bytes-8 81.5kB ± 0% 181.6kB ± 3% +122.94% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 672kB ± 0% 1683kB ± 7% +150.30% (p=0.000 n=8+10) MVCCGet1Version8Bytes-8 64.0B ± 0% 464.0B ± 0% +625.00% (p=0.000 n=10+10) MVCCGet10Versions8Bytes-8 64.0B ± 0% 514.5B ± 0% +703.91% (p=0.000 n=10+10) MVCCGet100Versions8Bytes-8 64.0B ± 0% 520.0B ± 0% +712.50% (p=0.000 n=10+10) MVCCPut10-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCPut100-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCPut1000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCPut10000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate100-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate1000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10000-8 0.00B ±NaN% 560.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutReplace10-8 16.0B ± 0% 584.0B ± 0% +3550.00% (p=0.000 n=10+10) MVCCConditionalPutReplace100-8 112B ± 0% 777B ± 0% +593.75% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 1.02kB ± 0% 2.60kB ± 0% +153.89% (p=0.000 n=10+10) MVCCConditionalPutReplace10000-8 10.5kB ± 0% 21.6kB ± 0% +105.17% (p=0.000 n=8+10) MVCCBatch1Put10-8 48.0B ± 0% 488.0B ± 0% +916.67% (p=0.000 n=10+10) MVCCBatch100Put10-8 0.00B ±NaN% 337.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch10000Put10-8 0.00B ±NaN% 336.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch100000Put10-8 0.00B ±NaN% 336.00B ± 0% +Inf% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 214kB ± 0% 3743kB ± 0% +1645.87% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 312kB ± 0% 3204kB ± 0% +925.76% (p=0.000 n=10+10) MVCCDeleteRange1Version256Bytes-8 476kB ± 0% 2066kB ± 0% +333.60% (p=0.000 n=9+9) name old allocs/op new allocs/op delta MVCCScan10Versions1Row64Bytes-8 2.00 ± 0% 20.00 ± 0% +900.00% (p=0.000 n=10+10) MVCCScan10Versions1Row512Bytes-8 3.00 ± 0% 21.00 ± 0% +600.00% (p=0.000 n=10+10) MVCCScan10Versions10Rows8Bytes-8 6.00 ± 0% 89.00 ± 0% +1383.33% (p=0.000 n=10+10) MVCCScan10Versions10Rows64Bytes-8 7.00 ± 0% 90.00 ± 0% +1185.71% (p=0.000 n=10+10) MVCCScan10Versions10Rows512Bytes-8 9.00 ± 0% 92.00 ± 0% +922.22% (p=0.000 n=10+10) MVCCScan10Versions100Rows8Bytes-8 11.0 ± 0% 747.1 ± 1% +6691.82% (p=0.000 n=10+10) MVCCScan10Versions100Rows64Bytes-8 13.0 ± 0% 749.5 ± 0% +5665.38% (p=0.000 n=10+10) MVCCScan10Versions100Rows512Bytes-8 16.7 ± 4% 754.8 ± 0% +4419.76% (p=0.000 n=10+10) MVCCScan10Versions1000Rows8Bytes-8 18.4 ± 3% 7254.4 ± 1% +39326.09% (p=0.000 n=10+10) MVCCScan10Versions1000Rows64Bytes-8 22.0 ± 0% 7318.1 ± 1% +33164.14% (p=0.000 n=10+9) MVCCScan10Versions1000Rows512Bytes-8 57.0 ± 0% 7300.3 ± 3% +12707.54% (p=0.000 n=10+10) MVCCScan100Versions1Row512Bytes-8 3.00 ± 0% 21.00 ± 0% +600.00% (p=0.000 n=10+10) MVCCScan100Versions10Rows512Bytes-8 9.00 ± 0% 94.40 ± 2% +948.89% (p=0.000 n=10+10) MVCCScan100Versions100Rows512Bytes-8 16.3 ± 4% 779.7 ± 5% +4683.44% (p=0.000 n=10+10) MVCCScan100Versions1000Rows512Bytes-8 56.5 ± 1% 7680.7 ±12% +13494.16% (p=0.000 n=10+10) MVCCGet1Version8Bytes-8 2.00 ± 0% 18.00 ± 0% +800.00% (p=0.000 n=10+10) MVCCGet10Versions8Bytes-8 2.00 ± 0% 20.00 ± 0% +900.00% (p=0.000 n=10+10) MVCCGet100Versions8Bytes-8 2.00 ± 0% 20.00 ± 0% +900.00% (p=0.000 n=10+10) MVCCPut10-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCPut100-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCPut1000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCPut10000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate100-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate1000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutCreate10000-8 0.00 ±NaN% 25.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCConditionalPutReplace10-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCConditionalPutReplace100-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCConditionalPutReplace1000-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCConditionalPutReplace10000-8 1.00 ± 0% 26.00 ± 0% +2500.00% (p=0.000 n=10+10) MVCCBatch1Put10-8 1.00 ± 0% 20.00 ± 0% +1900.00% (p=0.000 n=10+10) MVCCBatch100Put10-8 0.00 ±NaN% 13.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch10000Put10-8 0.00 ±NaN% 13.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCBatch100000Put10-8 0.00 ±NaN% 13.00 ± 0% +Inf% (p=0.000 n=10+10) MVCCDeleteRange1Version8Bytes-8 20.0 ± 0% 187328.3 ± 0% +936541.50% (p=0.000 n=10+10) MVCCDeleteRange1Version32Bytes-8 26.0 ± 0% 131152.9 ± 0% +504334.23% (p=0.000 n=10+10) MVCCDeleteRange1Version256Bytes-8 39.8 ± 3% 34567.6 ± 0% +86753.27% (p=0.000 n=10+10) ```
LMDB, the Lightning Memory-Mapped Database, is a much simpler datastore than RocksDB. It uses memory-mapped files and a B+Tree, which is a read-optimized data structure.
Still interesting to play with it.
What's not so great:
Merge
,MVCCComputeStats
).This is not yet fit for usage since merges and stats computations have not been
hoisted up from the RocksDB C++ code yet (we pushed it down for performance at
some point).
The encoding used is quite inefficient, but can be improved once
bmatsuo/lmdb-go#61 lands. Some of the horrendous
allocation and performance numbers likely go back to the encoding.
What's less clear is how the single-threaded model of lmdb can most efficiently
be used with Cockroach. Essentially, mutations need to be serialized, and this
is done through a single mutex. Additionally, transactions need to be pinned
to a OS thread. The ideal usage would see one large environment per store, with
individual databases per Replica (on which writes are already serialized
anyway).
This change is