forked from cruppstahl/upscaledb
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
962 lines (816 loc) · 39.1 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
I Am Legend:
Items are sorted by priority (highest on top).
o a pending TODO item (for the current release)
. a pending TODO item (for future releases)
x a finished TODO item
==============================================================================
------------- hamsterdb-erlang 2.1.9 ---------------------------------
o improve erlang integration
o integration for quickcheck-ci.com
x look into Thomas' mail
------------- hamsterdb pro 2.1.9 ------------------------------------
x documentation rewrite
x use github wiki, remove old cruft
x then create the initial repository:
https://github.com/blog/699-making-github-more-open-git-backed-wikis
x /introduction
x /evaluate & benchmark
x /faq (merge w/ performance)
x /tutorial
... (existing stuff)
x /pro
x /hola
x API pages
overview (description, status)
installation
usage (compiling, linking)
api functions (link to doxygen)
samples
x c
x c++
x java
x dotnet
x erlang
x python
x protobuf
x have the hamsterdb/documentation directory "link" to the git project
x get rid of HAM_API_REVISION, move HAM_VERSION_* to the header file, or
move version.h to include/ham
x refactoring: the ByteArray should be non-copyable, or at least
clearly define the copying semantics regarding ownership. Also
review its uses, try to figure out how it can be improved. Should it be a
DynamicArray<uint8_t>?
x Zint32: still has a few bugs when running unittests because
index->used_size is miscalculated
gdb --args ./test Zint*
b zint32.cpp:66
b hamsterdb::Zint32::Zint32KeyList::copy_to
x then once more check the monster tests
x remove dependency to malloc.h, use stdlib.h instead
x add patches from Thomas Fähnle
x refactoring: clean up the whole Database code; it's too much and too
complex; there must be an absolutely clear border between Btree and
Transaction related code.
-> should we move the transactional implementations to their own
database class, i.e. TransactionalDatabase?
x LocalDatabase::cursor_get_duplicate_position: delegate to Cursor class
x when to call cursor->set_lastop(Cursor::kLookupOrInsert)?
x LocalDatabase::cursor_insert -> calls LocalDatabase::insert
x remove BtreeCursor::insert, it does not have relevant code
x LocalDatabase::cursor_erase -> calls LocalDatabase::erase
x remove BtreeCursor::erase
x remove TransactionCursor::erase
x LocalDatabase::cursor_find -> calls LocalDatabase::find
x update BtreeCursor::find
x a lot of functions have to clear the changeset when they leave;
use ON_EXIT_SCOPE macro for this!
x run performance tests
x run monster tests
x #42: approx. matching: cursor returns wrong key when using transactions
x create a unittest to reproduce this
x the fix should be that ham_cursor_find calls ham_db_find(..., cursor)
x cherry-pick the following commits on the master branch
1447ba4eb217532e8fb49c4a84a0dc3b982a3ffe
6a8dd20ec9bd2ec718d1136db7667e0e58911003
x Thomas Fähnle: create new parameter for fadvise configuration
x implement for file-based (w and w/o memory mapping) devices
x need to check for the function in configure.ac
x return flag with ham_env_get_parameter (not documented)
x create a unittest to make sure that the parameter is not persisted
x create a new parameter for ham_bench
x A few things to refactor in the btree
x the BtreeAction classes do not need a Transaction pointer as an argument!
x actually NO part of the Btree should EVER access a Transaction...
x btree_visitor.h has a structure called ScanVisitor, and BtreeVisitor is
in btree_index.h. Can both be combined?
x Introduce new option for record numbers (HAM_RECORD_NUMBER32,
HAM_RECORD_NUMBER64), deprecate the old one. Use 32bit whenever possible!
x introduce a new flag; the old one uses 64bit. all tests must work
x implement (local)
x implement (server)
x implement (remote)
x verify with ham_dump/ham_info
x add unittests
x re-use existing tests
x also for remote!
x check for overflows!
x add to ham_bench
x fix documentation
x samples
x header file
x implement Cursor::get_record_size for remote access
x issue43: fix segfault
it seems that a node is removed which is not in the rb-tree. the node
used to be there, but got dropped when another node was removed.
x Remove requirement for "delete stability" when erasing keys
x Introduce generic BtreeUpdateAction which handles merges and splits
x when erasing a key: also expect splits; in case of a split the
BtreeEraseAction has to retry the operation - either with the old
or the new node!
x hamsterdb-erlang: add support for HAM_RECORD_NUMBER32
x Zint32: add compression based on bit-packing
-> https://github.com/lemire/simdcomp/blob/master/include/simdintegratedbitpacking.h
x add parameter to header file
x support in ham_bench
x update IndexFactory for the new KeyList
x move block/index related functionality to a base class BlockKeyList
(w/ template parameter for the block structure)
x integrate Daniel's sources as a library
x add a unittest to compress/decompress
x implement the compression
x don't use index->used_size, it's not required
x don't use index->block_size; it's always static
x use delta-encoding to have better compression
x fill in the implementation
x add unittests (similar to zint32)
x add monster tests (similar to zint32/varbyte)
x don't calculate bitwidth if index->bits is already maxxed out (32)
x there is no vacuumize prior to a split. The vacuumize is cancelled
immediately because it is always "internal".
x don't split blocks in the middle if the new key is appended
x don't split blocks in the middle if the new key is prepended
x test with bulk-erase and deltas of size 1, no records
-> no delete stability!
x erase: only calculate bits of the new delta, not of the whole block
x fix the remaining TODOs
x improve vacuumize_impl() performance by allocating multiple
blocks at once
x rewrite simdmaxbitsd1() to accept a "length" parameter
x Mail from Joel: approx. matching returns wrong key
ham_cursor_find(... HAM_FIND_GEQ_MATCH)
keys "aa", "bb", "cc", find "b": returns "cc" instead of "bb"
-> use existing sources as the basis
-> insert keys as suggested
x Impove test coverage of approx. matching w/ EQC
x need erlang function for approx. matching (ham_db_find w/ flags
-> returns status AND key)
x for all key types (incl. fixed length keys of size 5, 32, 48 and
variable length keys)
x generate keys based on key type/size
x store inserted keys
x verify the result of the find operation
x specify key flags for approx. matching and verify the result
x lt_match
x leq_match
x gt_match
x geq_match
x add inline records of max. size to increase splits
x only btree
x Update wiki documentation/web documentation
x ham_env_create: links to doxygen-reference is broken
x verify other links
x update doxygen documentation to 2.1.9
x improve HAM_PARAM_JOURNAL_SWITCH_THRESHOLD
x implement for ham_env_open
x implement for ham_env_get_parameters
x add unittest for ham_env_create and ham_env_open
x issue45: crash when env is closed
Journal::recover -> __abort_uncommitted_txns -> ham_txn_abort
-> env->get_txn_manager()->abort
crashes because env->m_txn_manager == 0
x reproduce in a unittest
x don't call ham_txn_abort but call txn->abort instead
x Mail from Joel: approx. matching returns wrong key
x increase test coverage A LOT!
x rewrite comparison logic
x reproduce with his test, fix the bug
x fix the regressions (see Joel's mail)
x reproduce crash with a unittest
btree: insert "A"
txn: erase "A"
txn: search "A" w/ GEQ
find_txn then moves to the next node, which is null
segfault when accessing *node
x PRO: add Group Varint-based integer encoding
x add parameter to header file
x support in ham_bench
x update IndexFactory for the new KeyList
x integrate Daniel's sources as a library
x implement the compression
x add unittests (similar to zint32)
x add monster tests (similar to zint32/varbyte)
x don't split blocks in the middle if the new key is appended
x don't split blocks in the middle if the new key is prepended
x tests fail if kMaxGroupVarintsPerBlock is set to 4
x also try larger values
x reduce index size by setting the correct bit widths
o Crash when opening env and Journal is applied
starting w/ key 1104669304804000000, the blob 729496 (and all following
60 blobs on page 720896) are missing (= their blob headers are empty).
-> index already has a key, therefore the key is overwritten during
recovery; but the assigned record is bogus (blob_header is not
initialized)
-> the journal has no changeset
x is the page of that blob created or modified during recovery?
=> no!
-> it's not modified in a changeset
-> the page is modified during recovery, but the affected record
is not (it's nulled out from the beginning)
x was this key the last one that was inserted?
-> no, there are more keys following; about 60 without blob, then
more with blobs
o which steps or failures could lead to this scenario?
1) a bug in the recovery process - no, because the file is already
broken prior to recovery
2) might be a crash during recovery which resulted in an invalid state
x Remove the changeset from the Environment, and see how it goes
x review: every public method in Environment, Database and
TransactionManager creates a context
x remove ALL functions that are not used in hamsterdb.cc from
the public interface!!
-> only 1 left: make txn_manager() private
x LocalDatabase: flushes/clears Context
x improve separation between logic and state; make state immutable and
thread-safe/exception safe. Ultimately, there should be a separation
between...
- mutable database state
btree nodes, blobs...
- immutable database state
static parameters
- operation state (parameters)
insert, erase...
- the code
Immutable state can be accessed concurrently, w/o locking and blocking.
Mutable state is synchronized on the lowest level, with very short locks
x Rewrite the code
- remove dependencies to upper layers, i.e. to the Environment
- change function names, if necessary
- move implementation to Impl namespace and separate file
- separate state from the code
1. Struct Impl::XXX -> has State
2. Struct XXX : public Impl::XXX -> hides State (private)
3. Struct XXXTest(XXX *) -> state() returns State
x PageManager
x Move lsn handling to its own object, clean up
x Cache
x Changeset
x Journal
x remove Journal::set_switch_threshold
x Environment - not virtual, has pimpl for local and remote
implementation
x Environment: has common code, takes care of exceptions etc
x *context_data -> move to LocalEnvironment
x move Mutex handling to env.cc (as much as possible)
x create separate EnvironmentTest class
x RemoteEnvironment
x make all other functions private
x LocalEnvironment
x clean up
x add Test interface
x remove "friend" declaration
x Database
x RemoteDatabase
x LocalDatabase
x rebase against v2!!
x Concurrency: move "purge cache" to background
-> Reorg the hash table. All buckets are in serial memory. This requires
a few allocations, but reduces cache misses because there's no
"jumping" from page to page. Such a bucket can be replaced
atomically with CAS (hazard pointer!) and appended w/o locking. And
it avoids frequent calls to remove_from_list/insert_to_list to push
the used pages to the front. As a consequence, a new purge mechanism
is required to identify "old" pages. These pages then could even be
sorted before they are flushed. (This was tested before w/o any effect,
but maybe back then the databases simply were too small?)
-> The new page "list" will support concurrent reads and writes. This
means there can be multiple threads performing I/O without
blocking each other.
x on a new branch! topic/thread
x Create a generic "PageCollection" class with atomic updates
x use static memory as default, but replace w/ dynamic memory if
capacity is not enough
x need a very fast spinlock for writing
x whenever the array grows:
1) grow into a copy
2) perform CAS with the old pointer
3) old pointer can be discarded after 1 second
x assert that there is only 1 writer at a time (i.e. use a mutex)
x when inserting: set the pointer, then atomically increment the counter
x when deleting: simply set to zero
x never use a lock when reading
x rewrite PageCollection
x use a single linked list (intrusive, just like now)
x use a Spinlock for reads *and* writes
. in the future, we can still add a bloom filter to check whether
a page exists or not
x Use PageCollection in the Changeset
x Make sure that extracting the pages is an atomic operation (locked!)
-> PageCollection::for_each(Visitor &visitor);
x Use this in the Changeset
x Use PageCollection for the Page list
x Use PageCollection for the Cache buckets
x must keep track of the *tail* as well!
x if the Cache is thread safe - should the PageCollection also be
thread safe?? - no, not required
x fix the TODOs
x Cache::m_cur_elements no longer required (use totallist.size)
x Cache must be thread-safe
x the PageManager must be thread-safe - no, it's enough if
PageManager::purge_cache is thread-safe (that's the case)
x Create a background thread (use boost::thread)
x make sure that configure.ac checks for boost::thread
x Join thread when closing the Environment
x thread should sleep till it's woken up, either by a new
message or by shutdown
x need a fast double-linked message queue
x wake up thread when new messages arrive
x thread skips all non-mandatory messages when shutting down
x Run performance tests
The cache evicts too fast, i.e. 2.1.9 page_count_fetched 0 -> 9998088
x --seed=12345 --stop-ops=10000000 --key=uint16
x --seed=12345 --stop-ops=50000 --key=uint16
x --seed=1380279291 --stop-ops=1000000 --distribution=ascending
x --seed=1380279291 --stop-ops=1000000 --distribution=ascending
x --seed=1380279291 --stop-ops=1000000 --key=uint16
x move page flush in background
-> A page is in use when it's added to a Changeset. However, pages that
are fetched read-only are never added. And if recovery is disabled
then it's not possible to figure out when a page is no longer used.
x wrap Changeset::flush and Changeset::clear into
LocalEnvironment::finalize_operation(abort/commit)
x make sure this is called after every operation (exception safe!)
-> use BOOST_SCOPE_EXIT
x every operation stores its state in its own Changeset
x ham_db_find/ham_cursor_find
x unify Database::find and Database::cursor_find,
remove cursor_find
x move creation of temp. txn to Database::find, then call
(virtual) LocalDatabase::find_impl()
x find_impl should not have cursor-specific logic
x LocalDatabase::find will then take care of the changeset
and call finalize()
x when creating a local txn: does it have to be stored in the
cursor?
x merge remote requests
x merge ham_db_erase/ham_cursor_erase
x merge ham_db_insert/ham_cursor_insert
x always add a page to the Changeset, even if in-memory or
recovery is disabled
x when going out of scope: clear the changeset or flush it
(cont below...)
x merge topic/perfect-world with v2
x then remove the threading code from v2; so far no thread is required
x PRO: rebase to v2
x regression: ./ham_bench --use-berkeleydb=true --reopen=true --use-berkeleydb=true --reopen=true --use-transactions=tmp --use-cursors --duplicate=last --fullcheck=find ext_052.tst
x ... then compare PRO vs APL sources and make sure they don't diverge
too much!
x simdcomp-tests sometimes fail. valgrind has a huge amount of errors
x PRO: add more simdcomp related functions for blocks < 128 integers
x update Daniel's code - the repository has new patches (create a fork)
git submodule add git@github.com:cruppstahl/simdcomp.git
x add 'export "C"' directives for c++
x add Makefile.am for Automake
x new function for calculating the delta
x add tests
x send push requests to Daniel
x PRO: retrieve better statistics regarding compression/overhead/payload
x for each node retrieve these statistics and accumulate them
as min, max, avg:
- number of keys stored
- actually used storage for the keys
- static overhead per page
- overhead per index
- number of indices
- unused storage
x each node has a fill_metrics() method
x extend ham_info
x by default, no new stats are printed
x otherwise call fill_metrics() and print them
x PRO: simdcomp should use binary search for decompressed blocks
(currently uses linear search)
-> see mail from Daniel for more information, code
x Simdcomp: find (line 152)
x Grouped Varint: find (line 157)
x Simdcomp: insert (line 589)
x Grouped Varint: insert (line 570)
x PRO: integrate Nathan's sources, or start a new KeyList
x add parameter to header file
x add sources to 3rdparty
x implement the encoding
x add unittests
x add support for ham_bench
x add monster tests (pro)
x make sure the library and 3rdparty are compiled with -O3 by default
x various issues for refactoring...
x merge cache_impl with cache.h
x PageManagerFactory is no longer required
x PageManager::alloc calls Page::allocate (rename to "alloc"!), which
calls DeviceDisk::read_page. Why? it's enough to assign the address
(unless the storage is mmapped)
x move 4worker/worker.h to 2worker/worker.h
x blob performance improvements
x there's no need to copy the BlobHeader; this can be returned as a pointer
instead (verify! the pointer must not be modified - make it const!)
-> or is it possible that the header is spread over two pages??
x If mmap is enabled then keys and records don't have to be copied as long
as they point into the mapped area!
x Concurrency: move "purge cache" to background (cont...)
x Right now only the Cache uses a mutex. But we need to know
whether a page is in use (= in a changeset) or if it's purged.
x add a spinlock to the page
x PageManager: lock the page after adding it to the ChangeSet
x afterwards: fetch the page if pers == 0
x how to unlock in case of an exception?
x PageManager: unlock the page when removing it from the Changeset
x when writing to disk: acquire the lock; after flushing: set the
pers-pointer to null
x make sure that ONLY page->pers is modified by the separate
thread, not the next/prev pointers!
x periodically remove empty stubs from the list
(or remove them immediately after the page is purged, and the
lock is re-acquired again
x fix TODO in cache_impl.h
x wake up thread if cache is full
x fix all regressions
x main thread must remove empty pages
x profile mutex contention with
http://0pointer.de/blog/projects/mutrace.html
The cache mutex is a problem. Verify that it's not held while pages
are flushed!
#id 5462574 288644 222691 147887.835 0.027 59.274
x some messages should only exist once, i.e. purging the cache.
no need to add them to the queue multiple times
x use valgrind to check for race conditions etc
x use Spinlock instead of Mutex for the Page
x have a special build mode where Mutex is used instead of Spinlock
(HAM_ENABLE_HELGRIND)
x tests with excessive use of pages are terminated by the OOM-killer
because purges are not fast enough
x ./ham_bench --use-berkeleydb=true --reopen=true --key=binary --keysize=64 --pagesize=1024 --recsize=0 --bulk-erase --distribution=ascending --extkey-threshold=20 --recsize=0 --stop-ops=500000
x find_first_reverse always returns the same page!
x check if there's a race condition between worker thread
and PageManager::close_database()
x run monster tests
x run performance tests
x Change copyright to 2015
x all sources
x webpage footer
x also for PRO!
x zint32: stream vbyte: use avx only if it's supported by the cpu!
(note that os.cc currently returns wrong value if __AVX__ is not set; but
we want to check the runtime capabilities and not the compile time setting)
x zint32: investigate performance problems of memmove (w/o compression!)
x look into assembler code of memmove; is it using a loop?
- yes it is (sse w/ prefetch)
x track statistics of memmove; how many bytes are moved? also track the
time to get a total MB/SEC figure
- 13.2 gb/sec
x write a micro-benchmark
- no, actually memmove is faster than memcpy. wtf?? check again
with Vtune. Is this a profiling error with valgrind/callgrind? -> no
x zint32: new KeyList for Masked VByte encoding (use StreamVbyte as a
template and compress/uncompress blockwise)
x add unittests
x add monster tests
o zint32: would it make sense to have a block-based "compression" without
compression?
o blog post: compare performance of
- Spinlock linux vs Mutex linux (1, 5, 10, 20 mio keys)
- Spinlock win32 vs Mutex win32 (same machine)
- leveldb performance (100 mio keys)
- hamsterdb performance (100 mio keys)
- leveldb performance (1, 5, 10, 20, 50, 100 mio integer keys)
- hamsterdb performance (w/o compression)
- hamsterdb-pro performance (w/ compression)
o Erlang: investigate failure in ham_eqc2
o check recovery.pl - extended_tests are failing on PRO because the
error inducer hits too early
o documentation related issues for 2.1.10
o HAM_RECORD_NUMBER32: add/update tutorial
o record data points into mapped storage
o PRO: zint32 - if vacuumize calculates that 10% of the space are wasted then
perform similar to Simdcomp: uncompress all blocks into a large array,
compress them again and build up the index from scratch
o PRO: zint32 currently does not work if pagesize != 16kb. Document this
limitation and return an error if a different pagesize is used
. When flushing pages (PageManagerWorker): use writev and flush all pages
in a single call (win32 has no writev, therefore os_win32.cc should
sort the array first)
. currently, code uses linear search over small ranges. This is not efficient.
If the range is in the cache then binary is still faster.
o replace binary search implementation with std::lower_bound
o PRO: only use linear search if SIMD is enabled (really? does it make
sense?)
. Refactoring: rewrite more classes to improve the whole structure
o HeaderPage - can it be treated like any other page, and moved
to the PageManager/Cache??
o remove LocalEnvironment::mark_header_page_dirty
o transform to struct instead of class
o move header Page ownership from EnvironmentHeader to PageManager
o EnvironmentHeader
o BlobManager
o disk: pass PageManager during initialization
o inmem: pass InMemoryDevice during initialization
o remove dependency to env
o Device
o BtreeActions
o Btree
. Refactoring: all unittest fixtures should derive from a BaseFixture,
which creates an Environment, creates a list of databases (w/ parameters),
and if required also a cursor, a transaction and a context
o include additional management functions like lenv(), ldb(), ltxn(),
page_manager(), cache(), context()...
o what else?
o then reorganize the tests
- public API
- internal modules
o Refactoring: would it make sense if each Database has its own BlobManager?
Then ham_env_erase_db would be much faster, and if each Database has its
own lock then the blob pages would not block each other
o are there other modules that should be "per database" and not
"per environment"?
- PageManager
- Device
- ...?
. LocalEnvironment::open creates a "fakepage" to find out the page_size;
just assume the default page size of 16kb and read the first page. Then
discard if the real page_size is a different one
o deal with databases < 16kb?
. migrate to libuv 1.0, it has a stable API
http://docs.libuv.org/en/latest/migration_010_100.html
o also for windows!
o More things to refactor in the btree
o EraseAction uses duplicate_index + 1, InsertAction uses duplicate_index
-> use a common behaviour/indexing
o EraseAction line 71: if the node is empty then it should be merged and
moved to the freelist!
o when splitting and HAM_HINT_APPEND is set, the new page is appended.
do the same for prepend!
o refactoring: db_local.cc has so many TODOs!
o refactoring: improve the code separation of cursor, btree_cursor
and txn_cursor, i.e. in db_local::cursor_get_record_size (wtf - is_null(0)?)
etc
o the whole cursor state is messed up. there should be 3 states:
- nil
- coupled to btree
- coupled to txn
and nothing else!
o there should be a key() and a record() method; the caller should
not get access to txn_cursor/btree_cursor etc
o cursor.cc has so many TODOs!
o internal nodes should use 32bit page IDs!
-> but should still work with 64bit IDs for in-memory environments
o delta updates managed in the BtreeNode
Updates are attached to a node. A background thread can merge them.
At runtime, they're consolidated. An Update can be part of a Transaction.
The existing TransactionOperation structure is replaced. The whole
Transaction handling will be rewritten. Instead of using separate
TransactionTrees, the Updates are attached to the node.
Updates are a sorted (!) linked list. Prior to merge, those Updates that
delete keys will be applied.
The Update structure should be parameterized, just like the Btree
(i.e. use the identical type and compare function).
When pages are merged or split then the delta updates are moved to the
new page. They are not merged immediately.
The Update structure has a pointer to its Transaction. If the Transaction
is committed or aborted, then the Update state is set to "committed"
(or "aborted"). When merging, aborted Updates are removed. The Transaction
pointer is only followed if the state of the Transaction is not clear (i.e.
either committed or aborted).
-> Flushes/Merges happen in background
-> Transactions, Commits and Btree operations are completely decoupled.
This requires a new recovery/logging strategy!
-> The whole btree will borrow ideas from an LSM since it is running
compactions in the background
-> Needs more research...
o create an Update struct
o template parmeters for key and compare function
o key, record, ...
o single linked "next" pointer
o The BtreeNodeProxy manages a (single) linked list of Updates
o This linked list is always sorted
o Add a flag to force-flush immediately (disable queueing)
o Merge the updates (naive implementation: insert one by one)
o ... before flushing the page
o ... before all other operations
o When splitting the page: re-distribute the updates to the
new node
o Don't merge pages as long as delta-updates are attached
o implementation checkpoint - all tests must run
o unittests
o monster tests
o performance tests
o Then consolidate the updates piece by piece
o for lookups
o for inserts
o for erase
o for traversal
o for cursors
o for scans
o ... and with duplicates
. investigate "pointer swizzling": internal btree nodes should store "hints"
to the actual Page, not the Page IDs, as soon as the Page was loaded for
the first time. This could circumvent the buffer cache and increase
performance.
How to invalidate those hints or discover that a page was evicted from
cache?
- Eviction could only free the persistent part of a page, and not the
stub.
- Could also use reference counting for the page
. improve the webpage documentation
o document the various btree formats on the webpage, with images
o variable length keys (w/ extended keys)
o POD keys
o default records
o inline records
o fixed-length records
o duplicates (w/ overflow tables)
o PRO: compressed keys
o PRO: compressed records
o PRO: Group Varint-related improvements
o will currently not work with pages > 16kb
o vacuumize (internal == false): merge two blocks if they're underfilled?
we have to reduce the number of indices
o also for varbyte?
o PRO: allow compression of 32bit record numbers
o PRO: use zint32 compression for internal nodes
-> requires 32bit page IDs
o PRO: prefix compression for variable-length keys
use an indirection for the prefixes and suffixes; store each
part in a slot. the keys themselves have then fixed length (2 slot id's)
==> supports efficient binary search!
==> is efficient for random read/writes AND linear scans
however, it's very complex to figure out how to optimally split the
strings into prefixes and suffixes
==> prefixes and suffixes can be stored as extended keys if they become
too large
see indexcompression2009.pdf - Efficient index compression in DB2 LUW
o look for more research papers
o PRO: look for a better compression for DefaultRecordList, i.e.
- Each group is a GroupedVarInt w/ 4 bits per entry; a 64bit
number can then hold flags for 16 numbers
-> (but maybe increase this to hold at least 32 or 64 numbers, to
reduce the overhead ratio)
o create a GroupedVarInt<Max, T> class, where |Max| is the maximum number
of elements that are grouped, and T is the type of these elements
(i.e. uint64_t)
-> memory is managed by the caller
-> the state (i.e. used block size etc) is stored externally, and
managed by the caller
o append a key
o prepend a key
o insert a key in the middle
o grow blocks
o split blocks
o can perform copies w/o re-compressing
o try to move the Zint32 index to a base class
o Use small index which stores offset + bits for each group
o a separate bit is used to signal whether the (compressed) number is
a record id
o avoid ripple effects by growing/splitting the block
o PRO: use compression also for duplicate records
i.e. use GroupedVarint for inline duplicates
. hola - next steps
o support java api
o support .net api
o support erlang api
o lua-functions as callbacks - then remote marshalling will work
o PRO: compile callbacks with clang remotely
o add remote support where it makes sense (only for PRO?)
. architecture for a new webpage
o pick an awesome design
i.e. similar to http://foundationdb.com, http://laravel.com,
http://rethinkdb.com, http://www.orientechnologies.com
o use make/m4/markdown to generate static pages:
https://github.com/datagrok/m4-bakery
https://developer.github.com/v3/markdown/
o come up with the full site structure/contents
http://sidekiq.org/pro/
o include full documentation, one page per API
o ... and for all languages
o keep the documentation in the source tree, not in -www?
o documentation comments are hosted on disqus
o blog comments are hosted on disqus, articles are also written in markup
o Makefile can "scp -r" everything to the servers (staging or production)
. client area with (low priority)
o authentication
o collection of files
o analytics (who downloads what and when?)
. admin area with (even lower priority)
o authentication
o customer database
o implementing business processes
o sending out release emails
o importing new releases
o etc
. hola: use sum-prefix-trees to precalculate partial sums/results?
they could be stored in a btree, and used to dynamically recalculate
requested values
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
o QuickCheck: automatically test the recovery feature by invoking "crashes"
o QuickCheck: create a new property for testing duplicates; the key is
always the same. The number of duplicate keys is tracked and
periodically checked with the API. A cursor can be used to remove a
specific duplicate, or to fetch a specific duplicate.
. use cache-oblivious b-tree layout
-> http://supertech.csail.mit.edu/cacheObliviousBTree.html
o see roadmap document for more information
o this feature is *per database*
o calculate number of reqd pages based on estimated keys from the user
(but what about the blobs??)
o make sure that this is not reverted when "reduce file size" feature
(above) is enabled
o the new pages are not managed by the freelist! therefore the freelist
will not need any modifications
o after resize: mmap the whole file area. this is actually important because
mmap is much faster than r/w; but when the database is created, the
original mapping already exists. therefore we might have to handle
more than one mapping in the file
o PageManager: when allocating a new page then use the distribution
function to fetch a page from the reserved storage
. try to batch allocations; when new pages are required then don't just
allocate one but multiple pages (if the database is big enough)
-> could create a second memory mapping for the next chunk
o PRO: hot backups (vacuumizes to a different file)
really only for PRO?
http://sqlite.org/c3ref/backup_finish.html
- make sure that all transactions are closed
- perform ham_env_flush
- then copy the file
- if compaction is enabled: copies keys w/ iterator
(later: performs bulk updates)
--> think this through; how to deal with delta updates? -> merge them
what if only a few databases should be backed up?
what if i want to back up in a logical format (i.e. csv)?
o "hola" - olap functions that operate directly on the btree data
-> see wiki
-> see java8 stream API:
http://download.java.net/jdk8/docs/api/java/util/stream/Stream.html
-> see supersonic:
https://code.google.com/p/supersonic/
-> see fast bitmap indices
http://code.google.com/p/lemurbitmapindex/
o create a design
o operations on compressed data (COUNT(), MIN(), MAX(), ...)?
o use async operations or futures/promises
o deprecate ham_db_get_key_count() (tutorial, documentation)
- bloom filter -> PRO
- concurrency -> PRO
. clean up approx. matching
o ONLY for cursors
o Flags: HAM_FIND_LT_MATCH | HAM_FIND_GT_MATCH | HAM_FIND_EQ_MATCH (default)
o lookup: the cursor is coupled to the key, even if the lookup fails
then perform a lookup:
found_key == requested_key:
HAM_FIND_EQ_MATCH: ok
HAM_FIND_LT_MATCH: return move_prev()
HAM_FIND_GT_MATCH: return move_next()
found_key < requested_key:
HAM_FIND_LT_MATCH: ok
HAM_FIND_GT_MATCH: return move_next()
HAM_FIND_EQ_MATCH: key not found
found_key > requested_key:
HAM_FIND_GT_MATCH: ok
HAM_FIND_LT_MATCH: return move_prev()
HAM_FIND_EQ_MATCH: key not found
o must work with transactions
o do not store key flags; the caller has to compare the key
o remove ham_key_set_intflags, ham_key_get_intflags, key->_flags (?)
. win32: need a release-v2.pl which fully automates the various release steps
o delete all generated protobuf files
o build for msvc 2008
o run unittests for debug and release
o run samples
o delete all generated protobuf files
o build for msvc 2010
o run unittests for debug and release
o run samples
o build release package
. also remove locking from C# and Java APIs
------------------- idea soup ---------------------------------------------
. PRO: should we have a separate "recsize == 0" RecordList for duplicates?
they could only store the duplicate count (but should be able to deal
with duplicates that are > 256!)
-> requires grouped varints
o asynchronous prefetching of pages
-> see posix_fadvise, libprefetch
o when recovering, give users the choice if active transactions should be
aborted (default behavior) or re-created
o needs a function to enumerate them
o A new transactional mode: read-only transactions can run "in the past" - only
on committed transactions. therefore they avoid conflicts and will always
succeed.
o need a function to get the txn of a conflict (same as in v2)
ham_status_t ham_txn_get_conflicting_txn(ham_txn_t *txn, ham_txn_t **other);
oder: txn-id zurückgeben? sonst gibt's ne race condition wenn ein anderer
thread "other" committed/aborted
o also add to c++ API
o add documentation (header file)
o add documentation (wiki)
. new test case for cursors
insert (1, a)
insert (1, b) (duplicate of 1)
move (last) (-> 1, b)
insert (1, c)
move (last) (-> 1, c)? is the dupecache updated correctly?
. there are a couple of areas where a btree cursor is uncoupled, just to
retrieve the key and to couple the txn-key. that's not efficient
db.c:__btree_cursor_points_to
db.c:__compare_cursors
txn_cursor.c:cursor_sync
txn_cursor.c:cursor_overwrite
o move to a separate function
o try to optimize
. add tests to verify that the cursor is not modified if an operation fails!
(in cursor.cpp:LongTxnCursorTest are some wrapper functions to move or
insert the cursor; that's a good starting point)