Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

most of Ger's stuff is now merged

git-svn-id: svn+ssh://hamsterdb.com/home/chris/repos/hamsterdb/trunk@1303 66673985-dd14-0410-9433-caba4d15c716
  • Loading branch information...
commit 6f34dd400a10549ad16568ea19e67c508756674d 1 parent 43f98e3
@cruppstahl authored
View
6 CREDITS
@@ -0,0 +1,6 @@
+
+Jul 20, 2009
+ham_env_get_parameters, ham_db_get_parameters and functions for approximate
+matching, minor bugfixes and performance improvements plus documentation
+improvements were written by Ger Hobbelt, http://www.hobbelt.com,
+http://www.hebbut.net - THANKS!
View
291 INSTALL
@@ -0,0 +1,291 @@
+Installation Instructions
+*************************
+
+Copyright (C) 1994, 1995, 1996, 1999, 2000, 2001, 2002, 2004, 2005,
+2006, 2007, 2008 Free Software Foundation, Inc.
+
+ This file is free documentation; the Free Software Foundation gives
+unlimited permission to copy, distribute and modify it.
+
+Basic Installation
+==================
+
+ Briefly, the shell commands `./configure; make; make install' should
+configure, build, and install this package. The following
+more-detailed instructions are generic; see the `README' file for
+instructions specific to this package.
+
+ The `configure' shell script attempts to guess correct values for
+various system-dependent variables used during compilation. It uses
+those values to create a `Makefile' in each directory of the package.
+It may also create one or more `.h' files containing system-dependent
+definitions. Finally, it creates a shell script `config.status' that
+you can run in the future to recreate the current configuration, and a
+file `config.log' containing compiler output (useful mainly for
+debugging `configure').
+
+ It can also use an optional file (typically called `config.cache'
+and enabled with `--cache-file=config.cache' or simply `-C') that saves
+the results of its tests to speed up reconfiguring. Caching is
+disabled by default to prevent problems with accidental use of stale
+cache files.
+
+ If you need to do unusual things to compile the package, please try
+to figure out how `configure' could check whether to do them, and mail
+diffs or instructions to the address given in the `README' so they can
+be considered for the next release. If you are using the cache, and at
+some point `config.cache' contains results you don't want to keep, you
+may remove or edit it.
+
+ The file `configure.ac' (or `configure.in') is used to create
+`configure' by a program called `autoconf'. You need `configure.ac' if
+you want to change it or regenerate `configure' using a newer version
+of `autoconf'.
+
+The simplest way to compile this package is:
+
+ 1. `cd' to the directory containing the package's source code and type
+ `./configure' to configure the package for your system.
+
+ Running `configure' might take a while. While running, it prints
+ some messages telling which features it is checking for.
+
+ 2. Type `make' to compile the package.
+
+ 3. Optionally, type `make check' to run any self-tests that come with
+ the package.
+
+ 4. Type `make install' to install the programs and any data files and
+ documentation.
+
+ 5. You can remove the program binaries and object files from the
+ source code directory by typing `make clean'. To also remove the
+ files that `configure' created (so you can compile the package for
+ a different kind of computer), type `make distclean'. There is
+ also a `make maintainer-clean' target, but that is intended mainly
+ for the package's developers. If you use it, you may have to get
+ all sorts of other programs in order to regenerate files that came
+ with the distribution.
+
+ 6. Often, you can also type `make uninstall' to remove the installed
+ files again.
+
+Compilers and Options
+=====================
+
+ Some systems require unusual options for compilation or linking that
+the `configure' script does not know about. Run `./configure --help'
+for details on some of the pertinent environment variables.
+
+ You can give `configure' initial values for configuration parameters
+by setting variables in the command line or in the environment. Here
+is an example:
+
+ ./configure CC=c99 CFLAGS=-g LIBS=-lposix
+
+ *Note Defining Variables::, for more details.
+
+Compiling For Multiple Architectures
+====================================
+
+ You can compile the package for more than one kind of computer at the
+same time, by placing the object files for each architecture in their
+own directory. To do this, you can use GNU `make'. `cd' to the
+directory where you want the object files and executables to go and run
+the `configure' script. `configure' automatically checks for the
+source code in the directory that `configure' is in and in `..'.
+
+ With a non-GNU `make', it is safer to compile the package for one
+architecture at a time in the source code directory. After you have
+installed the package for one architecture, use `make distclean' before
+reconfiguring for another architecture.
+
+ On MacOS X 10.5 and later systems, you can create libraries and
+executables that work on multiple system types--known as "fat" or
+"universal" binaries--by specifying multiple `-arch' options to the
+compiler but only a single `-arch' option to the preprocessor. Like
+this:
+
+ ./configure CC="gcc -arch i386 -arch x86_64 -arch ppc -arch ppc64" \
+ CXX="g++ -arch i386 -arch x86_64 -arch ppc -arch ppc64" \
+ CPP="gcc -E" CXXCPP="g++ -E"
+
+ This is not guaranteed to produce working output in all cases, you
+may have to build one architecture at a time and combine the results
+using the `lipo' tool if you have problems.
+
+Installation Names
+==================
+
+ By default, `make install' installs the package's commands under
+`/usr/local/bin', include files under `/usr/local/include', etc. You
+can specify an installation prefix other than `/usr/local' by giving
+`configure' the option `--prefix=PREFIX'.
+
+ You can specify separate installation prefixes for
+architecture-specific files and architecture-independent files. If you
+pass the option `--exec-prefix=PREFIX' to `configure', the package uses
+PREFIX as the prefix for installing programs and libraries.
+Documentation and other data files still use the regular prefix.
+
+ In addition, if you use an unusual directory layout you can give
+options like `--bindir=DIR' to specify different values for particular
+kinds of files. Run `configure --help' for a list of the directories
+you can set and what kinds of files go in them.
+
+ If the package supports it, you can cause programs to be installed
+with an extra prefix or suffix on their names by giving `configure' the
+option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'.
+
+Optional Features
+=================
+
+ Some packages pay attention to `--enable-FEATURE' options to
+`configure', where FEATURE indicates an optional part of the package.
+They may also pay attention to `--with-PACKAGE' options, where PACKAGE
+is something like `gnu-as' or `x' (for the X Window System). The
+`README' should mention any `--enable-' and `--with-' options that the
+package recognizes.
+
+ For packages that use the X Window System, `configure' can usually
+find the X include and library files automatically, but if it doesn't,
+you can use the `configure' options `--x-includes=DIR' and
+`--x-libraries=DIR' to specify their locations.
+
+Particular systems
+==================
+
+ On HP-UX, the default C compiler is not ANSI C compatible. If GNU
+CC is not installed, it is recommended to use the following options in
+order to use an ANSI C compiler:
+
+ ./configure CC="cc -Ae"
+
+and if that doesn't work, install pre-built binaries of GCC for HP-UX.
+
+ On OSF/1 a.k.a. Tru64, some versions of the default C compiler cannot
+parse its `<wchar.h>' header file. The option `-nodtk' can be used as
+a workaround. If GNU CC is not installed, it is therefore recommended
+to try
+
+ ./configure CC="cc"
+
+and if that doesn't work, try
+
+ ./configure CC="cc -nodtk"
+
+Specifying the System Type
+==========================
+
+ There may be some features `configure' cannot figure out
+automatically, but needs to determine by the type of machine the package
+will run on. Usually, assuming the package is built to be run on the
+_same_ architectures, `configure' can figure that out, but if it prints
+a message saying it cannot guess the machine type, give it the
+`--build=TYPE' option. TYPE can either be a short name for the system
+type, such as `sun4', or a canonical name which has the form:
+
+ CPU-COMPANY-SYSTEM
+
+where SYSTEM can have one of these forms:
+
+ OS KERNEL-OS
+
+ See the file `config.sub' for the possible values of each field. If
+`config.sub' isn't included in this package, then this package doesn't
+need to know the machine type.
+
+ If you are _building_ compiler tools for cross-compiling, you should
+use the option `--target=TYPE' to select the type of system they will
+produce code for.
+
+ If you want to _use_ a cross compiler, that generates code for a
+platform different from the build platform, you should specify the
+"host" platform (i.e., that on which the generated programs will
+eventually be run) with `--host=TYPE'.
+
+Sharing Defaults
+================
+
+ If you want to set default values for `configure' scripts to share,
+you can create a site shell script called `config.site' that gives
+default values for variables like `CC', `cache_file', and `prefix'.
+`configure' looks for `PREFIX/share/config.site' if it exists, then
+`PREFIX/etc/config.site' if it exists. Or, you can set the
+`CONFIG_SITE' environment variable to the location of the site script.
+A warning: not all `configure' scripts look for a site script.
+
+Defining Variables
+==================
+
+ Variables not defined in a site shell script can be set in the
+environment passed to `configure'. However, some packages may run
+configure again during the build, and the customized values of these
+variables may be lost. In order to avoid this problem, you should set
+them in the `configure' command line, using `VAR=value'. For example:
+
+ ./configure CC=/usr/local2/bin/gcc
+
+causes the specified `gcc' to be used as the C compiler (unless it is
+overridden in the site shell script).
+
+Unfortunately, this technique does not work for `CONFIG_SHELL' due to
+an Autoconf bug. Until the bug is fixed you can use this workaround:
+
+ CONFIG_SHELL=/bin/bash /bin/bash ./configure CONFIG_SHELL=/bin/bash
+
+`configure' Invocation
+======================
+
+ `configure' recognizes the following options to control how it
+operates.
+
+`--help'
+`-h'
+ Print a summary of all of the options to `configure', and exit.
+
+`--help=short'
+`--help=recursive'
+ Print a summary of the options unique to this package's
+ `configure', and exit. The `short' variant lists options used
+ only in the top level, while the `recursive' variant lists options
+ also present in any nested packages.
+
+`--version'
+`-V'
+ Print the version of Autoconf used to generate the `configure'
+ script, and exit.
+
+`--cache-file=FILE'
+ Enable the cache: use and save the results of the tests in FILE,
+ traditionally `config.cache'. FILE defaults to `/dev/null' to
+ disable caching.
+
+`--config-cache'
+`-C'
+ Alias for `--cache-file=config.cache'.
+
+`--quiet'
+`--silent'
+`-q'
+ Do not print messages saying which checks are being made. To
+ suppress all normal output, redirect it to `/dev/null' (any error
+ messages will still be shown).
+
+`--srcdir=DIR'
+ Look for the package's source code in directory DIR. Usually
+ `configure' can determine that directory automatically.
+
+`--prefix=DIR'
+ Use DIR as the installation prefix. *Note Installation Names::
+ for more details, including other options available for fine-tuning
+ the installation locations.
+
+`--no-create'
+`-n'
+ Run the configure checks, but stop before creating any output
+ files.
+
+`configure' also accepts some other, not widely useful, options. Run
+`configure --help' for more details.
+
View
532 include/ham/hamsterdb_stats.h
@@ -0,0 +1,532 @@
+/**
+ * Copyright (C) 2005-2008 Christoph Rupp (chris@crupp.de).
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * See files COPYING.* for License information.
+ *
+ *
+ * \file hamsterdb_stats.h
+ * \brief Internal hamsterdb Embedded Storage statistics gathering and
+ * hinting functions.
+ * \author Ger Hobbelt, ger@hobbelt.com
+ *
+ */
+
+#ifndef HAM_HAMSTERDB_STATS_H__
+#define HAM_HAMSTERDB_STATS_H__
+
+#include <ham/hamsterdb.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+struct ham_statistics_t;
+
+
+/**
+ * function prototype for the hamsterdb-specified @ref ham_statistics_t cleanup
+ * function.
+ *
+ * @sa HAM_PARAM_GET_STATISTICS
+ * @sa HAM_statistics_t
+ * @sa ham_clean_statistics_datarec
+ */
+typedef void ham_free_statistics_func_t(struct ham_statistics_t *self);
+
+/**
+ * The upper bound value which will trigger a statistics data rescale operation
+ * to be initiated in order to prevent integer overflow in the statistics data
+ * elements.
+ */
+#define HAM_STATISTICS_HIGH_WATER_MARK 0x7FFFFFFF /* could be 0xFFFFFFFF */
+
+/**
+ * As we [can] support record sizes up to 4Gb, at least theoretically,
+ * we can express this size range as a spanning DB_CHUNKSIZE size range:
+ * 1..N, where N = log2(4Gb) - log2(DB_CHUNKSIZE). As we happen to know
+ * DB_CHUNKSIZE == 32, at least for all regular hamsterdb builds, our
+ * biggest power-of-2 for the freelist slot count ~ 32-5 = 27, where 0
+ * represents slot size = 1 DB_CHUNKSIZE, 1 represents size of 2
+ * DB_CHUNKSIZEs, 2 ~ 4 DB_CHUNKSIZEs, and so on.
+ *
+ * EDIT:
+ * In order to cut down on statistics management cost due to overhead
+ * caused by having to keep up with the latest for VERY large sizes, we
+ * cut this number down to support sizes up to a maximum size of 64Kb ~
+ * 2^16, meaning any requests for more than 64Kb/CHUNKSIZE bytes is
+ * sharing their statistics.
+ *
+ */
+#define HAM_FREELIST_SLOT_SPREAD (16-5+1) /* 1 chunk .. 2^(SPREAD-1) chunks */
+
+/* -- equivalents of the statistics.h internal PERSISTED data structures -- */
+
+/**
+ * We keep track of VERY first free slot index + free slot index
+ * pointing at last (~ supposed largest) free range + 'utilization' of the
+ * range between FIRST and LAST as a ratio of number of free slots in
+ * there vs. total number of slots in that range (giving us a 'fill'
+ * ratio) + a fragmentation indication, determined by counting the number
+ * of freelist slot searches that FAILed vs. SUCCEEDed within the
+ * first..last range, when the search begun at the 'first' position
+ * (a FAIL here meaning the freelist scan did not deliver a free slot
+ * WITHIN the first..last range, i.e. it has scanned this entire range
+ * without finding anything suitably large).
+ *
+ * Note that the free_fill in here is AN ESTIMATE.
+ */
+typedef struct ham_freelist_slotsize_stats_t
+{
+ ham_u32_t first_start;
+
+ /* reserved: */
+ ham_u32_t free_fill;
+ ham_u32_t epic_fail_midrange;
+ ham_u32_t epic_win_midrange;
+
+ /** number of scans per size range */
+ ham_u32_t scan_count;
+
+ ham_u32_t ok_scan_count;
+
+ /** summed cost ('duration') of all scans per size range */
+ ham_u32_t scan_cost;
+ ham_u32_t ok_scan_cost;
+
+} ham_freelist_slotsize_stats_t;
+
+/**
+ * freelist statistics as they are persisted on disc.
+ *
+ * Stats are kept with each freelist entry record, but we also keep
+ * some derived data in the nonpermanent space with each freelist:
+ * it's not required to keep a freelist page in cache just so the
+ * statistics + our operational mode combined can tell us it's a waste
+ * of time to go there.
+ */
+typedef struct ham_freelist_page_statistics_t
+{
+ ham_freelist_slotsize_stats_t per_size[HAM_FREELIST_SLOT_SPREAD];
+
+ /**
+ * (bit) offset which tells us which free slot is the EVER LAST
+ * created one; after all, freelistpage:maxbits is a scandalously
+ * optimistic lie: all it tells us is how large the freelist page
+ * _itself_ can grow, NOT how many free slots we actually have
+ * _alive_ in there.
+ *
+ * 0: special case, meaning: not yet initialized...
+ */
+ ham_u32_t last_start;
+
+ /**
+ * total number of available bits in the page ~ all the chunks which
+ * actually represent a chunk in the DB storage space.
+ *
+ * (Note that a freelist can be larger (_max_bits) than the actual
+ * number of storage pages currently sitting in the database file.)
+ *
+ * The number of chunks already in use in the database therefore ~
+ * persisted_bits - _allocated_bits.
+ */
+ ham_u32_t persisted_bits;
+
+ /**
+ * count the number of insert operations where this freelist page
+ * played a role
+ */
+ ham_u32_t insert_count;
+ ham_u32_t delete_count;
+ ham_u32_t extend_count;
+ ham_u32_t fail_count;
+ ham_u32_t search_count;
+
+ ham_u32_t rescale_monitor;
+
+} ham_freelist_page_statistics_t;
+
+/* -- end of equivalents of the statistics.h internal PERSISTED data
+ * structures -- */
+
+/**
+ * global freelist algorithm specific run-time info: per cache
+ */
+typedef struct ham_runtime_statistics_globdata_t
+{
+ /** number of scans per size range */
+ ham_u32_t scan_count[HAM_FREELIST_SLOT_SPREAD];
+ ham_u32_t ok_scan_count[HAM_FREELIST_SLOT_SPREAD];
+
+ /** summed cost ('duration') of all scans per size range */
+ ham_u32_t scan_cost[HAM_FREELIST_SLOT_SPREAD];
+ ham_u32_t ok_scan_cost[HAM_FREELIST_SLOT_SPREAD];
+
+ /** count the number of insert operations for this DB */
+ ham_u32_t insert_count;
+ ham_u32_t delete_count;
+ ham_u32_t extend_count;
+ ham_u32_t fail_count;
+ ham_u32_t search_count;
+
+ ham_u32_t insert_query_count;
+ ham_u32_t erase_query_count;
+ ham_u32_t query_count;
+
+ ham_u32_t first_page_with_free_space[HAM_FREELIST_SLOT_SPREAD];
+
+ /**
+ * Note: counter/statistics value overflow management:
+ *
+ * As the 'cost' numbers will be the fastest growing numbers of
+ * them all, it is sufficient to check cost against a suitable
+ * high water mark, and once it reaches that mark, to rescale
+ * all statistics.
+ *
+ * Of course, we could have done without the rescaling by using
+ * 64-bit integers for all statistics elements, but 64-bit
+ * integers are not native to all platforms and incurr a (minor)
+ * run-time penalty when used. It is felt that slower machines,
+ * which are often 32-bit only, benefit from a compare plus
+ * once-in-a-while rescale, as this overhead can be amortized
+ * over a large multitude of statistics updates.
+ *
+ * How does the rescaling work?
+ *
+ * The statistics all are meant to represent relative numbers,
+ * so uniformly scaling these numbers will not produce worse
+ * results from the hinters -- as long as the scaling does not produce
+ * edge values (0 or 1) which destroy the significance of the numbers
+ * gathered thus far.
+ *
+ * I believe a rescale by a factor of 256 (2^8) is quite safe
+ * when the high water mark is near the maxint (2^32) edge, even
+ * when the cost number can be 100 times as large as the other
+ * numbers in some regular use cases. Meanwhile, a division by
+ * 256 will reduce the collected numeric values so much that
+ * there is ample headroom again for the next 100K+ operations;
+ * at an average monitored cost increase of 10-20 per
+ * insert/delete trial and, for very large databases using an
+ * overly conservative freelist management setting, ~50-200 trials
+ * per insert/delete API invocation (which should be a hint to the
+ * user that another DAM mode is preferred; after all, 'classical'
+ * is only there for backwards compatibility, and in the old
+ * days, hamsterdb was a snail when you'd be storing 1M+ records
+ * in a single DB table), the resulting statistics additive step
+ * is a nominal worst case of 20 * 200 = 4000 cost points per
+ * insert/delete.
+ *
+ * Assuming a high water mark for signed int, i.e. 2^31 ~ 2.14
+ * billion, dividing ('rescaling') that number down to 2^(31-8) ~ 8M
+ * produces a headroom of ~ 2.13 billion points, which,
+ * assuming the nominal worst case of a cost addition of 4000
+ * points per insert/delete, implies new headroom for ~ 500K
+ * insert/delete API operations.
+ *
+ * Which, in my book, is ample space. This also means that the
+ * costs incurred by the rescaling can be amortized over 500K+
+ * operations, resulting in an - on average - negligible overhead.
+ *
+ * So we can use 32-bits for all statistics counters quite
+ * safely. Assuming our 'cost is the fastest riser' position holds for
+ * all use cases, that is.
+ *
+ * A quick analysis shows this to be probably true, even for
+ * fringe cases (a mathematical proff would be nicer here, but
+ * alas):
+ * let's assume worst case, where we have a lot of trials
+ * (testing each freelist page entry in a very long freelist,
+ * i.e. a huge database table) which all fail. 'Cost' is
+ * calculated EVERY TIME the innermost freelist search method is
+ * invoked, i.e. when the freelist bitarray is inspected, and
+ * both fail and success costs are immediately fed into the
+ * statistics, so our worst case for the 'cost-is-fastest' lemma
+ * would be a long trace of fail trials, which do NOT test the
+ * freelist bitarrays, i.e. fails which are discarded in the outer layers,
+ * thanks to the hinters (global and per-entry) kicking in and preventing
+ * those freelist bitarray scans. Assume then that all counters
+ * have the same value, which would mean that the number of
+ * fails has to be higher that the final cost, repetitively.
+ *
+ * This _can_ happen when the number of fail trials at the
+ * per-entry outer level is higher than the cost of the final
+ * (and only) freelist bitarray scan, which clocks in at a
+ * nominal 4-10 points for success cases. However, those _outer_
+ * fail trials are NOT counted and fed to the statistics, so
+ * this case will only register a single, successful or failing,
+ * trial - with cost.
+ *
+ * As long as the code is not changed to count those
+ * hinter-induced fast rounds in the outer layers when searching
+ * for a slot in the freelist, the lemma 'cost grows fastest'
+ * holds, as any other possible 'worst case' will either
+ * succeed quite quickly or fail through a bitarray scan, which
+ * results in such fail rounds having associated non-zero, 1+
+ * costs associated with them.
+ *
+ * To be on the safe side of it all, we accumulate all costs in
+ * a special statistics counter, which is specifically designed
+ * to be used for the high water mark minotring and subsequent
+ * decision to rescale: rescale_monitor.
+ */
+ ham_u32_t rescale_monitor;
+
+} ham_runtime_statistics_globdata_t;
+
+
+/**
+ * @defgroup ham_operation_types hamsterdb Database Operation Types
+ * @{
+ *
+ * Indices into find/insert/erase specific statistics
+ *
+ * @sa ham_statistics_t
+ * @sa ham_runtime_statistics_opdbdata_t
+ */
+#define HAM_OPERATION_STATS_FIND 0
+#define HAM_OPERATION_STATS_INSERT 1
+#define HAM_OPERATION_STATS_ERASE 2
+
+/** The number of operations defined for the statistics gathering process */
+#define HAM_OPERATION_STATS_MAX 3
+
+/**
+ * @}
+ */
+
+
+/**
+ * Statistics gathered specific per operation (find, insert, erase)
+ */
+typedef struct ham_runtime_statistics_opdbdata_t
+{
+ ham_u32_t btree_count;
+ ham_u32_t btree_fail_count;
+ ham_u32_t btree_cost;
+ ham_u32_t btree_fail_cost;
+
+ ham_offset_t btree_last_page_addr;
+
+ /**
+ * number of consecutive times that this last page was produced as
+ * an answer ('sequential hits')
+ */
+ ham_u32_t btree_last_page_sq_hits;
+
+ ham_u32_t query_count;
+
+ ham_u32_t btree_hinting_fail_count;
+ ham_u32_t btree_hinting_count;
+
+ ham_u32_t aging_tracker;
+
+} ham_runtime_statistics_opdbdata_t;
+
+typedef struct ham_runtime_statistics_dbdata_t
+{
+ /* find/insert/erase */
+ ham_runtime_statistics_opdbdata_t op[HAM_OPERATION_STATS_MAX];
+
+ /**
+ * common rescale tracker as the rescaling is done on all operations data
+ * at once, so these remain 'balanced'.
+ *
+ * Fringe case consideration: when there's, say, a lot of FIND going
+ * on with a few ERASE operations in between, is it A Bad Thing that
+ * the ERASE stats risc getting rescaled to almost nil then? Answer: NO.
+ * Because there's a high probability that the last ERASE btree leaf
+ * node isn't in cache anymore anyway -- unless it's the same one
+ * as used by FIND.
+ *
+ * The reason we keep track of 3 different leaf nodes is only so we
+ * can supply good hinting in scanerios where FIND, INSERT and/or
+ * ERASE are mixed in reasonable ratios; keeping track of only a single
+ * btree leaf would deny us some good hinting for the other operations.
+ */
+ ham_u32_t rescale_tracker;
+
+ /**
+ * Remember the upper and lower bound kays for this database; update them
+ * when we insert a new key, maybe even update them when we delete/erase
+ * a key.
+ *
+ * These bounds are collected on the fly while searching (find()): they are
+ * stored in here as soon as a find() operation hits either the lower or
+ * upper bound of the key range stored in the database.
+ *
+ * The purpose of storing these bounds is to speed up out-of-bounds
+ * key searches significantly: by comparing incoming keys with these
+ * bounds, we can immediately tell whether a key will have a change of
+ * being found or not, thus precluding the need to traverse the btree -
+ * which would produce the same answer in the end anyhow.
+ *
+ * WARNING: having these key (copies) in here means we'll need to
+ * clean them up when we close the database connection, or we'll risk
+ * leaking memory in the key->data here.
+ *
+ * NOTE #1: this is the humble beginning of what in a more sophisticated
+ * database server system would be called a 'histogram' (Oracle, etc.).
+ * Here we don't spend the effort to collect data for a full histogram,
+ * but merely collect info about the extremes of our stored key range.
+ *
+ * NOTE #2: I'm pondering whether this piece of statistics gathering
+ * should be allowed to be turned off by the user 'because he knows best'
+ * where premium run-time performance requirements are at stake.
+ * Yet... The overhead here is a maximum of two key comparisons plus
+ * 2 key copies (which can be significant when we're talking about
+ * extended keys!) when we're producing find/insert/erase results which
+ * access a btree leaf node which is positioned at the upper/lower edge of
+ * the btree key range.
+ *
+ * Hence, worst case happens for sure with tiny databases, as those will
+ * have ONE btree page only (root=leaf!) and the worst case is reduced
+ * to 1 key comparison + 1 key copy for any larger database, which spans
+ * two btree pages or more.
+ *
+ * To further reduce the worst case overhead, we also store the
+ * within-btree-node index of the upper/lower bound key: when this does
+ * not change, there is no need to compare the key - unless the key is
+ * overwritten, which is a special case of the insert operation.
+ *
+ * @warning
+ * The @a key data is allocated using the @ref ham_db_t allocator and the
+ * key data must be freed before the related @ref ham_db_t handle
+ * is closed or deleted.
+ */
+ ham_key_t lower_bound;
+ ham_u32_t lower_bound_index;
+ ham_offset_t lower_bound_page_address;
+ ham_bool_t lower_bound_set;
+ ham_key_t upper_bound;
+ ham_u32_t upper_bound_index;
+ ham_offset_t upper_bound_page_address;
+ ham_bool_t upper_bound_set;
+
+} ham_runtime_statistics_dbdata_t;
+
+/**
+ * This structure is a @e READ-ONLY data structure returned through invoking
+ * @ref ham_env_get_parameters or @ref ham_get_parameters with a
+ * @ref HAM_PARAM_GET_STATISTICS @ref ham_parameter_t entry.
+ *
+ * @warning
+ * The content of this structure will be subject to change with each hamsterdb
+ * release; having it available in the public interface does @e not mean one
+ * can assume the data layout and/or content of the @ref ham_statistics_t
+ * structure to remain constant over multiple release version updates of
+ * hamsterdb.
+ *
+ * Also note that the data is exported to aid very advanced uses of hamsterdb
+ * only and is to be accessed in an exclusively @e read-only fashion.
+ *
+ * The structure includes a function pointer which will optionally be set
+ * by hamsterdb upon invoking @ref ham_env_get_parameters or
+ * @ref ham_get_parameters and this function should be invoked
+ * by the caller to release all memory allocated by hamsterdb in the
+ * @ref ham_statistics_t structure, and this action @e MUST be performed
+ * @e before the related @a env and/or @a db handles are either closed
+ * or deleted, whichever of these comes first in your application run-time flow.
+ *
+ * The easiest way to invoke this @ref ham_clean_statistics_datarec function
+ * (when it is set), is to use the provided @ref ham_free_statistics() macro.
+ *
+ * @sa HAM_PARAM_GET_STATISTICS
+ * @sa ham_clean_statistics_datarec
+ * @sa ham_get_parameters
+ * @sa ham_env_get_parameters
+ */
+typedef struct ham_statistics_t
+{
+ /** Number of freelist pages (and statistics records) known to hamsterdb */
+ ham_size_t freelist_record_count;
+
+ /** Number of freelist statistics records allocated in this structure */
+ ham_size_t freelist_stats_maxalloc;
+
+ /** The @a freelist_stats_maxalloc freelist statistics records */
+ ham_freelist_page_statistics_t *freelist_stats;
+
+ /** The @ref ham_db_t specific statistics */
+ ham_runtime_statistics_dbdata_t db_stats;
+
+ /** The @ref ham_env_t statistics, a.k.a. 'global statistics' */
+ ham_runtime_statistics_globdata_t global_stats;
+
+ /**
+ * [input] Whether the freelist statistics should be gathered (this is
+ * a relatively costly operation)
+ * [output] will be reset when the freelist statistics have been gathered
+ */
+ unsigned dont_collect_freelist_stats: 1;
+
+ /**
+ * [input] Whether the @ref ham_db_t specific statistics should be gathered
+ * [output] will be reset when the db specific statistics have been gathered
+ */
+ unsigned dont_collect_db_stats: 1;
+
+ /**
+ * [input] Whether the @ref ham_env_t statistics (a.k.a. 'global
+ * statistics') should be gathered
+ * [output] will be reset when the global statistics have been gathered
+ */
+ unsigned dont_collect_global_stats: 1;
+
+ /**
+ * A reference to a hamsterdb-specified @e optional data cleanup function.
+ *
+ * @warning
+ * The user @e MUST call this cleanup function when it is set by
+ * hamsterdb, preferrably through invoking
+ * @ref ham_clean_statistics_datarec() as that function will check if
+ * this callback has been set or not before invoking it.
+ *
+ * @sa ham_clean_statistics_datarec
+ */
+ ham_free_statistics_func_t *_free_func;
+
+ /*
+ * internal use: this element is set by hamsterdb and to be used by the
+ * @a _free_func callback.
+ */
+ void *_free_func_internal_arg;
+
+} ham_statistics_t;
+
+/**
+ * Invoke the optional @ref ham_statistics_t content cleanup function.
+ *
+ * This function will check whether the @ref ham_statistics_t free/cleanup
+ * callback has been set or not before invoking it.
+ *
+ * @param s A pointer to a valid @ref ham_statistics_t data structure. 'Valid'
+ * means you must call this @ref ham_free_statistics() macro @e after having
+ * called @ref ham_env_get_parameters or @ref ham_get_parameters with
+ * a @ref HAM_PARAM_GET_STATISTICS @ref ham_parameter_t entry which had this
+ * @ref ham_statistics_t reference @a s attached and @e before either the
+ * related @ref ham_db_t or @ref ham_env_t handles are closed (@ref
+ * ham_env_close/@ref ham_close) or deleted (@ref ham_env_delete/@ref
+ * ham_delete).
+ *
+ * @return @ref HAM_SUCCESS upon success
+ * @return @ref HAM_INV_PARAMETER if the @a s pointer is NULL
+ *
+ * @sa HAM_PARAM_GET_STATISTICS
+ * @sa ham_clean_statistics_datarec
+ * @sa ham_get_parameters
+ * @sa ham_env_get_parameters
+ */
+HAM_EXPORT ham_status_t HAM_CALLCONV
+ham_clean_statistics_datarec(ham_statistics_t *s);
+
+
+#ifdef __cplusplus
+} // extern "C"
+#endif
+
+#endif /* HAM_HAMSTERDB_STATS_H__ */
+
View
436 src/fraction.c
@@ -0,0 +1,436 @@
+/**
+ * Copyright (C) 2005-2008 Christoph Rupp (chris@crupp.de).
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * See files COPYING.* for License information.
+ *
+ */
+
+#include "config.h"
+
+#include <string.h>
+#include <math.h>
+#include <float.h>
+#include <ham/hamsterdb.h>
+#include "fraction.h"
+#include "error.h"
+
+
+
+/* fraction code ripped from my patch for meGUI */
+/**
+* Code according to info found here: http://mathforum.org/library/drmath/view/51886.html
+*
+*
+* Date: 06/29/98 at 13:12:44
+*
+* From: Doctor Peterson
+*
+* Subject: Re: Decimal To Fraction Conversion
+*
+*
+* The algorithm I am about to show you has an interesting history. I
+* recently had a discussion with a teacher in England who had a
+* challenging problem he had given his students, and wanted to know what
+* others would do to solve it. The problem was to find the fraction
+* whose decimal value he gave them, which is essentially identical to
+* your problem! I wasn't familiar with a standard way to do it, but
+* solved it by a vaguely remembered Diophantine method. Then, my
+* curiosity piqued, and I searched the Web for information on the
+* problem and didn't find it mentioned in terms of finding the fraction
+* for an actual decimal, but as a way to approximate an irrational by a
+* fraction, where the continued fraction method was used.
+*
+*
+* I wrote to the teacher, and he responded with a method a student of
+* his had come up with, which uses what amounts to a binary search
+* technique. I recognized that this produced the same sequence of
+* approximations that continued fractions gave, and was able to
+* determine that it is really equivalent, and that it is known to some
+* mathematicians (or at least math historians).
+*
+*
+* After your request made me realize that this other method would be
+* easier to program, I thought of an addition to make it more efficient,
+* which to my knowledge is entirely new. So we're either on the cutting
+* edge of computer technology or reinventing the wheel, I'm not sure
+* which!
+*
+*
+* Here's the method, with a partial explanation for how it works:
+*
+*
+* We want to approximate a value m (given as a decimal) between 0 and 1,
+* by a fraction Y/X. Think of fractions as vectors (denominator,
+* numerator), so that the slope of the vector is the value of the
+* fraction. We are then looking for a lattice vector (X, Y) whose slope
+* is as close as possible to m. This picture illustrates the goal, and
+* shows that, given two vectors A and B on opposite sides of the desired
+* slope, their sum A + B = C is a new vector whose slope is between the
+* two, allowing us to narrow our search:
+*
+* <pre>
+* num
+* ^
+* |
+* + + + + + + + + + + +
+* |
+* + + + + + + + + + + +
+* | slope m=0.7
+* + + + + + + + + + + + /
+* | /
+* + + + + + + + + + + D &lt;--- solution
+* | /
+* + + + + + + + + + /+ +
+* | /
+* + + + + + + + C/ + + +
+* | /
+* + + + + + + /+ + + + +
+* | /
+* + + + + B/ + + + + + +
+* | /
+* + + + /A + + + + + + +
+* | /
+* + +/ + + + + + + + + +
+* | /
+* +--+--+--+--+--+--+--+--+--+--+--&gt; denom
+* </pre>
+*
+*
+* Here we start knowing the goal is between A = (3,2) and B = (4,3), and
+* formed a new vector C = A + B. We test the slope of C and find that
+* the desired slope m is between A and C, so we continue the search
+* between A and C. We add A and C to get a new vector D = A + 2*B, which
+* in this case is exactly right and gives us the answer.
+*
+*
+* Given the vectors A and B, with slope(A) &lt; m &lt; slope(B),
+* we can find consecutive integers M and N such that
+* slope(A + M*B) &lt; x &lt; slope(A + N*B) in this way:
+*
+*
+* If A = (b, a) and B = (d, c), with a/b &lt; m &lt; c/d, solve
+*
+* <pre>
+* a + x*c
+* ------- = m
+* b + x*d
+* </pre>
+*
+*
+* to give
+*
+* <pre>
+* b*m - a
+* x = -------
+* c - d*m
+* </pre>
+*
+*
+* If this is an integer (or close enough to an integer to consider it
+* so), then A + x*B is our answer. Otherwise, we round it down and up to
+* get integer multipliers M and N respectively, from which new lower and
+* upper bounds A' = A + M*B and B' = A + N*B can be obtained. Repeat the
+* process until the slopes of the two vectors are close enough for the
+* desired accuracy. The process can be started with vectors (0,1), with
+* slope 0, and (1,1), with slope 1. Surprisingly, this process produces
+* exactly what continued fractions produce, and therefore it will
+* terminate at the desired fraction (in lowest terms, as far as I can
+* tell) if there is one, or when it is correct within the accuracy of
+* the original data.
+*
+*
+* For example, for the slope 0.7 shown in the picture above, we get
+* these approximations:
+*
+*
+* Step 1: A = 0/1, B = 1/1 (a = 0, b = 1, c = 1, d = 1)
+*
+* <pre>
+* 1 * 0.7 - 0 0.7
+* x = ----------- = --- = 2.3333
+* 1 - 1 * 0.7 0.3
+*
+* M = 2: lower bound A' = (0 + 2*1) / (1 + 2*1) = 2 / 3
+* N = 3: upper bound B' = (0 + 3*1) / (1 + 3*1) = 3 / 4
+* </pre>
+*
+*
+* Step 2: A = 2/3, B = 3/4 (a = 2, b = 3, c = 3, d = 4)
+*
+* <pre>
+* 3 * 0.7 - 2 0.1
+* x = ----------- = --- = 0.5
+* 3 - 4 * 0.7 0.2
+*
+* M = 0: lower bound A' = (2 + 0*3) / (3 + 0*4) = 2 / 3
+* N = 1: upper bound B' = (2 + 1*3) / (3 + 1*4) = 5 / 7
+* </pre>
+*
+*
+* Step 3: A = 2/3, B = 5/7 (a = 2, b = 3, c = 5, d = 7)
+*
+* <pre>
+* 3 * 0.7 - 2 0.1
+* x = ----------- = --- = 1
+* 5 - 7 * 0.7 0.1
+*
+* N = 1: exact value A' = B' = (2 + 1*5) / (3 + 1*7) = 7 / 10
+* </pre>
+*
+*
+* which of course is obviously right.
+*
+*
+* In most cases you will never get an exact integer, because of rounding
+* errors, but can stop when one of the two fractions is equal to the
+* goal to the given accuracy.
+*
+*
+* [...]Just to keep you up to date, I tried out my newly invented algorithm
+* and realized it lacked one or two things. Specifically, to make it
+* work right, you have to alternate directions, first adding A + N*B and
+* then N*A + B. I tested my program for all fractions with up to three
+* digits in numerator and denominator, then started playing with the
+* problem that affects you, namely how to handle imprecision in the
+* input. I haven't yet worked out the best way to allow for error, but
+* here is my C++ function (a member function in a Fraction class
+* implemented as { short num; short denom; } ) in case you need to go to
+* this algorithm.
+*
+*
+* [Edit [i_a]: tested a few stop criteria and precision settings;
+* found that you can easily allow the algorithm to use the full integer
+* value span: worst case iteration count was 21 - for very large prime
+* numbers in the denominator and a precision set at double.Epsilon.
+* Part of the code was stripped, then reinvented as I was working on a
+* proof for this system. For one, the reason to 'flip' the A/B treatment
+* (i.e. the 'i&1' odd/even branch) is this: the factor N, which will
+* be applied to the vector addition A + N*B is (1) an integer number to
+* ensure the resulting vector (i.e. fraction) is rational, and (2) is
+* determined by calculating the difference in direction between A and B.
+* When the target vector direction is very close to A, the difference
+* in *direction* (sort of an 'angle') is tiny, resulting in a tiny N
+* value. Because the value is rounded down, A will not change. B will,
+* but the number of iterations necessary to arrive at the final result
+* increase significantly when the 'odd/even' processing is not included.
+* Basically, odd/even processing ensures that once every second iteration
+* there will be a major change in direction for any target vector M.]
+*
+*
+* Edit [i_a]: further testing finds the empirical maximum
+* precision to be ~ 1.0E-13, IFF you use the new high/low precision
+* checks (simpler, faster) in the code (old checks have been commented out).
+* Higher precision values cause the code to produce very huge fractions
+* which clearly show the effect of limited floating point accuracy.
+* Nevetheless, this is an impressive result.
+*
+* I also changed the loop: no more odd/even processing but now we're
+* looking for the biggest effect (i.e. change in direction) during EVERY
+* iteration: see the new x1:x2 comparison in the code below.
+* This will lead to a further reduction in the maximum number of iterations
+* but I haven't checked that number now. Should be less than 21,
+* I hope. ;-)
+*/
+
+
+double fract2dbl(const ham_fraction_t *src)
+{
+ ham_assert(src->denom != 0, (0));
+ return src->num / (double)src->denom;
+}
+
+void to_fract_w_prec(ham_fraction_t *dst, double val, double precision)
+{
+ ham_fraction_t low = {0, 1}; // "A" = 0/1 (a/b)
+ ham_fraction_t high = {1, 1}; // "B" = 1/1 (c/d)
+
+ // find nearest fraction
+ ham_u32_t intPart = (ham_u32_t)val;
+ val -= intPart;
+
+ for (;;)
+ {
+ double testLow;
+ double testHigh;
+ double x1;
+ double x2;
+
+ ham_assert(fract2dbl(&low) <= val, (0));
+ ham_assert(fract2dbl(&high) >= val, (0));
+
+ // b*m - a
+ // x = -------
+ // c - d*m
+ testLow = low.denom * val - low.num;
+ testHigh = high.num - high.denom * val;
+
+ // test for match:
+ //
+ // m - a/b < precision
+ //
+ // ==>
+ //
+ // b * m - a < b * precision
+ //
+ // which is happening here: check both the current A and B fractions.
+ //if (testHigh < high.denom * Precision)
+ if (testHigh < precision) // [i_a] speed improvement; this is even better for irrational 'val'
+ {
+ break; // high is answer
+ }
+ //if (testLow < low.denom * Precision)
+ if (testLow < precision) // [i_a] speed improvement; this is even better for irrational 'val'
+ {
+ // low is answer
+ high = low;
+ break;
+ }
+
+ x1 = testHigh / testLow;
+ x2 = testLow / testHigh;
+
+ // always choose the path where we find the largest change in direction:
+ if (x1 > x2)
+ {
+ ham_u32_t n;
+ ham_u32_t h_num;
+ ham_u32_t h_denom;
+ ham_u32_t l_num;
+ ham_u32_t l_denom;
+
+ //double x1 = testHigh / testLow;
+ // safety checks: are we going to be out of integer bounds?
+ if ((x1 + 1) * low.denom + high.denom >= (double)0xFFFFFFFF)
+ {
+ break;
+ }
+
+ n = (ham_u32_t)x1; // lower bound for m
+ //int m = n + 1; // upper bound for m
+
+ // a + x*c
+ // ------- = m
+ // b + x*d
+ h_num = n * low.num + high.num;
+ h_denom = n * low.denom + high.denom;
+
+ //ham_u32_t l_num = m * low.num + high.num;
+ //ham_u32_t l_denom = m * low.denom + high.denom;
+ l_num = h_num + low.num;
+ l_denom = h_denom + low.denom;
+
+ low.num = l_num;
+ low.denom = l_denom;
+ high.num = h_num;
+ high.denom = h_denom;
+ }
+ else
+ {
+ ham_u32_t n;
+ ham_u32_t h_num;
+ ham_u32_t h_denom;
+ ham_u32_t l_num;
+ ham_u32_t l_denom;
+
+ //double x2 = testLow / testHigh;
+ // safety checks: are we going to be out of integer bounds?
+ if (low.denom + (x2 + 1) * high.denom >= (double)0x7FFFFFFF)
+ {
+ break;
+ }
+
+ n = (ham_u32_t)x2; // lower bound for m
+ //ham_u32_t m = n + 1; // upper bound for m
+
+ // a + x*c
+ // ------- = m
+ // b + x*d
+ l_num = low.num + n * high.num;
+ l_denom = low.denom + n * high.denom;
+
+ //ham_u32_t h_num = low.num + m * high.num;
+ //ham_u32_t h_denom = low.denom + m * high.denom;
+ h_num = l_num + high.num;
+ h_denom = l_denom + high.denom;
+
+ high.num = h_num;
+ high.denom = h_denom;
+ low.num = l_num;
+ low.denom = l_denom;
+ }
+ ham_assert(fract2dbl(&low) <= val, (0));
+ ham_assert(fract2dbl(&high) >= val, (0));
+ }
+
+ high.num += high.denom * intPart;
+
+ *dst = high;
+}
+
+
+void to_fract(ham_fraction_t *dst, double val)
+{
+ to_fract_w_prec(dst, val, 1.0E-13 /* float.Epsilon */ );
+}
+
+#if 0
+
+void TestFraction(void)
+{
+ ham_fraction_t ret;
+ double vut;
+
+ vut = 0.1;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 0.99999997;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = (0x40000000 - 1.0) / (0x40000000 + 1.0);
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 1.0 / 3.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 1.0 / (0x40000000 - 1.0);
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 320.0 / 240.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 6.0 / 7.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 320.0 / 241.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 720.0 / 577.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 2971.0 / 3511.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 3041.0 / 7639.0;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 1.0 / sqrt(2);
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+ vut = 3.1415926535897932384626433832795 /* M_PI */;
+ to_fract(&ret, vut);
+ ham_assert(fabs(vut - fract2dbl(&ret)) < 1E-9, (0));
+}
+
+#endif
+
+
+
+
+
+
+
View
39 src/fraction.h
@@ -0,0 +1,39 @@
+/**
+ * Copyright (C) 2005-2008 Christoph Rupp (chris@crupp.de).
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * See files COPYING.* for License information.
+ */
+
+#ifndef HAM_FRACTION_H__
+#define HAM_FRACTION_H__
+
+#include <ham/types.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+typedef struct
+{
+ ham_u32_t num;
+ ham_u32_t denom;
+} ham_fraction_t;
+
+
+extern double fract2dbl(const ham_fraction_t *src);
+
+extern void to_fract_w_prec(ham_fraction_t *dst, double val, double precision);
+
+extern void to_fract(ham_fraction_t *dst, double val);
+
+#ifdef __cplusplus
+} // extern "C"
+#endif
+
+#endif
View
92 src/freelist_v2.c
@@ -0,0 +1,92 @@
+/**
+ * Copyright (C) 2005-2008 Christoph Rupp (chris@crupp.de).
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * See files COPYING.* for License information.
+ *
+ */
+
+#define IMPLEMENT_MODERN_FREELIST32
+
+#include "freelist.c"
+
+
+
+ham_status_t
+__freel_flush_stats32(ham_db_t *db)
+{
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ /*
+ do not update the statistics in a READ ONLY database!
+ */
+ if (!(db_get_rt_flags(db) & HAM_READ_ONLY))
+ {
+ freelist_cache_t *cache;
+ freelist_entry_t *entries;
+
+ cache=db_get_freelist_cache(db);
+ ham_assert(cache, (0));
+
+ entries = freel_cache_get_entries(cache);
+
+ if (entries && freel_cache_get_count(cache) > 0
+ && !db_is_mgt_mode_set(db_get_data_access_mode(db),
+ HAM_DAM_ENFORCE_PRE110_FORMAT))
+ {
+ /*
+ only persist the statistics when we're using a v1.1.0+ format DB
+ */
+ ham_size_t i;
+
+ for (i = freel_cache_get_count(cache); i-- > 0; )
+ {
+ freelist_entry_t *entry = entries + i;
+
+ if (freel_entry_statistics_is_dirty(entry))
+ {
+ freelist_payload_t *fp;
+ freelist_page_statistics_t *pers_stats;
+
+ if (!freel_entry_get_page_id(entry))
+ {
+ /* header page */
+ fp = db_get_freelist(db);
+ db_set_dirty(db);
+ }
+ else
+ {
+ /*
+ * otherwise just fetch the page from the cache or the disk
+ */
+ ham_page_t *page = db_fetch_page(db, freel_entry_get_page_id(entry), 0);
+ if (!page)
+ return (db_get_error(db));
+ fp = page_get_freelist(page);
+ ham_assert(freel_get_start_address(fp) != 0, (0));
+ page_set_dirty(page);
+ }
+
+ ham_assert(fp->_s._s32._zero == 0, (0));
+
+ pers_stats = freel_get_statistics32(fp);
+
+ ham_assert(sizeof(*pers_stats) == sizeof(*freel_entry_get_statistics(entry)), (0));
+ memcpy(pers_stats, freel_entry_get_statistics(entry), sizeof(*pers_stats));
+
+ /* and we're done persisting/flushing this entry */
+ freel_entry_statistics_reset_dirty(entry);
+ }
+ }
+ }
+ }
+
+ return (0);
+}
+
+
View
27 src/freelist_v2.h
@@ -0,0 +1,27 @@
+/**
+ * Copyright (C) 2005-2008 Christoph Rupp (chris@crupp.de).
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * See files COPYING.* for License information.
+ *
+ *
+ * freelist structures, functions and macros
+ *
+ */
+
+#ifndef HAM_FREELIST32_H__
+#define HAM_FREELIST32_H__
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#ifdef __cplusplus
+} // extern "C"
+#endif
+
+#endif /* HAM_FREELIST_H__ */
View
2,716 src/statistics.c
@@ -0,0 +1,2716 @@
+/**
+ * Copyright (C) 2005-2008 Christoph Rupp (chris@crupp.de).
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * See files COPYING.* for License information.
+ *
+ */
+
+#include "config.h"
+
+#include <string.h>
+#include <ham/hamsterdb.h>
+#include <ham/hamsterdb_stats.h>
+#include "db.h"
+#include "endian.h"
+#include "freelist.h"
+#include "error.h"
+#include "btree_cursor.h"
+#include "btree.h"
+#include "util.h"
+#include "statistics.h"
+
+
+
+
+
+
+/*
+ * TODO statistics gatherer/hinter:
+ *
+ * keep track of two areas' 'utilization':
+ *
+ * 1) for fast/uberfast mode, keep track of the LAST free zone, i.e.
+ * the free zone at the end;
+ * ONLY move the start marker for that BACKWARDS when we get a freeing
+ * op just before it OR when we specifically scan backwards to find the
+ * adjusted start after lots of fragmented delete ops and we're nog in
+ * turbo-fast mode: this would save space.
+ *
+ * 2) keep track of the marker where the FIRST free chunk just was,
+ * i.e. before which' point there definitely is NO free space. Use this
+ * marker as the start for a free-space-search when in
+ * space-saving/classic mode; use the other 'start of free spcae at end
+ * of the page'
+ * marker as the starting point for (uber-)fast searches.
+ *
+ * 'utilization': keep track of the number of free chunks and allocated
+ * chunks in the middle zone ~ the zone between FIRST and LST marker:
+ * the ratio is a measure of the chance we expect to have when searching
+ * this zone for a free spot - by not coding/designing to cover a
+ * specific pathological case (add+delete @ start & end of store and
+ * repeating this cycle over and over causing the DB to 'jump all over
+ * the place' in classic mode; half of the free slot searches would be
+ * 'full scans' of the freelist then. Anyway, we do not wish to code for
+ * this specific pathological case, as such code will certainly
+ * introduce another pathological case instead, which should be fixed,
+ * resulting in expanded code and yet another pathological case fit for
+ * the new code situation, etc.etc. ad nauseam. Instead, we use
+ * statistical measures to express an estimate, i.e. the chance that we
+ * might need to scan a large portion of the freelist when we
+ * run in classic spacesaving insert mode, and apply that statistical
+ * data to the hinter, using the current mode.
+ *
+ * -- YES, that also means we are able to switch freelist scanning
+ * mode, and thus speed-
+ * versus storage consumption hints, on a per-insert basis: a single
+ * database can mix slow but spacesaving record inserts for those times / tables when we do not need the extra oemph, while other inserts can
+ * be told (using the flags in the API calls) to act optimized for
+ * - none (classic) --> ~ storage space saving
+ * - storage space saving
+ * - insertion speed By using 2 bits: 1 for speed and one for
+ * uber/turbo or regular, we can have 3 or 4 modes, where a 'speedy
+ * space saving' mode might imply we're using those freelist stats to
+ * decide whether to start the scan at the end or near the start of the
+ * freelist, in order to arrive at a 'reasonable space utilization'
+ * while keeping up the speed, at least when determined over multiple
+ * inserts.
+ *
+ * And mode 4 can be used to enforce full scan or something like that:
+ * this can be used to improve the statistics as those are not persisted
+ * on disc.
+ *
+ *
+ * The stats gatherer is delivering the most oomph, especially for tiny
+ * keys and records, where Boyer-Moore is not really effective (or even
+ * counter productive); gathering stats about the free slots and
+ * occupied slots helps us speeding up multiple inserts, even while the
+ * data is only alive for 1 run-time open-close period of time.
+ *
+ *
+ * Make the cache counter code indirect, so we can switch and test
+ * various cache aging systems quickly.
+ *
+ *
+ *
+ * When loading a freelist page, we can use sampling to get an idea of
+ * where the LAST zone starts and ends (2 bsearches: one assuming the
+ * freelist is sorted in descending order
+ * --> last 1 bit, one assuming the freelist is sorted in ascending
+ * order (now that we
+ * 'know' the last free bit, this will scan the range 0..last-1-bit to
+ * find the first 1 bit in there.
+ *
+ * Making sure we limit our # of samples irrespective of freelist page
+ * size, so we can use the same stats gather for classic and modern
+ * modes.
+ *
+ *
+ * perform such sampling using semi-random intervals: prevent being
+ * sensitive to a particular pathologic case this way.
+ */
+
+
+
+#define rescale_256(val) \
+ val = ((val + 256 - 1) >> 8) /* make sure non-zero numbers remain non-zero: roundup(x) */
+
+#define rescale_2(val) \
+ val = ((val + 2 - 1) >> 1) /* make sure non-zero numbers remain non-zero: roundup(x) */
+
+
+
+/**
+ * inline function (must be fast!) which calculates the smallest
+ * encompassing power-of-2 for the given value. The integer equivalent
+ * of roundup(log2(value)).
+ *
+ * Returned value range: 0..64
+ */
+static __inline ham_u16_t ham_log2(ham_u64_t v)
+{
+
+ // which would be faster? Duff style unrolled loop or (CPU cached) loop?
+#if 0
+
+ register ham_u64_t value = v;
+ register ham_u16_t power = !!value;
+
+#if 0
+#define HAM_LOG2_ONE_STAGE(value, power) \
+ value >>= 1; \
+ power+=!!value; /* no branching required; extra cost: always same \
+ * # of rounds --> quad+ amount of extra rounds --> \
+ * much slower! */
+#else
+#define HAM_LOG2_ONE_STAGE(value, power) \
+ value >>= 1; \
+ if (!value) break; \
+ power++;
+#endif
+
+#define HAM_LOG2_16_STAGES(value, power) \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power); \
+ HAM_LOG2_ONE_STAGE(value, power)
+
+ do
+ {
+ HAM_LOG2_16_STAGES(value, power);
+#if 0
+ HAM_LOG2_16_STAGES(value, power);
+ HAM_LOG2_16_STAGES(value, power);
+ HAM_LOG2_16_STAGES(value, power);
+#endif
+ } while (value);
+
+ return power;
+
+#else /* if 0 */
+
+ if (v)
+ {
+ register ham_u16_t power = 64;
+ register ham_s64_t value = (ham_s64_t)v;
+
+ /*
+ * test top bit by checking two's complement sign.
+ *
+ * This LOG2 is crafted to spend the least number of
+ * rounds inside the BM freelist bitarray scans.
+ */
+ while (!(value < 0))
+ {
+ power--;
+ value <<= 1;
+ }
+ return power;
+ }
+ return 0;
+
+#endif /* if 0 */
+}
+
+/**
+ * inline function (must be fast!) which calculates the smallest
+ * encompassing power-of-16 for the given value. The integer equivalent
+ * of roundup(log16(value)).
+ *
+ * Returned value range: 0..16
+ */
+static __inline ham_u16_t ham_log16(ham_size_t v)
+{
+ register ham_size_t value = v;
+ register ham_u16_t power = !!value;
+
+ if (value)
+ {
+ do
+ {
+ power++;
+ } while (value >>= 4);
+ }
+
+ return power;
+}
+
+static __inline ham_u16_t ham_bitcount2bucket_index(ham_size_t size)
+{
+ ham_u16_t bucket = ham_log2(size);
+ if (bucket >= HAM_FREELIST_SLOT_SPREAD)
+ {
+ bucket = HAM_FREELIST_SLOT_SPREAD - 1;
+ }
+ return bucket;
+}
+
+/**
+ * inline function (must be fast!) which calculates the inverse of the
+ * ham_log2() above:
+ * converting a bucket index number to the maximum possible size for
+ * that bucket.
+ */
+static __inline ham_size_t ham_bucket_index2bitcount(ham_u16_t bucket)
+{
+ return (1U << (bucket * 1)) - 1;
+}
+
+
+
+
+
+
+static void
+rescale_global_statistics(ham_db_t *db)
+{
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ ham_u16_t b;
+
+ for (b = 0; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ rescale_256(globalstats->scan_count[b]);
+ rescale_256(globalstats->ok_scan_count[b]);
+ rescale_256(globalstats->scan_cost[b]);
+ rescale_256(globalstats->ok_scan_cost[b]);
+ //rescale_256(globalstats->first_page_with_free_space[b]);
+ }
+
+ rescale_256(globalstats->insert_count);
+ rescale_256(globalstats->delete_count);
+ rescale_256(globalstats->extend_count);
+ rescale_256(globalstats->fail_count);
+ rescale_256(globalstats->search_count);
+ rescale_256(globalstats->insert_query_count);
+ rescale_256(globalstats->erase_query_count);
+ rescale_256(globalstats->query_count);
+ rescale_256(globalstats->rescale_monitor);
+}
+
+
+static void
+rescale_freelist_page_stats(freelist_cache_t *cache, freelist_entry_t *entry)
+{
+ ham_u16_t b;
+ freelist_page_statistics_t *entrystats = freel_entry_get_statistics(entry);
+
+ for (b = 0; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ //rescale_256(entrystats->per_size[b].first_start);
+ //rescale_256(entrystats->per_size[b].free_fill);
+ rescale_256(entrystats->per_size[b].epic_fail_midrange);
+ rescale_256(entrystats->per_size[b].epic_win_midrange);
+ rescale_256(entrystats->per_size[b].scan_count);
+ rescale_256(entrystats->per_size[b].ok_scan_count);
+ rescale_256(entrystats->per_size[b].scan_cost);
+ rescale_256(entrystats->per_size[b].ok_scan_cost);
+ }
+
+ //rescale_256(entrystats->last_start);
+ //rescale_256(entrystats->persisted_bits);
+ rescale_256(entrystats->insert_count);
+ rescale_256(entrystats->delete_count);
+ rescale_256(entrystats->extend_count);
+ rescale_256(entrystats->fail_count);
+ rescale_256(entrystats->search_count);
+ rescale_256(entrystats->rescale_monitor);
+
+ freel_entry_statistics_set_dirty(entry);
+}
+
+void
+db_update_freelist_stats_fail(ham_db_t *db, freelist_entry_t *entry,
+ freelist_payload_t *f,
+ freelist_hints_t *hints)
+{
+ freelist_cache_t *cache = db_get_freelist_cache(db);
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ freelist_page_statistics_t *entrystats = freel_entry_get_statistics(entry);
+ ham_size_t cost = hints->cost;
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(hints->size_bits);
+ ham_u32_t position = entrystats->persisted_bits;
+
+ // should NOT use freel_get_max_bitsXX(f) here!
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ freel_entry_statistics_set_dirty(entry);
+
+ if (globalstats->rescale_monitor >= HAM_STATISTICS_HIGH_WATER_MARK - cost)
+ {
+ /* rescale cache numbers! */
+ rescale_global_statistics(db);
+ }
+ globalstats->rescale_monitor += cost;
+
+ globalstats->fail_count++;
+ globalstats->search_count++;
+ globalstats->scan_cost[bucket] += cost;
+ globalstats->scan_count[bucket]++;
+
+ if (entrystats->rescale_monitor >= HAM_STATISTICS_HIGH_WATER_MARK - cost)
+ {
+ /* rescale cache numbers! */
+ rescale_freelist_page_stats(cache, entry);
+ }
+ entrystats->rescale_monitor += cost;
+
+ if (hints->startpos < entrystats->last_start)
+ {
+ /* we _did_ look in the midrange, but clearly we were not lucky there */
+ entrystats->per_size[bucket].epic_fail_midrange++;
+ }
+ entrystats->fail_count++;
+ entrystats->search_count++;
+ entrystats->per_size[bucket].scan_cost += cost;
+ entrystats->per_size[bucket].scan_count++;
+
+ /*
+ * only upgrade the fail-based start position to the very edge of
+ * the freelist page's occupied zone, when the edge is known
+ * (initialized).
+ */
+ if (!hints->aligned && position)
+ {
+ ham_u16_t b;
+ /*
+ * adjust the position to point at a free slot within the
+ * occupied zone, which would produce such an outcome by having
+ * too few free slots still in there following such a position.
+ *
+ * Hence we're saying there _is_ space (even when there may be
+ * none at all) but we also say this free space is not large
+ * enough to suit us.
+ *
+ * Why this weird juggling? Because, when the freelist is
+ * expanded as new (free) pages become registered, we will then
+ * have (a) sufficient free space (duh!) and most importantly,
+ * we'll have (b) made sure the next search for available slots
+ * by then does NOT skip/ignore those last few free bits we
+ * still _may_ have in this preceding zone, which is a WIN when
+ * we're into saving disc space.
+ */
+ ham_u32_t offset = entry->_allocated_bits;
+ if (offset > hints->size_bits)
+ {
+ offset = hints->size_bits;
+ }
+ if (position > offset - 1)
+ {
+ position -= offset - 1;
+ }
+ /*
+ * now we are at the first position within the freelist page
+ * where the reported FAIL for the given size_bits would happen,
+ * guaranteed.
+ */
+ for (b = bucket; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ if (entrystats->per_size[b].first_start < position)
+ {
+ entrystats->per_size[b].first_start = position;
+ }
+ /* also update buckets for larger chunks at the same time */
+ }
+
+ if (entrystats->last_start < position)
+ {
+ entrystats->last_start = position;
+ }
+ for (b = 0; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ ham_assert(entrystats->last_start >= entrystats->per_size[b].first_start, (0));
+ }
+ }
+}
+
+
+void
+db_update_freelist_stats(ham_db_t *db, freelist_entry_t *entry,
+ freelist_payload_t *f,
+ ham_u32_t position,
+ freelist_hints_t *hints)
+{
+ ham_u16_t b;
+ ham_size_t cost = hints->cost;
+
+ freelist_cache_t *cache = db_get_freelist_cache(db);
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ freelist_page_statistics_t *entrystats = freel_entry_get_statistics(entry);
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(hints->size_bits);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ freel_entry_statistics_set_dirty(entry);
+
+ if (globalstats->rescale_monitor >= HAM_STATISTICS_HIGH_WATER_MARK - cost)
+ {
+ /* rescale cache numbers! */
+ rescale_global_statistics(db);
+ }
+ globalstats->rescale_monitor += cost;
+
+ globalstats->search_count++;
+ globalstats->ok_scan_cost[bucket] += cost;
+ globalstats->scan_cost[bucket] += cost;
+ globalstats->ok_scan_count[bucket]++;
+ globalstats->scan_count[bucket]++;
+
+ if (entrystats->rescale_monitor >= HAM_STATISTICS_HIGH_WATER_MARK - cost)
+ {
+ /* rescale cache numbers! */
+ rescale_freelist_page_stats(cache, entry);
+ }
+ entrystats->rescale_monitor += cost;
+
+ if (hints->startpos < entrystats->last_start)
+ {
+ if (position < entrystats->last_start)
+ {
+ /* we _did_ look in the midrange, but clearly we were not lucky there */
+ entrystats->per_size[bucket].epic_fail_midrange++;
+ }
+ else
+ {
+ entrystats->per_size[bucket].epic_win_midrange++;
+ }
+ }
+ entrystats->search_count++;
+ entrystats->per_size[bucket].ok_scan_cost += cost;
+ entrystats->per_size[bucket].scan_cost += cost;
+ entrystats->per_size[bucket].ok_scan_count++;
+ entrystats->per_size[bucket].scan_count++;
+
+ /*
+ * since we get called here when we just found a suitably large
+ * free slot, that slot will be _gone_ for the next search, so we
+ * bump up our 'free slots to be found starting here'
+ * offset by size_bits, skipping the current space.
+ */
+ position += hints->size_bits;
+
+ for (b = bucket; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ if (entrystats->per_size[b].first_start < position)
+ {
+ entrystats->per_size[b].first_start = position;
+ }
+ /* also update buckets for larger chunks at the same time */
+ }
+
+ if (entrystats->last_start < position)
+ {
+ entrystats->last_start = position;
+ }
+ for (b = 0; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ ham_assert(entrystats->last_start >= entrystats->per_size[b].first_start, (0));
+ }
+
+ if (entrystats->persisted_bits < position)
+ {
+ /* overflow? reset this marker! */
+ ham_assert(entrystats->persisted_bits == 0, ("Should not get here when not invoked from the [unit]tests!"));
+ if (hints->size_bits > entry->_allocated_bits)
+ {
+ entrystats->persisted_bits = position;
+ }
+ else
+ {
+ /* extra HACKY safety margin */
+ entrystats->persisted_bits = position - hints->size_bits + entry->_allocated_bits;
+ }
+ }
+}
+
+
+/*
+ * No need to check for rescaling in here; see the notes that go with
+ * 'cost_monitor' to know that these counter increments will always
+ * remain below the current high water mark and hence do not risk
+ * introducing integer overflow here.
+ *
+ * This applies to the edit, no_hit, and query stat update routines
+ * below.
+ */
+
+void
+db_update_freelist_stats_edit(ham_db_t *db, freelist_entry_t *entry,
+ freelist_payload_t *f,
+ ham_u32_t position,
+ ham_size_t size_bits,
+ ham_bool_t free_these,
+ ham_u16_t mgt_mode)
+{
+ freelist_cache_t *cache = db_get_freelist_cache(db);
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ freelist_page_statistics_t *entrystats = freel_entry_get_statistics(entry);
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(size_bits);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ freel_entry_statistics_set_dirty(entry);
+
+ if (free_these)
+ {
+ /*
+ * addition of free slots: delete, transaction abort or DB
+ * extend operation
+ *
+ * differentiate between them by checking if the new free zone
+ * is an entirely fresh addition or sitting somewhere in already
+ * used (recorded) space: extend or not?
+ */
+ ham_u16_t b;
+
+ ham_assert(entrystats->last_start >= entrystats->per_size[bucket].first_start, (0));
+ for (b = 0; b <= bucket; b++)
+ {
+ if (entrystats->per_size[b].first_start > position)
+ {
+ entrystats->per_size[b].first_start = position;
+ }
+ /* also update buckets for smaller chunks at the same time */
+ }
+
+ /* if we just freed the chunk just BEFORE the 'last_free', why
+ * not merge them, eh? */
+ if (entrystats->last_start == position + size_bits)
+ {
+ entrystats->last_start = position;
+
+ /* when we can adjust the last chunk, we should also adjust
+ *the start for bigger chunks... */
+ for (b = bucket + 1; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ if (entrystats->per_size[b].first_start > position)
+ {
+ entrystats->per_size[b].first_start = position;
+ }
+ /* also update buckets for smaller chunks at the same
+ * time */
+ }
+ }
+ for (b = 0; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ ham_assert(entrystats->last_start >= entrystats->per_size[b].first_start, (0));
+ }
+
+ position += size_bits;
+
+ /* if this is a 'free' for a newly created page, we'd need to
+ * adjust the outer edge */
+ if (entrystats->persisted_bits < position)
+ {
+ globalstats->extend_count++;
+
+ ham_assert(entrystats->last_start < position, (0));
+ entrystats->persisted_bits = position;
+ }
+ else
+ {
+ //ham_assert(entrystats->last_start >= position, (0));
+ globalstats->delete_count++;
+ }
+
+ ham_assert(entrystats->persisted_bits >= position, (0));
+
+ {
+ ham_u32_t entry_index = (ham_u32_t)(entry - freel_cache_get_entries(cache));
+
+ ham_assert(entry_index >= 0, (0));
+ ham_assert(entry_index < freel_cache_get_count(cache), (0));
+
+ for (b = 0; b <= bucket; b++)
+ {
+ if (globalstats->first_page_with_free_space[b] > entry_index)
+ {
+ globalstats->first_page_with_free_space[b] = entry_index;
+ }
+ /* also update buckets for smaller chunks at the same
+ * time */
+ }
+ }
+ }
+ else
+ {
+ ham_u16_t b;
+
+ /*
+ * occupation of free slots: insert or similar operation
+ */
+ position += size_bits;
+
+ for (b = bucket; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ if (entrystats->per_size[b].first_start < position)
+ {
+ entrystats->per_size[b].first_start = position;
+ }
+ /* also update buckets for larger chunks at the same time */
+ }
+
+ globalstats->insert_count++;
+
+ if (entrystats->last_start < position)
+ {
+ entrystats->last_start = position;
+ }
+ for (b = 0; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ ham_assert(entrystats->last_start >= entrystats->per_size[b].first_start, (0));
+ }
+
+ if (entrystats->persisted_bits < position)
+ {
+ /*
+ * the next is really a HACKY HACKY stop-gap measure:
+ * we see that the last_ever_seen offset has not been
+ * initialized (or incorrectly initialized) up to now, so we
+ * guestimate where it is, guessing on the safe side: we
+ * assume all free bits are situated past the current
+ * location, and shift the last_ever_seen position up
+ * accordingly
+ */
+ //globalstats->extend_count++;
+
+ ham_assert(entrystats->persisted_bits == 0, ("Should not get here when not invoked from the [unit]tests!"));
+ entrystats->persisted_bits = position + size_bits + entry->_allocated_bits;
+ }
+
+ /*
+ * maxsize within given bucket must still fit in the page, or
+ * it's useless checking this page again.
+ */
+ if (ham_bucket_index2bitcount(bucket) > freel_entry_get_allocated_bits(entry))
+ {
+ ham_u32_t entry_index = (ham_u32_t)(entry - freel_cache_get_entries(cache));
+
+ ham_assert(entry_index >= 0, (0));
+ ham_assert(entry_index < freel_cache_get_count(cache), (0));
+
+ /*
+ * We can update this number ONLY WHEN we have an
+ * allocation in the edge page;
+ * this is because we have modes where the freelist is
+ * checked in random and blindly updating the lower bound
+ * here would jeopardize the utilization of the DB.
+ *
+ * This applies to INCREMENTING the lower bound like we do
+ * here; we can ALWAYS DECREMENT the lower bound, as we do
+ * in the 'free_these' branch above.
+ */
+ if (globalstats->first_page_with_free_space[bucket] == entry_index)
+ {
+ for (b = bucket; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ if (globalstats->first_page_with_free_space[b] <= entry_index)
+ {
+ globalstats->first_page_with_free_space[b] = entry_index + 1;
+ }
+ /* also update buckets for smaller chunks at the
+ * same time */
+ }
+ }
+ }
+ }
+}
+
+
+
+
+void
+db_update_freelist_globalhints_no_hit(ham_db_t *db, freelist_entry_t *entry, freelist_hints_t *hints)
+{
+ freelist_cache_t *cache = db_get_freelist_cache(db);
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(hints->size_bits);
+ ham_u32_t entry_index = (ham_u32_t)(entry - freel_cache_get_entries(cache));
+
+ ham_assert(entry_index >= 0, (0));
+ ham_assert(entry_index < freel_cache_get_count(cache), (0));
+
+ ham_assert(hints->page_span_width >= 1, (0));
+
+ /*
+ * We can update this number ONLY WHEN we have an allocation in the
+ * edge page;
+ * this is because we have modes where the freelist is checked in
+ * random and blindly updating the lower bound here would jeopardize
+ * the utilization of the DB.
+ */
+ if (globalstats->first_page_with_free_space[bucket] == entry_index)
+ {
+ ham_u16_t b;
+
+ for (b = bucket; b < HAM_FREELIST_SLOT_SPREAD; b++)
+ {
+ if (globalstats->first_page_with_free_space[b] <= entry_index)
+ {
+ globalstats->first_page_with_free_space[b] = entry_index + hints->page_span_width;
+ }
+ /* also update buckets for smaller chunks at the same time */
+ }
+ }
+}
+
+
+
+void
+db_update_global_stats_find_query(ham_db_t *db, ham_size_t key_size)
+{
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ ham_runtime_statistics_opdbdata_t *opstats = db_get_op_perf_data(db, HAM_OPERATION_STATS_FIND);
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(key_size / DB_CHUNKSIZE);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ globalstats->query_count++;
+
+ opstats->query_count++;
+}
+
+
+void
+db_update_global_stats_insert_query(ham_db_t *db, ham_size_t key_size, ham_size_t record_size)
+{
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ ham_runtime_statistics_opdbdata_t *opstats = db_get_op_perf_data(db, HAM_OPERATION_STATS_INSERT);
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(key_size / DB_CHUNKSIZE);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ globalstats->insert_query_count++;
+
+ opstats->query_count++;
+}
+
+
+void
+db_update_global_stats_erase_query(ham_db_t *db, ham_size_t key_size)
+{
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ ham_runtime_statistics_opdbdata_t *opstats = db_get_op_perf_data(db, HAM_OPERATION_STATS_ERASE);
+
+ ham_u16_t bucket = ham_bitcount2bucket_index(key_size / DB_CHUNKSIZE);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(!(db_get_rt_flags(db)&HAM_IN_MEMORY_DB), (0));
+ ham_assert(db_get_freelist_cache(db), (0));
+
+ globalstats->erase_query_count++;
+
+ opstats->query_count++;
+}
+
+
+
+
+/**
+ * This call assumes the 'dst' hint values have already been filled
+ * with some sane values before; this routine will update those values
+ * where it deems necessary.
+ *
+ * This function is called once for each operation that requires the
+ * use of the freelist: it gives hints about where in the ENTIRE
+ * FREELIST you'd wish to start searching; this means this hinter
+ * differs from the 'per entry' hinter gelow in that it provides
+ * freelist page indices instead of offsets: that last bit is the job of
+ * the 'per entry hinter'; our job here is to cut down on the number of
+ * freelist pages visited.
+ */
+void
+db_get_global_freelist_hints(freelist_global_hints_t *dst, ham_db_t *db)
+{
+ freelist_cache_t *cache = db_get_freelist_cache(db);
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+
+ ham_u32_t offset;
+ ham_u16_t bucket = ham_bitcount2bucket_index(dst->size_bits);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(dst, (0));
+ ham_assert(dst->skip_init_offset == 0, (0));
+ ham_assert(dst->skip_step == 1, (0));
+
+ {
+ static int c = 0;
+ c++;
+ if (c % 100000 == 999)
+ {
+ /*
+ what is our ratio fail vs. search?
+
+ Since we know search >= fail, we'll calculate the
+ reciprocal in integer arithmetic, as that one will be >= 1.0
+ */
+ if (globalstats->fail_count)
+ {
+ ham_u64_t fail_reciprocal_ratio = globalstats->search_count;
+ fail_reciprocal_ratio *= 1000;
+ fail_reciprocal_ratio /= globalstats->fail_count;
+
+ ham_trace(("GLOBAL FAIL/SEARCH ratio: %f", 1000.0/fail_reciprocal_ratio));
+ }
+ /*
+ and how about our scan cost per scan? and per good scan?
+ */
+ if (globalstats->scan_count[bucket])
+ {
+ ham_u64_t cost_per_scan = globalstats->scan_cost[bucket];
+ cost_per_scan *= 1000;
+ cost_per_scan /= globalstats->scan_count[bucket];
+
+ ham_trace(("GLOBAL COST/SCAN ratio: %f", cost_per_scan/1000.0));
+ }
+ if (globalstats->ok_scan_count[bucket])
+ {
+ ham_u64_t ok_cost_per_scan = globalstats->ok_scan_cost[bucket];
+ ok_cost_per_scan *= 1000;
+ ok_cost_per_scan /= globalstats->ok_scan_count[bucket];
+
+ ham_trace(("GLOBAL 'OK' COST/SCAN ratio: %f", ok_cost_per_scan/1000.0));
+ }
+ if (globalstats->erase_query_count
+ + globalstats->insert_query_count)
+ {
+ ham_u64_t trials_per_query = 0;
+ int i;
+
+ for (i = 0; i < HAM_FREELIST_SLOT_SPREAD; i++)
+ {
+ trials_per_query += globalstats->scan_count[i];
+ }
+ trials_per_query *= 1000;
+ trials_per_query /= globalstats->erase_query_count
+ + globalstats->insert_query_count;
+
+ ham_trace(("GLOBAL TRIALS/QUERY (INSERT + DELETE) ratio: %f", trials_per_query/1000.0));
+ }
+ }
+ }
+
+
+ /*
+ * improve our start position, when we know there's nothing to be
+ * found before a given minimum offset
+ */
+ offset = globalstats->first_page_with_free_space[bucket];
+ if (dst->start_entry < offset)
+ {
+ dst->start_entry = offset;
+ }
+
+ /*
+ if we are looking for space for a 'huge blob', i.e. a size which spans multiple
+ pages, we should let the caller know: round up the number of full pages that we'll
+ need for this one.
+ */
+ dst->page_span_width = (dst->size_bits + dst->freelist_pagesize_bits - 1) / dst->freelist_pagesize_bits;
+ ham_assert(dst->page_span_width >= 1, (0));
+
+ /*
+ * NOW that we have the range and everything to say things we are
+ * certain about, we can further improve things by introducing a bit
+ * of heuristics a.k.a. statistical mumbojumbo:
+ *
+ * when we're in UBER/FAST mode and SEQUENTIAL to boot, we only
+ * wish to look at the last chunk of free space and ignore the rest.
+ *
+ * When we're in UBER/FAST mode, CLASSIC style, we don't feel like
+ * wading through an entire freelist every time when we know already
+ * that utilization is such that our chances at finding a match are
+ * low, which means we'd rather turn this thing into SEQUENTIAL
+ * mode, maybe even SEQUENTIAL+UBER/FAST, for as long as the
+ * utilization is such that our chance at finding a match is still
+ * rather low.
+ */
+ switch (dst->mgt_mode & (HAM_DAM_SEQUENTIAL_INSERT
+ | HAM_DAM_RANDOM_WRITE_ACCESS
+ | HAM_DAM_FAST_INSERT))
+ {
+ /* SEQ+RANDOM_ACCESS: impossible mode; nasty trick for testing to help Overflow4 unittest pass: disables global hinting, but does do reverse scan for a bit of speed */
+ case HAM_DAM_RANDOM_WRITE_ACCESS | HAM_DAM_SEQUENTIAL_INSERT:
+ dst->max_rounds = freel_cache_get_count(cache);
+ dst->mgt_mode &= ~HAM_DAM_RANDOM_WRITE_ACCESS;
+ if (0)
+ {
+ default:
+ // dst->max_rounds = freel_cache_get_count(cache);
+ dst->max_rounds = 32; /* speed up 'classic' for LARGE databases anyhow! */
+ }
+ if (0)
+ {
+ /*
+ * here's where we get fancy:
+ *
+ * We allow ourselves a bit of magick: for larger freelists, we
+ * cut down on the number of pages we'll probe during each
+ * operation, thus cutting down on freelist scanning/hinting
+ * work out there.
+ *
+ * The 'sensible' heuristic here is ...
+ * for 'non-UBER/FAST' modes: a limit of 8 freelist pages,
+ *
+ * for 'UBER/FAST' modes: a limit of 3 freelist pages tops.
+ */
+ case HAM_DAM_SEQUENTIAL_INSERT:
+ case HAM_DAM_RANDOM_WRITE_ACCESS:
+ dst->max_rounds = 8;
+ }
+ if (0)
+ {
+ case HAM_DAM_FAST_INSERT:
+ case HAM_DAM_RANDOM_WRITE_ACCESS | HAM_DAM_FAST_INSERT:
+ case HAM_DAM_SEQUENTIAL_INSERT | HAM_DAM_FAST_INSERT:
+ dst->max_rounds = 3;
+ }
+ if (dst->max_rounds >= freel_cache_get_count(cache))
+ {
+ dst->max_rounds = freel_cache_get_count(cache);
+ }
+ else
+ {
+ /*
+ * and to facilitate an 'even distribution' of the freelist
+ * entries being scanned, we hint the scanner should use a
+ * SRNG (semi random number generator) approach by using the
+ * principle of a prime-modulo SRNG, where the next value is
+ * calculated using a multiplier which is mutual prime with
+ * the freelist entry count, followed by a modulo operation.
+ *
+ * _WE_ need to tweak that a bit as looking at any freelist
+ * entries before the starting index there is useless as we
+ * already know those entries don't carry sufficient free
+ * space anyhow. Nevertheless we don't need to be very
+ * mindful about it; we'll be using a large real number for
+ * the semi-random generation of the next freelist entry
+ * index, so all we got to do is make sure we've got our
+ * 'size' MODULO correct when we use this hinting data.
+ *
+ * 295075153: we happen to have this large prime which
+ * we'll assume will be larger than any sane freelist entry
+ * list will ever get in this millenium ;-) so using it for
+ * the mutual-prime multiplier in here will be fine.
+ * (Incidentally, we say 'multiplier', but we use it really
+ * as an adder, which is perfectly fine as any (A+B) MOD C
+ * operation will have a cycle of B when the B is mutual
+ * prime to C assuming a constant A; this also means that, as we apply this
+ * operation multiple times in sequence, the resulting
+ * numbers have a cycle of B and will therefore deliver a
+ * rather flat distribution over C when B is suitably large
+ * compared to C. (That last bit is not mandatory, but it generally makes
+ * for a more semi-random skipping pattern.)
+ */
+ dst->skip_step=295075153;
+ /*
+ * The init_offset is just a number to break repetiveness
+ * of the generated pattern; in SRNG terms, this is the
+ * seed.
+ *
+ * We re-use the statistics counts here as a 'noisy source'
+ * for our seed. Note that we use the fail_count only as all
+ * this randomization is fine and dandy, but we don't want
+ * it to help thrash the page cache, so the freelist page
+ * entry probe pattern should remian the same until a probe
+ * FAILs; only then do we really need to change the pattern.
+ */
+ dst->skip_init_offset=globalstats->fail_count;
+ }
+ break;
+ }
+
+ /*
+ To accomodate multi-freelist-entry spanning 'huge blob' free space searches,
+ we set up the init and step here to match that of a Boyer-Moore search method.
+
+ Yes, this means this code has intimate knowledge of the 'huge blob free space search'
+ caller, i.e. the algorithm used when
+
+ dst->page_span_width > 1
+
+ and I agree it's nasty, but this way the outer call's code is more straight-forward
+ in handling both the regular, BM-assisted full scan of the freelist AND the faster
+ 'skipping' mode(s) possible here (e.g. the UBER-FAST search mode where only part of
+ the freelist will be sampled for each request).
+ */
+ if (dst->skip_step < dst->page_span_width)
+ {
+ /*
+ set up for BM: init = 1 step ahead minus 1, as we check the LAST entry instead
+ of the FIRST, and skip=span so we jump over the freelist according to the BM plan:
+ no hit on the sample means the next possible spot will include sample current+span.
+ */
+ dst->skip_init_offset = dst->page_span_width - 1;
+ dst->skip_step = dst->page_span_width;
+ }
+}
+
+
+
+/*
+ * This call assumes the 'dst' hint values have already been filled
+ * with some sane values before; this routine will update those values
+ * where it deems necessary.
+ */
+void
+db_get_freelist_entry_hints(freelist_hints_t *dst, ham_db_t *db, freelist_entry_t *entry)
+{
+ ham_runtime_statistics_globdata_t *globalstats = db_get_global_perf_data(db);
+ freelist_page_statistics_t *entrystats = freel_entry_get_statistics(entry);
+
+ ham_u32_t offset;
+ ham_u16_t bucket = ham_bitcount2bucket_index(dst->size_bits);
+ ham_assert(bucket < HAM_FREELIST_SLOT_SPREAD, (0));
+ ham_assert(dst, (0));
+
+ /*
+ * we can decide to 'up' the skip/probe_step size in the hints when
+ * we find out we're running into a lot of fragmentation, i.e.
+ * lots of free slot hints which don't lead to a perfect hit.
+ *
+ * By bumping up the probestep distance, we can also 'upgrade' our
+ * start offset to come from the next bucket: the one meant for the
+ * bigger boys out there.
+ */
+
+ {
+ static int c = 0;
+ c++;
+ if (c % 100000 == 999)
+ {
+ /*
+ what is our ratio fail vs. search?
+
+ Since we know search >= fail, we'll calculate the
+ reciprocal in integer arithmetic, as that one will be >= 1.0
+ */
+ if (globalstats->fail_count)
+ {
+ ham_u64_t fail_reciprocal_ratio = globalstats->search_count;
+ fail_reciprocal_ratio *= 1000;
+ fail_reciprocal_ratio /= globalstats->fail_count;
+
+ ham_trace(("FAIL/SEARCH ratio: %f", 1000.0/fail_reciprocal_ratio));
+ }
+ /*
+ and how about our scan cost per scan? and per good scan?
+ */
+ if (globalstats->scan_count[bucket])
+ {
+ ham_u64_t cost_per_scan = globalstats->scan_cost[bucket];
+ cost_per_scan *= 1000;
+ cost_per_scan /= globalstats->scan_count[bucket];
+
+ ham_trace(("COST/SCAN ratio: %f", cost_per_scan/1000.0));
+ }
+ if (globalstats->ok_scan_count[bucket])
+ {
+ ham_u64_t ok_cost_per_scan = globalstats->ok_scan_cost[bucket];
+ ok_cost_per_scan *= 1000;
+ ok_cost_per_scan /= globalstats->ok_scan_count[bucket];
+
+ ham_trace(("'OK' COST/SCAN ratio: %f", ok_cost_per_scan/1000.0));
+ }
+ if (globalstats->erase_query_count
+ + globalstats->insert_query_count)
+ {
+ ham_u64_t trials_per_query = 0;
+ int i;
+
+ for (i = 0; i < HAM_FREELIST_SLOT_SPREAD; i++)
+ {
+ trials_per_query += globalstats->scan_count[i];
+ }
+ trials_per_query *= 1000;
+ trials_per_query /= globalstats->erase_query_count
+ + globalstats->insert_query_count;
+
+ ham_trace(("TRIALS/QUERY (INSERT + DELETE) ratio: %f", trials_per_query/1000.0));
+ }
+
+
+ /*
+ what is our FREELIST PAGE's ratio fail vs. search?
+
+ Since we know search >= fail, we'll calculate the
+ reciprocal in integer arithmetic, as that one will be >= 1.0
+ */
+ if (entrystats->fail_count)
+ {
+ ham_u64_t fail_reciprocal_ratio = entrystats->search_count;
+ fail_reciprocal_ratio *= 1000;
+ fail_reciprocal_ratio /= entrystats->fail_count;
+
+ ham_trace(("PAGE FAIL/SEARCH ratio: %f", 1000.0/fail_reciprocal_ratio));
+ }
+ /*
+ and how about our scan cost per scan? and per good scan?
+ */
+ if (entrystats->per_size[bucket].scan_count)
+ {
+ ham_u64_t cost_per_scan = entrystats->per_size[bucket].scan_cost;
+ cost_per_scan *= 1000;
+ cost_per_scan /= entrystats->per_size[bucket].scan_count;
+
+ ham_trace(("PAGE COST/SCAN ratio: %f", cost_per_scan/1000.0));
+ }
+ if (entrystats->per_size[bucket].ok_scan_count)
+ {
+ ham_u64_t ok_cost_per_scan = entrystats->per_size[bucket].ok_scan_cost;
+ ok_cost_per_scan *= 1000;
+ ok_cost_per_scan /= entrystats->per_size[bucket].ok_scan_count;
+
+ ham_trace(("PAGE 'OK' COST/SCAN ratio: %f", ok_cost_per_scan/1000.0));
+ }
+ }
+ }
+
+ ham_assert(entrystats->last_start >= entrystats->per_size[bucket].first_start, (0));
+ ham_assert(entrystats->persisted_bits >= entrystats->last_start, (0));
+
+ /*
+ * improve our start position, when we know there's nothing to be
+ * found before a given minimum offset
+ */
+ offset = entrystats->per_size[bucket].first_start;
+ if (dst->startpos < offset)
+ {
+ dst->startpos = offset;
+ }
+
+ offset = entrystats->persisted_bits;