Document index structure #2

pmezard · 2015-10-10T13:22:51Z

Here is a sketch of my understanding of the index structure. As mentioned in the commit message, I usually avoid document internals but core data structures help understanding and communicating a lot.

As usual I second-guessed everything so please correct any mistake. Search algorithms are not mentioned either as I did not check the code exhaustively, only just enough to avoid talking complete nonsense.

While I am there, a couple of questions, some I asked on IRC already:

Is there any gain in duplicating the field index and position in term frequency vectors? The field index is stored in the row key and the position might be deduced from the element index right? Or maybe there is no gain in removing them either.
What is the purpose of PrefixCodedInt64? It is reminiscent to stuff I understood lucene was doing to index and query numeric range, by refining requests precision while searching, but that is not completely clear to me. Also, the

nChars := (((63 - shift) * 37) >> 8) + 1

is cryptic to me. I would have expected something like:

nChars := ((63 - shift + 6)/7

since the high order bit is discarded and remaining ones split on 7-bits boundaries.

I usually avoid documenting project internals but I believe input and output data structures shape everything, speed up understanding of the code and help communication. Also, they do not change too often and such changes big events, which makes it less likely to forget updating this document.

mschoch · 2015-10-10T22:59:13Z

content/docs/Index-Structure.md

+table. The store must be able to enumerate its entries starting at a given key,
+in lexicographic byte order.
+
+Index key and values are handled as byte arrays and are called rows (see `index/upside_down/row.go` for details). To store different row types in a single table, bleve prefixes their keys with a single byte, for instance the term frequency keys start with a 'f'. The following sections describe the data structures stored in the index with pseudo-Go code for value layout.


term frequencies rows are prefixed with 't' not 'f'. field definition rows are prefixed with 'f'.

woops, fixed (don't worry if you read "fixed" without a fix, I use that as a reminder, the commit will come eventually).

mschoch · 2015-10-10T23:15:23Z

The reason we also store field in the term vectors, even though its duplicated from information already in the key is because of composite fields. So for example, when we search without specifying a field, we might get a match for the field "_all". So in the term frequency row, the key contains the numeric value corresponding to that field. But that is not useful if we want to display/highlight the results, we still need to know which field the term usage actually occurred in. So, in the term vector information, we keep the actual field it occurred in and not "_all".

mschoch · 2015-10-10T23:21:10Z

Regarding the PrefixCodedInt64, yes its used to encode a single number as multiple terms, at different precisions. The allow a numeric range search to scan fewer rows, though the tuning of this is not very optimal at the moment (I think we way-over-generate terms currently, but I haven't measured it).

Regarding the exact formula in question, it was a port from this Lucene function: https://github.com/apache/lucene-solr/blob/d31c9525b9607efd32cd92d47d1a6e6457bb84d5/lucene/core/src/java/org/apache/lucene/util/NumericUtils.java#L144-L161

pmezard · 2015-10-11T08:43:18Z

I corrected the bit about composite fields (learning what they are at the same time...). Thanks for PrefixCodedInt link.

I do not know how you handle pull-request adjustments. I know I like to review intermediate changes rather than reevaluating everything at each round, at the price of cluttering history. So if you prefer I amend the first changeset I can do that, or if you want me to squash everything when we are done, no problem (you can even do it yourself when merging).

pmezard · 2015-10-11T08:47:00Z

// i/7 is the same as (i*37)>>8 for i in 0..63

Yeah right, I mostly figured that except for the part that it is exact on 0..63. Still, I wonder how useful is this kind of trick. What I know is it is completely obscure.

mschoch · 2015-10-12T17:46:57Z

I agree it isn't obvious, seems we should either include the comment or switch to the /7 version.

At the moment I'm in favor of switching it to /7 as when I benchmark it I got the following:

with benchtime=1s

benchmark                      old ns/op     new ns/op     delta
BenchmarkTestPrefixCoded-4     13750         13755         +0.04%

with benchtime=10s

benchmark                      old ns/op     new ns/op     delta
BenchmarkTestPrefixCoded-4     13965         13692         -1.95%

and with benchtime=1m

benchmark                      old ns/op     new ns/op     delta
BenchmarkTestPrefixCoded-4     13659         13585         -0.54%

I'm not sold its really faster, but it doesn't seem like its slower either.

Also, this was with: nChars := ((63 - shift) / 7) + 1

but you had proposed: nChars := ((63 - shift + 6)/7

Can you explain?

pmezard · 2015-10-12T18:14:25Z

I assume this line computes the number of "7-bits" bytes required to store (63 - shift) bits. If this is the case, current code is incorrect as shown by:

https://play.golang.org/p/zcAtpPNdJG

See the case "bits: 7", where the current version allocates 2 bytes instead of 1.

But :

I can be wrong on what is being computed
Correctness does not matter anymore, changing the code would break compatibity with existing data.

Your version is equivalent to the current one.

mschoch · 2015-10-12T18:43:42Z

I think the issue is that the original starts with 63, not 64. If you have a 64 bit value, and shift 63 bits, you still need 1 byte to store the result. In your case nchars is 0:

    bits:  0, current:  1, marty:  1, patrick:  0

If I change yours to patrick := (64 - shift + 6) / 7 it matches for all values.

pmezard · 2015-10-12T19:11:42Z

Yes, I was confused by:

sortableBits := int64(uint64(in) ^ 0x8000000000000000)

I read that like a &^ discarding the high-order bit for some unfathomable reason. I suppose flipping it ensures bit lexicographic order is preserved for negative values.

based on discussion here: blevesearch/blevesearch.github.io-hugo#2

Document index structure

pmezard added 2 commits October 9, 2015 16:43

Add .gitignore and ignore generated public/ directory

6ef2004

mschoch reviewed Oct 10, 2015
View reviewed changes

docs: fix index doc errors, clarify term vectors usage

53c90f5

mschoch added a commit to blevesearch/bleve that referenced this pull request Oct 12, 2015

simplify prefix coding

f43fa42

based on discussion here: blevesearch/blevesearch.github.io-hugo#2

mschoch added a commit that referenced this pull request Oct 12, 2015

Merge pull request #2 from pmezard/document-index-structure

478003f

Document index structure

mschoch merged commit 478003f into blevesearch:master Oct 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document index structure #2

Document index structure #2

pmezard commented Oct 10, 2015

mschoch Oct 10, 2015

pmezard Oct 11, 2015

mschoch commented Oct 10, 2015

mschoch commented Oct 10, 2015

pmezard commented Oct 11, 2015

pmezard commented Oct 11, 2015

mschoch commented Oct 12, 2015

pmezard commented Oct 12, 2015

mschoch commented Oct 12, 2015

pmezard commented Oct 12, 2015

Document index structure #2

Document index structure #2

Conversation

pmezard commented Oct 10, 2015

mschoch Oct 10, 2015

Choose a reason for hiding this comment

pmezard Oct 11, 2015

Choose a reason for hiding this comment

mschoch commented Oct 10, 2015

mschoch commented Oct 10, 2015

pmezard commented Oct 11, 2015

pmezard commented Oct 11, 2015

mschoch commented Oct 12, 2015

pmezard commented Oct 12, 2015

mschoch commented Oct 12, 2015

pmezard commented Oct 12, 2015