Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document index structure #2

Merged
merged 3 commits into from
Oct 12, 2015

Conversation

pmezard
Copy link
Contributor

@pmezard pmezard commented Oct 10, 2015

Here is a sketch of my understanding of the index structure. As mentioned in the commit message, I usually avoid document internals but core data structures help understanding and communicating a lot.

As usual I second-guessed everything so please correct any mistake. Search algorithms are not mentioned either as I did not check the code exhaustively, only just enough to avoid talking complete nonsense.

While I am there, a couple of questions, some I asked on IRC already:

  • Is there any gain in duplicating the field index and position in term frequency vectors? The field index is stored in the row key and the position might be deduced from the element index right? Or maybe there is no gain in removing them either.
  • What is the purpose of PrefixCodedInt64? It is reminiscent to stuff I understood lucene was doing to index and query numeric range, by refining requests precision while searching, but that is not completely clear to me. Also, the
nChars := (((63 - shift) * 37) >> 8) + 1 

is cryptic to me. I would have expected something like:

nChars := ((63 - shift + 6)/7

since the high order bit is discarded and remaining ones split on 7-bits boundaries.

I usually avoid documenting project internals but I believe input and
output data structures shape everything, speed up understanding of the
code and help communication. Also, they do not change too often and such
changes big events, which makes it less likely to forget updating this
document.
table. The store must be able to enumerate its entries starting at a given key,
in lexicographic byte order.

Index key and values are handled as byte arrays and are called rows (see `index/upside_down/row.go` for details). To store different row types in a single table, bleve prefixes their keys with a single byte, for instance the term frequency keys start with a 'f'. The following sections describe the data structures stored in the index with pseudo-Go code for value layout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

term frequencies rows are prefixed with 't' not 'f'. field definition rows are prefixed with 'f'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, fixed (don't worry if you read "fixed" without a fix, I use that as a reminder, the commit will come eventually).

@mschoch
Copy link
Member

mschoch commented Oct 10, 2015

The reason we also store field in the term vectors, even though its duplicated from information already in the key is because of composite fields. So for example, when we search without specifying a field, we might get a match for the field "_all". So in the term frequency row, the key contains the numeric value corresponding to that field. But that is not useful if we want to display/highlight the results, we still need to know which field the term usage actually occurred in. So, in the term vector information, we keep the actual field it occurred in and not "_all".

@mschoch
Copy link
Member

mschoch commented Oct 10, 2015

Regarding the PrefixCodedInt64, yes its used to encode a single number as multiple terms, at different precisions. The allow a numeric range search to scan fewer rows, though the tuning of this is not very optimal at the moment (I think we way-over-generate terms currently, but I haven't measured it).

Regarding the exact formula in question, it was a port from this Lucene function: https://github.com/apache/lucene-solr/blob/d31c9525b9607efd32cd92d47d1a6e6457bb84d5/lucene/core/src/java/org/apache/lucene/util/NumericUtils.java#L144-L161

@pmezard
Copy link
Contributor Author

pmezard commented Oct 11, 2015

I corrected the bit about composite fields (learning what they are at the same time...). Thanks for PrefixCodedInt link.

I do not know how you handle pull-request adjustments. I know I like to review intermediate changes rather than reevaluating everything at each round, at the price of cluttering history. So if you prefer I amend the first changeset I can do that, or if you want me to squash everything when we are done, no problem (you can even do it yourself when merging).

@pmezard
Copy link
Contributor Author

pmezard commented Oct 11, 2015

// i/7 is the same as (i*37)>>8 for i in 0..63

Yeah right, I mostly figured that except for the part that it is exact on 0..63. Still, I wonder how useful is this kind of trick. What I know is it is completely obscure.

@mschoch
Copy link
Member

mschoch commented Oct 12, 2015

I agree it isn't obvious, seems we should either include the comment or switch to the /7 version.

At the moment I'm in favor of switching it to /7 as when I benchmark it I got the following:

with benchtime=1s

benchmark                      old ns/op     new ns/op     delta
BenchmarkTestPrefixCoded-4     13750         13755         +0.04%

with benchtime=10s

benchmark                      old ns/op     new ns/op     delta
BenchmarkTestPrefixCoded-4     13965         13692         -1.95%

and with benchtime=1m

benchmark                      old ns/op     new ns/op     delta
BenchmarkTestPrefixCoded-4     13659         13585         -0.54%

I'm not sold its really faster, but it doesn't seem like its slower either.

Also, this was with: nChars := ((63 - shift) / 7) + 1

but you had proposed: nChars := ((63 - shift + 6)/7

Can you explain?

@pmezard
Copy link
Contributor Author

pmezard commented Oct 12, 2015

I assume this line computes the number of "7-bits" bytes required to store (63 - shift) bits. If this is the case, current code is incorrect as shown by:

https://play.golang.org/p/zcAtpPNdJG

See the case "bits: 7", where the current version allocates 2 bytes instead of 1.

But :

  • I can be wrong on what is being computed
  • Correctness does not matter anymore, changing the code would break compatibity with existing data.

Your version is equivalent to the current one.

@mschoch
Copy link
Member

mschoch commented Oct 12, 2015

I think the issue is that the original starts with 63, not 64. If you have a 64 bit value, and shift 63 bits, you still need 1 byte to store the result. In your case nchars is 0:

    bits:  0, current:  1, marty:  1, patrick:  0

If I change yours to patrick := (64 - shift + 6) / 7 it matches for all values.

@pmezard
Copy link
Contributor Author

pmezard commented Oct 12, 2015

Yes, I was confused by:

sortableBits := int64(uint64(in) ^ 0x8000000000000000)

I read that like a &^ discarding the high-order bit for some unfathomable reason. I suppose flipping it ensures bit lexicographic order is preserved for negative values.

mschoch added a commit to blevesearch/bleve that referenced this pull request Oct 12, 2015
mschoch added a commit that referenced this pull request Oct 12, 2015
@mschoch mschoch merged commit 478003f into blevesearch:master Oct 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants