-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document index structure #2
Conversation
I usually avoid documenting project internals but I believe input and output data structures shape everything, speed up understanding of the code and help communication. Also, they do not change too often and such changes big events, which makes it less likely to forget updating this document.
table. The store must be able to enumerate its entries starting at a given key, | ||
in lexicographic byte order. | ||
|
||
Index key and values are handled as byte arrays and are called rows (see `index/upside_down/row.go` for details). To store different row types in a single table, bleve prefixes their keys with a single byte, for instance the term frequency keys start with a 'f'. The following sections describe the data structures stored in the index with pseudo-Go code for value layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
term frequencies rows are prefixed with 't' not 'f'. field definition rows are prefixed with 'f'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, fixed (don't worry if you read "fixed" without a fix, I use that as a reminder, the commit will come eventually).
The reason we also store field in the term vectors, even though its duplicated from information already in the key is because of composite fields. So for example, when we search without specifying a field, we might get a match for the field "_all". So in the term frequency row, the key contains the numeric value corresponding to that field. But that is not useful if we want to display/highlight the results, we still need to know which field the term usage actually occurred in. So, in the term vector information, we keep the actual field it occurred in and not "_all". |
Regarding the PrefixCodedInt64, yes its used to encode a single number as multiple terms, at different precisions. The allow a numeric range search to scan fewer rows, though the tuning of this is not very optimal at the moment (I think we way-over-generate terms currently, but I haven't measured it). Regarding the exact formula in question, it was a port from this Lucene function: https://github.com/apache/lucene-solr/blob/d31c9525b9607efd32cd92d47d1a6e6457bb84d5/lucene/core/src/java/org/apache/lucene/util/NumericUtils.java#L144-L161 |
I corrected the bit about composite fields (learning what they are at the same time...). Thanks for PrefixCodedInt link. I do not know how you handle pull-request adjustments. I know I like to review intermediate changes rather than reevaluating everything at each round, at the price of cluttering history. So if you prefer I amend the first changeset I can do that, or if you want me to squash everything when we are done, no problem (you can even do it yourself when merging). |
Yeah right, I mostly figured that except for the part that it is exact on 0..63. Still, I wonder how useful is this kind of trick. What I know is it is completely obscure. |
I agree it isn't obvious, seems we should either include the comment or switch to the /7 version. At the moment I'm in favor of switching it to /7 as when I benchmark it I got the following:
I'm not sold its really faster, but it doesn't seem like its slower either. Also, this was with: but you had proposed: Can you explain? |
I assume this line computes the number of "7-bits" bytes required to store (63 - shift) bits. If this is the case, current code is incorrect as shown by: https://play.golang.org/p/zcAtpPNdJG See the case "bits: 7", where the current version allocates 2 bytes instead of 1. But :
Your version is equivalent to the current one. |
I think the issue is that the original starts with 63, not 64. If you have a 64 bit value, and shift 63 bits, you still need 1 byte to store the result. In your case nchars is 0:
If I change yours to |
Yes, I was confused by:
I read that like a |
based on discussion here: blevesearch/blevesearch.github.io-hugo#2
Here is a sketch of my understanding of the index structure. As mentioned in the commit message, I usually avoid document internals but core data structures help understanding and communicating a lot.
As usual I second-guessed everything so please correct any mistake. Search algorithms are not mentioned either as I did not check the code exhaustively, only just enough to avoid talking complete nonsense.
While I am there, a couple of questions, some I asked on IRC already:
is cryptic to me. I would have expected something like:
since the high order bit is discarded and remaining ones split on 7-bits boundaries.