This cool idea to generalize numeric/trie fields came from Adrien:
Today, when we index a numeric field (LongField, etc.) we pre-compute
(via NumericTokenStream) outside of indexer/codec which prefix terms
should be indexed.
But this can be inefficient: you set a static precisionStep, and
always add those prefix terms regardless of how the terms in the field
are actually distributed. Yet typically in real world applications
the terms have a non-random distribution.
So, it should be better if instead the terms dict decides where it
makes sense to insert prefix terms, based on how dense the terms are
in each region of term space.
This way we can speed up query time for both term (e.g. infix
suggester) and numeric ranges, and it should let us use less index
space and get faster range queries.
This would also mean that min/maxTerm for a numeric field would now be
correct, vs today where the externally computed prefix terms are
placed after the full precision terms, causing hairy code like
NumericUtils.getMaxInt/Long. So optos like #6922 become
feasible.
The terms dict can also do tricks not possible if you must live on top
of its APIs, e.g. to handle the adversary/over-constrained case when a
given prefix has too many terms following it but finer prefixes
have too few (what block tree calls "floor term blocks").
Migrated from LUCENE-5879 by Michael McCandless (@mikemccand), 2 votes, resolved Apr 07 2015
Attachments: LUCENE-5879.patch (versions: 14)
Linked issues:
This cool idea to generalize numeric/trie fields came from Adrien:
Today, when we index a numeric field (LongField, etc.) we pre-compute
(via NumericTokenStream) outside of indexer/codec which prefix terms
should be indexed.
But this can be inefficient: you set a static precisionStep, and
always add those prefix terms regardless of how the terms in the field
are actually distributed. Yet typically in real world applications
the terms have a non-random distribution.
So, it should be better if instead the terms dict decides where it
makes sense to insert prefix terms, based on how dense the terms are
in each region of term space.
This way we can speed up query time for both term (e.g. infix
suggester) and numeric ranges, and it should let us use less index
space and get faster range queries.
This would also mean that min/maxTerm for a numeric field would now be
correct, vs today where the externally computed prefix terms are
placed after the full precision terms, causing hairy code like
NumericUtils.getMaxInt/Long. So optos like #6922 become
feasible.
The terms dict can also do tricks not possible if you must live on top
of its APIs, e.g. to handle the adversary/over-constrained case when a
given prefix has too many terms following it but finer prefixes
have too few (what block tree calls "floor term blocks").
Migrated from LUCENE-5879 by Michael McCandless (@mikemccand), 2 votes, resolved Apr 07 2015
Attachments: LUCENE-5879.patch (versions: 14)
Linked issues: