Skip to content

implement PositionLengthAttribute for all tokenstreams where its appropriate [LUCENE-3843] #4916

@asfimport

Description

@asfimport

#4840 introduces PositionLengthAttribute, which extends the tokenstream API
from a sausage to a real graph.

Currently tokenstreams such as WordDelimiterFilter and SynonymsFilter theoretically
work at a graph level, but then serialize themselves to a sausage, for example:

wi-fi with WDF creates:
wi(posinc=1), fi(posinc=1), wifi(posinc=0)

So the lossiness is that the 'wifi' is simply stacked ontop of 'fi'

PositionLengthAttribute fixes this by allowing a token to declare how far it "spans",
so we don't lose any information.

While the indexer currently can only support sausages anyway (and for performance reasons,
this is probably just fine!), other tokenstream consumers such as queryparsers and suggesters
such as #4915 can actually make use of this information for better behavior.

So I think its ideal if the TokenStream API doesn't reflect the lossiness of the index format,
but instead keeps all information, and after #4840 is committed we should fix tokenstreams
to preserve this information for consumers that can use it.


Migrated from LUCENE-3843 by Robert Muir (@rmuir), 6 votes, updated May 09 2016
Linked issues:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions