Skip to content

Make WordDelimiterGraphFilter a Tokenizer [LUCENE-8516] #9562

@asfimport

Description

@asfimport

Being able to split tokens up at arbitrary points in a filter chain, in effect adding a second round of tokenization, can cause any number of problems when trying to keep tokenstreams to contract. The most common offender here is the WordDelimiterGraphFilter, which can produce broken offsets in a wide range of situations.

We should make WDGF a Tokenizer in its own right, which should preserve all the functionality we need, but make reasoning about the resulting tokenstream much simpler.


Migrated from LUCENE-8516 by Alan Woodward (@romseygeek), updated Oct 04 2018
Attachments: LUCENE-8516.patch

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions