Punctuation handling in StandardTokenizer (and WikipediaTokenizer) [LUCENE-1161]

It would be useful, in the StandardTokenizer, to be able to have more control over in-word punctuation is handled.  For instance, it is not always desirable to split on dashes or other punctuation.  In other cases, one may want to output the split tokens plus a collapsed version of the token that removes the punctuation.

For example, Solr's WordDelimiterFilter provides some nice capabilities here, but it can't do it's job when using the StandardTokenizer because the StandardTokenizer already makes the decision on how to handle it without giving the user any choice.

I think, in JFlex, we can have a back-compatible way of letting users make decisions about punctuation that occurs inside of a token.  Such as e-bay or i-pod, thus allowing for matches on iPod and eBay.



---
Migrated from [LUCENE-1161](https://issues.apache.org/jira/browse/LUCENE-1161) by Grant Ingersoll (@gsingers), 1 vote, resolved Jan 26 2011


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Punctuation handling in StandardTokenizer (and WikipediaTokenizer) [LUCENE-1161] #2238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Punctuation handling in StandardTokenizer (and WikipediaTokenizer) [LUCENE-1161] #2238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions