Skip to content

Latest commit

 

History

History
62 lines (54 loc) · 3.21 KB

File metadata and controls

62 lines (54 loc) · 3.21 KB
Error in user YAML: Alias parsing is not enabled.
---
uid: Lucene.Net.Analysis.Standard
summary: *content
---

Fast, general-purpose grammar-based tokenizers.

The xref:Lucene.Net.Analysis.Standard package contains three fast grammar-based tokenizers constructed with JFlex:

  • xref:Lucene.Net.Analysis.Standard.StandardTokenizer: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.

    StandardAnalyzer includes StandardTokenizer, StandardFilter, LowerCaseFilter and StopFilter. When the LuceneVersion specified in the constructor is lower than 3.1, the ClassicTokenizer implementation is invoked.

  • ClassicTokenizer: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) ClassicAnalyzer includes ClassicTokenizer, StandardFilter, LowerCaseFilter and StopFilter.

  • UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.

    UAX29URLEmailAnalyzer includes UAX29URLEmailTokenizer, StandardFilter, LowerCaseFilter and StopFilter.