Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

README.md

PragmaticSegmenterNet

Build status

This project is a direct port of Pragmatic Segmenter which provides rule-based sentence boundary detection.

Usage

The Segmenter class provides the Segment method which in the simplest usage takes a string:

using PragmaticSegmenterNet;

IReadOnlyList<string> result = Segmenter.Segment("One Sentence. And another sentence.");

// ["One Sentence.", "And another sentence."]

IReadOnlyList<string> result2 = Segmenter.Segment("Anything.", Language.Italian);

// ["Anything"]

The Segment method has a number of optional parameters:

IReadOnlyList<string> Segment(string text, Language language = Language.English, bool cleanText = true, DocumentType documentType = DocumentType.Any)
  • Language - An enum representing the supported languages, the default is English, see the supported languages list below for the list of currently supported languages.
  • CleanText - A boolean indicating whether the input text should be cleaned prior to segmentation. Cleaning removes extra newlines and whitespace. Defaults to true.
  • DocumentType - Used by the text cleaning process to determine which reformatting to apply. For PDFs this handles newlines in the middle of a sentence whereas for HTML documents this will handle HMTL tags. Defaults to any which does not apply any special formatting.

Languages

  • English = 0 (default)
  • Amharic = 1
  • Arabic = 2
  • Armenian = 3
  • Bulgarian = 4
  • Burmese = 5
  • Chinese = 6
  • Danish = 7
  • Dutch = 8
  • French = 9
  • German = 10
  • Greek = 11
  • Hindi = 12
  • Italian = 13
  • Japanese = 14
  • Kazakh = 15 (partial support, potentially only for the Cyrillic form of the alphabet)
  • Persian = 16
  • Polish = 17
  • Russian = 18
  • Spanish = 19
  • Urdu = 20

Credit

This project wouldn't be possible without the work done by Pragmatic Segmenter team. Any bugs in the code are entirely my fault.

You can’t perform that action at this time.