Skip to content

UglyToad/PragmaticSegmenterNet

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
July 4, 2020 16:29
September 12, 2018 19:25
September 12, 2018 19:25
September 21, 2021 07:18
September 15, 2018 12:33

PragmaticSegmenterNet

Build status

This project is a direct port of Pragmatic Segmenter which provides rule-based sentence boundary detection.

Usage

The Segmenter class provides the Segment method which in the simplest usage takes a string:

using PragmaticSegmenterNet;

IReadOnlyList<string> result = Segmenter.Segment("One Sentence. And another sentence.");

// ["One Sentence.", "And another sentence."]

IReadOnlyList<string> result2 = Segmenter.Segment("Anything.", Language.Italian);

// ["Anything"]

The Segment method has a number of optional parameters:

IReadOnlyList<string> Segment(string text, Language language = Language.English, bool cleanText = true, DocumentType documentType = DocumentType.Any)
  • Language - An enum representing the supported languages, the default is English, see the supported languages list below for the list of currently supported languages.
  • CleanText - A boolean indicating whether the input text should be cleaned prior to segmentation. Cleaning removes extra newlines and whitespace. Defaults to true.
  • DocumentType - Used by the text cleaning process to determine which reformatting to apply. For PDFs this handles newlines in the middle of a sentence whereas for HTML documents this will handle HMTL tags. Defaults to any which does not apply any special formatting.

Languages

  • English = 0 (default)
  • Amharic = 1
  • Arabic = 2
  • Armenian = 3
  • Bulgarian = 4
  • Burmese = 5
  • Chinese = 6
  • Danish = 7
  • Dutch = 8
  • French = 9
  • German = 10
  • Greek = 11
  • Hindi = 12
  • Italian = 13
  • Japanese = 14
  • Kazakh = 15 (partial support, potentially only for the Cyrillic form of the alphabet)
  • Persian = 16
  • Polish = 17
  • Russian = 18
  • Spanish = 19
  • Urdu = 20

Credit

This project wouldn't be possible without the work done by Pragmatic Segmenter team.