Adds functions supporting some string algorithms in the Unicode standard. For example:
-
The Unicode Case Folding algorithm to provide case-independent equality checking irrespective of language or script with
Unicode.String.fold/2
andUnicode.String.equals_ignoring_case?/2
-
The Unicode Segmentation algorithm to detect, break, split or stream strings into grapheme clusters, words, sentences and line break points.
-
The Unicode Line Breaking algorithm to determine line breaks (as in breaks where word-wrapping would be acceptable).
The Unicode Case Folding algorithm defines how to perform case folding. This allows comparison of strings in a case-insensitive fashion. It does not define the means to compare ignoring diacritical marks (accents). Some examples follow, for details see:
Unicode.String.fold/2
Unicode.String.equals_ignoring_case?/3
iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
true
iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
true
iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
false
The Unicode Segmentation annex details the algorithm to be applied with segmenting text (Elixir strings) into words, sentences, graphemes and line breaks. Some examples follow, for details see:
Unicode.String.split/2
Unicode.String.break?/2
Unicode.String.break/2
Unicode.String.splitter/2
Unicode.String.next/2
Unicode.String.stream/2
# Split text at a word boundary.
iex> Unicode.String.split "This is a sentence. And another.", break: :word
["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."]
# Split text at a word boundary but omit any whitespace
iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true
["This", "is", "a", "sentence", ".", "And", "another", "."]
# Split text at a sentence boundary.
iex> Unicode.String.split "This is a sentence. And another.", break: :sentence
["This is a sentence. ", "And another."]
# By default, common abbreviations are suppressed (ie
# the do not cause a break)
iex> Unicode.String.split "No, I don't have a Ph.D. but I don't think it matters.", break: :word, trim: true
["No", ",", "I", "don't", "have", "a", "Ph.D", ".", "but", "I", "don't",
"think", "it", "matters", "."]
iex> Unicode.String.split "No, I don't have a Ph.D. but I don't think it matters.", break: :sentence, trim: true
["No, I don't have a Ph.D. but I don't think it matters."]
# Sentence Break suppressions are locale sensitive.
iex> Unicode.String.Segment.known_locales
["de", "el", "en", "en-US", "en-US-POSIX", "es", "fi", "fr", "it", "ja", "pt",
"root", "ru", "sv", "zh", "zh-Hant"]
iex> Unicode.String.split "Non, c'est M. Dubois.", break: :sentence, trim: true, locale: "fr"
["Non, c'est M. Dubois."]
# Note that break: :line does NOT mean split the string
# at newlines. It splits the string where a line break would be
# acceptable. This is very useful for calculating where
# to perform word-wrap on some text.
iex> Unicode.String.split "This is a sentence. And another.", break: :line
["This ", "is ", "a ", "sentence. ", "And ", "another."]
Segmentation can also be streamed using Unicode.String.stream/2
. For large strings this may improve memory usage since the intermediate segments will be garbage collected when they fall out of scope.
iex> Enum.to_list Unicode.String.stream("this is a set of words", trim: true) ["this", "is", "a", "set", "of", "words"]
iex> Enum.map Unicode.String.stream("this is a set of words", trim: true),
...> fn word -> %{word: word, length: String.length(word)} end
[
%{length: 4, word: "this"},
%{length: 2, word: "is"},
%{length: 1, word: "a"},
%{length: 3, word: "set"},
%{length: 2, word: "of"},
%{length: 5, word: "words"}
]
The package can be installed by adding :unicode_string
to your list of dependencies in mix.exs
:
def deps do
[
{:unicode_string, "~> 1.0"}
]
end