Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Unicode eXtention for Erlang (Strings, Collation)

branch: master
README.rst

; -- Mode: Markdown; -- ; vim: filetype=markdown tw=76 expandtab shiftwidth=4 tabstop=4

Unicode eXtension

License: Apache License, Version 2.0

Alternative license: LGPLv3

Author: Uvarov Michael (arcusfelis@gmail.com)

Unidata version: 6.1.0

Read edoc documentation

Module for working with strings. A string is a flatten list of Unicode characters.

All actions with Unicode were described in the Unicode Standards.

Build Status

This library realized only these documents:

  • UAX 15 Unicode Normalization Forms
  • UTS 10 Unicode Collation Algorithm

and some parts from:

  • UAX 44 Unicode Character Database

Structure of the library

ux_string uses ux_char and ux_unidata.

ux_uca uses ux_char and ux_unidata.

ux_char uses ux_unidata.

ux_unidata is for an internal data access.

ux_string.erl: String Functions for lists of Unicode characters.

This module provides the functions for operations with UNIDATA. UNIDATA contains data about Unicode characters.

Functions for working with Unicode Normal Forms (UNF)

  • to_nfc/1
  • to_nfd/1
  • to_nfkd/1
  • to_nfkc/1
  • is_nfc/1
  • is_nfd/1
  • is_nfkc/1
  • is_nfkd/1

Functions from stdlib for Unicode strings

  • to_lower/1
  • to_upper/1

Functions for processing strings as groups of graphemes

Grapheme is a letter with its modifiers. - length/1 - reverse/1 - first/2 - last/2

Examples

Code:

(ux@delta)11> ux_string:length("FF g̈").
4
(ux@delta)12> string:len("FF g̈").
5
(ux@delta)13> ux_string:to_graphemes("FF g̈").
["F","F"," ",[103,776]]

"PHP-style" string functions

  • explode/2,3
  • html_special_chars/1 (htmlspecialchars in php)
  • strip_tags/1,2

Examples

Code:

ux_string:explode(["==", "++", "|"], "+++-+=|==|==|=+-+++").

Result:

[[],"+-+=",[],[],[],[],"=+-","+"]

Code:

ux_html:strip_tags("<b>bold text</b>").

Result:

"bold text"

Types function

Type is a General Category in Unicode.

Code:

Str = "Erlang created the field of telephone
networks analysis. His early work in scrutinizing the use of local, exchange
and trunk telephone line usage in a small community, to understand the
theoretical requirements of an efficient network led to the creation of the
Erlang formula, which became a foundational element of present day
telecommunication network studies.",
ux_string:explode_types(['Zs', 'Lu'], Str).

Result:

[[],"rlang","created","the","field","of","telephone",
 "networks","analysis.",[],"is","early","work","in",
 "scrutinizing","the","use","of","local,","exchange","and",
 "trunk","telephone","line","usage","in","a","small",
 [...]|...]

Code:

ux_string:types(Str).

Result:

['Lu','Ll','Ll','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
 'Ll','Ll','Ll','Zs','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
 'Ll','Zs','Ll','Ll','Zs','Ll'...]

Where atom 'Lu' is Letter, Uppercase; ll is Letter, Lowercase. Read more about types from description of ux_char:type/1.

Code:

ux_string:delete_types(['Ll'], Str).

Result:

"E       . H        ,          ,                E ,           ."

ux_char.erl: Char Functions

Code:

ux_char:type($ ).

Result:

'Zs'

List of types

  • Normative Categories:
    • Lu Letter, Uppercase
    • Ll Letter, Lowercase
    • Lt Letter, Titlecase
    • Mn Mark, Non-Spacing
    • Mc Mark, Spacing Combining
    • Me Mark, Enclosing
    • Nd Number, Decimal Digit
    • Nl Number, Letter
    • No Number, Other
    • Zs Separator, Space
    • Zl Separator, Line
    • Zp Separator, Paragraph
    • Cc Other, Control
    • Cf Other, Format
    • Cs Other, Surrogate
    • Co Other, Private Use
    • Cn Other, Not Assigned (no characters in the file have this property)
  • Informative Categories:
    • Lm Letter, Modifier
    • Lo Letter, Other
    • Pc Punctuation, Connector
    • Pd Punctuation, Dash
    • Ps Punctuation, Open
    • Pe Punctuation, Close
    • Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
    • Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
    • Po Punctuation, Other
    • Sm Symbol, Math
    • Sc Symbol, Currency
    • Sk Symbol, Modifier
    • So Symbol, Other

ux_uca.erl: Unicode Collation Algorithm

See Unicode Technical Standard #10.

Functions

  • compare/2,3
  • sort/1,2
  • sort_key/1,2
  • sort_array/1,2
  • search/2,3,4

Examples

Code from erlang shell:

1> ux_uca:sort_key("a").
<<21,163,0,0,32,0,0,2,0,0,255,255>>

2> ux_uca:sort_key("abc").
<<21,163,21,185,21,209,0,0,34,0,0,4,0,0,255,255,255,255,
  255,255>>

3> ux_uca:sort_key("abcd").
<<21,163,21,185,21,209,21,228,0,0,35,0,0,5,0,0,255,255,
  255,255,255,255,255,255>>

Code:

ux_uca:compare("a", "a").
ux_uca:compare("a", "b").
ux_uca:compare("c", "b").

Result:

equal
lower
greater

Code:

Options = ux_uca_options:get_options([
        {natural_sort, false},
        {strength, 3},
        {alternate, shifted}
    ]),
InStrings = ["erlang", "esl", "nitrogen", "epm", "mochiweb", "rebar", "eunit"],
OutStrings = ux_uca:sort(Options, InStrings),
[io:format("~ts~n", [S]) || S <- OutStrings],

SortKeys = [{Str, ux_uca:sort_key(Options, Str)} || Str <- OutStrings],
[io:format("~ts ~w~n", [S, K]) || {S, K} <- SortKeys],

ok.

Result:

epm
erlang
esl
eunit
mochiweb
nitrogen
rebar
epm [5631,5961,5876,0,32,32,32,0,2,2,2]
erlang [5631,6000,5828,5539,5890,5700,0,32,32,32,32,32,32,0,2,2,2,2,2,2]
esl [5631,6054,5828,0,32,32,32,0,2,2,2]
eunit [5631,6121,5890,5760,6089,0,32,32,32,32,32,0,2,2,2,2,2]
mochiweb [5876,5924,5585,5735,5760,6180,5631,5561,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
nitrogen [5890,5760,6089,6000,5924,5700,5631,5890,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
rebar [6000,5631,5561,5539,6000,0,32,32,32,32,32,0,2,2,2,2,2]
ok

Searching

Code:

(ux@delta)30> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"fox").
{"The quick brown ","fox"," jumps over the lazy dog."}

(ux@delta)33> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"cat").
false

Searching and Strength

Code:

(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.
#Fun<erl_eval.6.80247286>

(ux@delta)32> ux_uca:search(CF(2), "The quick brown fox jumps over the lazy
dog.", "dog", maximal).
{"The quick brown fox jumps over the lazy"," dog.",[]}

(ux@delta)21> ux_uca:search(CF(2), "fF", "F").
{[],"f","F"}

(ux@delta)22> ux_uca:search(CF(3), "fF", "F").
{"f","F",[]}

Searching and Match-Style

Code:

(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.
#Fun<erl_eval.6.80247286>

(ux@delta)27> ux_uca:search(CF(3), "! F   ?S?", "! F !", 'minimal').
{"! ","F","   ?S?"}

(ux@delta)28> ux_uca:search(CF(3), "! F   ?S?", "! F !", 'maximal').
{[],"! F   ?","S?"}

(ux@delta)29> ux_uca:search(CF(3), "! F   ?S?", "! F !", 'medium').
{[],"! F ","  ?S?"}

ux_unidata.erl

Stores UNIDATA information. For internal using only.

Data loading

ux_unidata_filelist:set_source(Level, ParserType, ImportedDataTypes,
FromFile).

For example:

ux_unidata_filelist:set_source(process, blocks, all, code:priv_dir(ux) ++ "/UNIDATA/Blocks.txt").

loads data about Unicode blocks from priv/UNIDATA/Blocks.txt.

So, different processes can use their own unidata dictionaries.

Level is process, application or node.

Parsers are located into ux_unidata_parser_* modules.

Default unidata files are loaded when the application tries get the access to them.

Something went wrong with that request. Please try again.