# Text segmentation: Myanmar (မြန်မာစာ) example

Tokenisation, also known as break iteration or text segmentation, is a key step in Natural Language Processinng, lexiographic analysis, text layout and many other text based processes. Many NLP libraries and other packages provide tokenisation support for a key set of languages. Although, it may be necessary to use more specialist tools, depending on the language or script (writing system) being processed.

Text segmentation or tokenisation can occur at codepoint, grapheme, syllable (phonemic or orthographic), word, line or sentence boundaries.

For this example we will be looking at grapheme, syllable and word segmentation of Myanmar (Burmese) text. Like a number of other writing systems, Myanmar does not use whitespace to seperate words. Whitespace is used at phrasal and sentence boundaries, and within a phrase words run together without any indication of a boundary.

Ideal tokenisation and line breaking for Myanmar text occurs at word boundaries, but identification of word boundaries requires a dictionary lookup or language specific tokenisation tools. A secondary break iteration point is at syllable boundaries. We will use a mix of language specific and generic tools to illustrate Myanmar segmentation.

## Setup

In [1]:
import os, sys
libpaths = [os.path.expanduser('../utils'), os.path.expanduser('../data/rbbi')]
for libpath in libpaths:
    if libpath not in sys.path:
        sys.path.append(libpath)
from elle import uu
import el_utils as elu

import pyidaungsu as pds
from myanmar.encodings import UnicodeEncoding
from myanmar.language import MorphoSyllableBreak, PhonemicSyllableBreak
from icu import BreakIterator, Locale, RuleBasedBreakIterator
import grapheme
import regex as re


LANG = "my-MM"
LOC = Locale.createCanonical(LANG)

## Helper functions

In [2]:
def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

# pymyan=True, the python-myanmar module is being used
SEP = "\u2009·\u2009"
def results(l, sep="|", pymyan=False):
    print("Number of tokens: ", str(len(l)))
    r = sep.join(list(s['syllable'] for s in l)) if pymyan else sep.join(l)
    print("Segmentation boundaries: " + r)


## Test string

We will use the string __ရန်ကုန်ကွန်ပျူတာတက္ကသိုလ်__ (University of Computer Studies, Yangon) which should ideally resolve to three words.

In [3]:
s = "ရန်ကုန်ကွန်ပျူတာတက္ကသိုလ်"  # University of Computer Studies, Yangon
uu(s, LANG).lengthData()

String: ရန်ကုန်ကွန်ပျူတာတက္ကသိုလ်
Codepoints: U+101B U+1014 U+103A U+1000 U+102F U+1014 U+103A U+1000 U+103D U+1014 U+103A U+1015 U+103B U+1030 U+1010 U+102C U+1010 U+1000 U+1039 U+1000 U+101E U+102D U+102F U+101C U+103A
+-------------+---------+
| Component   |   Count |
| Characters  |      25 |
+-------------+---------+
| Bytes       |      75 |
+-------------+---------+
| Graphemes   |      14 |
+-------------+---------+


## Using pyidaungsu package

In [4]:
# Using pyidaungsu

print(pds.detect(s))
# syllable level
print(pds.tokenize(s)) # lang parameter for default function is 'mm'
#print(pds.tokenize(s, lang="mm")) 
# word level 
print(pds.tokenize(s, form="word"))

uni
['ရန်', 'ကုန်', 'ကွန်', 'ပျူ', 'တာ', 'တက္က', 'သိုလ်']
['ရန်ကုန်', 'ကွန်ပျူတာ', 'တက္ကသိုလ်']


## Grapheme segmentation

We will illustrate three approaches to grapheme segmentation:

1. A regular expression using the the _[regex](https://github.com/mrabarnett/mrab-regex)_ package with the regex pattern `r'\X'`. _Regex_ does not support language specific tailoring for text segmentation.
2. The _[grapheme](https://github.com/alvinlindstam/grapheme)_ package for working with user perceived characters. It implements [Unicode Standard Annex #29](http://unicode.org/reports/tr29/). _Grapheme_ does not support language specific tailoring for text segmentation.
3. _[PyICU](https://gitlab.pyicu.org/main/pyicu)'s_ `BreakIterator.createCharacterInstance()`. _PyICU_ is a Python extension wrapping the ICU C/C++ libraries.


### Regular expressions

In [6]:
chars_regex = re.findall(r'\X', s)
results(chars_regex, sep=SEP)

Number of tokens:  14
Segmentation boundaries: ရ · န် · ကု · န် · ကွ · န် · ပျူ · တ · ာ · တ · က္ · က · သို · လ်


### Grapheme

At the time of writing, grapheme supports Unicode 13.0.

In [7]:
# Graphemes
chars_grapheme = list(grapheme.graphemes(s))
results(chars_grapheme, sep=SEP)

Number of tokens:  14
Segmentation boundaries: ရ · န် · ကု · န် · ကွ · န် · ပျူ · တ · ာ · တ · က္ · က · သို · လ်


### PyICU

In [8]:
cbi = BreakIterator.createCharacterInstance(LOC)
chars_icu = list(iterate_breaks(s, cbi))

results(chars_icu, sep=SEP)

Number of tokens:  14
Segmentation boundaries: ရ · န် · ကု · န် · ကွ · န် · ပျူ · တ · ာ · တ · က္ · က · သို · လ်


## Syllable segmentation

Three options are available for syllable segementation:

1. PyICU using rule based break iteration.
2. Two Myanmar specific packages:
    1. [pyidaungsu](https://github.com/kaunghtetsan275/pyidaungsu). Pyidaungsu provides syllable level segmentation for Myanmar(Burmese), S'gaw Karen, Shan and Mon.
    2. [python-myanmar](https://github.com/trhura/python-myanmar). This package provides both morphological and phonetic syllable breaks for Myanmar (Burmese) text.

### PyICU

_PyICU_ does not provide a syllable based break iterator, but it is possible to write a custom rule set for the break iterator which would perform syllable level segmentation.

We are using `MyanmarSyllable.rbbi` from the [Lucene](https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi) project.

In [9]:
with open('../data/rbbi/MyanmarSyllable.rbbi') as f:
    rbbi = f.read()

sbi = RuleBasedBreakIterator(rbbi)
syl_icu = list(iterate_breaks(s, sbi))  
results(syl_icu, sep=SEP)


Number of tokens:  8
Segmentation boundaries: ရန် · ကုန် · ကွန် · ပျူ · တာ · တ · က္က · သိုလ်


### Pyidaungsu

In [10]:
syl_pyidaungsu = pds.tokenize(s)
# pds.tokenize() can take a lang parameter, which defaults to 'mm'
# i.e. pds.tokenize(s, lang="mm")

results(syl_pyidaungsu, sep=SEP)

Number of tokens:  7
Segmentation boundaries: ရန် · ကုန် · ကွန် · ပျူ · တာ · တက္က · သိုလ်


### Python-myanmar

The python-myanmar module supports both phonemic and morphological syllables.

In [11]:
# Myanmar (Burmese) phonemic syllables - using python-myanmar
syl_py_myan_phonemic = list(PhonemicSyllableBreak(s, UnicodeEncoding()))
results(syl_py_myan_phonemic, pymyan=True, sep=SEP)

# Myanmar (Burmese) morphological syllables - using python-myanmar
syl_py_myan_morpho = list(MorphoSyllableBreak(s, UnicodeEncoding()))
results(syl_py_myan_morpho, pymyan=True, sep=SEP)

Number of tokens:  8
Segmentation boundaries: ရန် · ကုန် · ကွန် · ပျူ · တာ · တက္ · က · သိုလ်
Number of tokens:  12
Segmentation boundaries: ရ · န် · ကု · န် · ကွ · န် · ပျူ · တာ · တ · က္က · သို · လ်


### Various regular expression syllabification expressions



In [44]:
def parse_regex(text, pattern):
    r = re.sub(pattern, r"·\1", text)
    if r[0] == "·":
        r = r[1:]
    return r.split("·")

# Regex pattern from ReSegment: https://github.com/swanhtet1992/ReSegment
resegment_pattern = r'(?:(?<!္)([က-ဪဿ၊-၏]|[၀-၉]+|[^က-၏]+)(?![ှျ]?[့္်]))'
syl_resegment = parse_regex(s, resegment_pattern)
results(syl_resegment, sep=SEP)

Number of tokens:  7
Segmentation boundaries: ရန် · ကုန် · ကွန် · ပျူ · တာ · တက္က · သိုလ်


### Variation in syllable algorithms



|Technique  |Count  |Segmentation |
|---------- |------ |------------ |
|ICU custom rules |8 |ရန် · ကုန် · ကွန် · ပျူ · တာ · တ · က္က · သိုလ် |
|Pyidaungsu |7 |ရန် · ကုန် · ကွန် · ပျူ · တာ · တက္က · သိုလ် |
|python-myanmar (phonemic) |8 |ရန် · ကုန် · ကွန် · ပျူ · တာ · တက္ · က · သိုလ် |
|python-myanmar (morphological) |12 |ရ · န် · ကု · န် · ကွ · န် · ပျူ · တာ · တ · က္က · သို · လ် |
|ReSegment |7 |ရန် · ကုန် · ကွန် · ပျူ · တာ · တက္က · သိုလ် |

## Word segmentation

### PyICU

In [12]:
wbi = BreakIterator.createWordInstance(LOC)
words_icu = list(iterate_breaks(s, wbi))

results(words_icu, sep=SEP)

Number of tokens:  4
Segmentation boundaries: ရန် · ကုန် · ကွန်ပျူတာ · တက္ကသိုလ်


### Pyidaungsu

In [13]:
word_pyidaungsu = pds.tokenize(s, form="word")

results(word_pyidaungsu, sep=SEP)

Number of tokens:  3
Segmentation boundaries: ရန်ကုန် · ကွန်ပျူတာ · တက္ကသိုလ်


### Variation in word segmentation algorithms

|Technique  |Count  |Segmentation |
|---------- |------ |------------ |
|PyICU |4 |ရန် · ကုန် · ကွန်ပျူတာ · တက္ကသိုလ် |
|Pyidaungsu |3 |ရန်ကုန် · ကွန်ပျူတာ · တက္ကသိုလ် |

The ideal word segmentation for `s` would be `ရန်ကုန် · ကွန်ပျူတာ · တက္ကသိုလ်`, where __ရန်ကုန်__ is the endonym for the city Yangon.

_ICU_ uses dictionary based method for Myanmar word segmentation, and the results are dependant on the quality fo the dictionary.