# Lao word tokenisation (segmentation)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enabling-languages/libraries-transformation/blob/main/notebooks/Segment_Lao.ipynb)


The code below will use ICU and LaoNLP to perform word segmentation for comparison. 

## Setup


In [1]:
%%capture

try:
  import google.colab
  IN_COLAB = True
except ImportError:
  IN_COLAB = False

if IN_COLAB:
  !pip install laonlp
  !pip install pyicu
  !pip install -U git+https://github.com/enabling-languages/el_internationalisation.git#egg=el_internationalisation
  !pip install -U git+https://github.com/enabling-languages/el_utilities.git#egg=el_utilities

from laonlp.tokenize import word_tokenize as lao_wt
from icu import BreakIterator, Locale, RuleBasedBreakIterator
import regex as re, el_utilities as elu, el_internationalisation as eli

ModuleNotFoundError: No module named 'laonlp'

In [3]:
DEFAULT_NORMALISATION_FORM: str = "NFM"

def laonlp_tokenise(s, sep):
    s = sep.join(lao_wt(s))
    s = re.sub(r"\s{2,}", " ", s)
    s = re.sub(r'\s([?.!"](?:\s|$))', r'\1', s)
    s = re.sub(r'\s([\-])(?:\s|$)', r'\1', s)
    return s

def iterate_breaks(text, bi):
    bi.setText(text)
    lastpos = 0
    while True:
        next_boundary = bi.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

bi = BreakIterator.createWordInstance(Locale('lo_LA'))
def icu_tokenise(s, sep, iterator=bi):
    s = sep.join(list(iterate_breaks(s, iterator)))
    s = re.sub(r"\s{2,}", " ", s)
    s = re.sub(r'\s([?.!\"](?:\s|$))', r'\1', s)
    s = re.sub(r'\s([\-])(?:\s|$)', r'\1', s)
    return s

def segment_words(text, engine="icu", sep="\u0020"):
    engine = engine.lower()
    if engine == "icu":
        return icu_tokenise(text, sep)
    elif engine == "laonlp":
        return laonlp_tokenise(text, sep)

In [4]:
# SEP = "\u2009·\u2009"
ZWSP = "\u200B"
with open('Lao.rbbi') as f:
    rbbi = f.read()
sbi = RuleBasedBreakIterator(rbbi)

def lao_syllabification(lao_text, bi, sep="|"):
    r = []
    for item in lao_text.split():
        r.append(sep.join(list(iterate_breaks(item, bi)))) if bool(re.search(r'[\p{Lao}]', item)) else r.append(item)
    return " ".join(r)

## Enter text

test: ບົດສະຫລຸບການຈັດຕັ້ງປະຕິບັດວຽກງານຮອບດ້ານສິກປີ 2008-2009 ແລະ ທິດທາງແຜນການສິກປີ 2009-2010

In [5]:
lao_data = input("Enter Lao text: ")

## Word segmentation with ICU

In [6]:
icu_seg = segment_words(lao_data, engine="icu", sep=" ")
print(icu_seg)
print(elu.el_transliterate(icu_seg, lang="lo", dir="forward", nf=DEFAULT_NORMALISATION_FORM))

ບົດ ສະຫລຸບ ການ ຈັດຕັ້ງ ປະຕິບັດ ວຽກງານ ຮອບ ດ້ານ ສິກ ປີ 2008-2009 ແລະ ທິດທາງ ແຜນການ ສິກ ປີ 2009-2010
Bot Sǭະຫລຸບ kangນ Chattang patibat vīakngān hǭp dān Sǭິກ pī 2008-2009 læ ທິດThāng Phǣnkān Sǭິກ pī 2009-2010


## Word segmentation with LaoNLP

In [7]:
laonlp_seg = segment_words(lao_data, engine="laonlp", sep=" ")
print(laonlp_seg)
print(elu.el_transliterate(laonlp_seg, lang="lo", dir="forward", nf=DEFAULT_NORMALISATION_FORM))

ບົດ ສະຫລຸບ ການຈັດຕັ້ງ ປະຕິບັດ ວຽກງານ ຮອບດ້ານ ສິກ ປີ 2008-2009 ແລະ ທິດທາງ ແຜນການສິກ ປີ 2009-2010
Bot Sǭະຫລຸບ kangນChattang patibat vīakngān hǭpdān Sǭິກ pī 2008-2009 læ ທິດThāng PhǣnkānSǭິກ pī 2009-2010


## Syllable segmentation with ICU

PyICU using rule based break iteration.

PyICU does not provide a syllable based break iterator, but it is possible to write a custom rule set for the break iterator which would perform syllable level segmentation.

We are using [Lao.rbbi](https://github.com/apache/solr/blob/e8e4245d9b36123446546ff15967ac95429ea2b0/lucene/analysis/icu/src/data/uax29/Lao.rbbi) from an older version of Solr.

Insert Zero Width Space at syllbale boundaries. This can then be used for syllable based romanisation.

In [8]:
print(lao_syllabification(lao_data, sbi, sep=ZWSP))

ບົດ​ສະ​ຫລຸບ​ການ​ຈັດ​ຕັ້ງ​ປະ​ຕິບ​ັ​ດວຽກ​ງານ​ຮອບ​ດ້ານ​ສິກ​ປີ 2008-2009 ແລະ ທິດ​ທາງ​ແຜນ​ການ​ສິກ​ປີ 2009-2010


Visable syllable boundaries:

In [9]:
print(lao_syllabification(lao_data, sbi))

ບົດ|ສະ|ຫລຸບ|ການ|ຈັດ|ຕັ້ງ|ປະ|ຕິບ|ັ|ດວຽກ|ງານ|ຮອບ|ດ້ານ|ສິກ|ປີ 2008-2009 ແລະ ທິດ|ທາງ|ແຜນ|ການ|ສິກ|ປີ 2009-2010
