# Lao word tokenisation (segmentation)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enabling-languages/libraries-transformation/blob/main/notebooks/Segment_Lao.ipynb)


The code below will use ICU and LaoNLP to perform word segmentation for comparison. 

## Setup


In [11]:
%%capture
!pip install laonlp
!pip install pyicu

from laonlp.tokenize import word_tokenize as lao_wt
from icu import BreakIterator, Locale
import regex as re

In [8]:
def laonlp_tokenise(s, sep):
    s = sep.join(lao_wt(s))
    s = re.sub(r"\s{2,}", " ", s)
    s = re.sub(r'\s([?.!"](?:\s|$))', r'\1', s)
    s = re.sub(r'\s([\-])(?:\s|$)', r'\1', s)
    return s

def iterate_breaks(text, bi):
    bi.setText(text)
    lastpos = 0
    while True:
        next_boundary = bi.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

bi = BreakIterator.createWordInstance(Locale('lo_LA'))
def icu_tokenise(s, sep, iterator=bi):
    s = sep.join(list(iterate_breaks(s, iterator)))
    s = re.sub(r"\s{2,}", " ", s)
    s = re.sub(r'\s([?.!\"](?:\s|$))', r'\1', s)
    s = re.sub(r'\s([\-])(?:\s|$)', r'\1', s)
    return s

def segment_words(text, engine="icu", sep="\u0020"):
    engine = engine.lower()
    if engine == "icu":
        return icu_tokenise(text, sep)
    elif engine == "laonlp":
        return laonlp_tokenise(text, sep)

## Enter text

In [3]:
lao_data = input("Enter Lao text: ")

Enter Lao text: ບົດສະຫລຸບການຈັດຕັ້ງປະຕິບັດວຽກງານຮອບດ້ານສິກປີ 2008-2009 ແລະ ທິດທາງແຜນການສິກປີ 2009-2010


## Segmentation with ICU

In [9]:
segment_words(lao_data, engine="icu", sep=" ")

'ບົດ ສະຫລຸບ ການ ຈັດຕັ້ງ ປະຕິບັດ ວຽກງານ ຮອບ ດ້ານ ສິກ ປີ 2008-2009 ແລະ ທິດທາງ ແຜນ ການ ສິກ ປີ 2009-2010'

## Segmentation with LaoNLP

In [10]:
segment_words(lao_data, engine="laonlp", sep=" ")

'ບົດ ສະຫລຸບ ການຈັດຕັ້ງ ປະຕິບັດ ວຽກງານ ຮອບດ້ານ ສິກ ປີ 2008-2009 ແລະ ທິດທາງ ແຜນການສິກ ປີ 2009-2010'