# MS to EN Noisy

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/noisy-ms-en-translation](https://github.com/huseinzol05/Malaya/tree/master/example/noisy-ms-en-translation).
    
</div>

<div class="alert alert-warning">

This module trained on standard language and augmented local language structures, proceed with caution.
    
</div>

In [1]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)

CPU times: user 6.16 s, sys: 1.33 s, total: 7.49 s
Wall time: 10.5 s


### List available Transformer models

In [2]:
malaya.translation.ms_en.available_transformer()

INFO:malaya.translation.ms_en:tested on 100k MS-EN test set generated from teacher semisupervised model, https://huggingface.co/datasets/mesolitica/ms-en
INFO:malaya.translation.ms_en:tested on FLORES200 MS-EN (zsm_Latn-eng_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200


Unnamed: 0,Size (MB),Quantized Size (MB),BLEU,SacreBLEU Verbose,SacreBLEU-chrF++-FLORES200,Suggested length
small,42.7,13.4,59.874731,80.6/64.3/54.1/46.3 (BP = 0.998 ratio = 0.998 ...,59.64,256
base,234.0,82.7,71.687583,86.2/74.8/67.2/61.0 (BP = 1.000 ratio = 1.005 ...,63.24,256
bigbird,246.0,63.7,59.548257,79.6/63.8/53.8/46.0 (BP = 1.000 ratio = 1.026 ...,62.49,1024
small-bigbird,50.4,13.1,55.967145,77.4/60.5/49.9/41.9 (BP = 1.000 ratio = 1.026 ...,60.57,1024
noisy-base,234.0,82.7,71.725493,86.3/74.8/67.2/61.0 (BP = 1.000 ratio = 1.002 ...,63.31,256


### Load Transformer models

```python
def transformer(model: str = 'base', quantized: bool = False, **kwargs):
    """
    Load Transformer encoder-decoder model to translate MS-to-EN.

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'small'`` - Transformer SMALL parameters.
        * ``'base'`` - Transformer BASE parameters.
        * ``'large'`` - Transformer LARGE parameters.
        * ``'bigbird'`` - BigBird BASE parameters.
        * ``'small-bigbird'`` - BigBird SMALL parameters.
        * ``'noisy-base'`` - Transformer BASE parameters trained on noisy dataset.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: model
        List of model classes:

        * if `bigbird` in model, return `malaya.model.bigbird.Translation`.
        * else, return `malaya.model.tf.compat.v1.Translation`.
    """
```

In [22]:
transformer = malaya.translation.ms_en.transformer()

In [None]:
transformer_noisy = malaya.translation.ms_en.transformer(model = 'noisy-base')

### Translate

#### Using greedy decoder

```python
def greedy_decoder(self, strings: List[str]):
    """
    translate list of strings.

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """
```

#### Using beam decoder

```python
def beam_decoder(self, strings: List[str], beam_size: int = 3, temperature: float = 0.5):
    """
    translate list of strings using beam decoder.
    Currently only `noisy` models supported `beam_size` and `temperature` parameters.

    Parameters
    ----------
    strings : List[str]
    beam_size: int, optional (default=3)
    temperature: float, optional (default=0.5)

    Returns
    -------
    result: List[str]
    """
```

**For better results, always split by end of sentences**.

In [6]:
from pprint import pprint

In [7]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin

string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)

('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')


In [8]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik

string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)

('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')


In [9]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)

('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
 'lanjutan tempoh.')


In [10]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)

('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
 'Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk '
 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
 'membantu memberikan pengetahuan am tentang kerjaya ini')


In [11]:
%%time

pprint(transformer_noisy.greedy_decoder([string_news1, string_news2, string_news3, string_karangan]))

['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues at the moment, instead focusing on the welfare of the '
 "people and efforts to revitalize the affected country's economy following "
 'the Covid-19 pandemic. The prime minister explained the matter when speaking '
 "at the Leader Meeting with the People's Assembly (DUN) leaders at the Bukit "
 'Gambir Multipurpose Hall today.',
 'ALOR SETAR - Pakatan Harapan (PH) political crisis has not ended when it '
 'failed to finalize a mutually agreed Prime Minister. Sik Member of '
 'Parliament Ahmad Tarmizi Sulaiman said he had suggested former United '
 "Nations Indigenous Party (UN) chairman Tun Dr Mahathir Mohamad and People's "
 'Justice Party (PKR) president Datuk Seri Anwar Ibrahim resign from politics '
 'as a solution.',
 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the '
 'relaxation was given as the government realized the problems they were '
 'facing in renewing th

### compare results using local language structure

In [1]:
strings = [
    'ak tak paham la',
    'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
    'Jadi haram jadah😀😃🤭',
    'nak gi mana tuu',
    'Macam nak ambil half day',
    "Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
]

In [20]:
%%time

pprint(transformer_noisy.greedy_decoder(strings))

["I don't understand.",
 'At 8 in the KK market it is very crowded, he is good at choosing a place.',
 "So it's illegal",
 'Where is that?',
 "It's like taking half a day.",
 'Imagine PH and won 14. Then there are all kinds of back doors. Ismail '
 "Sabri's last time went up. That's why I don't give a fk about politics "
 'anymore. The oath has been poured up.']
CPU times: user 10.7 s, sys: 2.08 s, total: 12.8 s
Wall time: 2.78 s


In [21]:
%%time

pprint(transformer.greedy_decoder(strings))

["I don't understand it",
 "At 8 o'clock in the KK market, he is good at choosing tmpt.",
 "So it's illegal",
 'Where to go',
 'Like taking half day',
 'Imagine PH and winning pru-14. There are so many back doors available. '
 "Last-last Ismail Sabri went up. That's why I don't give a fk about politics "
 'anymore. The swear is fk up.']
CPU times: user 10.8 s, sys: 1.35 s, total: 12.1 s
Wall time: 2.71 s


### compare with Google translate using googletrans

Install it by,

```bash
pip3 install googletrans==4.0.0rc1
```

In [3]:
from googletrans import Translator

translator = Translator()

In [6]:
for t in strings:
    r = translator.translate(t, src='ms', dest = 'en')
    print(r.text)

I don't understand
At 8 o'clock in the KK market is a lot of people 😂, he's good at choosing TMPT.
So it's illegal to make it
Where are you going
It's like taking half day
Imagine PH and won the GE-14.There must be all kinds of back doors.Last-last Ismail Sabri went up.That's why I don't give a fk about politics anymore.I swear it's up.
