#TURJUMAN
Turjuman is a neural machine translation toolkit. It translates from 20 languages into Modern Standard Arabic (MSA). Turjuman is described in this paper: [**TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation**](https://arxiv.org/abs/2206.03933). Turjuman exploits our [AraT5 model](https://github.com/UBC-NLP/araT5). This endows Turjuman with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value.

---


https://github.com/UBC-NLP/turjuman

## (1) Install Turjuman

In [52]:
!pip install git+https://github.com/UBC-NLP/turjuman.git --q

## (2) Initial turjuman object

In [53]:
import logging
import os
from turjuman import turjuman

In [55]:
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=os.environ.get("LOGLEVEL", "INFO").upper(),
)
logger = logging.getLogger("turjuman.translate")
cache_dir="/content/mycache"


In [56]:
torj = turjuman.turjuman(logger, cache_dir)

2022-05-18 19:00:11 | INFO | turjuman.translate | Loading model from UBC-NLP/turjuman


## (2) Translate using beam search (default)

> Indented block


- **Beam search** is the *default* generation method on Turjuman
- Beam search default setting:
  - **seq_length**: The maximum sequence length value, *default value is 300*
  - **max_outputs**: The maximum of the output tanslations (*default value is 1*)
  - **num_beams NUM_BEAMS**: Number of beams (*default value is 1*)
  - **no_repeat_ngram_size**: Number of n-gram that doesn't appears twice (*default value is 2*)

In [57]:
beam_options = {"search_method":"beam", "seq_length": 300, "num_beams":5, "no_repeat_ngram_size":2, "max_outputs":1}
target = torj.translate("As US reaches one million COVID deaths, how are Americans coping?",**beam_options)
print (target)

2022-05-18 19:02:59 | INFO | turjuman.translate | Using beam search
2022-05-18 19:03:02 | INFO | turjuman.translate | Extract outputs


{'source': 'As US reaches one million COVID deaths, how are Americans coping?', 'target': ['وبينما تصل الولايات المتحدة إلى مليون حالة وفاة من فيروس كوفيد-19 ، كيف يتعامل الأمريكيون مع ذلك ؟']}


## (3) Translate using greedy search
- Greedy search default setting:
  - **seq_length**: The maximum sequence length value, *default vlaue is 300*

In [59]:
greedy_options = {"search_method":"greedy", "seq_length": 300}
target = torj.translate("As US reaches one million COVID deaths, how are Americans coping?",**greedy_options)
print (target)

2022-05-18 19:04:37 | INFO | turjuman.translate | Using greedy search
2022-05-18 19:04:39 | INFO | turjuman.translate | Extract outputs


{'source': 'As US reaches one million COVID deaths, how are Americans coping?', 'target': ['وبما أن الولايات المتحدة تصل إلى مليون حالة وفاة من فيروس كوفيد-19 ، كيف يمكن للولايات المتحدة أن تتصدى لهذا ؟']}


## (4) Translate using sampling search
- Sampling search default setting:
  - **seq_length**: The maximum sequence length value, *default value is 300*
  - **max_outputs**: The maxmuim of the output tanslations (*default value is 1*)
  - **top_k**: Sample from top K likely next words instead of all words (*default value is 50*)
  - **top_p**: Sample from the smallest set whose cumulative probability mass exceeds p for next words (*default value is 0.95*)

In [61]:
sampling_options = {"search_method":"sampling", "seq_length": 300, "max_outputs":1, "top_p":0.95, "top_k":50}
target = torj.translate("As US reaches one million COVID deaths, how are Americans coping?",**sampling_options)
print (target)

2022-05-18 19:09:12 | INFO | turjuman.translate | Using sampling search
2022-05-18 19:09:14 | INFO | turjuman.translate | Extract outputs


{'source': 'As US reaches one million COVID deaths, how are Americans coping?', 'target': ['وبما أن الولايات المتحدة تصل إلى مليون حالات وفاة بسبب كوفيد-19 ، كيف يعالج الأميركيون الأمر ؟']}


## (5) Read and translate text from file
- **input_file**: import the text from file. The translation will saved on the JSON format file
- **batch_size**: The maximum number of source examples utilized in one iteration (default value is 25)
- **gen_options**: Generation options 

In [62]:
!wget https://raw.githubusercontent.com/UBC-NLP/turjuman/main/examples/samples.txt

--2022-05-18 19:10:23--  https://raw.githubusercontent.com/UBC-NLP/turjuman/main/examples/samples.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 732 [text/plain]
Saving to: ‘samples.txt’


2022-05-18 19:10:23 (45.6 MB/s) - ‘samples.txt’ saved [732/732]



In [65]:
gen_options = {"search_method":"beam", "seq_length": 300, "num_beams":5, "no_repeat_ngram_size":2, "max_outputs":1}
torj.translate_from_file("samples.txt", batch_size=25, **gen_options)

2022-05-18 19:12:42 | INFO | turjuman.translate | Using beam search
2022-05-18 19:12:42 | INFO | turjuman.translate | Loading source text from file (samples.txt)


  0%|          | 0/1 [00:00<?, ?it/s]

2022-05-18 19:12:43 | INFO | turjuman.translate | Running tokenizer on source text


  0%|          | 0/1 [00:00<?, ?ba/s]


translate:   0%|          | 0/1 [00:00<?, ?it/s][A2022-05-18 19:12:46 | INFO | turjuman.translate | Translating with batch_size 25 and #batches = 1

translate: 100%|██████████| 1/1 [00:15<00:00, 15.35s/it]
2022-05-18 19:13:01 | INFO | turjuman.translate | Extract outputs
2022-05-18 19:13:01 | INFO | turjuman.translate | The translation are saved on samples_Turjuman_translate.json


In [66]:
#read the output file
import pandas as pd
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', None)
df = pd.read_json("samples_Turjuman_translate.json", orient='records', lines=True)
df

Unnamed: 0,source,target
0,"As US reaches one million COVID deaths, how are Americans coping?",وبينما تصل الولايات المتحدة إلى مليون حالة وفاة من فيروس كوفيد-19 ، كيف يتعامل الأمريكيون مع ذلك ؟
1,Here is what you need to know.,إليكم ما تحتاجون إلى معرفته.
2,Это список суверенных государств и зависимых территорий в Азии .,هذه قائمة الدول ذات السيادة والأقاليم التابعة في آسيا.
3,U-901 è un sottomarino tedesco .,يو-901 هي غواصة ألمانية.
4,Όλες οι πτήσεις προς τα Νησιά Ανταμάν και Νικομπάρ διεξάγονται στο Διεθνές Αεροδρόμιο Βιρ Σαβαρκάρ .,جميع الرحلات إلى جزر عدن و نيكبار تتم عبر مطار فير سافاركار الدولي.
5,Bir tür sözel olmayan iletişim biçimidir ve sosyal davranış üzerinde büyük etkisi olduğu düşünülmektedir .,وهو نوع من التواصل غير الرسمي ، ويعتقد أنه له تأثير كبير على السلوك الاجتماعي.
6,Jeg kan betale for din datters behandling .,يمكنني أن أدفع ثمن علاج ابنتك
7,Strefa przemysłowa dla inwestycji zagranicznych .,قطاع الصناعات التحويلية للاستثمار الأجنبي.
8,क्या तुम्हें यकीन है कि वही है ?,هل أنت واثق من ذلك ؟
