#TURJUMAN
TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer [AraT5 model](https://github.com/UBC-NLP/araT5) (Nagoudi et al., 2022), endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality.

---


https://github.com/UBC-NLP/turjuman

##Install requirments

In [1]:
!pip install -U git+https://github.com/UBC-NLP/turjuman.git --q

[K     |████████████████████████████████| 1.2 MB 3.9 MB/s 
[K     |████████████████████████████████| 4.2 MB 34.1 MB/s 
[K     |████████████████████████████████| 90 kB 4.4 MB/s 
[K     |████████████████████████████████| 6.6 MB 33.2 MB/s 
[K     |████████████████████████████████| 84 kB 1.1 MB/s 
[K     |████████████████████████████████| 596 kB 30.6 MB/s 
[?25h  Building wheel for turjuman (setup.py) ... [?25l[?25hdone


##Turjuman Command Line Interface (CLI)
Turjuman cli support two types of inputs:
- **-t** or **--text**: Write you input text directly on the command line. The translation will display directly on the terminal.
- **-f** or **--input_file**: import the text from file. The translation will saved on the JSON format file.

### (1) Usage and Arguments


In [2]:
 !turjuman_translate -h

usage: turjuman_translate [-h] [-t TEXT] [-f INPUT_FILE] [-m SEARCH_METHOD]
                          [-s SEQ_LENGTH] [-o MAX_OUTPUTS] [-b NUM_BEAMS]
                          [-n NO_REPEAT_NGRAM_SIZE] [-k TOP_K] [-p TOP_P]
                          [-c CACHE_DIR] [-l LOGGING_FILE]

Turjuman Translate Command Line Interface (CLI)

optional arguments:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT  Translate the input text
  -f INPUT_FILE, --input_file INPUT_FILE
                        Translate the input file
  -m SEARCH_METHOD, --search_method SEARCH_METHOD
                        Turjuman translation search method should be one of
                        the follows ['greedy', 'beam', 'sampling'], default
                        value is beam search
  -s SEQ_LENGTH, --seq_length SEQ_LENGTH
                        The maximum sequence length value, default vlaue is
                        512
  -o MAX_OUTPUTS, --max_outputs MAX_OUTPUTS
                

## (2) Translate using beam search (default)
- **Beam search** is the *default* generation method on Turjuman
- Beam search default setting:
  - **-s** or **--seq_length**: The maximum sequence length value, *default vlaue is 512*
  - **-o** or **--max_outputs**: The maxmuim of the output tanslations (*default vlaue is 1*)
  - **-b** or **--num_beams NUM_BEAMS**: Number of beams (*default vlaue is 1*)
  - **-n** or **--no_repeat_ngram_size**: Number of n-gram that doesn't appears twice (*default vlaue is 2*)

In [3]:
 # Beam search is the default generation method on Turjuman
 !turjuman_translate --text "This has got to be one of the best stores in the world !"

2022-05-13 01:49:37 | INFO | turjuman.translate_cli | Turjuman Translate Command Line Interface
2022-05-13 01:49:37 | INFO | turjuman.translate_cli | Translate from input sentence
2022-05-13 01:49:37 | INFO | turjuman.translate_cli | Loading model from UBC-NLP/turjuman
Downloading: 100% 1.85k/1.85k [00:00<00:00, 1.20MB/s]
Downloading: 100% 565/565 [00:00<00:00, 317kB/s]
Downloading: 100% 2.32M/2.32M [00:01<00:00, 1.66MB/s]
Downloading: 100% 1.74k/1.74k [00:00<00:00, 1.30MB/s]
Downloading: 100% 565/565 [00:00<00:00, 416kB/s]
Downloading: 100% 1.05G/1.05G [00:56<00:00, 20.0MB/s]
2022-05-13 01:50:57 | INFO | turjuman.translate_cli | Using beam search
source: This has got to be one of the best stores in the world !
target: لقد أصبح هذا واحد من أفضل المتاجر في العالم!


## (3) Translate using greedy search
- Greedy search default setting:
  - **-s** or **--seq_length**: The maximum sequence length value, *default vlaue is 512*

In [4]:
!turjuman_translate --search_method greedy --text "This has got to be one of the best stores in the world !"

2022-05-13 01:51:06 | INFO | turjuman.translate_cli | Turjuman Translate Command Line Interface
2022-05-13 01:51:06 | INFO | turjuman.translate_cli | Translate from input sentence
2022-05-13 01:51:06 | INFO | turjuman.translate_cli | Loading model from UBC-NLP/turjuman
2022-05-13 01:51:21 | INFO | turjuman.translate_cli | Using greedy search
source: This has got to be one of the best stores in the world !
target: هذا كان من أفضل المتاجر في العالم!


## (4) Translate using sampling search
- Sampling search default setting:
  - **-s** or **--seq_length**: The maximum sequence length value, *default vlaue is 512*
  - **-o** or **--max_outputs**: The maxmuim of the output tanslations (*default vlaue is 1*)
  - **-k** or **--top_k**: Sample from top K likely next words instead of all words (*default vlaue is 50*)
  - **-p** or **--top_p**: Sample from the smallest set whose cumulative probability mass exceeds p for next words (*default vlaue is 0.95*)

In [5]:
!turjuman_translate --search_method sampling --text "This has got to be one of the best stores in the world !"

2022-05-13 01:51:29 | INFO | turjuman.translate_cli | Turjuman Translate Command Line Interface
2022-05-13 01:51:29 | INFO | turjuman.translate_cli | Translate from input sentence
2022-05-13 01:51:29 | INFO | turjuman.translate_cli | Loading model from UBC-NLP/turjuman
2022-05-13 01:51:44 | INFO | turjuman.translate_cli | Using sampling search
source: This has got to be one of the best stores in the world !
target: - و لهذا هذا واحد من أفضل الدكاكين في العالم


## (5) Read and translate text from file
- **-f** or **--input_file**: import the text from file. The translation will saved on the JSON format file

In [6]:
!wget https://raw.githubusercontent.com/UBC-NLP/turjuman/main/examples/samples.txt

--2022-05-13 01:51:48--  https://raw.githubusercontent.com/UBC-NLP/turjuman/main/examples/samples.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278 [text/plain]
Saving to: ‘samples.txt’


2022-05-13 01:51:48 (4.72 MB/s) - ‘samples.txt’ saved [278/278]



In [7]:
 # translate sentences that imported from file using default Beam search
 !turjuman_translate --input_file samples.txt

2022-05-13 01:51:53 | INFO | turjuman.translate_cli | Turjuman Translate Command Line Interface
2022-05-13 01:51:53 | INFO | turjuman.translate_cli | Translate from input file samples.txt
2022-05-13 01:51:53 | INFO | turjuman.translate_cli | Loading model from UBC-NLP/turjuman
2022-05-13 01:52:08 | INFO | turjuman.translate_cli | Using beam search
2022-05-13 01:52:17 | INFO | turjuman.translate_cli | The translation are saved on samples_Turjuman_translate.json


In [8]:
#read the output file
import pandas as pd
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', None)
df = pd.read_json("samples_Turjuman_translate.json", orient='records', lines=True)
df

Unnamed: 0,source,target
0,This has got to be one of the best stores in the world !,لقد أصبح هذا واحد من أفضل المتاجر في العالم!
1,"C’ ́etait un travail rapide, propre et de qualit ́e. pr ́evoyez de revenir.",كان هناك عمل سريع ، نظيف ونوعي.
2,direk burası berbat.,تَركَبُ مِن قِبل
3,"Uma estrela, eu nunca lidei com pessoas t ̃ao insistentes, rudes e manipuladoras.",نجم واحد ، لم أحب أبدا أناسا مُصرين ووقحين
4,날씨가 좋은데 산에 가자.,يُمكنك أن تُخبرنا كيف نُصبح مُ


In [9]:
 # translate sentences that imported from file usinf default Beam search
 !turjuman_translate --input_file samples.txt --max_outputs 3

2022-05-13 01:52:25 | INFO | turjuman.translate_cli | Turjuman Translate Command Line Interface
2022-05-13 01:52:25 | INFO | turjuman.translate_cli | Translate from input file samples.txt
2022-05-13 01:52:25 | INFO | turjuman.translate_cli | Loading model from UBC-NLP/turjuman
2022-05-13 01:52:40 | INFO | turjuman.translate_cli | Using beam search
2022-05-13 01:52:49 | INFO | turjuman.translate_cli | The translation are saved on samples_Turjuman_translate.json


In [10]:
df = pd.read_json("samples_Turjuman_translate.json", orient='records', lines=True)
df

Unnamed: 0,source,3_targets
0,This has got to be one of the best stores in the world !,"[لقد أصبح هذا واحد من أفضل المتاجر في العالم!, لقد أصبح هذا واحدا من أفضل المتاجر في العالم!, لقد أصبح هذا من أفضل المتاجر في العالم!]"
1,"C’ ́etait un travail rapide, propre et de qualit ́e. pr ́evoyez de revenir.","[كان هناك عمل سريع ، نظيف ونوعي., كان هناك عمل سريع ، نظيف ومن نوع ما., كان هناك عمل سريع ، نظيف وجيد.]"
2,direk burası berbat.,"[تَركَبُ مِن قِبل , تَحمَلُ مِن قِبل , تَحمَلُهُ مِن قِ]"
3,"Uma estrela, eu nunca lidei com pessoas t ̃ao insistentes, rudes e manipuladoras.","[نجم واحد ، لم أحب أبدا أناسا مُصرين ووقحين, نجم واحد ، لم أحب أبدا أناسا مُصرّين ومُ, نجم واحد ، لم أحب أبدا أناسا مُصرّين ووقح]"
4,날씨가 좋은데 산에 가자.,"[يُمكنك أن تُخبرنا كيف نُصبح مُ, يُمكنك أن تُخبرنا عن هذا., يُمكن أن تُستخدم كأدوات.]"
