<a href="https://colab.research.google.com/github/Kabongosalomon/RDC-Mobongoli/blob/main/jw300_utils/building_global_test_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Setting Up the data

Downloading the global test set is simple,
we need to set english and your target as source and target language, then we find the intersection of the english test set with the target corpus after that we get the corresponding target sentences from the target corpus.

In [2]:
%%capture
!pip install opustools-pkg

# SET THE LANGUAGE CODE and other variables.

You need to change the value below for your language!

The language codes from the [JW300 corpus website](https://object.pouta.csc.fi/OPUS-JW300/v1/languages.json) are: 
```
{
    "language": "French - Français",
    "language_en": "French",
    "language_native": "Français",
    "language_short": "fr",
    "url": "https://wol.jw.org/fr/wol/pref/r30/lp-f?newrsconf=r30&newlib=lp-f&url="
}, 

{
    "language": "Lingala - Lingala",
    "language_en": "Lingala",
    "language_native": "Lingala",
    "language_short": "ln",
    "url": "https://wol.jw.org/ln/wol/pref/r126/lp-li?newrsconf=r126&newlib=lp-li&url="
},

{
    "language": "Tshiluba - Tshiluba",
    "language_en": "Tshiluba",
    "language_native": "Tshiluba",
    "language_short": "lua",
    "url": "https://wol.jw.org/lua/wol/pref/r477/lp-sh?newrsconf=r477&newlib=lp-sh&url="
},

{
    "language": "Kikongo - Kikongo",
    "language_en": "Kikongo",
    "language_native": "Kikongo",
    "language_short": "kwy",
    "url": "https://wol.jw.org/kwy/wol/pref/r128/lp-kg?newrsconf=r128&newlib=lp-kg&url="
},

{
    "language": "Swahili (Congo) - Kiswahili (Congo)",
    "language_en": "Swahili (Congo)",
    "language_native": "Kiswahili (Congo)",
    "language_short": "swc",
    "url": "https://wol.jw.org/swc/wol/pref/r143/lp-zs?newrsconf=r143&newlib=lp-zs&url="
  },
```
Already-created test sets: https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/

In [3]:
import os
source_language = "ln"
target_language = "lua" # TODO: CHANGE THIS TO YOUR LANGUAGE! "ha" is hausa. See the language codes at https://opus.nlpl.eu/JW300.php
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.|
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# No need to use gdrive since we are training on gcp
!mkdir -p "$src-$tgt-$tag"
os.environ["gdrive_path"] = "%s-%s-%s" % (source_language, target_language, tag) # saving directly on the vm

In [4]:
!echo $gdrive_path

ln-lua-baseline


#### Downloading the corpus data

for precaution , am removing the old data

In [5]:
!rm -f w300.$src jw300.$tgt JW300_latest_xml_$src-$tgt.xml.gz JW300_latest_xml_$src-$tgt.xml JW300_latest_xml_$src.zip  JW300_latest_xml_$tgt.zip test.fr-any.fr

In [6]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/ln-lua.xml.gz not found. The following files are available for downloading:

   3 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/ln-lua.xml.gz
  60 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/ln.zip
  32 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/lua.zip

  95 MB Total size
./JW300_latest_xml_ln-lua.xml.gz ... 100% of 3 MB
./JW300_latest_xml_ln.zip ... 100% of 60 MB
./JW300_latest_xml_lua.zip ... 100% of 32 MB


In [7]:
! wget https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/test.$src-any.$src

  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

--2021-07-20 15:31:15--  https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/test.ln-any.ln
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 259242 (253K) [text/plain]
Saving to: ‘test.ln-any.ln’


2021-07-20 15:31:15 (4.62 MB/s) - ‘test.ln-any.ln’ saved [259242/259242]



In [8]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = f"test.{source_language}-any.{source_language}"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 2933 global test sentences to filter from the training/dev data.


In [9]:
!ls

drive			     JW300_latest_xml_lua.zip  ln-lua-baseline
JW300_latest_xml_ln-lua.xml  jw300.ln		       sample_data
JW300_latest_xml_ln.zip      jw300.lua		       test.ln-any.ln


#### Building the corpus

In the 2 cells below you can check if the 2 datasets are aligned. Even if you don't speak the language you can get a sense, especially with similar words, punctuation, and so forth.

In [10]:
! head -5 jw300.$src

Nsango malamu esengeli kosakolama
LIBOSO bábimisa telegrame , bansango ya mosika ezalaki koumela mingi mpo ekóma , mpe ezalaki kokóma na mpasi , na kotalela ntaka mpe lolenge ya nzela .
Tózwa ndakisa ya Ba - Incas , oyo bazalaki na Ampire moko monene na Amerika ya Sudi .
Na nsuka ya bambula 1400 mpe na ebandeli ya bambula ya 1500 , ntango Ampire yango ekómaki na nguya mingi , mokili na yango esangisaki bisika oyo ezali lelo oyo Argentine , Bolivie , Chili , Colombie , Équateur mpe Pérou , epai Cuzco , mboka mokonzi ya Ampire yango ezalaki .
Bangomba milaimilai , bazamba ya mineneminene mpe bantaka milaimilai ezalaki kokómisa mibembo mpasi .


In [11]:
! head -5 jw300.$tgt

Mukenji udi anu ne bua kufika !
KUMPALA kua kupatulabu biamu bia telefone ya kale , kuvua lutatu lukole bua kufikisha mukenji kampanda kudi bantu ba miaba ya kule pa lukasa bua mishindu ivua bantu benza ngendu ne bua njila .
Tshilejilu , mona lutatu luvua nalu bena Inca mu buloba buabu bunene bua mu Amerike wa ku Sud .
Ku ndekelu kua bidimu bia 1400 ne ku ntuadijilu kua bidimu bia 1500 , dîba divua bukalenge bua bena Inca butante bikole , buvua bukuate miaba idi lelu ditunga dia Argentine , dia Bolivie , dia Chili , dia Colombie , dia Équateur ne dia Pérou , muaba uvua tshimenga tshiabu tshikulu tshia Cuzco .
Bantu bavua ne lutatu lua kuendakana , bualu njila ivua ne mikuna mipite bule milondangane , ne metu malabale ne ntanta ivua mipite bule .


In [12]:
import pandas as pd

# TMX file to dataframe
source_file = 'jw300.' + source_language  ## source language is english
target_file = 'jw300.' + target_language ## Target is whatever you set. For our example it was ha, so jw300.ha
target_test = {}
source = []
target = []
english_sentences_in_global_test_set = {}  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as src_f:
    for i, line in enumerate(src_f):
        # Skip sentences that are contained in the test set and add them into the new frencg test
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            # Here is the intersection with the global test set
            english_sentences_in_global_test_set[i] = line.strip()           
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in english_sentences_in_global_test_set.keys():
            target.append(line.strip())
        else:
            #Collecting the aligned test sentences
            target_test[j] = line.strip()
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(english_sentences_in_global_test_set.keys()), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.head(10)

Loaded data and skipped 8170/318244 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,Nsango malamu esengeli kosakolama,Mukenji udi anu ne bua kufika !
1,"LIBOSO bábimisa telegrame , bansango ya mosika...",KUMPALA kua kupatulabu biamu bia telefone ya k...
2,"Tózwa ndakisa ya Ba - Incas , oyo bazalaki na ...","Tshilejilu , mona lutatu luvua nalu bena Inca ..."
3,Na nsuka ya bambula 1400 mpe na ebandeli ya ba...,Ku ndekelu kua bidimu bia 1400 ne ku ntuadijil...
4,"Bangomba milaimilai , bazamba ya mineneminene ...","Bantu bavua ne lutatu lua kuendakana , bualu n..."
5,"Longola yango , Ba - Incas bazalaki na banyama...",Kabidi pa kumbusha nyama ya Iama ( mienze bu m...
6,Kasi ndenge nini bazalaki kotindelana bansango...,Kadi mmunyi muvuabu bafika ku dimanyishangana ...
7,"Ba - Incas bakómisaki monɔkɔ na bango , Quechu...",Bena Inca bakavuija muakulu wa Quechua muakulu...
8,Basalaki mpe babalabala ebele .,Bakenza kabidi njila ya bungi .
9,"Balabala na bango ya monene , oyo elekaki na m...",Njila wabu mutambe bunene uvua ne bule bua kil...


## Check a random item!
Let's pick one of the keys in the dictionary at random and check it. 

In [13]:
import random
keys_in_target_test = list(target_test.keys())
print(type(keys_in_target_test))
random_key = random.choice(keys_in_target_test)
print(f"The random key we picked was {random_key}")

<class 'list'>
The random key we picked was 313135


In [14]:
target_test[random_key]

'Mmunyi mutudi mu buobumue patudi tuyisha lumu luimpe ?'

In [15]:
english_sentences_in_global_test_set[random_key]

'Ndenge nini kosakola nsango malamu esalaka ete tózala na bomoko ?'

Do the two look like they line up? 

## Check several rows at the tail end

Let's get a sample from the end of the dataset

In [17]:
target_test_set = pd.DataFrame(zip(target_test.values(), english_sentences_in_global_test_set.values()), columns=[f'{target_language}_equivalent', f'{source_language}_equivalent'])

In [18]:
target_test_set = target_test_set.reset_index()

In [19]:
target_test_set = target_test_set.set_index("index")

In [20]:
target_test_set.tail()

Unnamed: 0_level_0,lua_equivalent,ln_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
8165,"13 , 14 .","13 , 14 ."
8166,( Bala Musambu wa 1 : 1 - 3 . ),( Tángá Nzembo 1 : 1 - 3 . )
8167,Ke bualu kayi bidi ne mushinga bua baledi kutu...,"Yango wana , ezali na ntina mingi ete baboti b..."
8168,Enza bua bana bebe bamone muudi wangata Yehowa...,Tiká bana na yo bámona ete Yehova azali mpenza...
8169,Ebi mbikoleshe ditabuja diende kudi Nzambi ne ...,Yango ekómisaki makasi kondima na ye epai ya N...


Removing duplicates from english and target set

In [21]:
target_test_set = target_test_set.drop_duplicates(subset=f'{target_language}_equivalent')

In [22]:
target_test_set = target_test_set.drop_duplicates(subset=f'{source_language}_equivalent')

In [23]:
target_test_set.head()

Unnamed: 0_level_0,lua_equivalent,ln_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,© 2017 Watch Tower Bible and Tract Society of ...,© 2017 Watch Tower Bible and Tract Society of ...
1,Dipatuka diende ngumue wa ku midimu idi yenzek...,Ebimisami mpo na koteya bato Biblia na mokili ...
2,"Bua kufila makuta , buela mu adrese wa www.jw....","Soki olingi kopesa likabo , kɔtá na www.jw.org ."
3,"Padibu kabayi baleje Bible mukuabu , mvese yon...","Soki liyebisi ezali te , mikapo ya Makomami eu..."
4,Tshikebelu,Etanda ya makambo ezali na kati


In [24]:
target_test_set.shape

(2673, 2)

In [25]:
target_test_set.loc[~target_test_set[f'{source_language}_equivalent'].isin(en_test_sents)]

Unnamed: 0_level_0,lua_equivalent,ln_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1


## Write out target-language test set file
In our example, we should have `test.ln-any.ln`, but it will be different for you if you picked a different code.

In [26]:
target_test_filename = f"test.{target_language}-any.{target_language}"
print(target_test_filename)

test.lua-any.lua


## Write out English-language test set file
In our example, we should have `test.fr-ln.fr`, but it will be different for you if you picked a different code.

**Make sure the data lines up in the two files!**
The first line of each file should be translations of each other.


In [29]:

with open(target_test_filename, "w") as test_tgt_any_tgt:
    test_tgt_any_tgt.write("\n".join(target_test_set[f'{target_language}_equivalent']))

In [30]:
!head -5 test.$tgt-any.$tgt

© 2017 Watch Tower Bible and Tract Society of Pennsylvania
Dipatuka diende ngumue wa ku midimu idi yenzeka pa buloba bujima ne makuta adi bantu bafila ku budisuile bua kulongeshangana Bible .
Bua kufila makuta , buela mu adrese wa www.jw.org .
Padibu kabayi baleje Bible mukuabu , mvese yonso mmiangatshila mu Bible — Nkudimuinu wa bulongolodi bupiabupia udi ne ngakuilu wa matuku aa .
Tshikebelu


In [31]:
source_test_filename = f"test.{source_language}-{target_language}.{source_language}"
print(f"saving english aligned sentences to {source_test_filename}")
with open(source_test_filename, "w") as test_en_tgt_en:
    test_en_tgt_en.write("\n".join(target_test_set[f'{target_language}_equivalent']))
!ls -al

saving english aligned sentences to test.ln-lua.ln
total 172504
drwxr-xr-x 1 root root     4096 Jul 20 15:39 .
drwxr-xr-x 1 root root     4096 Jul 20 15:27 ..
drwxr-xr-x 4 root root     4096 Jul 16 13:19 .config
drwx------ 6 root root     4096 Jul 20 15:29 drive
-rw-r--r-- 1 root root 18328475 Jul 20 15:29 JW300_latest_xml_ln-lua.xml
-rw-r--r-- 1 root root 61546014 Jul 20 15:29 JW300_latest_xml_ln.zip
-rw-r--r-- 1 root root 32608568 Jul 20 15:29 JW300_latest_xml_lua.zip
-rw-r--r-- 1 root root 31573127 Jul 20 15:31 jw300.ln
-rw-r--r-- 1 root root 31797537 Jul 20 15:31 jw300.lua
drwxr-xr-x 2 root root     4096 Jul 20 15:29 ln-lua-baseline
drwxr-xr-x 1 root root     4096 Jul 16 13:20 sample_data
-rw-r--r-- 1 root root   259242 Jul 20 15:31 test.ln-any.ln
-rw-r--r-- 1 root root   242017 Jul 20 15:39 test.ln-lua.ln
-rw-r--r-- 1 root root   242017 Jul 20 15:37 test.lua-any.lua


In [33]:
!head -5 test.$src-$tgt.$src

© 2017 Watch Tower Bible and Tract Society of Pennsylvania
Dipatuka diende ngumue wa ku midimu idi yenzeka pa buloba bujima ne makuta adi bantu bafila ku budisuile bua kulongeshangana Bible .
Bua kufila makuta , buela mu adrese wa www.jw.org .
Padibu kabayi baleje Bible mukuabu , mvese yonso mmiangatshila mu Bible — Nkudimuinu wa bulongolodi bupiabupia udi ne ngakuilu wa matuku aa .
Tshikebelu


## One last check to see if the two files are aligned

Let's just get one more sample! Let's take from the end this time

In [34]:
!echo "test.$src-$tgt.$src"
!tail -5 test.$src-$tgt.$src
!echo
!echo "**********************"
!echo "test.$tgt-any.$tgt"
!echo "**********************"
!tail -5 test.$tgt-any.$tgt

test.ln-lua.ln
Mukungulu wa ndeke ne miadi ya nyunyi bidiku mushindu umue anyi ?
Nunku nnganyi udi ne meji a bungi , mmuenji wa ndeke anyi ? Peshi mMufuki wa nyunyi ? ”
Ke bualu kayi bidi ne mushinga bua baledi kutungunuka ne kulongesha bana babu . ”
Enza bua bana bebe bamone muudi wangata Yehowa bu muntu mulelela .
Ebi mbikoleshe ditabuja diende kudi Nzambi ne kudi Bible . ”
**********************
test.lua-any.lua
**********************
Mukungulu wa ndeke ne miadi ya nyunyi bidiku mushindu umue anyi ?
Nunku nnganyi udi ne meji a bungi , mmuenji wa ndeke anyi ? Peshi mMufuki wa nyunyi ? ”
Ke bualu kayi bidi ne mushinga bua baledi kutungunuka ne kulongesha bana babu . ”
Enza bua bana bebe bamone muudi wangata Yehowa bu muntu mulelela .
Ebi mbikoleshe ditabuja diende kudi Nzambi ne kudi Bible . ”