<a href="https://colab.research.google.com/github/Kabongosalomon/RDC-Mobongoli/blob/main/jw300_utils/building_global_test_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import os

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Setting Up the data

Downloading the global test set is simple,
we need to set english and your target as source and target language, then we find the intersection of the english test set with the target corpus after that we get the corresponding target sentences from the target corpus.

In [2]:
%%capture
!pip install opustools-pkg

# SET THE LANGUAGE CODE and other variables.

You need to change the value below for your language!

The language codes from the [JW300 corpus website](https://object.pouta.csc.fi/OPUS-JW300/v1/languages.json) are: 
```
{
    "language": "French - Français",
    "language_en": "French",
    "language_native": "Français",
    "language_short": "fr",
    "url": "https://wol.jw.org/fr/wol/pref/r30/lp-f?newrsconf=r30&newlib=lp-f&url="
}, 

{
    "language": "Lingala - Lingala",
    "language_en": "Lingala",
    "language_native": "Lingala",
    "language_short": "ln",
    "url": "https://wol.jw.org/ln/wol/pref/r126/lp-li?newrsconf=r126&newlib=lp-li&url="
},

{
    "language": "Tshiluba - Tshiluba",
    "language_en": "Tshiluba",
    "language_native": "Tshiluba",
    "language_short": "lua",
    "url": "https://wol.jw.org/lua/wol/pref/r477/lp-sh?newrsconf=r477&newlib=lp-sh&url="
},

{
    "language": "Kikongo - Kikongo",
    "language_en": "Kikongo",
    "language_native": "Kikongo",
    "language_short": "kwy",
    "url": "https://wol.jw.org/kwy/wol/pref/r128/lp-kg?newrsconf=r128&newlib=lp-kg&url="
},

{
    "language": "Swahili (Congo) - Kiswahili (Congo)",
    "language_en": "Swahili (Congo)",
    "language_native": "Kiswahili (Congo)",
    "language_short": "swc",
    "url": "https://wol.jw.org/swc/wol/pref/r143/lp-zs?newrsconf=r143&newlib=lp-zs&url="
  },
```
Already-created test sets: https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/

In [3]:
source_language = "fr"
target_language = "swc" # TODO: CHANGE THIS TO YOUR LANGUAGE! "ha" is hausa. See the language codes at https://opus.nlpl.eu/JW300.php
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.|
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# No need to use gdrive since we are training on gcp
!mkdir -p "$src-$tgt-$tag"
os.environ["gdrive_path"] = "%s-%s-%s" % (source_language, target_language, tag) # saving directly on the vm

In [4]:
!echo $gdrive_path

fr-swc-baseline


#### Downloading the corpus data

for precaution , am removing the old data

In [5]:
!rm -f jw300.$src jw300.$tgt JW300_latest_xml_$src-$tgt.xml.gz JW300_latest_xml_$src-$tgt.xml JW300_latest_xml_$src.zip  JW300_latest_xml_$tgt.zip test.fr-any.fr

In [6]:
! ls

fr-swc-baseline  sample_data


In [7]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q -ln -S 1 -T 1
# ! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q -ln -S 2-4 -T 2-4


# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/fr-swc.xml.gz not found. The following files are available for downloading:

   6 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/fr-swc.xml.gz
 278 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/fr.zip
  54 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/swc.zip

 339 MB Total size
./JW300_latest_xml_fr-swc.xml.gz ... 100% of 6 MB
./JW300_latest_xml_fr.zip ... 100% of 278 MB
./JW300_latest_xml_swc.zip ... 100% of 54 MB


In [8]:
df_src = pd.read_csv(f"jw300.{source_language}", sep='\t', names=['source_sentence'])
df_tgt = pd.read_csv(f"jw300.{target_language}", sep='\t', names=['target_sentence'])

display(df_src.tail())
display(df_tgt.tail())

Unnamed: 0,source_sentence
497617,Mais c’est plutôt en travaillant avec foi et d...
497618,"De plus , des milliers de personnes dans le mo..."
497619,Cette étude a motivé beaucoup d’entre elles à ...
497620,L’effet que « la parole de Dieu » a sur ces pe...
497621,Les déclarations de Jéhovah sur son projet qui...


Unnamed: 0,target_sentence
497617,"Lakini , kwa sababu ya imani yetu kwa Mungu , ..."
497618,Mamilioni ya watu katika dunia yote wameanza p...
497619,Hilo limechochea wengi kati yao wafanye mabadi...
497620,Namna wanaendelea kufanya mabadiliko inaonyesh...
497621,Mambo yenye Mungu amefunua katika Biblia juu y...


In [9]:
# ! wget https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/test.$src-any.$src
! wget https://raw.githubusercontent.com/masakhane-io/masakhane-mt/master/jw300_utils/test/test.$src-any.$src

# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

--2021-07-23 16:32:12--  https://raw.githubusercontent.com/masakhane-io/masakhane-mt/master/jw300_utils/test/test.fr-any.fr
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 377235 (368K) [text/plain]
Saving to: ‘test.fr-any.fr’


2021-07-23 16:32:12 (25.6 MB/s) - ‘test.fr-any.fr’ saved [377235/377235]



In [10]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
src_test_sents = set()
filter_test_sents = f"test.{source_language}-any.{source_language}"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    src_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3974 global test sentences to filter from the training/dev data.


In [11]:
!ls

fr-swc-baseline		     JW300_latest_xml_fr.zip   sample_data
jw300.fr		     JW300_latest_xml_swc.zip  test.fr-any.fr
JW300_latest_xml_fr-swc.xml  jw300.swc


#### Building the corpus

In the 2 cells below you can check if the 2 datasets are aligned. Even if you don't speak the language you can get a sense, especially with similar words, punctuation, and so forth.

In [12]:
! head -5 jw300.$src

Sommaire
8 janvier 2000
Médecine et chirurgie sans transfusion : une discipline en plein essor
De plus en plus de patients optent pour la chirurgie sans transfusion .
Pourquoi , et quels sont les résultats ?


In [13]:
! head -5 jw300.$tgt

Ukurasa wa pili
Mwezi wa 8 , 2000
Tiba na Upasuaji Bila Damu Uhitaji Unaoongezeka 3 - 11
Sasa tiba na upasuaji bila damu ni wa kawaida zaidi kuliko wakati mwingine wowote .
Kwa nini unahitajiwa sana namna hiyo ?


In [14]:
# TMX file to dataframe
source_file = 'jw300.' + source_language  ## source language is english
target_file = 'jw300.' + target_language ## Target is whatever you set. For our example it was ha, so jw300.ha
target_test = {}
source = []
target = []
english_sentences_in_global_test_set = {}  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as src_f:
    for i, line in enumerate(src_f):
        # Skip sentences that are contained in the test set and add them into the new frencg test
        if line.strip() not in src_test_sents:
            source.append(line.strip())
        else:
            # Here is the intersection with the global test set
            english_sentences_in_global_test_set[i] = line.strip()           
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in english_sentences_in_global_test_set.keys():
            target.append(line.strip())
        else:
            #Collecting the aligned test sentences
            target_test[j] = line.strip()
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(english_sentences_in_global_test_set.keys()), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.head(10)

Loaded data and skipped 6292/497621 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,8 janvier 2000,"Mwezi wa 8 , 2000"
1,Médecine et chirurgie sans transfusion : une d...,Tiba na Upasuaji Bila Damu Uhitaji Unaoongezek...
2,De plus en plus de patients optent pour la chi...,Sasa tiba na upasuaji bila damu ni wa kawaida ...
3,"Pourquoi , et quels sont les résultats ?",Kwa nini unahitajiwa sana namna hiyo ?
4,12 Voulez - ​ vous apprendre une langue étrang...,"Je , ni njia badala iliyo salama ya utiaji - d..."
5,Le monde étonnant des insectes 15,Wadudu Wenye Kustaajabisha 15
6,Plutôt que d’écraser tous ceux qui croisent vo...,Badala ya kupondaponda kila mdudu unayekutana ...
7,Un point de vue équilibré sur les coutumes 26,Maoni Yaliyosawazika juu ya Desturi Zinazopend...
8,De nombreuses coutumes reposent sur des supers...,Desturi nyingi zinategemea mawazo ya kishiriki...
9,Comment un chrétien devrait - ​ il considérer ...,Mkristo apaswa kuonaje mazoea hayo ?


## Check a random item!
Let's pick one of the keys in the dictionary at random and check it. 

In [15]:
import random
keys_in_target_test = list(target_test.keys())
print(type(keys_in_target_test))
random_key = random.choice(keys_in_target_test)
print(f"The random key we picked was {random_key}")

<class 'list'>
The random key we picked was 491250


In [16]:
target_test[random_key]

'Kumutegemea Yehova kunatupatia uhodari ao nguvu ya kuvumilia majaribu .'

In [17]:
english_sentences_in_global_test_set[random_key]

'La confiance en Jéhovah nous donne le courage de supporter toutes sortes d’épreuves .'

Do the two look like they line up? 

## Check several rows at the tail end

Let's get a sample from the end of the dataset

In [18]:
target_test_set = pd.DataFrame(zip(target_test.values(), english_sentences_in_global_test_set.values()), columns=[f'{target_language}_equivalent', f'{source_language}_equivalent'])

In [19]:
target_test_set = target_test_set.reset_index()

In [20]:
target_test_set = target_test_set.set_index("index")

In [21]:
target_test_set.tail()

Unnamed: 0_level_0,swc_equivalent,fr_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
6287,Ni nani mwenye kuwa na akili sana ; mutu mweny...,"Alors , qui est le plus intelligent : le créat..."
6288,"13 , 14 .","13 , 14 ."
6289,Baba mumoja alisema hivi : “ Usichoke hata kid...,Un père conseille : « Ne vous fatiguez jamais ...
6290,Ndiyo sababu ni jambo la maana wazazi wasiache...,C’est pour cela que c’est important que les pa...
6291,Acha watoto wako watambue kama unaona Yehova k...,Montre - ​ leur que Jéhovah est vraiment réel ...


Removing duplicates from english and target set

In [22]:
target_test_set = target_test_set.drop_duplicates(subset=f'{target_language}_equivalent')
target_test_set = target_test_set.drop_duplicates(subset=f'{source_language}_equivalent')

In [23]:
# target_test_set = target_test_set.drop_duplicates()

In [24]:
target_test_set.tail()

Unnamed: 0_level_0,swc_equivalent,fr_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
6286,Na namna gani unaweza kulinganisha mulio wa av...,Et qu’est - ​ ce qui est plus joli : le bruit ...
6287,Ni nani mwenye kuwa na akili sana ; mutu mweny...,"Alors , qui est le plus intelligent : le créat..."
6289,Baba mumoja alisema hivi : “ Usichoke hata kid...,Un père conseille : « Ne vous fatiguez jamais ...
6290,Ndiyo sababu ni jambo la maana wazazi wasiache...,C’est pour cela que c’est important que les pa...
6291,Acha watoto wako watambue kama unaona Yehova k...,Montre - ​ leur que Jéhovah est vraiment réel ...


In [25]:
target_test_set.shape

(2792, 2)

In [26]:
target_test_set.loc[~target_test_set[f'{source_language}_equivalent'].isin(src_test_sents)]

Unnamed: 0_level_0,swc_equivalent,fr_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1


## Write out target-language test set file
In our example, we should have `test.ln-any.ln`, but it will be different for you if you picked a different code.

In [27]:
target_test_filename = f"test.{target_language}-any.{target_language}"
print(target_test_filename)

test.swc-any.swc


## Write out English-language test set file
In our example, we should have `test.fr-ln.fr`, but it will be different for you if you picked a different code.

**Make sure the data lines up in the two files!**
The first line of each file should be translations of each other.


In [28]:
with open(target_test_filename, "w") as test_tgt_any_tgt:
    test_tgt_any_tgt.write("\n".join(target_test_set[f'{target_language}_equivalent']))

In [29]:
!head -5 test.$tgt-any.$tgt

Ukurasa wa pili
Kwa nini ?
Sasa gazeti Amkeni !
La .
Baadhi ya majina yamebadilishwa .


In [30]:
source_test_filename = f"test.{source_language}-{target_language}.{source_language}"
print(f"saving english aligned sentences to {source_test_filename}")
with open(source_test_filename, "w") as test_src_tgt_src:
    test_src_tgt_src.write("\n".join(target_test_set[f'{source_language}_equivalent']))
!ls -al

saving english aligned sentences to test.fr-swc.fr
total 456264
drwxr-xr-x 1 root root      4096 Jul 23 16:32 .
drwxr-xr-x 1 root root      4096 Jul 23 16:27 ..
drwxr-xr-x 4 root root      4096 Jul 16 13:19 .config
drwxr-xr-x 2 root root      4096 Jul 23 16:29 fr-swc-baseline
-rw-r--r-- 1 root root  48329463 Jul 23 16:32 jw300.fr
-rw-r--r-- 1 root root  33329795 Jul 23 16:29 JW300_latest_xml_fr-swc.xml
-rw-r--r-- 1 root root 285126296 Jul 23 16:29 JW300_latest_xml_fr.zip
-rw-r--r-- 1 root root  55358133 Jul 23 16:29 JW300_latest_xml_swc.zip
-rw-r--r-- 1 root root  44154279 Jul 23 16:32 jw300.swc
drwxr-xr-x 1 root root      4096 Jul 16 13:20 sample_data
-rw-r--r-- 1 root root    377235 Jul 23 16:32 test.fr-any.fr
-rw-r--r-- 1 root root    251866 Jul 23 16:32 test.fr-swc.fr
-rw-r--r-- 1 root root    241462 Jul 23 16:32 test.swc-any.swc


In [31]:
!head -5 test.$src-$tgt.$src

Sommaire
Pourquoi ?
Réveillez - vous !
Non .
Par souci d’anonymat , certains noms ont été changés .


In [32]:
source_test_filename = f"test.{source_language}-{target_language}.{target_language}"
print(f"saving english aligned sentences to {source_test_filename}")
with open(source_test_filename, "w") as test_en_tgt_tgt:
    test_en_tgt_tgt.write("\n".join(target_test_set[f'{target_language}_equivalent']))
!ls -al

saving english aligned sentences to test.fr-swc.swc
total 456500
drwxr-xr-x 1 root root      4096 Jul 23 16:32 .
drwxr-xr-x 1 root root      4096 Jul 23 16:27 ..
drwxr-xr-x 4 root root      4096 Jul 16 13:19 .config
drwxr-xr-x 2 root root      4096 Jul 23 16:29 fr-swc-baseline
-rw-r--r-- 1 root root  48329463 Jul 23 16:32 jw300.fr
-rw-r--r-- 1 root root  33329795 Jul 23 16:29 JW300_latest_xml_fr-swc.xml
-rw-r--r-- 1 root root 285126296 Jul 23 16:29 JW300_latest_xml_fr.zip
-rw-r--r-- 1 root root  55358133 Jul 23 16:29 JW300_latest_xml_swc.zip
-rw-r--r-- 1 root root  44154279 Jul 23 16:32 jw300.swc
drwxr-xr-x 1 root root      4096 Jul 16 13:20 sample_data
-rw-r--r-- 1 root root    377235 Jul 23 16:32 test.fr-any.fr
-rw-r--r-- 1 root root    251866 Jul 23 16:32 test.fr-swc.fr
-rw-r--r-- 1 root root    241462 Jul 23 16:32 test.fr-swc.swc
-rw-r--r-- 1 root root    241462 Jul 23 16:32 test.swc-any.swc


In [33]:
!head -5 test.$src-$tgt.$tgt

Ukurasa wa pili
Kwa nini ?
Sasa gazeti Amkeni !
La .
Baadhi ya majina yamebadilishwa .


## One last check to see if the two files are aligned

Let's just get one more sample! Let's take from the end this time

In [34]:
!echo "test.$src-$tgt.$src"
!tail -5 test.$src-$tgt.$src
!echo
!echo "**********************"
!echo "test.$src-$tgt.$tgt"
!tail -5 test.$src-$tgt.$tgt
!echo
!echo "**********************"
!echo "test.$tgt-any.$tgt"
!echo "**********************"
!tail -5 test.$tgt-any.$tgt

test.fr-swc.fr
Et qu’est - ​ ce qui est plus joli : le bruit d’un avion ou le chant d’un oiseau ?
Alors , qui est le plus intelligent : le créateur des avions ou le Créateur des oiseaux ?
Un père conseille : « Ne vous fatiguez jamais d’essayer de nouvelles méthodes pour reparler de vieux sujets .
C’est pour cela que c’est important que les parents n’arrêtent jamais d’enseigner .
Montre - ​ leur que Jéhovah est vraiment réel pour toi .
**********************
test.fr-swc.swc
Na namna gani unaweza kulinganisha mulio wa avion na wimbo wa ndege ?
Ni nani mwenye kuwa na akili sana ; mutu mwenye alitengeneza avion ao Muumbaji wa ndege ? ”
Baba mumoja alisema hivi : “ Usichoke hata kidogo kutumia njia za mupya ili kufasiria habari za zamani . ”
Ndiyo sababu ni jambo la maana wazazi wasiache kufundisha watoto wao . ”
Acha watoto wako watambue kama unaona Yehova kuwa mutu wa kweli kabisa .
**********************
test.swc-any.swc
**********************
Na namna gani unaweza kulinganisha mulio wa 