<a href="https://colab.research.google.com/github/Kabongosalomon/RDC-Mobongoli/blob/main/jw300_utils/building_french_global_test_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install opus-tools
! pip install opustools-pkg

Collecting opustools-pkg
  Downloading opustools_pkg-0.0.52-py3-none-any.whl (80 kB)
[?25l[K     |████                            | 10 kB 19.9 MB/s eta 0:00:01[K     |████████                        | 20 kB 23.1 MB/s eta 0:00:01[K     |████████████▏                   | 30 kB 26.7 MB/s eta 0:00:01[K     |████████████████▏               | 40 kB 29.9 MB/s eta 0:00:01[K     |████████████████████▎           | 51 kB 32.7 MB/s eta 0:00:01[K     |████████████████████████▎       | 61 kB 34.3 MB/s eta 0:00:01[K     |████████████████████████████▎   | 71 kB 35.7 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 9.2 MB/s 
[?25hInstalling collected packages: opustools-pkg
Successfully installed opustools-pkg-0.0.52


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Setting Up the data

Downloading the french global test set is simple,
we need to set english  and french as source and target language, then we find the intersection of the english test set with the english corpus after that we get the corresponding french sentencs from the french corpus

In [3]:
import os
source_language = "fr"
target_language = "ln" # ln is the language code of lingala 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# No need to use gdrive since we are training on gcp
!mkdir -p "$src-$tgt-$tag"
os.environ["gdrive_path"] = "%s-%s-%s" % (source_language, target_language, tag) # saving directly on the vm

In [4]:
!echo $gdrive_path

fr-ln-baseline


#### Downloading the corpus data

for precaution , am removing the old data

In [5]:
!rm -f w300.$src jw300.$tgt JW300_latest_xml_$src-$tgt.xml.gz JW300_latest_xml_$src-$tgt.xml JW300_latest_xml_$src.zip  JW300_latest_xml_$tgt.zip

In [6]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/fr-ln.xml.gz not found. The following files are available for downloading:

   6 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/fr-ln.xml.gz
 278 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/fr.zip
  60 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/ln.zip

 345 MB Total size
./JW300_latest_xml_fr-ln.xml.gz ... 100% of 6 MB
./JW300_latest_xml_fr.zip ... 100% of 278 MB
./JW300_latest_xml_ln.zip ... 100% of 60 MB


In [7]:
! wget https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/test.$src-any.$src
  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

--2021-07-20 15:21:17--  https://raw.githubusercontent.com/ai-drc/RDC-Mobongoli/main/jw300_utils/test/test.fr-any.fr
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 377235 (368K) [text/plain]
Saving to: ‘test.fr-any.fr’


2021-07-20 15:21:17 (33.4 MB/s) - ‘test.fr-any.fr’ saved [377235/377235]



In [8]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = f"test.{source_language}-any.{source_language}"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3974 global test sentences to filter from the training/dev data.


In [9]:
!ls

drive		JW300_latest_xml_fr-ln.xml  jw300.ln
fr-ln-baseline	JW300_latest_xml_fr.zip     sample_data
jw300.fr	JW300_latest_xml_ln.zip     test.fr-any.fr


#### Building the corpus

For those who knows french , in the 2 cells bellows am checking if the 2 dataset are aligned

In [10]:
! head -5 jw300.$src

Qui veut être millionnaire ?
IL SEMBLE que ce soit là le désir de tout un chacun , ou presque .
Or la solution la plus simple , dans l’esprit du public , est de gagner à la loterie ou au loto sportif * .
Flattant les désirs du grand nombre — et convoitant les excédents qui reviendront à l’État — , de Moscou à Madrid , de Manille à Mexico , les gouvernements parrainent des loteries d’État qui peuvent faire gagner l’équivalent de plusieurs centaines de millions de francs français .
Quelques joueurs deviennent effectivement millionnaires .


In [11]:
! head -5 jw300.$tgt

Nani alingi kozala milionere ?
EYANO emonani lokola ete , wana ezali mposa ya moto na moto to pene na bato nyonso .
Nzokande , na makanisi ya bato , nzela ya pɛtɛɛ mpo na kozwa yango ezali kolónga na loterie to na momekano ya kosakola liboso équipe ya ndembo oyo ekolónga .
Kolamusáká mposa ya bato mingi ​ — mpe koluláká kozwa misolo oyo Leta akozwa likoló ​ — kolongwa Moscou kino Madrid , kolongwa Manille kino Mexico , baguvernema bazali kopesa lisungi na loterie esalemi na Leta , kati na yango balóngi bakoki kozwa nkámá mingi ya bamilió ya badolare .
Mwa babɛti na yango bazali mpenza kokóma bamilionere .


In [12]:
import pandas as pd

# TMX file to dataframe
source_file = 'jw300.' + source_language  ## source language is english
target_file = 'jw300.' + target_language ## Target is french
french_test = {}
source = []
target = []
english_sentences_in_global_test_set = {}  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as src_f:
    for i, line in enumerate(src_f):
        # Skip sentences that are contained in the test set and add them into the new frencg test
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            # TODOS : Here is the intersection with the global test set
            english_sentences_in_global_test_set[i] = line.strip()           
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in english_sentences_in_global_test_set.keys():
            target.append(line.strip())
        else:
            #TODOS : Collecting the aligned test sentences
            french_test[j] = line.strip()
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(english_sentences_in_global_test_set.keys()), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.tail(10)

Loaded data and skipped 6707/590525 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
583809,"Comme les chrétiens hébreux , nous pouvons étu...","Lokola bakristo Baebre , tokoki kotánga makamb..."
583810,"Pour montrer que cette promesse est biblique ,...",Mpo na komonisa ete elaka yango euti na Makoma...
583811,Nous sommes touchés de savoir que « la promess...,Koyeba ete “ elaka ya kokɔta na kopema [ ya Nz...
583812,Nous sommes convaincus que c’est possible d’en...,Tondimaka ete makambo oyo Biblia eteyaka na oy...
583813,"Pas en obéissant à la Loi de Moïse , ni en fai...",Tosalaka yango te mpo na koluka kotosa Mibeko ...
583814,Mais c’est plutôt en travaillant avec foi et d...,"Kasi , lokola tondimelaka Nzambe , tosepelaka ..."
583815,"De plus , des milliers de personnes dans le mo...",Ebele ya bato na mokili mobimba babandá mpe ko...
583816,Cette étude a motivé beaucoup d’entre elles à ...,Yango esalisaki mingi na bango bábongola bomoi...
583817,L’effet que « la parole de Dieu » a sur ces pe...,Ndenge oyo bazali kobongwana emonisi polele et...
583818,Les déclarations de Jéhovah sur son projet qui...,Makambo oyo Nzambe amonisá na Biblia mpo na mo...


In [13]:
# french_test

In [14]:
# english_sentences_in_global_test_set[6794]

In [15]:
french_test_set = pd.DataFrame(zip(french_test.values(), english_sentences_in_global_test_set.values()), columns=[f'{target_language}_equivalent', f'{source_language}_equivalent'])

In [16]:
french_test_set = french_test_set.reset_index()

In [17]:
french_test_set = french_test_set.set_index("index")

In [18]:
french_test_set.tail()

Unnamed: 0_level_0,ln_equivalent,fr_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
6702,"Sikoyo , nani aleki na mayele : moto oyo asalá...","Alors , qui est le plus intelligent : le créat..."
6703,"13 , 14 .","13 , 14 ."
6704,"Yango wana , ezali na ntina mingi ete baboti b...",C’est pour cela que c’est important que les pa...
6705,Tiká bana na yo bámona ete Yehova azali mpenza...,Montre - ​ leur que Jéhovah est vraiment réel ...
6706,Yango ekómisaki makasi kondima na ye epai ya N...,C’est excellent pour sa foi en Dieu et en la B...


Removing duplicates from english and french set

In [19]:
french_test_set = french_test_set.drop_duplicates(subset=f'{target_language}_equivalent')

In [20]:
french_test_set = french_test_set.drop_duplicates(subset=f'{source_language}_equivalent')

In [21]:
french_test_set.head()

Unnamed: 0_level_0,ln_equivalent,fr_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Mpo na nini ?,Pourquoi ?
1,Oyo ezali mobeko moleki monene mpe oyo na libo...,C’est là le plus grand et le premier commandem...
6,Lamuká !,Réveillez - vous !
11,Sapolsky mpe E .,Non .
18,Oyo ezali mobeko moleki monene mpe ya liboso .,’ C’est là le plus grand et le premier command...


In [22]:
french_test_set.shape

(2933, 2)

In [24]:
french_test_set.loc[~french_test_set[f'{source_language}_equivalent'].isin(en_test_sents)]

Unnamed: 0_level_0,ln_equivalent,fr_equivalent
index,Unnamed: 1_level_1,Unnamed: 2_level_1


In [25]:
with open(f"test.{target_language}-any.{target_language}", "w") as test_fr_any_fr:
    test_fr_any_fr.write("\n".join(french_test_set[f'{target_language}_equivalent']))

In [26]:
!head -5 test.$tgt-any.$tgt

Mpo na nini ?
Oyo ezali mobeko moleki monene mpe oyo na liboso .
Lamuká !
Sapolsky mpe E .
Oyo ezali mobeko moleki monene mpe ya liboso .
