# Leveraging Bitext mining and COMET-QE for improving parallel data selection of low-resource machine translation  
<a href="https://colab.research.google.com/github/emmanuelayanful/AIMS-NLP-Project/blob/main/Data_preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# # Linking to drive
# from google.colab import drive
# drive.mount("/content/drive")

In [2]:
# Installing package to retrieve datasets
! pip install opustools



In [3]:
import os
import re

In [4]:
# os.chdir("/content/drive/MyDrive/AIMS PROJECT")

In [5]:
def clean_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'\(.*?\)', '', text)  # Removes text within parentheses (including the parentheses)
    text = re.sub(r'\[.*?\]', '', text)  # Removes text within square brackets (including the brackets)
    text = re.sub(r'[\'\"]', '', text)  # Removes both single and double quotes
    text = re.sub(r'\s+,', ',', text) # Removes space before comma
    text = re.sub(r'\.\s*', '\n', text) # Split text after a period
    text = re.sub(r'\:\s*', '\n', text) # Split text after a colon
    #text = re.sub(r'[.:].*?\n', '\n', text) # Split text after a period or colon
    return text

In [6]:


def create_corpus(src, tgt, file_dir, corpora="bible-uedin"):
    if isinstance(tgt, str):
        tgt = [tgt]
    
    for tg in tgt:
        os.makedirs(os.path.join(file_dir, f"{src}-{tg}"), exist_ok=True)
        download_dir = os.path.join(file_dir, f"{src}-{tg}")
        source_file = os.path.join(download_dir, f"{corpora}.{src}-{tg}.{src}")
        target_file = os.path.join(download_dir, f"{corpora}.{src}-{tg}.{tg}")
 
        command = f"opus_read -d {corpora} -s {src} -t {tg} -p moses -dl {download_dir} -w {source_file} {target_file} -q"
        os.system(command)

        # Assuming that the source and target files are already downloaded or created
        with open(source_file, 'r', encoding='utf-8') as src_file, open(target_file, 'r', encoding='utf-8') as tgt_file:
            src_lines = src_file.readlines()
            tgt_lines = tgt_file.readlines()
        
        # Apply cleaning to both source and target data
        cleaned_src_lines = [clean_text(line) for line in src_lines]
        cleaned_tgt_lines = [clean_text(line) for line in tgt_lines]

        # Write cleaned data back to files
        with open(f'{source_file}.txt', 'w', encoding='utf-8') as src_file, open(f'{target_file}.txt', 'w', encoding='utf-8') as tgt_file:
            src_file.writelines([line for line in cleaned_src_lines])
            tgt_file.writelines([line for line in cleaned_tgt_lines])

        print(f"\nCleaned data saved for {src}-{tg} corpus.\n")
    
    print("\nDone!!!")

create_corpus(src='en', tgt=['ss', 'ee', 'zu', 'so', 'am', 'wo'], corpora="bible-uedin", file_dir='corpus')


Only moses write_mode is supported for moses preprocessing. Ignoring write_mode normal.


The following files are available for downloading:

   1 MB https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/en-ss.txt.zip | 15699 alignment pairs, 351344 source tokens, 277931 target tokens (id 23526)

   1 MB Total size
corpus/en-ss/bible-uedin_latest_moses_en-ss.txt.zip ... 100% of 1 MB

Cleaned data saved for en-ss corpus.

The following files are available for downloading:



Only moses write_mode is supported for moses preprocessing. Ignoring write_mode normal.


   1 MB https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/ee-en.txt.zip | 16001 alignment pairs, 440448 source tokens, 357784 target tokens (id 23318)

   1 MB Total size
corpus/en-ee/bible-uedin_latest_moses_ee-en.txt.zip ... 100% of 1 MB

Cleaned data saved for en-ee corpus.

The following files are available for downloading:



Only moses write_mode is supported for moses preprocessing. Ignoring write_mode normal.


   1 MB https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/en-zu.txt.zip | 15907 alignment pairs, 355745 source tokens, 193005 target tokens (id 23541)

   1 MB Total size
corpus/en-zu/bible-uedin_latest_moses_en-zu.txt.zip ... 100% of 1 MB

Cleaned data saved for en-zu corpus.

The following files are available for downloading:



Only moses write_mode is supported for moses preprocessing. Ignoring write_mode normal.


   5 MB https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/en-so.txt.zip | 62195 alignment pairs, 1550431 source tokens, 1455125 target tokens (id 23523)

   5 MB Total size
corpus/en-so/bible-uedin_latest_moses_en-so.txt.zip ... 100% of 5 MB

Cleaned data saved for en-so corpus.



Only moses write_mode is supported for moses preprocessing. Ignoring write_mode normal.


The following files are available for downloading:

   5 MB https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/am-en.txt.zip | 61084 alignment pairs, 828535 source tokens, 1525058 target tokens (id 21512)

   5 MB Total size
corpus/en-am/bible-uedin_latest_moses_am-en.txt.zip ... 100% of 5 MB

Cleaned data saved for en-am corpus.



Only moses write_mode is supported for moses preprocessing. Ignoring write_mode normal.


The following files are available for downloading:

   1 MB https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/en-wo.txt.zip | 15833 alignment pairs, 354451 source tokens, 342902 target tokens (id 23538)

   1 MB Total size
corpus/en-wo/bible-uedin_latest_moses_en-wo.txt.zip ... 100% of 1 MB

Cleaned data saved for en-wo corpus.


Done!!!


In [7]:
src = 'en'; tgt = 'ss'

In [8]:
! head -n 20 corpus/{src}-{tgt}/bible-uedin.{src}-{tgt}.*

==> corpus/en-ss/bible-uedin.en-ss.en <==
The book of the genealogy of Jesus Christ , the son of David, the son of Abraham.
Abraham became the father of Isaac. Isaac became the father of Jacob. Jacob became the father of Judah and his brothers.
Judah became the father of Perez and Zerah by Tamar. Perez became the father of Hezron. Hezron became the father of Ram.
Ram became the father of Amminadab. Amminadab became the father of Nahshon. Nahshon became the father of Salmon.
Salmon became the father of Boaz by Rahab. Boaz became the father of Obed by Ruth. Obed became the father of Jesse.
Jesse became the father of King David. David became the father of Solomon by her who had been the wife of Uriah.
Solomon became the father of Rehoboam. Rehoboam became the father of Abijah. Abijah became the father of Asa.
Asa became the father of Jehoshaphat. Jehoshaphat became the father of Joram. Joram became the father of Uzziah.
Uzziah became the father of Jotham. Jotham became the father of Ahaz.

In [9]:
! wc -l corpus/{src}-{tgt}/bible-uedin.{src}-{tgt}.*

   15699 corpus/en-ss/bible-uedin.en-ss.en
   20206 corpus/en-ss/bible-uedin.en-ss.en.txt
   15699 corpus/en-ss/bible-uedin.en-ss.ss
   22128 corpus/en-ss/bible-uedin.en-ss.ss.txt
   73732 total
