# Preprocessing

- Cleaning parallel corpus
- BPE tokenization

[Bicleaner](https://github.com/bitextor/bicleaner-hardrules)


RULES:
- no_empty,	Sentence is empty
- not_too_long,	Sentence is more than 1024 characters long
- not_too_short,	Sentence is less than	3 words long
- length_ratio,	The length ratio between the source sentence and target sentence (in bytes) is too low or too high
- no_identical,	Alphabetic content in source sentence and target sentence is identical
- no_literals,  Unwanted literals: "Re:","{{", "%s", "}}", "+++", "***", '=\"'
- no_only_symbols,	The ratio of non-alphabetic characters in source sentence is more than 90%
- no_only_numbers,	The ratio of numeric characters in source sentence is too high
- no_urls,	There are URLs (disabled by default)
- no_breadcrumbs,	There are more than 2 breadcrumb characters in the sentence
- no_glued_words,	There are words in the sentence containing too many uppercased characters between lowercased characters
- no_repeated_words, There are words repeated consecutively
- no_unicode_noise,	Too many characters from unwanted unicode in source sentence
- no_space_noise,	Too many consecutive single characters separated by spaces in the sentence (excludes digits)
- no_paren,	Too many parenthesis or brackets in sentence
- no_escaped_unicode,	There is unescaped unicode characters in sentence
- no_bad_encoding,	Source sentence or target sentence contains mojibake
- no_titles,	All words in source sentence or target sentence are uppercased or in titlecase
- no_wrong_language,	Sentence is not in the desired language
- no_porn,	Source sentence or target sentence contains text identified as porn
- no_number_inconsistencies,	Sentence contains different numbers in source and target (disabled by default)
- no_script,_inconsistencies	Sentence source or target contains characters from different script/writing systems (disabled by default)
- lm_filter,	The sentence pair has low fluency score from the language model

In [1]:
#install dependency libraries
!apt install libhunspell-dev
!apt-get install hunspell-en-us
# hunspell-en-med ??
!apt-get install hunspell-de-de
!pip install hunspell

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  dictionaries-common hunspell-en-us libhunspell-1.7-0 libtext-iconv-perl
Suggested packages:
  ispell | aspell | hunspell wordlist hunspell openoffice.org-hunspell | openoffice.org-core
The following NEW packages will be installed:
  dictionaries-common hunspell-en-us libhunspell-1.7-0 libhunspell-dev libtext-iconv-perl
0 upgraded, 5 newly installed, 0 to remove and 29 not upgraded.
Need to get 896 kB of archives.
After this operation, 3,130 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libtext-iconv-perl amd64 1.7-7build3 [14.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 dictionaries-common all 1.28.14 [185 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 hunspell-en-us all 1:2020.12.07-2 [280 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 libhun

In [2]:
#install bicleaner and hard-rules 2.11
!pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip (553 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.6/553.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp311-cp311-linux_x86_64.whl size=3187002 sha256=3291eb21f6148b592817c8ef1bd01e87dcae782095b2ec4229af33c228c4c638
  Stored in directory: /tmp/pip-ephem-wheel-cache-2q69_c78/wheels/4e/ca/6a/e5da175b1396483f6f410cdb4cfe8bc8fa5e12088e91d60413
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0


In [3]:
!pip list > requirements.txt
!cat requirements.txt   # to show versions of all libraries

Package                            Version
---------------------------------- -------------------
absl-py                            1.4.0
accelerate                         1.3.0
aiohappyeyeballs                   2.6.1
aiohttp                            3.11.13
aiosignal                          1.3.2
alabaster                          1.0.0
albucore                           0.0.23
albumentations                     2.0.5
ale-py                             0.10.2
altair                             5.5.0
annotated-types                    0.7.0
anyio                              3.7.1
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
array_record                       0.7.1
arviz                              0.20.0
astropy                            7.0.1
astropy-iers-data                  0.2025.3.10.0.29.26
astunparse                         1.6.3
atpublic                           4.1.0
attrs                              25.3.0
audioread          

In [4]:
#clone repo from Github and navigate to correct working directory
!git clone https://github.com/fubotz/BMT_2025S
%cd BMT_2025S/week2_files/Basic-MT_week2_files

fatal: destination path 'BMT_2025S' already exists and is not an empty directory.
/content/BMT_2025S/week2_files/Basic-MT_week2_files


In [5]:
#load parallel corpus
#check number of lines
!wc -l dev*

   500 dev.en-de
   500 dev.en-de.de
   500 dev.en-de.en
  1500 total


In [6]:
#bicleanaer requires parallel data into the same file with columns en-de
!paste dev.en-de.en dev.en-de.de > dev.en-de

In [7]:
#check output
!head dev.en-de

Yevonde's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and goddesses.	Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Göttinnen verkleidet kamen.
Mora is working on a trilogy about the IT specialist Darius Kopp, of which band I "The Only Man on the Continent" and Volume II "The Monster" have already appeared.	Terézia Mora arbeitet an einer Trilogie um den IT-Spezialisten Darius Kopp, von der Band I „Der einzige Mann auf dem Kontinent“ und Band II „Das Ungeheuer“ bereits erschienen sind.
The first person to enter this section was Günther J. Wolf with seven members of his ice course.	Eine erste Befahrung dieses Abschnitts gelang Günther J. Wolf mit sieben Teilnehmern seines Eiskurses.
They were renumbered in 1970 to 100 903 and 904, and in 1973 to 199 003 and 004.	Sie wurden 1970 in 100 903 und 904, 1973 in 199 003 und 004 umgenummert.
The grave is probably a disturbe

In [8]:
!pip install cyhunspell

Collecting cyhunspell
  Downloading CyHunspell-1.3.4.tar.gz (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cacheman>=2.0.6 (from cyhunspell)
  Downloading CacheMan-2.2.0-py2.py3-none-any.whl.metadata (5.8 kB)
Downloading CacheMan-2.2.0-py2.py3-none-any.whl (13 kB)
Building wheels for collected packages: cyhunspell
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for cyhunspell (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for cyhunspell[0m[31m
[0m[?25h  Running setup.py clean for cyhunspell
Failed to build cyhunspell
[31mERROR: ERROR: Failed t

In [1]:
!pip install numpy==1.24



In [2]:
!pip install bicleaner-hardrules

Collecting bicleaner-hardrules
  Downloading bicleaner_hardrules-2.10.6-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (14 kB)
Collecting toolwrapper<=3,>=1.0 (from bicleaner-hardrules)
  Downloading toolwrapper-2.1.0.tar.gz (3.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacremoses==0.0.53 (from bicleaner-hardrules)
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fasttext-wheel==0.9.2 (from bicleaner-hardrules)
  Downloading fasttext_wheel-0.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting fastspell==0.11.1 (from bicleaner-hardrules)
  Downloading fastspell-0.11.1-py3-none-any.whl.metadata (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m


In [6]:
#apply bicleaner
!bicleaner-hardrules  \
        -s en -t de \
        dev.en-de  \
        dev.en-de.classified

2025-03-18 13:02:09,923 - INFO - LM filtering disabled.
2025-03-18 13:02:09,923 - INFO - Porn removal disabled.
2025-03-18 13:02:09,937 - INFO - Executing main program...
2025-03-18 13:02:09,938 - INFO - Starting process
2025-03-18 13:02:09,938 - INFO - Running 1 workers at 10000 rows per block
2025-03-18 13:02:09,948 - INFO - Start mapping
2025-03-18 13:02:09,954 - INFO - End mapping
2025-03-18 13:02:12,117 - INFO - Hard rules applied. Output available in dev.en-de.classified
2025-03-18 13:02:12,124 - INFO - Finished
2025-03-18 13:02:12,124 - INFO - Total: 500 rows
2025-03-18 13:02:12,124 - INFO - Elapsed time 2.19 s
2025-03-18 13:02:12,124 - INFO - Troughput: 228 rows/s
2025-03-18 13:02:12,124 - INFO - Program finished


In [7]:
#check file
!head dev.en-de.classified

Yevonde's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and goddesses.	Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Göttinnen verkleidet kamen.	1
Mora is working on a trilogy about the IT specialist Darius Kopp, of which band I "The Only Man on the Continent" and Volume II "The Monster" have already appeared.	Terézia Mora arbeitet an einer Trilogie um den IT-Spezialisten Darius Kopp, von der Band I „Der einzige Mann auf dem Kontinent“ und Band II „Das Ungeheuer“ bereits erschienen sind.	1
The first person to enter this section was Günther J. Wolf with seven members of his ice course.	Eine erste Befahrung dieses Abschnitts gelang Günther J. Wolf mit sieben Teilnehmern seines Eiskurses.	1
They were renumbered in 1970 to 100 903 and 904, and in 1973 to 199 003 and 004.	Sie wurden 1970 in 100 903 und 904, 1973 in 199 003 und 004 umgenummert.	1
The grave is probably a 

In [None]:
#select only 1
!grep '1$' dev.en-de.classified >  dev.en-de.clean

In [None]:
!grep '0$' dev.en-de.classified >  dev.en-de.filter

In [None]:
#check files
!wc -l dev.en-de.classified
!wc -l dev.en-de.clean

500 dev.en-de.classified
466 dev.en-de.clean


In [None]:
!head -n 50 dev.en-de.clean

Yevonde's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and goddesses.	Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Göttinnen verkleidet kamen.	1
Mora is working on a trilogy about the IT specialist Darius Kopp, of which band I "The Only Man on the Continent" and Volume II "The Monster" have already appeared.	Terézia Mora arbeitet an einer Trilogie um den IT-Spezialisten Darius Kopp, von der Band I „Der einzige Mann auf dem Kontinent“ und Band II „Das Ungeheuer“ bereits erschienen sind.	1
The first person to enter this section was Günther J. Wolf with seven members of his ice course.	Eine erste Befahrung dieses Abschnitts gelang Günther J. Wolf mit sieben Teilnehmern seines Eiskurses.	1
They were renumbered in 1970 to 100 903 and 904, and in 1973 to 199 003 and 004.	Sie wurden 1970 in 100 903 und 904, 1973 in 199 003 und 004 umgenummert.	1
The grave is probably a 

In [None]:
!head -n 50 dev.en-de.filter

He was an editor of the journals: Zeitschrift für Tropenmedizin, the Zentralblatt für Bakteriologie and the Zeitschrift für Parasitenkunde.	Ferner war er Herausgeber der Zeitschrift für Tropenmedizin, dem Zentralblatt für Bakteriologie und der Zeitschrift für Parasitenkunde.	0
"Das Himmelreich zu Erlangen – offen aus Tradition?"	Das Himmelreich zu Erlangen – offen aus Tradition?	0
"Wörterbuch zur Sprache und Kultur der Twareg".	Prasse: Wörterbuch zur Sprache und Kultur der Twareg.	0
Sensors and Actuators B: Chemical.	In: Sensors and Actuators B: Chemical.	0
The Daily Courier.	In: The Daily Courier.	0
Competitivitat de l´economia catalana en l´horitzó 2010: Effectes macroeconòmics del dèfiit fiscal amb l´Estat espanyol (Competitivity of the Catalan economy in the horizon 2010: Macroeconomic effects of the fiscal deficit with the Spanish State) - 2003 Polítiques públiques: Una visió renovada (Public politics: An updated perspective) - 2004 L´espoli fiscal.	Competitivitat de l´economia ca

In [None]:
# split file into columns
!cut -f1 dev.en-de.clean > dev.en-de.clean.en
!cut -f2 dev.en-de.clean > dev.en-de.clean.de

In [None]:
#check files
!wc -l dev.en-de.clean.en
!wc -l dev.en-de.clean.de

466 dev.en-de.clean.en
466 dev.en-de.clean.de


# TODO BICLEAN
  - Training data 500k, dev 5k, and test 5k
  - clean it with hard rules


*paper: https://aclanthology.org/2020.eamt-1.31.pdf



# BPE

from [subword-nmt](https://github.com/rsennrich/subword-nmt)

In [None]:
#install subword nmt
!pip install subword-nmt #==0.3.8


Collecting subword-nmt
  Downloading subword_nmt-0.3.8-py3-none-any.whl.metadata (9.2 kB)
Collecting mock (from subword-nmt)
  Downloading mock-5.2.0-py3-none-any.whl.metadata (3.1 kB)
Downloading subword_nmt-0.3.8-py3-none-any.whl (27 kB)
Downloading mock-5.2.0-py3-none-any.whl (31 kB)
Installing collected packages: mock, subword-nmt
Successfully installed mock-5.2.0 subword-nmt-0.3.8


In [None]:
#learn bpe model
!subword-nmt learn-joint-bpe-and-vocab --input train.en-de.en train.en-de.de -s 16000 -o train.bpe --write-vocabulary train.vocab.en train.vocab.de

100% 16000/16000 [01:17<00:00, 205.48it/s]


In [None]:
#apply bpe source
!subword-nmt apply-bpe -c train.bpe < dev.en-de.en > dev.en-de.bpe.en

In [None]:
#check out bpe
!head dev.en-de.bpe.en

Y@@ ev@@ on@@ de@@ 's most famous work was inspired by a theme party held on 5 March 193@@ 5, where gu@@ ests d@@ ressed as Roman and Greek god@@ s and god@@ dess@@ es.
Mor@@ a is working on a tr@@ il@@ og@@ y about the IT special@@ ist D@@ ari@@ us K@@ opp@@ , of which band I "The Only Man on the Contin@@ ent@@ " and Vol@@ ume II "The Mon@@ ster@@ " have already appear@@ ed.
The first person to enter this section was Gün@@ ther J. Wol@@ f with seven members of his ice cour@@ se.
They were ren@@ um@@ ber@@ ed in 1970 to 100 90@@ 3 and 90@@ 4, and in 1973 to 19@@ 9 00@@ 3 and 00@@ 4.
The grave is probably a dist@@ ur@@ bed arrange@@ ment, which was covered earlier with wood or ston@@ es.
Per@@ sec@@ u@@ tions ended following John@@ 's death on 23 May 167@@ 7, at the age of 7@@ 4.
In celeb@@ ration he wrote a book entitled Three Vis@@ its to Mad@@ ag@@ as@@ car (185@@ 8).
Berlin@@ ale Tal@@ ents and Per@@ spek@@ tive Deutsch@@ es Kin@@ o have joined forces to award the inaug@@ ural “@@ K

# TODO BPE

- Train bpe model with the training data
- Apply on training, dev, and test

**NOTE:** to get original segmentation use


```
!sed -r 's/(@@ )|(@@ ?$)//g' < file_in > file_out
```

