<a href="https://colab.research.google.com/github/VincentCCL/MTAT/blob/main/notebooks/MTAT26_DataPreparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2. Data Preparation
High-quality data preparation is essential for any machine translation model. Neural models are sensitive to noise and inconsistencies in the training data: even small preprocessing problems can degrade translation quality or hinder reproducibility.

This chapter introduces a practical data preparation workflow using parallel data from the **OPUS-Tatoeba corpus** (English–Dutch). This is an example workflow and the provided steps are useful to prepare other datasets as well for their usage in MT.

##2.1 The Opus collection
[OPUS](https://opus.nlpl.eu/) (Tiedemann, 2012) is a large open collection of parallel corpora compiled from multilingual resources on the web. The project, initiated by Jörg Tiedemann, aggregates data from sources such as movie subtitles, parliamentary proceedings, software documentation, religious
texts, TED talks, GNOME and KDE localisations, and many others.

OPUS provides:
* unified formats (e.g. Moses, TMX),
* automatic sentence alignment,
* language identification and metadata,
* downloadable parallel sentence pairs for thousands of language pairs,
* stable, versioned releases for reproducible experiments.

Because OPUS handles alignment and cleaning, it is widely used for MT training and evaluation.

##2.2 The Tatoeba Project
[Tatoeba](https://tatoeba.org/) is a volunteer-driven project that collects example sentences translated into many languages. It began in 2006 and has grown into a large multilingual collection. The original dataset is organised as a many-to-many graph of translations, not as a clean parallel corpus.

OPUS releases a processed version of Tatoeba that:
* extracts verified translation pairs,
* normalises encoding,
* provides sentence-aligned files per language pair.



##2.3 Overview of the Pipeline
We follow the standard MT data preparation workflow:
1. Download parallel data from OPUS-Tatoeba.
2. Inspect the corpus.
3. Clean and normalise text.
4. Apply filtering (length, ratio, noise).
5. Shuffle reproducibly.
6. Tokenise at word level.
7. Split into train/dev/test sets.
8. Save files for model training

###2.3.1 Downloading the corpus
We use the Moses-format release for English–Dutch from OPUS-Tatoeba:

In [None]:
!wget https://object.pouta.csc.fi/OPUS-Tatoeba/v2023-04-12/moses/en-nl.txt.zip
!unzip *.zip

--2026-02-19 14:15:28--  https://object.pouta.csc.fi/OPUS-Tatoeba/v2023-04-12/moses/en-nl.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2368008 (2.3M) [application/zip]
Saving to: ‘en-nl.txt.zip’


2026-02-19 14:15:31 (1.62 MB/s) - ‘en-nl.txt.zip’ saved [2368008/2368008]

Archive:  en-nl.txt.zip
  inflating: README                  
  inflating: LICENSE                 
  inflating: Tatoeba.en-nl.en        
  inflating: Tatoeba.en-nl.nl        
  inflating: Tatoeba.en-nl.xml       


* The exclamation mark `!` tells Google Colab that this is a linux command and not a python
command.
* `wget https://...` is the linux command to download a file from a url.
* `unzip filename` is the linux command to unzip a file.


###2.3.2 Loading the Data

In [None]:
source = "Tatoeba.en-nl.en"
target = "Tatoeba.en-nl.nl"


OPUS always names files in the pattern `Corpus.src-tgt.src`. The lines in both files are aligned:
line i in the English file is the translation of line i in the Dutch file.

So we let python know which filenames to use


We load the aligned files into a pandas DataFrame

In [None]:
import pandas as pd
import csv

df_source = pd.read_csv(
    source, names=["Source"], sep="\0",
    quoting=csv.QUOTE_NONE, engine="python"
)

df_target = pd.read_csv(
    target, names=["Target"], sep="\0",
    quoting=csv.QUOTE_NONE, engine="python"
)

df = pd.concat([df_source, df_target], axis=1)
print("Initial size:", df.shape)

Initial size: (79541, 2)


* `import pandas as pd`: We import the `pandas` library, which provides high-level data structures for working with
tabular data. By convention, it is imported under the short alias `pd`.
* `import csv`: We import Python's built-in csv module in order to access predefined constants that control how text files are parsed, such as how quotation marks are handled.
* `df_source = pd.read_csv(...)`: We read the source-language file into a pandas DataFrame. Each line of the file corresponds to one sentence and becomes one row in the DataFrame.
* `source`: This variable contains the path to the source-language text file.
* `names=["Source"]`: Because the input file does not contain a header row, we explicitly assign a column name. The single column is named `Source`.
* `sep="\0"`: We specify a null character as the separator. Since this  character does not occur in normal text, pandas treats each entire line as a single field, even if the sentence itself contains spaces or punctuation.
* `quoting=csv.QUOTE_NONE`: We disable quotation handling entirely. This ensures that quotation marks inside sentences are treated as normal characters rather than as delimiters.
* `engine="python"`: We explicitly select the Python parsing engine, which supports custom separators such as the null character.
* `df_target = pd.read_csv(...)`: We repeat the same procedure for the target-language file, storing the result in a separate `DataFrame`.
* `names=["Target"]`: The single column of this `DataFrame` is named `Target`, corresponding to the target-language sentence.
* `df = pd.concat([df_source, df_target], axis=1)`: We concatenate the source and target

In [None]:
df

Unnamed: 0,Source,Target
0,Let's try something.,Laten we iets proberen!
1,Let's try something.,Laat ons iets proberen.
2,I have to go to sleep.,Ik moet gaan slapen.
3,Today is June 18th and it is Muiriel's birthday!,Vandaag is het 18 juni en het is de verjaardag...
4,Muiriel is 20 now.,Muiriel is nu 20 jaar oud.
...,...,...
79536,"Sugar is sweet, butter is fat.","Suiker is zoet, boter is vet."
79537,Does the hen eat eggs?,Eet de kip eieren?
79538,The bee is an insect and it eats honey.,De bij is een insect en eet honing.
79539,The fat cow has good milk.,De dikke koe heeft goede melk.


##2.4 Basic Cleaning
We apply simple but essential cleaning steps.

In [None]:
df = df.dropna()
df = df.drop_duplicates()
df = df[df["Source"] != df["Target"]].reset_index(drop=True)

df

Unnamed: 0,Source,Target
0,Let's try something.,Laten we iets proberen!
1,Let's try something.,Laat ons iets proberen.
2,I have to go to sleep.,Ik moet gaan slapen.
3,Today is June 18th and it is Muiriel's birthday!,Vandaag is het 18 juni en het is de verjaardag...
4,Muiriel is 20 now.,Muiriel is nu 20 jaar oud.
...,...,...
79477,"Sugar is sweet, butter is fat.","Suiker is zoet, boter is vet."
79478,Does the hen eat eggs?,Eet de kip eieren?
79479,The bee is an insect and it eats honey.,De bij is een insect en eet honing.
79480,The fat cow has good milk.,De dikke koe heeft goede melk.


* `df = df.dropna()`: We remove all rows that contain missing values (`NaN`) in either the source or target column. In a parallel corpus, incomplete sentence pairs are not useful for training and must be discarded.
* `df = df.drop_duplicates()`: We remove duplicate rows from the `DataFrame`. This eliminates repeated source–target sentence pairs, which could otherwise bias the model during training by over-representing certain examples.
* `df = df[df["Source"] != df["Target"]].reset_index(drop=True)`: We filter out sentence pairs where the source and target sentences are identical. Such pairs
typically indicate noise (e.g. untranslated sentences) and provide no useful learning signal for a translation model. After filtering, we reset the `DataFrame` index and drop the old index to ensure that row numbering remains consecutive

Next, minimal markup removal (rare in Tatoeba, but included for consistency)

In [None]:
import re

clean_re = r"<.*?>|&?(amp|nbsp|quot);"

df["Source"] = (
    df["Source"].replace(clean_re, " ", regex=True)
                 .replace(r"  +", " ", regex=True)
                 .str.strip()
)

df["Target"] = (
    df["Target"].replace(clean_re, " ", regex=True)
                 .replace(r"  +", " ", regex=True)
                 .str.strip()
)
df

Unnamed: 0,Source,Target
0,Let's try something.,Laten we iets proberen!
1,Let's try something.,Laat ons iets proberen.
2,I have to go to sleep.,Ik moet gaan slapen.
3,Today is June 18th and it is Muiriel's birthday!,Vandaag is het 18 juni en het is de verjaardag...
4,Muiriel is 20 now.,Muiriel is nu 20 jaar oud.
...,...,...
79477,"Sugar is sweet, butter is fat.","Suiker is zoet, boter is vet."
79478,Does the hen eat eggs?,Eet de kip eieren?
79479,The bee is an insect and it eats honey.,De bij is een insect en eet honing.
79480,The fat cow has good milk.,De dikke koe heeft goede melk.


* `import re`: We import Python's built-in re module, which provides support for regular expressions. Regular expressions allow us to search for and remove structured patterns such as markup and encoded symbols.
* `clean_re = r"<.*?>|&?(amp|nbsp|quot);"`: We define a regular expression pattern that matches common types of markup and HTML entities:
  * `<.*?>` matches anything that looks like an HTML or XML tag.
  * `&?(amp|nbsp|quot);` matches common HTML entities such as &amp;, &nbsp;, and
&quot;.
* `df["Source"] = (...)`: We apply a sequence of text-cleaning operations to the source-language column of the `DataFrame`.
* `replace(clean_re, " ", regex=True)`: All substrings that match the regular expression pattern are replaced by a single space. This removes markup while preserving word boundaries.
* `replace(r" +", " ", regex=True)`: We collapse multiple consecutive spaces into a single space. This prevents the introduction of irregular spacing as a side effect of the previous replacement step.
* `str.strip()`: We remove leading and trailing whitespace from each sentence.
* `df["Target"] = (...)`: We repeat the same sequence of cleaning operations for the target-language column. Applying identical preprocessing to both languages is important for consistency in a parallel corpus.

##2.5 Sentence-Level Filtering
Even after basic cleaning, parallel corpora may still contain problematic sentence pairs.
Some typical issues are:
* one side being much longer than the other (misalignment),
* unusually long sentences that are hard for the model to handle,
* residual noise that slipped through earlier steps.

Tatoeba sentences are generally short, but we include filters here both for good practice and for demonstration. We implement two common types of filters:
1. a maximum length filter, which removes sentences that are too long on either side;
2. a length ratio filter, which removes sentence pairs where one side is much longer than the other.

Let `|s|` be the number of tokens on the source (English) side and `|t|` the number of tokens on the target (Dutch) side. We define a maximum allowed ratio `r` and discard pairs that violate
$\frac{\max(|s|,|t|)}{\min(|s|,|t|)} \leq r.$

In practice, we approximate |s| and |t| using the number of spaces plus one (assuming one space
between tokens), and we set `r = max_ratio`.

In [None]:
max_spaces = 50
max_ratio  = 2.5

src_len = df["Source"].str.count(" ") + 1
tgt_len = df["Target"].str.count(" ") + 1

mask = (
    (src_len > tgt_len * max_ratio) |
    (tgt_len > src_len * max_ratio) |
    (src_len > max_spaces)          |
    (tgt_len > max_spaces)
)

df = df[~mask].reset_index(drop=True)


* `max_spaces` enforces an absolute upper bound on sentence length;
* `max_ratio` enforces that source and target lengths are not too different (by more than a factor of 2.5 in either direction);
* the boolean `mask` selects all sentence pairs that violate these constraints, and we remove them with `df = df[~mask]`

In [None]:
df

Unnamed: 0,Source,Target
0,Let's try something.,Laten we iets proberen!
1,Let's try something.,Laat ons iets proberen.
2,I have to go to sleep.,Ik moet gaan slapen.
3,Today is June 18th and it is Muiriel's birthday!,Vandaag is het 18 juni en het is de verjaardag...
4,Muiriel is 20 now.,Muiriel is nu 20 jaar oud.
...,...,...
79376,"Sugar is sweet, butter is fat.","Suiker is zoet, boter is vet."
79377,Does the hen eat eggs?,Eet de kip eieren?
79378,The bee is an insect and it eats honey.,De bij is een insect en eet honing.
79379,The fat cow has good milk.,De dikke koe heeft goede melk.


##2.6  Shuffling the Corpus

In [None]:
df = df.sample(frac=1.0, random_state=42).reset_index(drop=True)

* `df = df.sample(frac=1.0, random_state=42).reset_index(drop=True)`: We randomly shuffle the rows of the `DataFrame`. Shuffling ensures that sentence pairs are presented to the model in a random order, which helps prevent undesired ordering effects during training.
* `frac=1.0`: The parameter `frac` specifies the fraction of rows to sample. A value of `1.0` means that all sentence pairs are included, but in a randomized order.
* `random_state=42`: We fix the random seed used for shuffling. This makes the shuffling process reproducible: running the code again with the same seed will result in the same order of sentence pairs.
* `reset_index(drop=True)`: After shuffling, we reset the `DataFrame` index and discard the old index. This ensures that row indices remain consecutive and do not reflect the original ordering of the corpus

##2.7 Tokenisation
We apply the Moses tokeniser to both sides. Moses tokenization makes the effect of preprocessing immediately visible. For example, the raw English sentence `I can't believe it's true!` is transformed into `I ca n't believe it 's true !`, where punctuation is separated and contractions are split into consistent subunits. Similarly, Dutch-specific rules handle quotation marks and punctuation in a language-aware way. Such normalization reduces sparsity in the training data by ensuring that the same linguistic patterns are represented
consistently across sentences.

Historically, Moses tokenization originates from the Moses statistical machine translation toolkit (Koehn et al., 2007) and became a de facto standard in MT research for many years. Even in neural machine translation, Moses-style tokenization remains widely used as a baseline because it is robust, language-aware, and well understood.

In [None]:
!pip install -q sacremoses
from sacremoses import MosesTokenizer

tok_en = MosesTokenizer(lang="en")
tok_nl = MosesTokenizer(lang="nl")

df["Source_Tok"] = df["Source"].apply(lambda s: tok_en.tokenize(s, return_str=True, escape=False))
df["Target_Tok"] = df["Target"].apply(lambda s: tok_nl.tokenize(s, return_str=True, escape=False))


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/897.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25h

* `!pip install -q sacremoses`: We install the `sacremoses` package, which is a Python reimplementation of the tokenization and preprocessing tools from the Moses statistical machine translation toolkit. The `-q` flag suppresses verbose installation output.
* `from sacremoses import MosesTokenizer`: We import the `MosesTokenizer` class, which provides language-aware tokenization rules commonly used in machine translation.
* `tok_en = MosesTokenizer(lang="en", escape=False)`: We create a tokenizer configured for English. This tokenizer applies English-specific rules for splitting text into tokens, handling punctuation, contractions, and special symbols.
`escape=False` indicates that apostrophes should not be changed into `&apos;`.
* `tok_nl = MosesTokenizer(lang="nl", escape=False)`: We create a tokenizer configured for Dutch. Using a language-specific tokenizer ensures that tokenization respects conventions of the target language.
* `df["Source_Tok"] = df["Source"].apply(...)`: We apply the English tokenizer to each source-language sentence in the `DataFrame` and store the tokenized result in a new column called `Source_Tok`.
* `lambda s: tok_en.tokenize(s, return_str=True)`: For each sentence s, we tokenize it using the Moses tokenizer. The parameter `return_str=True`
ensures that the output is a single space-separated string of tokens rather than a list.
* `df["Target_Tok"] = df["Target"].apply(...)`: We repeat the same procedure for the target-language sentences, storing the tokenized output in the column `Target_Tok`.
* `lambda s: tok_nl.tokenize(s, return_str=True)`: Each Dutch sentence is tokenized according to Dutch-specific rules, ensuring consistent and
language-appropriate preprocessing

In [None]:
df

Unnamed: 0,Source,Target,Source_Tok,Target_Tok
0,She is as clever as she is beautiful.,Ze is even intelligent als mooi.,She is as clever as she is beautiful .,Ze is even intelligent als mooi .
1,Be realistic.,Wees realistisch.,Be realistic .,Wees realistisch .
2,Carl looked very happy.,Carl zag er erg blij uit.,Carl looked very happy .,Carl zag er erg blij uit .
3,"Show me your passport, please.",Wilt u me uw paspoort even laten zien alstubli...,"Show me your passport , please .",Wilt u me uw paspoort even laten zien alstubli...
4,She was really impressed.,Ze was echt onder de indruk.,She was really impressed .,Ze was echt onder de indruk .
...,...,...,...,...
79376,Let's play soccer.,Laten we voetbal spelen.,Let 's play soccer .,Laten we voetbal spelen .
79377,Some German words are extremely difficult for ...,Sommige Duitse woorden zijn uiterst moeilijk u...,Some German words are extremely difficult for ...,Sommige Duitse woorden zijn uiterst moeilijk u...
79378,I have fully come to terms with it.,Ik heb me er volledig bij neergelegd.,I have fully come to terms with it .,Ik heb me er volledig bij neergelegd .
79379,For once in my life I'm doing a good deed... A...,Doe ik ook eens een keer een goede daad... haa...,For once in my life I 'm doing a good deed ......,Doe ik ook eens een keer een goede daad ... ha...


##2.8 Train/Dev/Test Split
Splitting the data into training, development, and test sets is a fundamental principle of experimental machine translation. The training set is used to learn model parameters, while the development set provides an unbiased signal for monitoring progress, tuning hyperparameters, and applying techniques such as early stopping. Crucially, the test set is kept completely separate and is only used once the model design is finalized. This separation prevents overfitting and ensures that reported evaluation scores reflect the model's ability to generalize to unseen data rather than its ability to memorize the training corpus. By fixing random seeds and clearly defining the splits, we also ensure reproducibility, allowing experiments to be repeated and compared
in a scientifically sound manner.

We extract small dev/test sets; the remainder becomes training data.

In [None]:
num_dev = 1000
num_test = 1000

df_dev = df.sample(n=num_dev, random_state=1)
df_train = df.drop(df_dev.index)

df_test = df_train.sample(n=num_test, random_state=2)
df_train = df_train.drop(df_test.index)

print("Train/dev/test sizes:",
      len(df_train), len(df_dev), len(df_test))

Train/dev/test sizes: 77381 1000 1000


* `num_dev = 1000`: We define the number of sentence pairs to be used for the development (validation) set. This set is used during training to monitor model performance and tune hyperparameters.
* `num_test = 1000`: We define the number of sentence pairs to be used for the test set. This set is kept separate and is only used for the final evaluation of the trained model.
* `df_dev = df.sample(n=num_dev, random_state=1)`: We randomly sample `num_dev` sentence pairs from the full `DataFrame` to create the development set. Fixing the random seed ensures that the same sentences are selected each time the code is run.
* `df_train = df.drop(df_dev.index)`: We remove the development sentences from the original `DataFrame`. The remaining sentence pairs form a temporary training pool.
* `df_test = df_train.sample(n=num_test, random_state=2)`: From the remaining data, we randomly sample `num_test` sentence pairs to create the test set. A different random seed is used to ensure an independent selection from the
development set.
* `df_train = df_train.drop(df_test.index)`: We remove the test sentences from the training pool. The remaining sentence pairs constitute the final training set.
* `print("Train/dev/test sizes:", ...)`: We print the number of sentence pairs in each split. This allows us to verify that the data has been correctly partitioned and that the splits are mutually exclusive.

In [None]:
df_dev

Unnamed: 0,Source,Target,Source_Tok,Target_Tok
74751,Nobody reads about my country.,Niemand leest over mijn land.,Nobody reads about my country .,Niemand leest over mijn land .
37854,I sleep in my car.,Ik slaap in mijn auto.,I sleep in my car .,Ik slaap in mijn auto .
46729,There're clean sheets under the bed.,Er liggen schone lakens onder het bed.,There 're clean sheets under the bed .,Er liggen schone lakens onder het bed .
27043,I have a donkey.,Ik heb een ezel.,I have a donkey .,Ik heb een ezel .
16425,Betty drives fast.,Betty rijdt snel.,Betty drives fast .,Betty rijdt snel .
...,...,...,...,...
17861,I love Norwegian!,Ik ben gek op Noors!,I love Norwegian !,Ik ben gek op Noors !
48843,We've got to do something about this problem s...,We moeten snel iets aan dit probleem doen.,We 've got to do something about this problem ...,We moeten snel iets aan dit probleem doen .
1570,She is wearing a hat.,Ze heeft een hoed op.,She is wearing a hat .,Ze heeft een hoed op .
22611,Would you like to see a UFO?,Zou je een ufo willen zien?,Would you like to see a UFO ?,Zou je een ufo willen zien ?


##2.9 Saving the Final Output Files


###2.9.1 Saving files locally in Colab

In [None]:
corpus_prefix="tatoeba-en-nl."

def write_split(df_split, srcfile, tgtfile):
    with open(srcfile, "w", encoding="utf-8") as sf:
        sf.write("\n".join(df_split["Source_Tok"]) + "\n")
    with open(tgtfile, "w", encoding="utf-8") as tf:
        tf.write("\n".join(df_split["Target_Tok"]) + "\n")
    print(f"Wrote {srcfile}, {tgtfile}")

write_split(df_train, corpus_prefix+"train.en", corpus_prefix+"train.nl")
write_split(df_dev,   corpus_prefix+"dev.en",   corpus_prefix+"dev.nl")
write_split(df_test,  corpus_prefix+"test.en",  corpus_prefix+"test.nl")



Wrote: train.*, dev.*, test.*


* `def write_split(df_split, srcfile, tgtfile):`: We define a function that writes a single data split (training, development, or test) to disk
as two separate text files: one for the source language and one for the target language.
* `df_split`: This argument contains a DataFrame representing one data split. Each row corresponds to one parallel sentence pair.
* `srcfile, tgtfile`: These arguments specify the filenames for the source-language and target-language output files.
* `with open(srcfile, "w", encoding="utf-8") as sf:`: We open the source-language output file in write mode using UTF-8 encoding, which
ensures correct handling of multilingual text.
* `sf.write("\n".join(df_split["Source_Tok"]) + "\n")`: We write all tokenized source-language sentences to the file, one sentence per line. Sentences
are joined using newline characters, and a final newline is added to comply with common MT data format conventions.
* `with open(tgtfile, "w", encoding="utf-8") as tf:`: We open the target-language output file in the same way.
* `tf.write("\n".join(df_split["Target_Tok"]) + "\n")`: We write the tokenized target-language sentences to the target file, again using one sentence
per line and preserving sentence alignment with the source file.
* `write_split(df_train, corpus_prefix+"train.en", corpus_prefix+"train.nl")`: We write the training split to disk, producing the files `tatoeba-en-nl.train.en` and `tatoeba-en-nl.train.nl`.
* `write_split(df_dev, corpus_prefix+"dev.en", corpus_prefix+"dev.nl")`: We write the development split to disk.
* `write_split(df_test, corpus_prefix+"test.en", corpus_prefix+"test.nl")`: We write the test split to disk.


###2.9.2 Copying the files to Google Drive
Now the files have been written in our local directory on Google Colab. In order to keep them for reuse later, we have to copy them to our Google Drive.

We first need to connect to our Google Drive. Then we need to make sure we create a dedicated target directory.

In [None]:
!mkdir /content/drive/MyDrive/MTAT/
!cp *.en /content/drive/MyDrive/MTAT/
!cp *.nl /content/drive/MyDrive/MTAT/

* `mkdir path` creates the path if it doesn;t exist yet;
* `cp *.en target_path` copies all files ending with `.en` to path
* `cp *.nl target_path` does the same for the files ending with `.nl`.

##2.10 Discussion and Variants of the Pipeline

The data preparation steps presented in this chapter correspond to a standard and widely used MT preprocessing pipeline. Variants of this pipeline (with minor modifications) have been used for many years in both statistical and neural machine translation research and remain a solid default choice for controlled experiments.

The core principles illustrated here are broadly applicable:
* careful cleaning of parallel data,
* removal of noise and misalignments,
* consistent preprocessing on source and target sides,
* explicit and reproducible train/dev/test splits.

At the same time, it is important to be aware that several steps in the pipeline are design choices
rather than absolute requirements. Two common variations are discussed below.

### 2.10.1 Lowercasing

In many traditional MT pipelines, all text is converted to lowercase during preprocessing. Lowercasing
reduces vocabulary size and sparsity by collapsing word forms such as “House” and “house”
into a single token. This was particularly important in phrase-based and early neural MT systems,
where vocabulary size had a strong impact on model capacity and training stability.

Lowercasing can still be useful when:
* training models on very limited data,
* working with highly noisy text,
* prioritising robustness over orthographic fidelity.

However, modern neural MT systems are generally able to handle case distinctions well, especially
when trained on sufficient data. For this reason, many contemporary MT pipelines preserve the
original casing, as we have done in this chapter. Case preservation allows models to learn proper
capitalization, sentence-initial casing, and named entity conventions directly from the data.

###2.10.2  Tokenisation vs. Subword Segmentation
We used Moses tokenization in this chapter because it:
* makes preprocessing steps explicit and transparent,
* provides language-aware normalization,
* is easy to inspect and understand for educational purposes.

In current neural MT practice, however, explicit word-level tokenization is often replaced—or
complemented—by subword segmentation methods such as Byte Pair Encoding (BPE) or SentencePiece
(unigram or BPE models). Subwording addresses the open-vocabulary problem by
decomposing rare or unseen words into smaller units, allowing models to generalize better to new
word forms.

When subwording is used, the pipeline typically changes as follows:
* minimal normalization is applied (often just cleaning and optional lowercasing),
* Moses tokenization may be skipped entirely,
* a shared or language-specific subword model is learned on the training data and applied
consistently to train, dev, and test sets.

In such setups, subword segmentation implicitly performs much of the work that tokenization used
to handle, including punctuation separation and morphological variation. As a result, modern
MT systems often operate directly on subword units rather than on word tokens.

## 2.11 Exercise: Process your own dataset
For a language pair you are familiar with, perform the following steps:
* Select a dataset from Opus
* Check the data manually: does it contain full sentences
* Check the size of the dataset:
  * if it is smaller than Tatoeba, it may be too small
  * if it is bigger than a few million sentences per language, it may be too big for practical processing, i.e. it may be too big to be read into the memory of your Google colab session, or it may take a very long time to process. On the other hand, it may lead to better translations, once your model is trained on it
* Process it with the steps as provided in the colab session.
* Write the final version to your Google drive
* Write a small report on your dataset processing, containing information on the raw dataset size and the cleaned up dataset size, and the exact cleaning procedure that you applied.