<h1 align="center"><strong>Machine translation</strong></h1>

Machine translation (MT) is the study of how to use computers to translate from one language into another. In terms of methodologies MT mainly falls in two categories: rule-based methods and corpus-based-methods.  

In this short notebook a dataset containing japanese and english text will be loaded and prepared for a machine translation task.  
The preparation will limit itself to the extraction of the sentence pair and the splitting into train and test set. 

# Import 

In [29]:
import kagglehub
from pathlib import Path
import pandas as pd
import xml.etree.ElementTree as ET
from pprint import pprint
from collections import Counter
from sklearn.model_selection import train_test_split 


# Data

In [35]:
# Download latest version
root=Path(kagglehub.dataset_download("team-ai/japaneseenglish-bilingual-corpus"))
print(root)


C:\Users\laran\.cache\kagglehub\datasets\team-ai\japaneseenglish-bilingual-corpus\versions\3


In [36]:
files=sorted(p for p in root.rglob("*") if p.is_file())
pprint(files[:10])

[WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/BDS00389.xml'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/fonts-japanese-gothic.ttf'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/kyoto_lexicon.csv'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/readme.pdf'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/wiki_corpus_2.01/BDS/BDS00001.xml'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/wiki_corpus_2.01/BDS/BDS00002.xml'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-ai/japaneseenglish-bilingual-corpus/versions/3/wiki_corpus_2.01/BDS/BDS00003.xml'),
 WindowsPath('C:/Users/laran/.cache/kagglehub/datasets/team-

Files are mostly XML files, to be able to use the parallel english / japanese text the XML files must be parsed first: 

### 1) check the tags

In [None]:
base = root / "wiki_corpus_2.01" / "BDS"
xml_file = sorted(base.glob("BDS*.xml"))[0]
root = ET.parse(xml_file).getroot()

for el in root.iter():
    print(el.tag, el.attrib)
    # Stop early to avoid printing everything
    if len(list(el)) > 0:
        break


art {'orl': 'ja', 'trl': 'en'}


Tags are 'ja' for japanese and 'en' for english.  
The structure will be further explored to properly extract japanese - english pairs

### 2. inspect the structure

In [20]:
assert root.tag.endswith("art") and root.attrib.get("orl")=="ja" and root.attrib.get("trl")=="en"

# List first-level children under <art>
lvl1 = [c.tag for c in root]

print("Level-1 child tags:", Counter(lvl1))

# Peek deeper: for each distinct lvl1 tag, show its distinct children and sample text
def strip_ns(tag): 
    return tag.split("}",1)[1] if "}" in tag else tag

seen = set()
for c in root:
    t = strip_ns(c.tag)
    if t in seen: 
        continue
    seen.add(t)
    sub = [strip_ns(x.tag) for x in list(c)]
    print(f"\n<{t}> children:", Counter(sub))
    # print a couple of leaf texts
    for leaf in c.iter():
        if len(list(leaf))==0 and (leaf.text or "").strip():
            txt = leaf.text.strip().replace("\n"," ")[:120]
            print(f"  sample leaf <{strip_ns(leaf.tag)}>: {txt}")
            break


Level-1 child tags: Counter({'sec': 6, 'par': 5, 'inf': 1, 'tit': 1, 'copyright': 1})

<inf> children: Counter()
  sample leaf <inf>: jawiki-20080607-pages-articles.xml

<tit> children: Counter({'e': 3, 'cmt': 3, 'j': 1})
  sample leaf <j>: 雪舟

<par> children: Counter({'sen': 2})
  sample leaf <j>: 雪舟（せっしゅう、1420年（応永27年） - 1506年（永正3年））は号で、15世紀後半室町時代に活躍した水墨画家・禅僧で、画聖とも称えられる。

<sec> children: Counter({'par': 3, 'tit': 1})
  sample leaf <j>: 生涯

<copyright> children: Counter()
  sample leaf <copyright>: copyright (c) 2010 Avanzare(id:34657), Kanejan(id:78613), Tommy6(id:51773), Nnh(id:474), Suguri F(id:11127), FREEZA(id:6


**interpretation**: 
- Root is: <art orl="ja" trl="en">
- frequent childrens are: sec, par, tit, inf, copyright
- language tags are: < j > for japanese and < e > for english 


With these informations we can built the extractor. 

### 3) Built the extractor and the paired dataframe

In [None]:
base = root / "wiki_corpus_2.01" / "BDS"
base.glob("BDS*.xml")

TAG_PAIR      = None         # No wrapper available
TAG_ORIGINAL  = "j"          # japanese tag
TAG_TRANSL    = "e"          # english tag

def strip_ns(tag):
    return tag.split("}",1)[1] if "}" in tag else tag

def text_or_none(el):
    return (el.text or "").strip() if el is not None and el.text else None

pairs = []
for xf in sorted(base.glob("BDS*.xml")):
    root = ET.parse(xf).getroot()

    if TAG_PAIR:
        # option 1 --> if I have the wrapper (not our case as TAG_PAIR=None)
        for node in root.findall(f".//{TAG_PAIR}"):
            ja = text_or_none(node.find(TAG_ORIGINAL))
            en = text_or_none(node.find(TAG_TRANSL))
            if ja and en:
                pairs.append((ja, en))
    else:
        # option 2 --> called in case we have no explicit wrapper (our case)
        for node in root.iter():
            children = list(node)
            if not children: 
                continue
            tagmap = {strip_ns(c.tag).lower(): c for c in children}
            if TAG_ORIGINAL.lower() in tagmap and TAG_TRANSL.lower() in tagmap:
                ja = text_or_none(tagmap[TAG_ORIGINAL.lower()])
                en = text_or_none(tagmap[TAG_TRANSL.lower()])
                if ja and en:
                    pairs.append((ja, en))

df = pd.DataFrame(pairs, columns=["ja", "en"]).dropna()   # dataframe and cleaning


### 4) inspect the newly created dataframe.  

Does it make sense? were the tag correctly extracted?

In [27]:
df.head(5)

Unnamed: 0,ja,en
0,雪舟,Sesshu
1,雪舟（せっしゅう、1420年（応永27年） - 1506年（永正3年））は号で、15世紀後半...,"Known as Sesshu (1420 - 1506), he was an ink p..."
2,日本の水墨画を一変させた。,He revolutionized the Japanese ink painting.
3,諱は「等楊（とうよう）」、もしくは「拙宗（せっしゅう）」と号した。,"He was given the posthumous name ""Toyo"" or ""Se..."
4,備中国に生まれ、京都・相国寺に入ってから周防国に移る。,"Born in Bicchu Province, he moved to Suo Provi..."


In [28]:
print("Total pairs:", len(df))

Total pairs: 28384


### 5) Prepare the data for machine translation

In [30]:
X = df['ja']   # original 
y = df['en']   # translated

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [33]:
# checks
print(f"----Overview----")
print("")
print(f"Training set size: {len(X_train)} entries")
print(f"Testing set size: {len(X_test)} entries")

----Overview----

Training set size: 22707 entries
Testing set size: 5677 entries


## Vectorization & embedding

Due to the nature of the dataset the vectorization through a CountVectorizer would not be relevant.  
For the MT task will be attempted a sequence tokenization and a seq2seq model. 