### Translation
we will fine-tune a Marian model pretrained to translate from English to Korean (since a lot of Hugging Face employees speak both those languages) on the KDE4 dataset, which is a dataset of localized files for the KDE apps (KDE: KDE is an international free software community that develops free and open-source software). The model we will use has been pretrained on a large corpus of Korean and English texts taken from the Opus dataset, which actually contains the KDE4 dataset.

This is the work on translation from english to korean using the pretrained model checkpoint by Jörg Tiedemann, professor of Department of Digital Hamanities

![Jörg Tiedemann](https://researchportal.helsinki.fi/files-asset/56125518/Tiedemann.png?w=160&f=webp)

#### MarianMT
Models were originally trained by Jörg Tiedemann using the Marian C++ library, which supports fast training and translation.

Since Marian models are smaller than many other translation models available in the library, they can be useful for fine-tuning experiments and integration tests.

#### Multilingual Models

- All model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text.
- You can see a models’s supported language codes in its model card, under target constituents, like in opus-mt-en-roa.
- Note that if a model is only multilingual on the source side, like Helsinki-NLP/opus-mt-roa-en, no language codes are required.

In [None]:
from google.colab import drive
drive.mount('/gdrive',force_remount=True)

In [None]:
!pip install accelerate -U
# please restart after this installment

In [None]:
!pip install transformers datasets evaluate sentencepiece

In [None]:
!pip install huggingface_hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

### OPUS
OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.
#### Sub-corpora
kd4 datasets

In [1]:
from datasets import load_dataset
raw_ds = load_dataset("kde4",lang1="en",lang2="ko")

Found cached dataset kde4 (C:/Users/clee/.cache/huggingface/datasets/kde4/en-ko-lang1=en,lang2=ko/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
raw_ds

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 76708
    })
})

In [3]:
raw_ds["train"][:20]

{'id': ['0',
  '1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16',
  '17',
  '18',
  '19'],
 'translation': [{'en': 'An Introduction to & kde;', 'ko': '& kde; 소개'},
  {'en': 'The & kde; Team', 'ko': 'kde; 팀'},
  {'en': 'ROLES_OF_TRANSLATORS',
   'ko': 'Cedna sptcedna@ gmail. com KDE 4. 4 문서 번역 박 신조 peremen@ gmail. com 과거 문서 정리 및 번역'},
  {'en': 'The & kde; Team', 'ko': 'kde; 팀'},
  {'en': 'An introduction to the K Desktop Environment', 'ko': 'K 데스크톱 환경 소개'},
  {'en': 'Quick Start Guide to & kde;', 'ko': '& kde; 빠른 시작 가이드'},
  {'en': 'KDE', 'ko': 'KDE'},
  {'en': 'quick start', 'ko': '빠른 시작'},
  {'en': 'introduction', 'ko': '소개'},
  {'en': 'Introduction', 'ko': '소개'},
  {'en': 'This document is a brief introduction to the K Desktop Environment. It will familiarize you with some of the basic features of & kde;.',
   'ko': '이 문서는 K 데스크톱 환경을 간단히 소개합니다. 이 문서를 읽으면 & kde; 의 기본적인 기능에 익숙해질 것입니다.'},
  {'en': 'This guide is far from 

In [4]:
split_datasets = raw_ds["train"].train_test_split(train_size=0.9,seed=20)
split_datasets

Loading cached split indices for dataset at C:\Users\clee\.cache\huggingface\datasets\kde4\en-ko-lang1=en,lang2=ko\0.0.0\243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac\cache-7a2aba6b6520f180.arrow and C:\Users\clee\.cache\huggingface\datasets\kde4\en-ko-lang1=en,lang2=ko\0.0.0\243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac\cache-3e75064839895ede.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 69037
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 7671
    })
})

In [5]:
split_datasets["validation"] = split_datasets.pop("test")

In [6]:
split_datasets["train"][1]["translation"]

{'en': 'Please add the output filename (%f) to the command line.',
 'ko': '명령 라인에 출력될 파일 이름 (% f) 을( 를) 추가하십시오.'}

### Change of pretrained model
The model should be Helsinki-NLP/opus-mt-en-ko which is not working at the moment and should be replaced
by circulus/kobart-trans-en-ko-v2 which is the only pretrained model currently available on internet.
No description for the model is available.

In [7]:
from transformers import AutoTokenizer
model_ckpt = "circulus/kobart-trans-en-ko-v2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Downloading (…)okenizer_config.json:   0%|          | 0.00/304 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading tokenizer.json:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

In [8]:
from transformers import pipeline
translator = pipeline("translation", model=model_ckpt)
translator("Default to expanded threads")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

[{'translation_text': '기본 확장된 실로'}]

In [9]:
translator(
    "I want to work on natiral language processing for translation."
)

Your input_length: 32 is bigger than 0.9 * max_length: 20. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


[{'translation_text': '나는 번역을 위한 원어민 처리 작업을 하고 싶습니다'}]

In [14]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format.",max_length=400
)

[{'translation_text': 'OFX 수입자 플러그인을 사용하여 α1을 수입할 수 없음 · 이 파일은 올바른 형식이 아닙니다 ·'}]

In [15]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
ko_sentence = split_datasets["train"][1]["translation"]["ko"]

inputs = tokenizer(en_sentence, text_target=ko_sentence)
inputs

{'input_ids': [15073, 16203, 296, 17223, 1700, 18914, 299, 21235, 1700, 17884, 25088, 23124, 17254, 17761, 15265, 17941, 300, 14338, 236, 301, 240, 27141, 21235, 1700, 15190, 20461, 19465, 25674, 29144, 245], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [19899, 14560, 18288, 14175, 10314, 9910, 29504, 14897, 14338, 236, 17254, 240, 15309, 239, 15715, 240, 14927, 13586, 20108, 245]}

In [16]:
wrong_targets = tokenizer(ko_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))

['▁명령', '▁라', '인에', '▁출', '력', '될', '▁파일', '▁이름', '▁(', '%', '▁f', ')', '▁을', '(', '▁를', ')', '▁추가', '하', '십시오', '.']
['▁명령', '▁라', '인에', '▁출', '력', '될', '▁파일', '▁이름', '▁(', '%', '▁f', ')', '▁을', '(', '▁를', ')', '▁추가', '하', '십시오', '.']


In [17]:
max_length = 128
# Can this size of max-length be made larger ? Can the memory of GPU affected by this size?

def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["ko"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [18]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/69037 [00:00<?, ? examples/s]

Map:   0%|          | 0/7671 [00:00<?, ? examples/s]

In [19]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


In [20]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [22]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [23]:
batch["labels"]

tensor([[19899, 14560, 18288, 14175, 10314,  9910, 29504, 14897, 14338,   236,
         17254,   240, 15309,   239, 15715,   240, 14927, 13586, 20108,   245],
        [26755, 11973, 26052,   299, 23590, 15555,   315, 16203, 14338,   307,
         24508,   316, 19650, 20280,  -100,  -100,  -100,  -100,  -100,  -100]])

In [24]:
batch["decoder_input_ids"]

tensor([[    1, 19899, 14560, 18288, 14175, 10314,  9910, 29504, 14897, 14338,
           236, 17254,   240, 15309,   239, 15715,   240, 14927, 13586, 20108],
        [    1, 26755, 11973, 26052,   299, 23590, 15555,   315, 16203, 14338,
           307, 24508,   316, 19650, 20280,     3,     3,     3,     3,     3]])

In [25]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

[19899, 14560, 18288, 14175, 10314, 9910, 29504, 14897, 14338, 236, 17254, 240, 15309, 239, 15715, 240, 14927, 13586, 20108, 245]
[26755, 11973, 26052, 299, 23590, 15555, 315, 16203, 14338, 307, 24508, 316, 19650, 20280]


In [None]:
!pip install sacrebleu

In [26]:
import evaluate

metric = evaluate.load("sacrebleu")

In [27]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

In [28]:
predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

In [29]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [30]:
# fp16 seems to be crucial to reduce the memory size of GPU. Without this option, the job fails at colab.
# The other metaparameters that affect the required size of GPU memory are
#     batch_size for the train and evals.
# Here, their values are set to 32 and 64, respectively.

from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-ko",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

In [31]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/chunwoolee0/marian-finetuned-kde4-en-to-ko into local empty directory.


Download file pytorch_model.bin:   0%|          | 17.4k/798M [00:00<?, ?B/s]

Download file runs/Jul27_05-47-44_c3b97e2f7dd3/events.out.tfevents.1690438058.c3b97e2f7dd3.662.0: 100%|#######…

Download file target.spm:   2%|1         | 15.4k/796k [00:00<?, ?B/s]

Download file source.spm:   0%|          | 3.37k/771k [00:00<?, ?B/s]

Download file runs/Jul27_05-47-44_c3b97e2f7dd3/events.out.tfevents.1690440358.c3b97e2f7dd3.662.1: 100%|#######…

Download file training_args.bin: 100%|##########| 4.06k/4.06k [00:00<?, ?B/s]

Clean file source.spm:   0%|          | 1.00k/771k [00:00<?, ?B/s]

Clean file target.spm:   0%|          | 1.00k/796k [00:00<?, ?B/s]

Clean file runs/Jul27_05-47-44_c3b97e2f7dd3/events.out.tfevents.1690438058.c3b97e2f7dd3.662.0:  14%|#4        …

Clean file runs/Jul27_05-47-44_c3b97e2f7dd3/events.out.tfevents.1690440358.c3b97e2f7dd3.662.1: 100%|##########…

Clean file training_args.bin:  25%|##4       | 1.00k/4.06k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/798M [00:00<?, ?B/s]

In [None]:
trainer.train()

In [None]:
trainer.evaluate(max_length=max_length)

In [None]:
trainer.push_to_hub(tags="translation", commit_message="Training complete")

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="chunwoolee0/circulus-kobart-en-to-ko")
translator("This course is produced by Hugging Face.")

In [None]:
translator("This course is produced by Hugging Face.")

In [None]:
translator("I ate the breakfast.")