## Downloads and Preprocess

In [1]:
from utils.dataset import JESC

JESC.info()
JESC.create_csv()

Webpage: https://nlp.stanford.edu/projects/jesc/
Paper  : https://arxiv.org/abs/1710.10639
Summary: Japanese-English Subtitle Corpus (2.8M sentences)
skipped: jesc.csv file already exists!


In [2]:
from utils.dataset import WikiCorpus

WikiCorpus.info()
WikiCorpus.create_csv()

Webpage : https://github.com/venali/BilingualCorpus/
Summary : a large scale corpus of manually translated Japanese sentences
          extracted from Wikipedia's Kyoto Articles (~500k sentences)
skipped: wiki_corpus.csv file already exists!


In [3]:
from utils.dataset import Tatoeba

Tatoeba.info()
Tatoeba.create_csv()

Webpage    : https://opus.nlpl.eu/Tatoeba.php
Webpage(HF): https://huggingface.co/datasets/tatoeba
Summary    : a collection of sentences from https://tatoeba.org/en/, contains
             over 400 languages ([en-ja] 200k sentences)
skipped: tatoeba.csv file already exists!


In [4]:
from utils.dataset import SnowSimplified

SnowSimplified.info()
SnowSimplified.create_csv()

Webpage: https://huggingface.co/datasets/snow_simplified_japanese_corpus
Summary: Japanese-English sentence pairs, all Japanese sentences have
         a simplified counterpart (85k(x2) sentences)
skipped: snow_simplified.csv file already exists!


In [5]:
from utils.dataset import MassiveTranslation

MassiveTranslation.info()
MassiveTranslation.create_csv()

Webpage: https://huggingface.co/datasets/Amani27/massive_translation_dataset
Summary: dataset derived from AmazonScience/MASSIVE for translation
         (16k sentences in 10 languages)
skipped: massive_translation.csv file already exists!


## Combining Datasets

In [6]:
from tokenizers import processors
from transformers import AutoTokenizer

source_lng = "ja"

if source_lng == "en": 
    target_lng = "ja"
    encoder = "bert-base-uncased"
    decoder = "rinna/japanese-gpt2-small" 
else: 
    target_lng = "en"
    encoder = "cl-tohoku/bert-base-japanese-v3"
    decoder = "gpt2"

encoder_tokenizer = AutoTokenizer.from_pretrained(encoder, use_fast=True)
decoder_tokenizer = AutoTokenizer.from_pretrained(decoder, use_fast=True)
if decoder_tokenizer.pad_token_id is None:
    decoder_tokenizer.pad_token_id = decoder_tokenizer.eos_token_id

# adds an EOS token at the end of each sentence
decoder_tokenizer._tokenizer.post_processor = processors.TemplateProcessing(
    single="$A " + decoder_tokenizer.eos_token,
    special_tokens=[(decoder_tokenizer.eos_token, decoder_tokenizer.eos_token_id)],
)

In [7]:
from utils.dataset import EnJaDatasetMaker, EnJaDatasetSample

dataset = EnJaDatasetMaker.prepare_dataset(
    "ja-en-test-1",
    [
        # lower is inclusive, upper is exclusive (0, 32) -> [0, 31]
        EnJaDatasetSample(SnowSimplified    ,  124, (0, 64)),
        EnJaDatasetSample(MassiveTranslation,   50, (0, 32)),
    ],
    source_language=source_lng,
    encoder_tokenizer=encoder_tokenizer,
    decoder_tokenizer=decoder_tokenizer,
    num_proc=6,
    seed=42
)

Map (num_proc=6):   0%|          | 0/100000 [00:00<?, ? examples/s]

Filter (num_proc=6):   0%|          | 0/100000 [00:00<?, ? examples/s]

sampling: 124 out of 100000


Map (num_proc=6):   0%|          | 0/2801388 [00:00<?, ? examples/s]

Filter (num_proc=6):   0%|          | 0/2801388 [00:00<?, ? examples/s]

sampling: using all data (27)



Saving the dataset (0/1 shards):   0%|          | 0/151 [00:00<?, ? examples/s]

In [8]:
dataset = EnJaDatasetMaker.load_dataset("ja-en-test-1")
dataset

Dataset({
    features: ['target', 'source', 'length', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 151
})

In [10]:
dataset[0]

{'target': "don 't rely too much on others .",
 'source': 'あまり他人には頼ってはいけない。',
 'length': tensor(12),
 'input_ids': tensor([[    2, 13903, 17606,   461,   465, 26960,   456,   465, 16562, 12494,
            385,     3]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'labels': tensor([[ 9099,   705,    83,  8814,  1165,   881,   319,  1854,   764, 50256]])}