## Downloads and Preprocess

In [1]:
from utils.dataset import JESCDataset

JESCDataset.info()
JESCDataset.create_csv()

Webpage: https://nlp.stanford.edu/projects/jesc/
Paper  : https://arxiv.org/abs/1710.10639
Summary: Japanese-English Subtitle Corpus (2.8M sentences)
skipped: jesc.csv file already exists!


In [2]:
from utils.dataset import WikiCorpusDataset

WikiCorpusDataset.info()
WikiCorpusDataset.create_csv()

Webpage : https://github.com/venali/BilingualCorpus/
Summary : a large scale corpus of manually translated Japanese sentences
          extracted from Wikipedia's Kyoto Articles (~500k sentences)
skipped: wiki_corpus.csv file already exists!


In [3]:
from utils.dataset import TatoebaDataset

TatoebaDataset.info()
TatoebaDataset.create_csv()

Webpage    : https://opus.nlpl.eu/Tatoeba.php
Webpage(HF): https://huggingface.co/datasets/tatoeba
Summary    : a collection of sentences from https://tatoeba.org/en/, contains
             over 400 languages ([en-ja] 200k sentences)
skipped: tatoeba.csv file already exists!


In [4]:
from utils.dataset import SnowSimplifiedDataset

SnowSimplifiedDataset.info()
SnowSimplifiedDataset.create_csv()

Webpage: https://huggingface.co/datasets/snow_simplified_japanese_corpus
Summary: Japanese-English sentence pairs, all Japanese sentences have
         a simplified counterpart (85k(x2) sentences)
skipped: snow_simplified.csv file already exists!


In [5]:
from utils.dataset import MassiveTranslationDataset

MassiveTranslationDataset.info()
MassiveTranslationDataset.create_csv()

Webpage: https://huggingface.co/datasets/Amani27/massive_translation_dataset
Summary: dataset derived from AmazonScience/MASSIVE for translation
         (16k sentences in 10 languages)
skipped: massive_translation.csv file already exists!


## Dataset Statistics

In [6]:
from transformers import AutoTokenizer

ja_en = (
    AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v3", use_fast=True),
    AutoTokenizer.from_pretrained("gpt2", use_fast=True)
)
en_ja = (
    AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True),
    AutoTokenizer.from_pretrained("rinna/japanese-gpt2-small", use_fast=True)
)

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


In [12]:
JESCDataset.stats(*ja_en, num_proc=8)
JESCDataset.stats(*en_ja, num_proc=8)

Map (num_proc=8):   0%|          | 0/2801388 [00:00<?, ? examples/s]

Showing statistic for 2,801,388 sentences:
	en[tokens] : Avg. 18.85 | Min.     3 | Max.   157 | >32. 204,645 | >64.  6990 | >128.    17 | >256.     0
	ja[tokens] : Avg. 19.91 | Min.     1 | Max.   199 | >32. 328,914 | >64. 11251 | >128.    29 | >256.     0


Map (num_proc=8):   0%|          | 0/2801388 [00:00<?, ? examples/s]

Showing statistic for 2,801,388 sentences:
	en[tokens] : Avg. 11.52 | Min.     3 | Max.    92 | >32.  12,634 | >64.    60 | >128.     0 | >256.     0
	ja[tokens] : Avg. 10.09 | Min.     2 | Max.    94 | >32.   4,131 | >64.    14 | >128.     0 | >256.     0


In [8]:
WikiCorpusDataset.stats(*ja_en, num_proc=8)
WikiCorpusDataset.stats(*en_ja, num_proc=8)

Map (num_proc=8):   0%|          | 0/457687 [00:00<?, ? examples/s]

Showing statistic for 457,687 sentences:
	en[tokens] : Avg. 58.88 | Min.     3 | Max.   906 | >32. 318,507 | >64. 164503 | >128. 30829 | >256.  1257
	ja[tokens] : Avg. 60.15 | Min.     1 | Max.  1253 | >32. 318,059 | >64. 174246 | >128. 34057 | >256.  1330


Map (num_proc=8):   0%|          | 0/457687 [00:00<?, ? examples/s]

Showing statistic for 457,687 sentences:
	en[tokens] : Avg. 34.02 | Min.     3 | Max.   826 | >32. 197,378 | >64. 46236 | >128.  2747 | >256.    81
	ja[tokens] : Avg. 23.24 | Min.     2 | Max.   654 | >32. 102,738 | >64.  9818 | >128.   253 | >256.    12


In [9]:
TatoebaDataset.stats(*ja_en, num_proc=8)
TatoebaDataset.stats(*en_ja, num_proc=8)

Map (num_proc=8):   0%|          | 0/208861 [00:00<?, ? examples/s]

Showing statistic for 208,861 sentences:
	en[tokens] : Avg. 19.47 | Min.     4 | Max.   232 | >32.  11,217 | >64.   722 | >128.    42 | >256.     0
	ja[tokens] : Avg. 25.32 | Min.     3 | Max.   420 | >32.  37,387 | >64.  2081 | >128.   138 | >256.    10


Map (num_proc=8):   0%|          | 0/208861 [00:00<?, ? examples/s]

Showing statistic for 208,861 sentences:
	en[tokens] : Avg. 11.56 | Min.     4 | Max.   126 | >32.   1,035 | >64.    48 | >128.     0 | >256.     0
	ja[tokens] : Avg. 11.07 | Min.     3 | Max.   153 | >32.     703 | >64.    46 | >128.     2 | >256.     0


In [10]:
SnowSimplifiedDataset.stats(*ja_en, num_proc=2)
SnowSimplifiedDataset.stats(*en_ja, num_proc=2)

Map (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Showing statistic for 100,000 sentences:
	en[tokens] : Avg. 16.32 | Min.     7 | Max.    35 | >32.      30 | >64.     0 | >128.     0 | >256.     0
	ja[tokens] : Avg. 21.12 | Min.     5 | Max.    62 | >32.   2,502 | >64.     0 | >128.     0 | >256.     0


Map (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Showing statistic for 100,000 sentences:
	en[tokens] : Avg. 10.03 | Min.     6 | Max.    21 | >32.       0 | >64.     0 | >128.     0 | >256.     0
	ja[tokens] : Avg.  9.33 | Min.     4 | Max.    21 | >32.       0 | >64.     0 | >128.     0 | >256.     0


In [11]:
MassiveTranslationDataset.stats(*ja_en, num_proc=2)
MassiveTranslationDataset.stats(*en_ja, num_proc=2)

Map (num_proc=2):   0%|          | 0/16521 [00:00<?, ? examples/s]

Showing statistic for 16,521 sentences:
	en[tokens] : Avg. 17.02 | Min.     3 | Max.   155 | >32.     528 | >64.     7 | >128.     1 | >256.     0
	ja[tokens] : Avg. 20.81 | Min.     1 | Max.   192 | >32.   1,773 | >64.    20 | >128.     1 | >256.     0


Map (num_proc=2):   0%|          | 0/16521 [00:00<?, ? examples/s]

Showing statistic for 16,521 sentences:
	en[tokens] : Avg.  9.48 | Min.     3 | Max.    64 | >32.       4 | >64.     0 | >128.     0 | >256.     0
	ja[tokens] : Avg.  9.27 | Min.     2 | Max.    51 | >32.       4 | >64.     0 | >128.     0 | >256.     0
