## Downloads and Preprocess

In [1]:
from utils.dataset import JESCDataset

JESCDataset.info()
JESCDataset.create_csv()

Webpage: https://nlp.stanford.edu/projects/jesc/
Paper  : https://arxiv.org/abs/1710.10639
Summary: Japanese-English Subtitle Corpus (2.8M sentences)
skipped: jesc.csv file already exists!


In [2]:
from utils.dataset import WikiCorpusDataset

WikiCorpusDataset.info()
WikiCorpusDataset.create_csv()

Webpage : https://github.com/venali/BilingualCorpus/
Summary : a large scale corpus of manually translated Japanese sentences
          extracted from Wikipedia's Kyoto Articles (~500k sentences)
skipped: wiki_corpus.csv file already exists!


In [3]:
from utils.dataset import TatoebaDataset

TatoebaDataset.info()
TatoebaDataset.create_csv()

Webpage    : https://opus.nlpl.eu/Tatoeba.php
Webpage(HF): https://huggingface.co/datasets/tatoeba
Summary    : a collection of sentences from https://tatoeba.org/en/, contains
             over 400 languages ([en-ja] 200k sentences)
skipped: tatoeba.csv file already exists!


In [4]:
from utils.dataset import SnowSimplifiedDataset

SnowSimplifiedDataset.info()
SnowSimplifiedDataset.create_csv()

Webpage: https://huggingface.co/datasets/snow_simplified_japanese_corpus
Summary: Japanese-English sentence pairs, all Japanese sentences have
         a simplified counterpart (85k(x2) sentences)
skipped: snow_simplified.csv file already exists!


In [5]:
from utils.dataset import MassiveTranslationDataset

MassiveTranslationDataset.info()
MassiveTranslationDataset.create_csv()

Webpage: https://huggingface.co/datasets/Amani27/massive_translation_dataset
Summary: dataset derived from AmazonScience/MASSIVE for translation
         (16k sentences in 10 languages)
skipped: massive_translation.csv file already exists!


## Dataset Statistics

In [6]:
from transformers import AutoTokenizer

ja_en = (
    AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v3", use_fast=True),
    AutoTokenizer.from_pretrained("gpt2", use_fast=True)
)
en_ja = (
    AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True),
    AutoTokenizer.from_pretrained("rinna/japanese-gpt2-small", use_fast=True)
)

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


In [7]:
JESCDataset.stats(*ja_en, num_proc=8)
JESCDataset.stats(*en_ja, num_proc=8)

Showing statistic for 2,801,388 sentences:
	en[tokens] : Avg. 18.85 | Min.     3 | Max.   157 | >16. 1,492,905 | >32. 204645 | >64.  6990 | >128.    17
	ja[tokens] : Avg. 19.91 | Min.     1 | Max.   199 | >16. 1,573,742 | >32. 328914 | >64. 11251 | >128.    29
Showing statistic for 2,801,388 sentences:
	en[tokens] : Avg. 11.52 | Min.     3 | Max.    92 | >16. 386,317 | >32. 12634 | >64.    60 | >128.     0
	ja[tokens] : Avg. 10.09 | Min.     2 | Max.    94 | >16. 241,813 | >32.  4131 | >64.    14 | >128.     0


In [8]:
WikiCorpusDataset.stats(*ja_en, num_proc=8)
WikiCorpusDataset.stats(*en_ja, num_proc=8)

Showing statistic for 457,687 sentences:
	en[tokens] : Avg. 58.88 | Min.     3 | Max.   906 | >16. 399,174 | >32. 318507 | >64. 164503 | >128. 30829
	ja[tokens] : Avg. 60.15 | Min.     1 | Max.  1253 | >16. 389,866 | >32. 318059 | >64. 174246 | >128. 34057
Showing statistic for 457,687 sentences:
	en[tokens] : Avg. 34.02 | Min.     3 | Max.   826 | >16. 345,465 | >32. 197378 | >64. 46236 | >128.  2747
	ja[tokens] : Avg. 23.24 | Min.     2 | Max.   654 | >16. 276,755 | >32. 102738 | >64.  9818 | >128.   253


In [9]:
TatoebaDataset.stats(*ja_en, num_proc=8)
TatoebaDataset.stats(*en_ja, num_proc=8)

Showing statistic for 208,861 sentences:
	en[tokens] : Avg. 19.47 | Min.     4 | Max.   232 | >16. 126,111 | >32. 11217 | >64.   722 | >128.    42
	ja[tokens] : Avg. 25.32 | Min.     3 | Max.   420 | >16. 173,528 | >32. 37387 | >64.  2081 | >128.   138
Showing statistic for 208,861 sentences:
	en[tokens] : Avg. 11.56 | Min.     4 | Max.   126 | >16.  18,052 | >32.  1035 | >64.    48 | >128.     0
	ja[tokens] : Avg. 11.07 | Min.     3 | Max.   153 | >16.  15,203 | >32.   703 | >64.    46 | >128.     2


In [10]:
SnowSimplifiedDataset.stats(*ja_en, num_proc=2)
SnowSimplifiedDataset.stats(*en_ja, num_proc=2)

Showing statistic for 100,000 sentences:
	en[tokens] : Avg. 16.32 | Min.     7 | Max.    35 | >16.  44,654 | >32.    30 | >64.     0 | >128.     0
	ja[tokens] : Avg. 21.12 | Min.     5 | Max.    62 | >16.  79,057 | >32.  2502 | >64.     0 | >128.     0
Showing statistic for 100,000 sentences:
	en[tokens] : Avg. 10.03 | Min.     6 | Max.    21 | >16.     586 | >32.     0 | >64.     0 | >128.     0
	ja[tokens] : Avg.  9.33 | Min.     4 | Max.    21 | >16.     127 | >32.     0 | >64.     0 | >128.     0


In [11]:
MassiveTranslationDataset.stats(*ja_en, num_proc=2)
MassiveTranslationDataset.stats(*en_ja, num_proc=2)

Showing statistic for 16,521 sentences:
	en[tokens] : Avg. 17.02 | Min.     3 | Max.   155 | >16.   7,666 | >32.   528 | >64.     7 | >128.     1
	ja[tokens] : Avg. 20.81 | Min.     1 | Max.   192 | >16.  10,496 | >32.  1773 | >64.    20 | >128.     1
Showing statistic for 16,521 sentences:
	en[tokens] : Avg.  9.48 | Min.     3 | Max.    64 | >16.     763 | >32.     4 | >64.     0 | >128.     0
	ja[tokens] : Avg.  9.27 | Min.     2 | Max.    51 | >16.     650 | >32.     4 | >64.     0 | >128.     0
