<a id='top'></a><a name='top'></a>
# Chapter 5: Natural Language Generation and Conversion with Transformer

## 5.3 Kana-Kanji Conversion with Transformer

<table align="left">
  <td>
    <a href="link.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

* [Imports and Setup](#setup)
* [5.3 Kana-Kanji Conversion with Transformer](#5.3)
    - [5.3.1 Sequence to Sequence (Seq2Seq) Models](#5.3.1)
    - [5.3.2 Converting from Kanji-Kana majiribun into Romaji](#5.3.2)
    - [5.3.3 Training and Tokenizing with SentencePiece](#5.3.3)
    - [5.3.4 Training a Conversion Model with Fairseq](#5.3.4)
    - [5.3.5 Checking created artifacts](#5.3.5)

---
<a name='setup'></a><a id='setup'></a>
# Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
# Option to use downloaded/pre-trained data (assumes Colab platform)
USE_GD_DATA = False

if USE_GD_DATA:
    try:
        from google.colab import drive
        drive.mount('/content/drive', force_remount=True)
        print("Creating a local copy of chp05_03...")
        # Assumes prepared data is stored as 'My Drive/chp05_03' on Google Drive 
        !cp -r /content/drive/MyDrive/chp05_03 /content/chp05_03
        print()
        !ls -l /content/chp05_03
    except Exception as e:
        print(f"Error: {e}")

In [2]:
import pathlib
from pathlib import Path

data_root = Path("chp05_03")
req_file = data_root / "requirements_5_5_3.txt"

if not data_root.is_dir():
    data_root.mkdir()
else:
    print(f"{data_root} exists.")

In [3]:
%%writefile {req_file}
cutlet==0.1.19
fugashi[unidic]==1.2.1
sentencepiece==0.1.97
fairseq==0.12.2
tensorboardX==2.5.1
watermark==2.3.1

Writing chp05_03/requirements_5_5_3.txt


In [4]:
import os
import sys
check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

# Need fugashi for cutlet, and need unidic for fugashi
if IS_COLAB:
    print("Installing packages")
    !pip install --quiet -r {req_file}
    !apt-get install tree &> /dev/null
    !python -m unidic download
    !sudo apt-get install ack -qq
    print()
    print("** Need to restart runtime after installing sentencepiece **")
    print("> Runtime > Restart runtime ...")
else:
    print("Running locally.")

Installing packages
[K     |████████████████████████████████| 364 kB 19.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 615 kB 56.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 63.2 MB/s 
[K     |████████████████████████████████| 11.0 MB 58.1 MB/s 
[K     |████████████████████████████████| 125 kB 81.7 MB/s 
[K     |████████████████████████████████| 128 kB 72.1 MB/s 
[K     |████████████████████████████████| 241 kB 81.9 MB/s 
[K     |████████████████████████████████| 123 kB 82.0 MB/s 
[K     |████████████████████████████████| 118 kB 79.8 MB/s 
[K     |████████████████████████████████| 112 kB 68.6 MB/s 
[K     |████████████████████████████████| 1.6 MB 63.2 MB/s 
[?25h  Building wheel for cutlet (PEP 517) ... [?25l[?25hdone
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone

In [1]:
# Standard Library imports
from itertools import chain
import os
import pathlib
from pathlib import Path
import shlex
import shutil
import subprocess
import sys
 
# Third-party imports
import cutlet
import fairseq
import sentencepiece as spm
import logging
import torch
from tqdm import tqdm
from watermark import watermark

# Suppress TensorFlog log messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False
print(f"IS_COLAB: {IS_COLAB}")

katsu = cutlet.Cutlet()

_ = torch.manual_seed(42)

def HR():
    print("-"*50)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device:{device}")
HR()

packages_check="cutlet,fairseq,fugashi,sentencepiece,tensorboardX,torch,watermark"
print(watermark(packages=packages_check, python=True,machine=True))

IS_COLAB: True
device:cuda
--------------------------------------------------
Python implementation: CPython
Python version       : 3.8.15
IPython version      : 7.9.0

cutlet       : 0.1.19
fairseq      : 0.12.2
fugashi      : 1.2.1
sentencepiece: 0.1.97
tensorboardX : 2.5.1
torch        : 1.12.1+cu113
watermark    : 2.3.1

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.10.133+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [2]:
data_file = "sentences20210924.tar.bz2"
data_url = f"janlpbook.s3.amazonaws.com/{data_file}"
data_root = Path("chp05_03") # redefine after runtime restart
data_dir = data_root /"data"
data_src = data_dir / data_file
data_path = data_dir / "sentences.csv"

data_file2 = "cc100.ja.mod1k.txt"
data_url2 = f"janlpbook.s3.amazonaws.com/{data_file2}"
data_path2 = data_dir / data_file2

print(f"""
data_root:\t{data_root}
data_file:\t{data_file}
data_url:\t{data_url}
data_dir:\t{data_dir}
data_src:\t{data_src}
data_path:\t{data_path}

data_file2:\t{data_file2}
data_url2:\t{data_url2}
data_path2:\t{data_path2}
""")


data_root:	chp05_03
data_file:	sentences20210924.tar.bz2
data_url:	janlpbook.s3.amazonaws.com/sentences20210924.tar.bz2
data_dir:	chp05_03/data
data_src:	chp05_03/data/sentences20210924.tar.bz2
data_path:	chp05_03/data/sentences.csv

data_file2:	cc100.ja.mod1k.txt
data_url2:	janlpbook.s3.amazonaws.com/cc100.ja.mod1k.txt
data_path2:	chp05_03/data/cc100.ja.mod1k.txt



---
<a name='5.3'></a><a id='5.3'></a>
# 5.3 Kana-Kanji Conversion with Transformer
<a href="#top">[back to top]</a>

<a name='5.3.1'></a><a id='5.3.1'></a>
## 5.3.1 Sequence to Sequence (Seq2Seq) Models
<a href="#top">[back to top]</a>

* For the Kana-Kanji conversion, use a Transformer-based Seq2Seq model.
* A Seq2Seq model converts a sequence into another sequence.
* It consists of two subcomponents, each of which is usually a full neural network with multiple layers:
    - An encoder
    - A decoder
* The encoder converts input into an internal representation, similar to word embeddings. 
* The decoder takes the representations produced by the encoder and produces the output text.
* Here we use a Seq2Seq model with Transformer-based architecture.
* Fairseq implements Transformer Seq2Seq, and we will use one of its default Transformer configurations.


<a name='5.3.2'></a><a id='5.3.2'></a>
## 5.3.2 Converting from Kanji-Kana majiribun into Romaji
<a href="#top">[back to top]</a>

* Create the parallel corpus for Kana-Kanji conversion
* Use cutlet to convert Kanji-Kana majiribun to Romaji

In [3]:
# Hello-world example
katsu = cutlet.Cutlet()
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")

'Katsu karee wa oishii'

* Create a large corpus of raw Japanese texts in order to build a parallel (Romaji to Japanese) corpus.
* Use a combination of Tatoeba and CC-100 datasets.
* Use a 1/1000 sample of CC-100

In [4]:
if not (data_dir).is_dir():
    print(f"Creating: {data_dir}")
    data_dir.mkdir(parents=True, exist_ok=False)
else:
    print(f"{data_dir} exists.")

Creating: chp05_03/data


In [5]:
# Download and prep the Tatoeba datasets
if not data_src.is_file():
    print(f"Downloading {data_url} to {data_src}")
    subprocess.run(shlex.split(f"wget -q -O {data_src} {data_url}"))
    print("Done.")
else:
    print(f"{data_src} exists.")

HR()

if not data_path.is_file():
    print(f"Extracting file {data_src}")
    shutil.unpack_archive(data_src, data_dir)
    print("Done.")
else:
    print(f"{data_path} exists")

Downloading janlpbook.s3.amazonaws.com/sentences20210924.tar.bz2 to chp05_03/data/sentences20210924.tar.bz2
Done.
--------------------------------------------------
Extracting file chp05_03/data/sentences20210924.tar.bz2
Done.


In [6]:
# Download the CC-100 datasets
if not data_path2.is_file():
    print(f"Downloading {data_url2} to {data_path2:}")
    subprocess.run(shlex.split(f"wget -q -O {data_path2:} {data_url2}"))
    print("Done.")
else:
    print(f"{data_path2} exists.")

Downloading janlpbook.s3.amazonaws.com/cc100.ja.mod1k.txt to chp05_03/data/cc100.ja.mod1k.txt
Done.


In [7]:
# Check
!head -n 5 {data_path}

1	cmn	我們試試看！
2	cmn	我该去睡觉了。
3	cmn	你在干什麼啊？
4	cmn	這是什麼啊？
5	cmn	今天是６月１８号，也是Muiriel的生日！


In [8]:
data_file3 = "sentences.jpn"
data_path3 = data_dir / data_file3
data_path3

PosixPath('chp05_03/data/sentences.jpn')

In [9]:
# OSX: grep changed from grep (GNU grep) 2.5.1 in 10.7 to grep 2.5.1-FreeBSD in OSX 10.8
# Accordingly, the FreeBSD grep version no longer supports -P, --perl-regexp
# Instead, we use 'ack' to ensure cross-compatibility across OSX and Linux.

print("Test ack compatibility with --perl-regexp on both OSX and Linux:")
ack_test = !ack -1 '\tjpn\t' chp05_03/data/sentences.csv | cut -f 3
print(ack_test[0])
assert ack_test[0] == "きみにちょっとしたものをもってきたよ。", "Problem with ack"

Test ack compatibility with --perl-regexp on both OSX and Linux:
きみにちょっとしたものをもってきたよ。


In [10]:
# Remove extra fields from sentences.csv
# Without `cut field 3`, the output is "1297 jpn きみにちょっとしたものをもってきたよ。"
# With `cut field 3`, the output is "きみにちょっとしたものをもってきたよ。"
if not Path(data_dir/"sentences.jpn").is_file():
    print(f"Creating {data_dir}/sentences.jpn")
    !ack '\tjpn\t' {data_dir}/sentences.csv | cut -f 3 > {data_dir}/sentences.jpn
else:
    print(f"{data_dir}/sentences.jpn exists.")

HR()

!head -n 5 {data_dir}/sentences.jpn
HR()
!du -h {data_dir}/sentences.jpn

Creating chp05_03/data/sentences.jpn
--------------------------------------------------
きみにちょっとしたものをもってきたよ。
何かしてみましょう。
私は眠らなければなりません。
何してるの？
今日は６月１８日で、ムーリエルの誕生日です！
--------------------------------------------------
12M	chp05_03/data/sentences.jpn


* Open these two files and build two files, a Kanji-Kana majiribun file with the original text, and a Romaji file with the converted text from cutlet.

* Use `itertools.chain()` from https://docs.python.org/3/library/itertools.html#itertools.chain. Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence. 

## Error: 

* Looks like `cutlet.Cutlet().romaji()` is choking on this text in cc100.ja.mod1k.txt, line 446182:
    
```
ｺﾝﾊﾞﾝﾊｰヾ(･∀･`o)ﾉ))明細書については、今日中に一通り終わらせないと先行きが不安ですが、普通にムリだろうなと諦めています。
```

In [11]:
def test_odoriji():
    # cc100.ja.mod1k.txt, line 446182
    input = "ｺﾝﾊﾞﾝﾊｰヾ(･∀･`o)ﾉ))明細書については、今日中に一通り終わらせないと先行きが不安ですが、普通にムリだろうなと諦めています。"
    print(input)
    HR()
    
    try:
        print(cutlet.Cutlet().romaji(input))
    except Exception as e:
        print(f"Error: {e}")
    HR()
    
    # Hack to replace the odoriji ヾ.
    # The other odoriji ゝゞヽ do not seem to causes any errors.
    input = input.replace("ヾ", "")
    print(input)
    HR()
    
    try:
        print(cutlet.Cutlet().romaji(input))
    except Exception as e:
        print(f"Error: {e}")
        
test_odoriji()

ｺﾝﾊﾞﾝﾊｰヾ(･∀･`o)ﾉ))明細書については、今日中に一通り終わらせないと先行きが不安ですが、普通にムリだろうなと諦めています。
--------------------------------------------------
Error: substring not found
--------------------------------------------------
ｺﾝﾊﾞﾝﾊｰ(･∀･`o)ﾉ))明細書については、今日中に一通り終わらせないと先行きが不安ですが、普通にムリだろうなと諦めています。
--------------------------------------------------
Konbanhaa  )) meisaisho ni tsuite wa, kyoujuu ni hitotoori owarasenaito sakiyuki ga fuan desuga, futsuu ni muri darou na to akiramete imasu.


In [12]:
def line_count(filename):
    # Purposely use subprocess for non-async
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

* GB: For consistency, rename filenames to use underbars as `tatoeba_cc100.kan`

In [13]:
fin1_n = line_count(f"{data_dir}/sentences.jpn")
fin2_n = line_count(f"{data_dir}/cc100.ja.mod1k.txt")
fin_chain_n = fin1_n + fin2_n
print(f"Total lines: {fin_chain_n:,}")

Total lines: 671,370


In [14]:
raw_text = data_dir / "raw_text"

if not raw_text.is_dir():
    print(f"Creating {raw_text}")
    raw_text.mkdir()
else:
    print(f"{raw_text} exists.")

Creating chp05_03/data/raw_text


In [15]:
if not (Path(f"{raw_text}/tatoeba_cc100.kan").is_file() and Path(f"{raw_text}/tatoeba_cc100.rom").is_file()):
    
    with open(f"{data_dir}/sentences.jpn") as fin1, \
        open(f"{data_dir}/cc100.ja.mod1k.txt") as fin2, \
        open(f"{raw_text}/tatoeba_cc100.kan", mode='w') as f_kan, \
        open(f"{raw_text}/tatoeba_cc100.rom", mode='w') as f_rom:

        for line in tqdm(chain(fin1, fin2), total=fin_chain_n):
            sent_kan = line.strip()
            if not sent_kan:
                continue
            if len(sent_kan) > 256:
                # skip long sentences
                continue

            try:
                # Hack to remove odoriji ヾ, otherwise it crashes cutlet.Cutlet().romaji()
                # str.replace() should be fast enough here.
                sent_kan = sent_kan.replace("ヾ", "")             
                sent_rom = katsu.romaji(sent_kan)
            except Exception as e:
                print(f"Error: {e}")
                continue

            sent_rom = sent_rom.replace(' ', '') 
            f_kan.write(sent_kan + '\n')
            f_rom.write(sent_rom + '\n')
            
else:
    print(f"{raw_text}/tatoeba_cc100.rom exists.")
    print(f"{raw_text}/tatoeba_cc100.kan exists.")                            

100%|██████████| 671370/671370 [03:46<00:00, 2960.62it/s]


In [16]:
!head -n 10 {raw_text}/tatoeba_cc100.kan {raw_text}/tatoeba_cc100.rom

==> chp05_03/data/raw_text/tatoeba_cc100.kan <==
きみにちょっとしたものをもってきたよ。
何かしてみましょう。
私は眠らなければなりません。
何してるの？
今日は６月１８日で、ムーリエルの誕生日です！
お誕生日おめでとうムーリエル！
ムーリエルは２０歳になりました。
パスワードは「Muiriel」です。
すぐに戻ります。
知らない。

==> chp05_03/data/raw_text/tatoeba_cc100.rom <==
Kiminichottoshitamonowomottekitayo.
Nankashitemimashou.
Watakushiwanemuranakerebanarimasen.
Nanshiteruno?
Kyouwa6tsuki18kade,Muurierunotanjouhidesu!
OtanjouhiomedetouMuurieru!
Muurieruwa20saininarimashita.
Pasuwaadowa"Muiriel"desu.
Sugunimodorimasu.
Shiranai.


In [17]:
# Check the result of replacing odoriji ヾ. Note that hankaku-katakana still remains.
odoriji_check = !grep -n --binary-files=text 明細書については {raw_text}/tatoeba_cc100.kan | cut -d ':' -f1
!sed -n '{odoriji_check[0]}p' {raw_text}/tatoeba_cc100.kan
!sed -n '{odoriji_check[0]}p' {raw_text}/tatoeba_cc100.rom

ｺﾝﾊﾞﾝﾊｰ(･∀･`o)ﾉ))明細書については、今日中に一通り終わらせないと先行きが不安ですが、普通にムリだろうなと諦めています。 葛西りいちさんの作品をできれば電子書籍で読みたいんです。スキャナで自炊するのは面倒 … [Read more…]
Konbanhaa))meisaishonitsuitewa,kyoujuunihitotooriowarasenaitosakiyukigafuandesuga,futsuunimuridarounatoakirameteimasu.KasaiRiichisannosakuhinwodekirebadenshishosekideyomitaindesu.sukyanadejisuisurunowamendou...[Readmore...]


<a name='5.3.3'></a><a id='5.3.3'></a>
## 5.3.3 Training and Tokenizing with SentencePiece (subword tokenization)
<a href="#top">[back to top]</a>

We are using the SentencePiece Python Wrapper.

Training is performed by passing parameters of spm_train to SentencePieceTrainer.train() function.

Training API (quick example)

```
% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
```
* `--input`: one-sentence-per-line **raw** corpus file. No need to run
  tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes
  the input with Unicode NFKC. You can pass a comma-separated list of files.
* `--model_prefix`: output model name prefix. `<model_name>.model` and `<model_name>.vocab` are generated.
* `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
* `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.
* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.

Resources:

* https://github.com/google/sentencepiece
* https://github.com/google/sentencepiece/blob/master/python/README.md
* https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb
* https://github.com/google/sentencepiece/blob/master/doc/options.md#training-options

In [18]:
!head -n 5 {raw_text}/tatoeba_cc100.kan

きみにちょっとしたものをもってきたよ。
何かしてみましょう。
私は眠らなければなりません。
何してるの？
今日は６月１８日で、ムーリエルの誕生日です！


In [19]:
sp_tokenizer_models = data_dir / "sp_tokenizer_models"

if not sp_tokenizer_models.is_dir():
    print(f"Creating SentencePiece models in {sp_tokenizer_models}.")
    sp_tokenizer_models.mkdir()
else:
    print(f"{sp_tokenizer_models} exists.")

Creating SentencePiece models in chp05_03/data/sp_tokenizer_models.


In [20]:
if not Path(sp_tokenizer_models / "tatoeba_cc100.kan.spm.model").is_file():
    
    print(f"Create SentencePiece tokenizer models for kanji data in {sp_tokenizer_models}.")
    spm.SentencePieceTrainer.train(
        input=f'{raw_text}/tatoeba_cc100.kan',
        model_prefix=f'{sp_tokenizer_models}/tatoeba_cc100.kan.spm',
        vocab_size=10_000,
        input_sentence_size=100_000,
        shuffle_input_sentence=True,
        minloglevel=1
    )
    print("Done.")
else:
    print(f"File {sp_tokenizer_models}/tatoeba_cc100.kan.spm.model exists.")

Create SentencePiece tokenizer models for kanji data in chp05_03/data/sp_tokenizer_models.
Done.


In [21]:
if not Path(sp_tokenizer_models / "tatoeba_cc100.rom.spm.model").is_file():
    
    print(f"Create SentencePiece tokenizer models for romaji data in {sp_tokenizer_models}.")
    spm.SentencePieceTrainer.train(
        input=f'{raw_text}/tatoeba_cc100.rom',
        model_prefix=f'{sp_tokenizer_models}/tatoeba_cc100.rom.spm',
        vocab_size=1_000,
        input_sentence_size=100_000,
        shuffle_input_sentence=True,
        minloglevel=1
    )
    print("Done.")
else:
    print(f"File {sp_tokenizer_models}/tatoeba_cc100.rom.spm.model exists.")

Create SentencePiece tokenizer models for romaji data in chp05_03/data/sp_tokenizer_models.
Done.


---
* Test the trained tokenization models on a few examples.

In [22]:
# Makes segmenter instance and loads the model file
sp_kan = spm.SentencePieceProcessor(
    model_file=f"{sp_tokenizer_models}/tatoeba_cc100.kan.spm.model"
)
sp_kan

<sentencepiece.SentencePieceProcessor; proxy of <Swig Object of type 'sentencepiece::SentencePieceProcessor *' at 0x7f7f0965fd80> >

In [23]:
# id <=> piece conversion
print(sp_kan.id_to_piece(sp_kan.encode("これはテストです。")))

['▁これは', 'テスト', 'です', '。']


In [24]:
# This causes more problems for the tokenizer
print(sp_kan.id_to_piece(sp_kan.encode("魑魅魍魎が跋扈する")))

['▁', '<unk>', '魅', '<unk>', 'が', '<unk>', 'する']


In [25]:
sp_rom = spm.SentencePieceProcessor(
    model_file=f"{sp_tokenizer_models}/tatoeba_cc100.rom.spm.model"
)

print(sp_rom.id_to_piece(sp_rom.encode("Korewatesutodesu.")))

['▁Kore', 'wa', 'te', 'suto', 'desu', '.']


In [26]:
tokenized_corpus = data_dir / "tokenized_corpus"

if not tokenized_corpus.is_dir():
    print(f"Tokenize the entire corpus in {tokenized_corpus}.")
    tokenized_corpus.mkdir()
else:
    print(f"{tokenized_corpus} exists.")

Tokenize the entire corpus in chp05_03/data/tokenized_corpus.


In [27]:
def tokenize_corpus_fn(spp, file_source, file_target):
    fin_n = line_count(file_source)
    print(f"Writing to {file_target}")

    with open(file_source) as fin, open(file_target, mode="w") as fout:
        for line in tqdm(fin, total=fin_n):
            sent_jpn = line.strip()
            tokens = spp.id_to_piece(spp.encode(sent_jpn))
            # Join all items into a single string, with space character as separator.
            fout.write(" ".join(tokens) + "\n")

In [28]:
# Tokenize the kanji corpus
file_source_kan = f"{raw_text}/tatoeba_cc100.kan"
file_target_kan = f"{tokenized_corpus}/tatoeba_cc100.tok.kan"

if not Path(f"{tokenized_corpus}/tatoeba_cc100.tok.kan").is_file():
    tokenize_corpus_fn(sp_kan, file_source_kan, file_target_kan)
else:
    print(f"{tokenized_corpus}/tatoeba_cc100.tok.kan exists.")

Writing to chp05_03/data/tokenized_corpus/tatoeba_cc100.tok.kan


100%|██████████| 603144/603144 [00:25<00:00, 23431.78it/s]


In [29]:
# Tokenize the romaji corpus
file_source_rom = f"{raw_text}/tatoeba_cc100.rom"
file_target_rom = f"{tokenized_corpus}/tatoeba_cc100.tok.rom"

if not Path(f"{tokenized_corpus}/tatoeba_cc100.tok.rom").is_file():
    tokenize_corpus_fn(sp_rom, file_source_rom, file_target_rom)
else:
    print(f"{tokenized_corpus}/tatoeba_cc100.tok.rom exists.")

Writing to chp05_03/data/tokenized_corpus/tatoeba_cc100.tok.rom


100%|██████████| 603144/603144 [00:29<00:00, 20341.40it/s]


In [30]:
# Examine the first 10 lines from each file
!head -n 10 {tokenized_corpus}/tatoeba_cc100.tok.kan {tokenized_corpus}/tatoeba_cc100.tok.rom

==> chp05_03/data/tokenized_corpus/tatoeba_cc100.tok.kan <==
▁ き み に ちょっとした ものを も ってきた よ 。
▁何か してみましょう 。
▁私は 眠 ら なければなりません 。
▁何 してる の ?
▁今日は 6 月 18 日 で 、 ム ー リ エル の 誕生日 です !
▁お 誕生日 お め で と う ム ー リ エル !
▁ ム ー リ エル は 20 歳 になりました 。
▁ パスワード は 「 M u i ri el 」 です 。
▁ すぐに 戻り ます 。
▁ 知らない 。

==> chp05_03/data/tokenized_corpus/tatoeba_cc100.tok.rom <==
▁Kimi ni chotto shita monowo mo ttekita yo .
▁Nan ka shite mi mashou .
▁Watakushiwa ne mu ra nakereba n arimasen .
▁Nan shite runo ?
▁Kyou wa 6 tsuki 18 ka de , M u u ri eru no tan jou hi desu !
▁O tan jou hi o me de tou M u u ri eru !
▁Mu u ri eru wa 20 sai ninarimashita .
▁ P a su wa a do wa " M u i ri e l " desu .
▁Su gu nimo do rimasu .
▁Shi ranai .


In [31]:
def create_validation_set(file, target):
    if not Path(tokenized_corpus / file).is_file():
        print("Create validation dataset.")
        # NR stands for Number Record
        # Get only every 100th line
        !awk 'NR%100==0' {target} > {tokenized_corpus}/{file}
    else:
        print(f"{tokenized_corpus}/{file} exists.")

    print(f"line count: {line_count(f'{tokenized_corpus}/{file}'):,}")

In [32]:
# Create Kanji validation set
create_validation_set("tatoeba_cc100.tok.valid.kan", file_target_kan)

Create validation dataset.
line count: 6,031


In [33]:
# Create Romaji validation set
create_validation_set("tatoeba_cc100.tok.valid.rom", file_target_rom)

Create validation dataset.
line count: 6,031


In [34]:
def create_training_set(file, target):
    if not Path(tokenized_corpus / file).is_file():
        print("Create training dataset.")
        # NR stands for Number Record
        # Get every line except every 100th
        !awk 'NR%100!=0' {target} > {tokenized_corpus}/{file}
    else:
        print(f"{tokenized_corpus}/{file} exists.")

    print(f"line count: {line_count(f'{tokenized_corpus}/{file}'):,}")

In [35]:
# Create Kanji training set
create_training_set("tatoeba_cc100.tok.train.kan", file_target_kan)

Create training dataset.
line count: 597,113


In [36]:
# Create Romaji training set
create_training_set("tatoeba_cc100.tok.train.rom", file_target_rom)

Create training dataset.
line count: 597,113


<a name='5.3.4'></a><a id='5.3.4'></a>
## 5.3.4 Training a Conversion Model with Fairseq
<a href="#top">[back to top]</a>

* Fairseq is a sequence modeling toolkit by Meta AI
* It implements major sequence models such as the Transformer-based seq2seq model.
* We need to first convert the raw text corpus into a binary format.
* This creates binary files for each language and data split.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs

Reference:

* https://github.com/facebookresearch/fairseq
* https://aclanthology.org/N19-4009.pdf
* https://www.youtube.com/watch?v=OtgDdWtHvto

In [37]:
bin = data_dir / "bin"
bin

PosixPath('chp05_03/data/bin')

In [38]:
# rom: Romaji
# kan: Kanji
if not bin.is_dir():
    print("Converting raw text corpus to a binary format.")
    HR() 
    !fairseq-preprocess --source-lang rom --target-lang kan \
        --trainpref {tokenized_corpus}/tatoeba_cc100.tok.train \
        --validpref {tokenized_corpus}/tatoeba_cc100.tok.valid \
        --destdir {bin} \
        --workers 4
    HR()
    print("Done")
else:
    print(f"{bin} exists")

Converting raw text corpus to a binary format.
--------------------------------------------------
2022-12-07 10:36:16 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='chp05_03/data/bin', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quant

* Run fairseq-train to start the training process.

In [39]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [40]:
models = data_dir / "models"
models

PosixPath('chp05_03/data/models')

In [53]:
if models.is_dir():
    print(f"fairseq-train model exists at {models}")

else:

    if IS_COLAB:
        print("Start training with fairseq-train on COLAB")
        HR()
        !fairseq-train \
            {bin} \
            --max-tokens 16384 \
            --arch transformer \
            --encoder-layers 4 \
            --decoder-layers 4 \
            --encoder-embed-dim 512 \
            --decoder-embed-dim 512 \
            --encoder-ffn-embed-dim 2048 \
            --decoder-ffn-embed-dim 2048 \
            --encoder-attention-heads 8 \
            --decoder-attention-heads 8 \
            --optimizer adam --lr 2e-4 \
            --lr-scheduler inverse_sqrt \
            --warmup-updates 4000 \
            --save-dir {models} \
            --max-epoch 2 \
            --reset-optimizer \
            --no-epoch-checkpoints \
            --fp16

            # --max-epoch 10 \
            
    else:

        print("Start training with fairseq-train locally (non-GPU)")
        HR()
        !fairseq-train \
            {bin} \
            --max-tokens 16384 \
            --arch transformer \
            --encoder-layers 4 \
            --decoder-layers 4 \
            --encoder-embed-dim 512 \
            --decoder-embed-dim 512 \
            --encoder-ffn-embed-dim 2048 \
            --decoder-ffn-embed-dim 2048 \
            --encoder-attention-heads 8 \
            --decoder-attention-heads 8 \
            --optimizer adam --lr 2e-4 \
            --lr-scheduler inverse_sqrt \
            --warmup-updates 4000 \
            --save-dir {models} \
            --max-epoch 1 \
            --no-epoch-checkpoints \
            --cpu

    HR()
    print("Done.")

Start training with fairseq-train on COLAB
--------------------------------------------------
2022-12-07 10:55:25 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes

In [42]:
sp_tatoeba = spm.SentencePieceProcessor(model_file=f"{sp_tokenizer_models}/tatoeba_cc100.rom.spm.model")

test_sentences = [
    'Kishaninoru.',
    'Shinbunkisha.',
    'Nunodefuku.',
    'Furuutowofuku.',
    'Nunodefuruutowofuku.',
    'Kikaigakushuuwobenkyousurumatatonaikikai.',
    'Jinkouchinounikyouminoarujinkougafueteiru.',
    'Kougakubunogakuhiwahijounikougakuninatta.',
    'Reizoukonishougaganainarashouganai.',
    'Chikatetsunarimasueki.',
    'Karumenmenyoripaeriasuki.'
]

rom_tokenized = [' '.join(sp_tatoeba.id_to_piece(sp_tatoeba.encode(rom))) for rom in test_sentences]

In [43]:
!echo "{'\n'.join(rom_tokenized)}"

▁Ki sha ni no ru .
▁Shi n bun ki sha .
▁ N u node fuku .
▁Fu ru u to wo fuku .
▁ N u node fu ru u to wo fuku .
▁Ki kai gaku shuu wo benkyou suru mata to nai ki kai .
▁Ji n kou chi nou ni kyou mi no aru jin kou ga fu e teiru .
▁Kou gaku bu no gaku hi wa hijouni kou gaku ninatta .
▁ R ei zou ko ni shou ga ganai nara shou ganai .
▁Chi ka te tsu n arimasu eki .
▁Ka ru men men yori pa e ri a suki .


In [44]:
!echo "{'\n'.join(rom_tokenized)}" | fairseq-interactive \
{bin} \
--path {models}/checkpoint_best.pt \
--source-lang rom \
--target-lang kan

2022-12-07 10:40:50 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_na

In [45]:
!echo "{'\n'.join(rom_tokenized)}" | fairseq-interactive \
{bin} \
--path {models}/checkpoint_best.pt \
--source-lang rom \
--target-lang kan \
--beam 10 2> /dev/null | grep 'H-' | cut -f3

^C


<a name='5.3.4'></a><a id='5.3.4'></a>
## 5.3.4 Checking created artifacts
<a href="#top">[back to top]</a>

In [46]:
!tree chp05_03

chp05_03
├── data
│   ├── bin
│   │   ├── dict.kan.txt
│   │   ├── dict.rom.txt
│   │   ├── preprocess.log
│   │   ├── train.rom-kan.kan.bin
│   │   ├── train.rom-kan.kan.idx
│   │   ├── train.rom-kan.rom.bin
│   │   ├── train.rom-kan.rom.idx
│   │   ├── valid.rom-kan.kan.bin
│   │   ├── valid.rom-kan.kan.idx
│   │   ├── valid.rom-kan.rom.bin
│   │   └── valid.rom-kan.rom.idx
│   ├── cc100.ja.mod1k.txt
│   ├── models
│   ├── raw_text
│   │   ├── tatoeba_cc100.kan
│   │   └── tatoeba_cc100.rom
│   ├── sentences20210924.tar.bz2
│   ├── sentences.csv
│   ├── sentences.jpn
│   ├── sp_tokenizer_models
│   │   ├── tatoeba_cc100.kan.spm.model
│   │   ├── tatoeba_cc100.kan.spm.vocab
│   │   ├── tatoeba_cc100.rom.spm.model
│   │   └── tatoeba_cc100.rom.spm.vocab
│   └── tokenized_corpus
│       ├── tatoeba_cc100.tok.kan
│       ├── tatoeba_cc100.tok.rom
│       ├── tatoeba_cc100.tok.train.kan
│       ├── tatoeba_cc100.tok.train.rom
│       ├── tatoeba_cc100.tok.valid.kan
│       └── tatoeba_cc1

<a name='5.3.5'></a><a id='5.3.5'></a>
## 5.3.5 Saving artifacts to Google Drive
<a href="#top">[back to top]</a>

https://drive.google.com/drive/my-drive

In [56]:
# If need to start with a clean directory on Google Drive
# !rm -fr /content/drive/MyDrive/chp05_03

In [61]:
# Set PUSH_TO_GD to True if you want to push chp05_03 to Google Drive 
PUSH_TO_GD = True
if IS_COLAB and PUSH_TO_GD:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    HR()

    print("Overwriting /content/drive/MyDrive/chp05_03")
    !cp -R /content/chp05_03  /content/drive/MyDrive
    HR()

    # Only keep the checkpoint_best.pt model
    !rm -fr /content/drive/MyDrive/chp05_03/data/models/checkpoint_last.pt
    !du -ha /content/drive/MyDrive/chp05_03/data/models
    HR() 

    print("Check contents of chp05_03:")
    !du -ah /content/drive/MyDrive/chp05_03 --max-depth=2 | sort -h

Mounted at /content/drive
--------------------------------------------------
Overwriting /content/drive/MyDrive/chp05_03
--------------------------------------------------
total 618642
-rw------- 1 root root 633488835 Dec  7 11:21 checkpoint_best.pt
--------------------------------------------------
Check contents of chp05_03:
512	/content/drive/MyDrive/chp05_03/requirements_5_5_3.txt
4.0K	/content/drive/MyDrive/chp05_03/data/.ipynb_checkpoints
830K	/content/drive/MyDrive/chp05_03/data/sp_tokenizer_models
12M	/content/drive/MyDrive/chp05_03/data/sentences.jpn
72M	/content/drive/MyDrive/chp05_03/data/cc100.ja.mod1k.txt
85M	/content/drive/MyDrive/chp05_03/data/bin
139M	/content/drive/MyDrive/chp05_03/data/raw_text
147M	/content/drive/MyDrive/chp05_03/data/sentences20210924.tar.bz2
349M	/content/drive/MyDrive/chp05_03/data/tokenized_corpus
515M	/content/drive/MyDrive/chp05_03/data/sentences.csv
605M	/content/drive/MyDrive/chp05_03/data/models
1.9G	/content/drive/MyDrive/chp05_03
1.9G	/con

In [60]:
if PUSH_TO_GD:
    # Test that we can do inference on stored on Google Drive
    !echo "{'\n'.join(rom_tokenized)}" | fairseq-interactive \
    /content/drive/MyDrive/chp05_03/data/bin \
    --path /content/drive/MyDrive/chp05_03/data/models/checkpoint_best.pt \
    --source-lang rom \
    --target-lang kan \
    --beam 10 2> /dev/null | grep 'H-' | cut -f3

▁ 医者 に の る 。
▁ 新聞 新聞 者 。
▁ の ので 服 。
▁ フル ー ル と 服 を 服 。
▁ の ので フル ー と 服 を 服 。
▁ 機械 学 学習 を 勉強 する と 機械 。
▁ 工事 地 地 の 脳 に 興味 のある 人口 が 増えている 。
▁ 高額 の 学 費 は非常に 高額 になった 。
▁ 冷蔵庫 に 証拠 がない なら 賞 がない 。
▁ 地下鉄 くなります 。
▁ カル 面 面 より パパ 好き 。
