# Yomikata: Disambiguating Japanese Heteronyms

A step by step guide to training Yomikata's word disambiguation model.

# Word pronunciation lists

To clean the datasets we use it is useful to have a list of Japanese words and their pronunciations. 

We do that by parsing the unidic and sudachi dictionaries. Note these scripts are slow -- but run one time only.

In [None]:
from yomikata.dataset.unidic import unidic_data

unidic_data()

In [None]:
from yomikata.dataset.sudachi import sudachi_data

sudachi_data()

In [None]:
from yomikata.dataset.kanjidic import kanjidic_data

kanjidic_data()

In [None]:
from yomikata.dataset.pronunciations import pronunciation_data

pronunciation_data()

In [None]:
from pathlib import Path

import pandas as pd
from yomikata.config import config

df = pd.read_csv(Path(config.READING_DATA_DIR, "all.csv"))
df.sample(10)

# Corpuses of annotated sentences

The model is trained on sentences which already have furigana. We have four data sources which we process here. Note these scripts are slow -- but run one time only.

[Corpus of titles of works in the national diet library](https://github.com/ndl-lab/huriganacorpus-ndlbib)

In [None]:
# from yomikata.dataset.ndlbib import ndlbib_data

# ndlbib_data()

[Aozora Bunko book corpus](https://github.com/ndl-lab/huriganacorpus-aozora)

In [None]:
from yomikata.dataset.aozora import aozora_data

aozora_data()

[Kyoto University document leads corpus](https://github.com/ku-nlp/KWDLC)

In [None]:
from yomikata.dataset.kwdlc import kwdlc_data

kwdlc_data()

[Search result for our heterophones in the BCCWJ corpus](https://chunagon.ninjal.ac.jp/bccwj-nt/search)

In [None]:
from yomikata.dataset.bccwj import bccwj_data

bccwj_data()

In [None]:
from pathlib import Path
from yomikata.config import config, logger
from yomikata import utils

input_files = [
    Path(config.SENTENCE_DATA_DIR, "aozora.csv"),
    Path(config.SENTENCE_DATA_DIR, "kwdlc.csv"),
    Path(config.SENTENCE_DATA_DIR, "bccwj.csv"),
    # Path(config.SENTENCE_DATA_DIR, "ndlbib.csv"),
]

utils.merge_csvs(input_files, Path(config.SENTENCE_DATA_DIR, "all.csv"), n_header=1)
logger.info("‚úÖ Merged sentence data!")

Filter out duplicate sentences.

In [None]:
from pathlib import Path
import pandas as pd

df = pd.read_csv(Path(config.SENTENCE_DATA_DIR, "all.csv"))
df_no_duplicates = df.drop_duplicates(subset=['sentence'], keep='first')
df_no_duplicates.to_csv(Path(config.SENTENCE_DATA_DIR, "all_filtered.csv"), index=False)
logger.info("‚úÖ Filtered out duplicate sentences!")

# Spliting furigana in the corpus

First we generate a dictionary of representations of longer furigana in terms of shorter furigana that appear in the corpus. So for example `{ÂºïÂá∫/„Å≤„Åç„Å†}` will be broken down into `{Âºï/„Å≤}` and `{Âá∫/„Åç„Å†}`. The algorithm attempts to find a set of shorter furigana for which concatenation of surfaces and readings exactly matches the whole or at least the beginning of the longer furigana. It prefers more granular representations (larger amount of shorter furigana) and if two equally granular representations are possible, it picks the one which is composed of furigana with the largest combined frequency in the corpus. It also translates the "„Éº" character into specific hiragana representation.

In [None]:
from yomikata.dataset.breakdown import generate_breakdown_dictionary

generate_breakdown_dictionary()

Next we use this dictionary to replace all long furigana in the corpus with shorter furigana. By decomposing furigana we get a corpus that allows us to determine readings of kanji surfaces in a more granular way.

In [None]:
from pathlib import Path
from yomikata.dataset import split
from yomikata import utils
from yomikata.config import config, logger

split_dict = utils.load_dict(Path(config.BREAKDOWN_DATA_DIR, "translations.json"))
logger.info("Starting decomposition process, this may take a while...")
split.decompose_furigana(
    Path(config.SENTENCE_DATA_DIR, "all_filtered.csv"),
    Path(config.SENTENCE_DATA_DIR, "all_broken_down.csv"),
    split_dict,
)
logger.info("‚úÖ Decomposed furigana!")

# Making a list of heteronyms

We use the list from [Sato et al 2022](https://aclanthology.org/2022.lrec-1.770.pdf) as a start. To these we add a list of additional heteronyms picked from the corpus by the frequency of mistakes MeCab tokenizer makes in predicting their readings and arrive at the following list:

In [None]:
# heteronyms = "ÂõΩÁ´ã|‰ªÆÂêç|ÈÅ∫Ë®Ä|Âè£ËÖî|‰∏ÄÈÄî|ÊúÄ‰∏≠|‰∏ÄË°å|‰∏ÄÂ§ú|‰∏ãÈáé|Ëä±ÂºÅ|Â±±Èô∞|‰∏ä‰∏ã|‰∏ñË´ñ|ÁâßÂ†¥|‰∏ÄÂë≥|ÊñΩË°å|ÊñΩÂ∑•|Ëª¢Áîü|Ê∏ÖÊµÑ|ËøΩÂæì|Â¢ìÁü≥|Êº¢Êõ∏|‰ΩúÊ≥ï|ÈªíÂ≠ê|Á´∂Â£≤|ÈñãÁúº|Ê±ÇÈÅì|ÊñΩÊ•≠|ÂÄüÂÆ∂|È¢®Ëªä|ËÉåÁ≠ã|ÈÄÜÊâã|ÁîüËä±|‰∏ÄÂØ∏|‰∏ÄÂàÜ|‰∏ÄÊñá|Ê∞óÈ™®|Á¥∞ÁõÆ|ËàπÂ∫ï|Áõ∏‰πó|Ê¢ÖÈõ®|È¢®Á©¥|Â§úË©±|ÈáéÂÖé|ÂÜ∑Ê∞¥|Áø°Áø†|ÂçÅÂÖ´Áï™|Áü≥Á∂ø|ÂÖ¨Êñá|Ë™≠Êú¨|Âè§Êú¨"
heteronyms = "Âπ¥‰∏≠|Ê∞óÂë≥|Êùü|ÂΩ±Èüø|Â§è|Áà∂|ÁêÜ|ÂÆùÂ°ö|Êàë|ÁßÅ|Êµ©|ÁΩÆ|ÂΩº|Êâã|Âè≥|Êòì|Êü±|Ê¨°|ÊòØ|ÁôΩÈ´™|ÊñáÂ≠ó|ÂçöÂ£´|ÈÄ†‰Ωú|Áõ∏|‰ª•|Ââ≤|Âºï|Â∫∑|Âçö|Âëä|‰∏âÈáç|Ââ≤ÂΩì|Â§™Èºì|Êúü|ËøëÂπ≥|Áúº|Ë®±|Âº•|Á¥†|ÂùÇ|ÂÆâ|Á∂¥|Ë°Ä|È°é|ÁåÆ|Êâã‰Ωú|Ê±ü|Áïë|‰∏ÉÂÖµË°õ|Èáé|Â£Å|Ë±ä|ÂºòÊ≥ï|ÁîüËä±|È°éÈ™®|Áπî|Ê¢Ö|Âåó|Âºò|ÂØæ|ÂÆ¥|Ê≤¢|Ê≥ïË°£|Á™Å|Á≤ó|Â•áÊÄ™|ÂøÉË°Ä|ÈáéÂÖé|È£õÈ®®|Â°µ|Â†±|Ë∫´‰Ωì|Â≥∂Ê¥•|Ëàπ|ÊÄù|Âπ∏|ÂÖàÂàª|‰ø∫|Ê†π|Áâõ|Êñ∞|Âäõ|Á®≤|Êº¢Êõ∏|Êüì|Á∑í|ÁõÆ|Êò•|‰∏é|Á≤í|ËçâÂ≠ê|Á§ºÊãù|ÈªÑ|ËàûÈ∂¥|ÈõªÁÅØ|Â®ò|ÊâÄ|ÈÅìÊ®ô|Êéå|ÁôΩ|ÈÉΩ|Ë≤ùÂ°ö|Â≤©|ÂçöÊñá|Áº∂Ë©∞|ÈÄ†|‰∫§Êèõ|‰∏ÉÂçÅ|ËÅû|ËúÇ|Êûï|Âèã|Â≠ê|Èãº|Ëä±|Á∂ö|Âåñ|Ë£úÁ∂¥|Ê∑±|Ë°£|ÊâãË°ì|Âøç|ËÉé|Ê≠Ø|È´òÂ±±|Á•û|Âêâ|È¶¨Èà¥ËñØ|ÂúüÂô®|Ëç∑ÂΩπ|Êò®Â§ú|Ê≠¢|Âª∂|Á¥∞|Âèó|ÈºªËÖî|Â¶æ|ÊÄßÈ™®|Â¶Ç‰Ωï|ÂÖµÊ≥ï|ËÑ±Ê∞¥|Â±±|ÁÆ±|Â∏ñ|Èõ®|ÂÖ±|‰∏çË∂≥|È¢®ÂëÇ|Èô∏Â••|Ëàà|Ë¶™|‰∏äÈáé|Âªª|Êûú|Â≠¶|‰∏äÊâã|Èºª|Èáù|‰∏°Áúº|Ëóç|Ê≥ïÂ∏´|Â∫ú|Âá∫|ÂÖ±Â≠ò|Â≥∂|Â∑ù|Â§âÊèõ|Ê≤≥|È≥•|Â∞º|‰∏ÄÂ§ú|Áâà|Ê∏ÖÊµÑ|‰Ωç|Êó•ÊöÆ|‰πùÂçÅ|È¢®Âúü|Ë®¥|‰ªä|Èõ®Ê∞¥|ÁôΩË°Ä|Áñ±Áò°|Âú∞|ÈÄü|Èªí|Êñá|Áæ§|Á´π|ÂΩ©|Áõ¥|Âè∞|Ê©ã|Âè≥Ë°õÈñÄ|Ëõô|Âá¶|‰æã|ÂÆø|Ë∑°|Ê∂ô|Ê≥ïËèØ|ÊØÖ|ÂÖà|Â••|Êòü|ËøΩÂæì|Âüé|Âçµ|ÈâÑ|ËèØ|Â§™Â§´|Êéõ|Êü≥|‰∏°ÂÅ¥|Âàù|Êö¶|‰∏ç|Ë∫Ø|ÂøÉ|Âæ°|Â∞æ|Ê†ºÂ≠ê|Ê≥ä|‰ªäÊòî|ÊΩÆ|Êü≥Áî∞|ËÖî|Â∏ÇÂ†¥|Â§ú‰∏≠|‰∏ã|Ë£ï|Â°©|‰ΩïÂàÜ|ÂêàÊà¶|Âª∫|Ê≤≥Â≤∏|Ë∫´|ÊµÖ|ÂÆ∂|Êπñ|Â§´|Â§©|ÁéâÊâã|Âêë|ËñÑ|ÈÉ°|Áí∞|Ë©±|ÁîüÂëΩ|Êùø|Á¨ë|Êáê|Ëóª|ÊπØ|Ëªä|Áì¶|ËÄÉ|ÈÅì|ÂáΩ|ËÄÅÂ≠ê|Á¥∞„ÄÖ|‰∏ÄÊúà|‰∏ÄÊó•|ÈÉ∑|ÈÅìÁ®ã|Á¥∞Â∑•|ÊñáÁßë|Ê≠§|ËÑÜ|Âã¢|Âúí|Âê´|ÁúºÈè°|ÂÆà|Â¶ñ|Ëº™|Â©¶|ÂΩ±|‰π≥Êàø|Ê¥ã|Ëâ∂|ÁµÇ|‰Ωú|Ê≥∞|Áä¨|ÂÖâ|Êù±|ËäΩËÖ´|Ë¶ãÁâ©|Á´Ø|ËÑÇ|Èâ§|‰∏â|Âàª|‰ªÅ|Ëâ≤|ËàπÂ∫ï|Â§ñ|ÂíΩÂñâ|‰∏ãÈ°é|ÁâπÈõÜ|Áî∞|Êßã|Â´åÊ∞ó|È°î|ÂÆö|Èï∑|Â≠¶Ê†°|‰∏¶|ËÇù|ËâØ|Âèñ|‰ΩèÂ±Ö|Ëàü|Êûó|Áõ∏Êí≤|ÈÄ≤|ÂÖ≠|È£õÊ≤´|Ë¶ã|Âêå‰∫∫|ËêΩ|Á∂¥Êñπ|Â§ú|Ë≤†|Ê≥ï|Áπ∞|ÂàÉ|Áâá|Êï∞|Â±Ö|Â∑±|‰∏ÄÊùØ|Êòé|‰º¥|ÂØå|Ê∞óË≥™|ÊúÄ‰∏≠|‰ø°Â§´|Ê∫Ä|Êñº|Ê∞ó|Â∏Ç|Êà∏|Êó¨|Âàë‰∫ã|ËÉΩÁôª|‰∏ÄÂ£∞|ÊÆ∫|Êäò|‰∫îÂçÅ|Êïè|‰∏ÄÂØ∏|Âêç|ÂÖ¨Êñá|ÈöÜ|Âà§‰æã|ÁÑº|ÊïÖÈÉ∑|Èùí|Á´ã|Á•≠|Á∂±|Â∞èÂ±ã|Ê≤≥Âè£|Âçó|Áæé|Â∑•Â≠¶|ÈôΩÂ≠ê|ÂÆÆ|ÂçÉÈáå|Âà•|Âπ¥‰∏ä|‰Ωì|ÂåªÂ≠¶|Â≠ò|ÊòéÊó•|Ëëâ|Á≤â|ÊüÑ|È†∏|ÂÆó|Ê°Ç|ÁÅ∞|‰Ωø|Êó•‰∏≠|Ê≠©|ÁßëÂ≠¶|ËÅñ‰∫∫|Â§ßÊâã|Êûù|Âà§|Âêπ|ÊñΩÂ∑•|Âº∑|ÂÆø‰∏ª|Ê≥ïÂ≠¶|Â•Ω|ËÜù|‰ªã|Ê≠©ÂÖµ|‰∏ÄÊò®Âπ¥|Âº∑Âäõ|Êó©|‰∏âÂõΩ|ÂÖ´Âπ°|Â∞∫|ÂÆöÂÆ∂|Ââç|È∂è|‰∏ÄÂë≥|ÈÖî|ÂàÜÂà•|Èõ£Ê≤ª|Ê∞∑|Â∞èÂ≠¶|ÂâçÈßÜ|Ë£Ç|È¢®|ÂçäÊúà|ÂàÜÈñì|È∫ó|ËÜ†|Á´úÈ¶¨|Á¥∞ÁõÆ|ÂÖ•|Ê†Ñ|ÂÖê|Á≤ã|ÂÖµË°õ|ÁØÄ|ÁÆ°|È¢®Ëªä|È†≠Êï∞|Èõ≤|Èú≤|Ë¶ãÈÄè|Êúà|ÈÄÜÊâã|È¶ô|ÊåØ|Â±±Âüé|Èõ™|Ëä±ÂºÅ|‰∏≠|Ê≤π|ÂÖÑ|Èõë|ÁµåÁ∑Ø|Âè§Êú¨|ÂêàÊ≥ï|Â§ßËîµ|Áßã|Ê∞è|Êó•Âêë|‰∏ãÊâã|Ë®é|‰∏≠Èñì|‰∏É|Âì≤|È≠Ç|Ë°®|‰∫ã|Ë™†|Êó•Êú¨|Á±≥|ÂïèÂ±ã|‰∏ä|Ê∏ÖÊ∞¥|È´òÈáé|Áâß|Áâ©|Âè£|Âúü|Ëá≥|Ê≠£|ËäΩ|ÂØõ|Â≠´|Áü≥|Âä©|ÊÅã|Á≠â|Â¢ìÁü≥|ÊµÅ|Ë∂ä|Ê°ê|Èáç|Êï¨|‰Ωï|Ëµ∑|ÊôÇË®à|ÂëΩ|Èöõ|Êµ∑|ÂåñÂ≠¶|Â§™|ÈÖí|Â∫ä|Áê¥|ÁîòËó∑|ÊûØ|Â£∞|ÁÇé|Â±±Ê≤≥|Âô®|ÈÄ£‰∏≠|ÁöÆ|‰∏Ä|Êßò|‰ºù|Á¥Ä|ÈñãÁúº|ÂÆù|ÈäÄÊùè|Âãù|Â¢É|Á†Ç|Â§ßÂ±±|ÊÄß|Ëô´|ÂÅ¥|Êúâ|È™®|Ê≠å|ÂÆ§|ÊôÇ|ËÄ≥|ÈßøÊ≤≥|Èñì|ÂåóÊñπ|Áé©ÂÖ∑|ÂÖÉ|‰∫åÂçÅ|‰∏à|‰∏á|‰π≥|ÈÄÅ|Ë°õÈñÄ|Á©Ç|Êò≠|Èõ∂|Âì≤ÈÉé|Ë™ø|Èù¢|Â∫ïÂäõ|ÈÄö|ÂñÑ|ÈñãÁô∫|‰ºöÊ¥•|Ê∞¥Èù¢|Á°ùÂ≠ê|Êò®Êó•|Á∑ëËâ≤|Â©Ü|ÁõõÂúü|Ë®Ä|Âêà|Á∑®|Â¢®|ÊºÅÂ†¥|Èô∞|Ê∫ê|Âí≥|Á∏Å|‰∏ÄË°å|Ëã±Êñá|ÊòéÊ∏Ö|‰∫åÈáç|ÁùÄ|Êù•|Á≠Ü|ÂÄüÂÆ∂|‰ø°|Âºµ|‰∏ÄÊôÇ|Ë™∞|Áï∞|Èùô|‰æùÂ≠ò|Ë°ÄÁóá|Êú´|Ê≥ïÂÖ∏|Â≤≥|ÂΩì|ÈõªÂ†¥|Ê¢ÖÈõ®|Êé¢|Êâì|Â¢≥|Áõ∏‰πó|Áø°Áø†|Êúõ|‰∏äÈ°é|È≠ö|Ëç∑|Ë™û|Êä±|È¶≥|Ê•µ|Ê∏Ö|Â∑å|ËÅñ|ÊäÄ|Ê£Æ|‰æç|ÁêÉ|Â•≥|ÁæΩ|Âùä|Êïô|ËèñËí≤|Âæπ|‰∏äÊñπ|ÂæÄ|ÂΩ¶|Á∑ë|ÂÄô|‰∏âËßí|Âõ∫|Âπº|‰ªè|Âèä|‰∏ãÈáé|ÂÆÖ|Ê≠¶|ÈÅ∫Ë®Ä|‰πù|Â§ßÂã¢|Á¶èÂ≥∂|Áøº|ÈªíÂ≠ê|Âæ©|Á∑ëÂåñ|ÊâãÁ∂ö|Â≠ù|Ê∞ë|Ëºù|Ëµ§Ë°Ä|ÁóÖ|‰øÇ|ÂàÜ|‰∏ÄÊò®Êó•|ÊØç|ÂÜÖ|Â†±Âëä|ÊöÆ|‰∫∫|‰∏ñ|È¨º|Ê±∫|Â§ßÂíå|Áúü|‰πÖ|Âãá|ÂÖµÈ¶¨|‰ªñ|Â∞èÂà§|Â∫¶|Â†§|Âéö|Â´Ç|‰ªäÊó•|Áôª|Â∞èÂÖ≠|Âè§|Á®Æ|Êòé‰ª£|Â∑£|ÂÖ∂|ÁÅ´|‰∏ÄË®Ä|ÂÆè|Âπ¥|ÁöÜ|Âêõ|Ââõ|ÈõÖ|Ëä±Â¥ó|Â§âÂåñ|ÂêæÂ¶ª|Ëµ§|Ë¢ã|Èáå|‰Ωô|Ê∏Ø|Ê∑≥|ÁçÖÂ≠ê|Âëâ|ÂÜ∑Ê∞¥|ÊâÄË¨Ç|Èëë|Èáë|ÈãºÊùø|Áô∫Ë∂≥|Â∏∏|Ëª¢Áîü|Ëçâ|ÁñæÈ¢®|Ëæ∫|Ê±†|Â¢ì|Â∑ª|Á∂ø|Â∞èÂΩ¢|Ëßí|Ê†º‰ªò|ÂçÅÂÖ´Áï™|Ê≤ª|Á≥∏|Â∏É|Ë°ó|Ë¶≥|Á¥ô|Ê∞¥|ÊÅµ|ÊÑõ|ÂÇç|Êúù|Ë≤´|ÁÑ°|ÈÉ®Â±ã|Êùë|Êó•|ÂõΩÁ´ã|Âè§‰ªä|Ê°ú|ÈªÑËâ≤|‰øÆ|Â∞è|Âæå|È°ç|ÈÖíÈ°û|Êåá|Á©∫|Ê≥â|Áãº|Ë¶Å|Ë≤ù|ÂõõÂçÅ|‰ªî|Ëñ¨|Â∫É|Èö†Â≤ê|ËÉå|Âõõ|Á†îÁ©∂|‰∏ÄÈÄî|Áéâ|Á´•|Ê≠¶Ëîµ|Áü≥Â∑ª|ÂàÄ|È†≠Ëìã|Èü≥|Âô∫|Êú¨|ÊãçÂ≠ê|ÂÖ¨|ÂØ∫|ÂãïÂäõ|È°ûËÅö|ÊÆø|È§®|Ë∂≥Ë∑°|Èçº|ËÖπ|Áîª|ÈÅî|Âåπ|Êõ∏|ÊØõ|Èßï|Âá∫Â±ï|ÂÅΩ|‰∏ä‰∏ã|ÁÇ∫|ÂÆü|Áî∑|ÁáÉ|Â†¥|ÊïôÂåñ|Âßâ|Ê≠™|Èè°|ËÉ∏|Âç∞|ÈôÑ|ÂÉç|Áú∏|ÂØíÊ∞ó|Ë•øÈÉ∑|Âè∏|ËèìÂ≠ê|Á®ã|Ê∞óÈ™®|‰∏ñË´ñ|Êú´Êúü|‰∫∫Â¶ª|Ë∞∑Èñì|ËçâÁ¥ô|ÂØø|Ê≠≥|Âü∫|Â§ßÁ§æ|Ê®™|ÁµÑ|Â±±Êùë|ÁÅØ|Êú¨Êõ∏|Âøó|ÊÇ™|‰º∏Â≠ê|Ê±ÇÈÅì|Â∫ï|ÂøÉËÇ∫|È´ò|Ëîµ|Êà¶|Â§ß‰∫∫|‰ºö|È¶¨|ËêΩËëâ|ÈõÑ|È†É|Ë®≥|Á´∂Â£≤|‰∫∫Ê∞ó|ËåÇ|‰∫å‰∫∫|Áî∫|ÊÇ≤|Âéü|‰πã|Âπ≥|‰øÆÊ•≠|Â§ßÂàÜ|Áßò|Âè≤Â≠¶|Êú®|ÊùØ|‰Ωê|ÂüéË∑°|‰ªÆÂêç|Â§´Â©¶|Êäú|ÂïèÈ°å|‰∫å|Â≥∞|‰∏ª|Â≠êË¶è|Á¥ÖËëâ|ÂΩºÊñπ|Á©∫Âäõ|Ë°å|ÁôΩÁü≥|ÁÜ±Âäõ|Ë≤ß|‰ªò|ÂãïÂ≠¶|‰∏ÄÊñá|ÊòéÂæåÊó•|ÊâãÊåá|Âõ†|ÊâãÂ°ö|ËÄÖ|È¢®Á©¥|Âπ≥Èáé|ÊµÆ|Â≠î|Ë≠ú|Â§ß‰∫ã|‰πæ|Ê•Ω|Â•¥|Áïô|Ââµ|ÈôΩ|Â±±Èô∞|Áîü|ËÉé‰ªî|ÂõΩ|‰∏âÂçÉ|Á¥Ö|Áã¨|Ë∑Ø|Ë∂≥|ÂÄâ|ÂìÅ|Ë™≠|Âêæ|ÂåÖ|Áß¶|Ê≤ºÊ¥•|ÂãïÂêë|ÂæíÁÑ∂Ëçâ|Êñπ|Ê†Ñ‰∏âÈÉé|ÂãïÈùô|Áµå|ËÅñÂæ≥|Êó•Èñì|ÊñΩÊ•≠|‰øù|Áô∫|Á≠ã|Êàø|Ë£è|È†≠|Ê≤¢Â∫µ|Â¢ó|Èä≠|Ëä≥|Â§úË©±|Â¶Ç|Ê†πÊú¨|Âè£ËÖî|Âà©Áõä|Â∫ó|Á∂≤|Âö•‰∏ã|Â¶ª|Áôæ|Ê¥ª|Ê®©|Êú≠|‰ΩïÊôÇ|Áèæ‰∏ñ|Ë™≠Êú¨|Âûã|Â§ßÂÆ∂|ÂçÅ|‰ª£|Ë∞∑|ÊñáÊõ∏|È∫ª|Ê•≠|ÂΩ¢|‰ΩúÊ≥ï|Âæó|Áî∫ÂÆ∂|Ë≤¥Â•≥|Èô∞ÈôΩ|Êú®Ë≥™|Ëå∂ÈÅì|Ë±ö|Ëöï|Â∏Ø|ÂçÉ|‰∏ÄÊñπ|ÂÜ¨|Êµ™Êº´|ÈÇ¶|Ê≥¢|ÂøÉ‰∏≠|Âë≥|‰æø|È´òÊùë|ÁâßÂ†¥|Ë©©|Âàá|Ê¥≤|Áü≥Á∂ø|Â§¢|‰øä|Ááï|Âπª|Ê£ü|Êï∑|Ê¢Å|ÁîüÁâ©|Ê†πÊ≤ª|ÈáëËâ≤|ËÉåÁ≠ã|Â§ß|Â°ö|Èõ∑|Èñ¢|ÊÆãÂ≠ò|Á´ú|ÁÜ±|Ê®π|ÁøÅ|ÂÜ†|ÊñΩË°å|Èò≤ÈåÜ|‰∏ÄÁõÆ|Êçß|Â∑¶|ÂÖ´|Âïè|Ë•ø|‰∏Å|Â§ßË∞∑|Â∞èÂÄâ|ËçâÂú∞|Á¨†|Á≠î|ÊñáÂ≠¶|‰∏ÄÂàÜ|Êí≠"

We look in our sentence data for these known heteronyms

In [None]:
import pandas as pd
from pathlib import Path
from yomikata.config import config, logger

full_df = pd.read_csv(Path(config.SENTENCE_DATA_DIR, "all_broken_down.csv"))
len(full_df)

In [None]:
%%time
df = full_df[
    full_df["sentence"].str.contains(heteronyms)
]
len(df)

In [None]:
from yomikata import utils
from collections import Counter
import pandas as pd
from pathlib import Path
from yomikata.config import config, logger
import random

heteronym_dict = {}
dictionary_df = pd.read_csv(Path(config.READING_DATA_DIR, "all.csv"))
dictionary_set = set(dictionary_df.itertuples(index=False, name=None))

# for heteronym in ["Êúâ"]:
for heteronym in heteronyms.split("|"):
    furis = df.loc[df["sentence"].str.contains(heteronym), "furigana"].values
    readings = []
    for furi in furis:
        reading_list = utils.get_all_surface_readings(heteronym, furi)
        readings += reading_list
        # readings += [string for string in reading_list if "„Éº" not in string and (heteronym, string) in dictionary_set]
    ms = Counter(readings)
    ms = {k: v for k, v in sorted(ms.items(), key=lambda item: item[1], reverse=True)}
    print(heteronym)
    print(ms)
    heteronym_dict[heteronym] = ms

We give up on identifying readings for which we have less than 40 examples

In [None]:
ncut = 40
heteronym_dict_cut = {
    k: {k2: v2 for (k2, v2) in v.items() if v2 > ncut}
    for (k, v) in heteronym_dict.items()
}
heteronym_dict_cut = {k: v for (k, v) in heteronym_dict_cut.items() if len(v) > 1}
print(len(heteronym_dict_cut))
heteronym_dict_cut

In [None]:
utils.save_dict(heteronym_dict_cut, Path(config.CONFIG_DIR, "heteronyms.json"))

# Prepare augmented dataset

In [None]:
from pathlib import Path
from yomikata.config import config, logger
from yomikata import utils

input_files = [
    # Path(config.SENTENCE_DATA_DIR, "aozora.csv"),
    # Path(config.SENTENCE_DATA_DIR, "kwdlc.csv"),
    # Path(config.SENTENCE_DATA_DIR, "bccwj.csv"),
    Path(config.SENTENCE_DATA_DIR, "ndlbib.csv"),
]

utils.merge_csvs(input_files, Path(config.SENTENCE_DATA_DIR, "augmentation.csv"), n_header=1)
logger.info("‚úÖ Merged sentence data!")

In [None]:
from pathlib import Path
import pandas as pd

df = pd.read_csv(Path(config.SENTENCE_DATA_DIR, "augmentation.csv"))
df_no_duplicates = df.drop_duplicates(subset=['sentence'], keep='first')
df_no_duplicates.to_csv(Path(config.SENTENCE_DATA_DIR, "augmentation_filtered.csv"), index=False)
logger.info("‚úÖ Filtered out duplicate sentences!")

In [None]:
from pathlib import Path
import pandas as pd

augmentation_filtered_path = Path(config.SENTENCE_DATA_DIR, "augmentation_filtered.csv")
test_optimized_path = Path(config.SENTENCE_DATA_DIR, "test/test_optimized_strict_heteronyms.csv")
augmentation_filtered_cleaned_path = Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned.csv")

df_augmented = pd.read_csv(augmentation_filtered_path)
df_test_optimized = pd.read_csv(test_optimized_path)

initial_row_count = len(df_augmented)

df_cleaned = df_augmented[~df_augmented['sentence'].isin(df_test_optimized['sentence'])]

final_row_count = len(df_cleaned)

logger.info(f"üî¢ Number of rows removed: {initial_row_count - final_row_count}")

df_cleaned.to_csv(augmentation_filtered_cleaned_path, index=False)

logger.info("‚úÖ Cleaned augmentation_filtered.csv and saved to augmentation_filtered_cleaned.csv!")

In [None]:
from pathlib import Path

from yomikata.config import config, logger
from yomikata.dataset.split import (
    check_data,
    filter_dictionary,
    filter_simple,
    optimize_furigana,
    remove_other_readings,
    split_data,
)

logger.info("Rough filtering for sentences with heteronyms")
filter_simple(
    Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned.csv"),
    Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned_have_heteronyms.csv"),
    config.HETERONYMS.keys(),
)

In [None]:
from pathlib import Path
from yomikata.dataset import split
from yomikata import utils
from yomikata.config import config, logger

split_dict = utils.load_dict(Path(config.BREAKDOWN_DATA_DIR, "translations.json"))
logger.info("Starting decomposition process, this may take a while...")
split.decompose_furigana(
    Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned_have_heteronyms.csv"),
    Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned_have_heteronyms_broken_down.csv"),
    split_dict,
)
logger.info("‚úÖ Decomposed furigana!")

In [None]:
from pathlib import Path

from yomikata.config import config, logger
from yomikata.dataset.split import (
    check_data,
    filter_dictionary,
    filter_simple,
    optimize_furigana,
    remove_other_readings,
    split_data,
)
logger.info("Removing heteronyms with unexpected readings")
remove_other_readings(
    Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned_have_heteronyms_broken_down.csv"),
    Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned_have_heteronyms_broken_down_strict.csv"),
    config.HETERONYMS,
)

In [None]:
from pathlib import Path
from yomikata.config import config, logger
from yomikata import utils
import pandas as pd

df1 = pd.read_csv(Path(config.SENTENCE_DATA_DIR, "augmentation_filtered_cleaned_have_heteronyms_broken_down_strict.csv"))
df1 = df1[['sentence', 'furigana']]
temp_file = Path(config.SENTENCE_DATA_DIR, "temp_filtered.csv")
df1.to_csv(temp_file, index=False)

input_files = [
    temp_file,
    Path(config.SENTENCE_DATA_DIR, "train/train_optimized_strict_heteronyms.csv"),
]
utils.merge_csvs(input_files, Path(config.SENTENCE_DATA_DIR, "train/train_optimized_strict_heteronyms_augmented.csv"), n_header=1)

temp_file.unlink()

# Process and split data

In [None]:
from pathlib import Path

from yomikata.config import config, logger
from yomikata.dataset.split import (
    check_data,
    filter_dictionary,
    filter_simple,
    optimize_furigana,
    remove_other_readings,
    split_data,
)
from yomikata.dictionary import Dictionary

We extract from the dataset the sentences which include our heteronyms.

In [None]:
logger.info("Rough filtering for sentences with heteronyms")
filter_simple(
    Path(config.SENTENCE_DATA_DIR, "all_broken_down.csv"),
    Path(config.SENTENCE_DATA_DIR, "have_heteronyms_simple.csv"),
    config.HETERONYMS.keys(),
)

In [None]:
logger.info("Use sudachi to filter out heteronyms in known compounds")
filter_dictionary(
    Path(config.SENTENCE_DATA_DIR, "have_heteronyms_simple.csv"),
    Path(config.SENTENCE_DATA_DIR, "have_heteronyms_simple.csv"),
    config.HETERONYMS.keys(),
    Dictionary("sudachi"),
)

Finally we remove sentences that only include heteronyms with readings that we are not trying to predict for.

In [None]:
logger.info("Removing heteronyms with unexpected readings")
remove_other_readings(
    Path(config.SENTENCE_DATA_DIR, "have_heteronyms_simple.csv"),
    Path(config.SENTENCE_DATA_DIR, "optimized_strict_heteronyms.csv"),
    config.HETERONYMS,
)

After checking our data makes sense we do a train/val/test split

In [None]:
test_result = check_data(
    Path(config.SENTENCE_DATA_DIR, "optimized_strict_heteronyms.csv")
)
logger.info("Performing train/test/split")
split_data(Path(config.SENTENCE_DATA_DIR, "optimized_strict_heteronyms.csv"))

logger.info("Data splits successfully generated!")

# DBERT

We train a BERT classifier model to disambiguate the heteronyms in our data. 

## Dataset Info

Before we start training we do some simple tests using the BERT tokenizer on the dataset

In [None]:
from pathlib import Path

from yomikata.config import config, logger
from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files={
        "train": str(Path(config.TRAIN_DATA_DIR, "train_optimized_strict_heteronyms.csv")),
        "val": str(Path(config.VAL_DATA_DIR, "val_optimized_strict_heteronyms.csv")),
        "test": str(Path(config.TEST_DATA_DIR, "test_optimized_strict_heteronyms.csv")),
    },
)
from yomikata.dbert import dBert

reader = dBert()

dataset = dataset.map(
    reader.batch_preprocess_function, batched=True, fn_kwargs={"pad": False}
)
dataset = dataset.filter(
    lambda entry: any(label != -100 for label in entry["labels"])
)

In [None]:
import numpy as np
from collections import Counter
from tqdm import tqdm

labels = []
for key in dataset.keys():
    print(f"{key} dataset has {len(dataset[key])} members")
    have_labels = [i for i in dataset[key] if np.max(i["labels"]) != -100]
    print(f"{len(have_labels)} actually contain heteronyms")
    key_length = len(dataset[key])
    for i in tqdm(range(key_length), desc="Counting labels"):
        labels += [value for value in dataset[key][i]["labels"] if value != -100]
    print("--")

label_counter = Counter(labels)

In [None]:
from collections import defaultdict
heteronyms = defaultdict(dict)

for label in label_counter:
    label_class = reader.label_encoder.index_to_class[label]
    (surface, reading) = label_class.split(":")
    heteronyms[surface][reading] = label_counter[label]

for heteronym in reader.heteronyms:
    print("heteronym:", heteronym)
    total = 0
    for reading in heteronyms[heteronym]:
        print(reading, heteronyms[heteronym][reading])
        total += heteronyms[heteronym][reading]
    print("total:", total)
    print("------------------------------")


## Train 

To train the model in the notebook

In [None]:
from yomikata.dbert import dBert
from datasets import load_dataset
from yomikata.config import config, logger
from pathlib import Path

reader = dBert(reinitialize=True)

dataset = load_dataset(
    "csv",
    data_files={
        "train": str(Path(config.TRAIN_DATA_DIR, "train_optimized_strict_heteronyms.csv")),
        "val": str(Path(config.VAL_DATA_DIR, "val_optimized_strict_heteronyms.csv")),
        "test": str(Path(config.TEST_DATA_DIR, "test_optimized_strict_heteronyms.csv")),
    },
)

reader.train(dataset)

Or using to get MLflow integration, experiment tracking, metrics, run the following in command line:

```
source yomikata/venv/bin/activate

python yomikata/yomikata/main.py yomikata/config/dbert-train-args.json
```

## Use 

In [None]:
from pathlib import Path

from yomikata.config import config, logger
from yomikata.dbert import dBert
# from yomikata.main import get_artifacts_dir_from_run

# artifacts_dir = get_artifacts_dir_from_run("e392694b345e4ca19fd97f6a872ced98")
# reader = dBert(artifacts_dir)
reader = dBert()

from yomikata.dictionary import Dictionary

dictreader = Dictionary()

In [None]:
text = "Áü•„Å£„Å¶ÂÇô„Åà„ÇãÊñ∞Âûã„Ç§„É≥„Éï„É´„Ç®„É≥„Ç∂ËÅ∑Â†¥„ÉªÂÆ∂Â∫≠„Åß‰ªäÊó•„Åã„Çâ„Åô„Åπ„Åç„Åì„Å®"  # Áü•[„Åó]„Å£„Å¶ÂÇô[„Åù„Å™]„Åà„ÇãÊñ∞Âûã[„Åó„Çì„Åå„Åü]„Ç§„É≥„Éï„É´„Ç®„É≥„Ç∂ËÅ∑Â†¥[„Åó„Çá„Åè„Å∞]„ÉªÂÆ∂Â∫≠[„Åã„Å¶„ÅÑ]„Åß‰ªäÊó•[„Åç„Çá„ÅÜ]„Åã„Çâ„Åô„Åπ„Åç„Åì„Å®
print(dictreader.furigana(reader.furigana(text)))
print(dictreader.furigana(text))

In [None]:
text = "Ë∫´‰Ωì--Êàë„ÄÖËá™Ë∫´„Åå„Åù„Çå„Åß„ÅÇ„Çã„Å®„Åì„Çç„ÅÆËá™ÁÑ∂"  # Ë∫´‰Ωì[„Åó„Çì„Åü„ÅÑ]--Êàë„ÄÖ[„Çè„Çå„Çè„Çå]Ëá™Ë∫´[„Åò„Åó„Çì]„Åå„Åù„Çå„Åß„ÅÇ„Çã„Å®„Åì„Çç„ÅÆËá™ÁÑ∂[„Åó„Åú„Çì]
print(dictreader.furigana(reader.furigana(text)))
print(dictreader.furigana(text))

In [None]:
text = "Ê∞ó„Åå„Å§„ÅÑ„Åü„ÇÇ„ÅÆ„Åã„Åù„Çå„Å®„ÇÇÂÅ∂ÁÑ∂„Åã„Çâ„Åã„ÄÅÁãô„Çè„Çå„ÅüÂõ£‰∏É„Åå„Åµ„Å®È¶ñ„Çí„Åô„Åè„ÇÅ„Åü„ÅÆ„Åß„ÄÅÂç±„ÅÜ„ÅèÈâÑÊâá„Åå„Åù„ÅÆË∫´‰Ωì„ÅÆ‰∏ä„ÇíÈÄö„ÇäË∂ä„Åó„Å™„Åå„Çâ„ÄÅ‰∏ÅÂ∫¶‰∏¶Ë°å„Åó„Å¶Â§ßÂù™ÊµÅ„ÅÆÁßòË°ì„Çí„Å§„Åè„Åó„Å§„Å§„ÅÇ„Å£„ÅüÂè≥ÂÅ¥Âêë„ÅÜ„ÅÆ„ÄÅÈªí‰ΩèÂõ£‰∏É„Å™„Çâ„Å¨Âè§È´òÊñ∞ÂÖµË°õ„ÅÆËÑáËÖπ„Å´„ÄÅ„ÅØ„ÉÉ„Åó„Å®ÂëΩ‰∏≠„ÅÑ„Åü„Åó„Åæ„Åó„Åü„ÄÇ"  # ,Ê∞ó[„Åç]„Åå„Å§„ÅÑ„Åü„ÇÇ„ÅÆ„Åã„Åù„Çå„Å®„ÇÇÂÅ∂ÁÑ∂[„Åê„ÅÜ„Åú„Çì]„Åã„Çâ„Åã„ÄÅÁãô[„Å≠„Çâ]„Çè„Çå„ÅüÂõ£‰∏É[„Å†„Çì„Åó„Å°]„Åå„Åµ„Å®È¶ñ[„Åè„Å≥]„Çí„Åô„Åè„ÇÅ„Åü„ÅÆ„Åß„ÄÅÂç±[„ÅÇ„ÇÑ]„ÅÜ„ÅèÈâÑÊâá[„Å¶„Å£„Åõ„Çì]„Åå„Åù„ÅÆË∫´‰Ωì[„Åã„Çâ„Å†]„ÅÆ‰∏ä[„ÅÜ„Åà]„ÇíÈÄö[„Å®„Åä]„ÇäË∂ä[„Åì]„Åó„Å™„Åå„Çâ„ÄÅ‰∏ÅÂ∫¶[„Å°„Çá„ÅÜ„Å©]‰∏¶Ë°å[„Å∏„ÅÑ„Åì„ÅÜ]„Åó„Å¶Â§ßÂù™ÊµÅ[„Åä„Åä„Å§„Åº„Çä„ÇÖ„ÅÜ]„ÅÆÁßòË°ì[„Å≤„Åò„ÇÖ„Å§]„Çí„Å§„Åè„Åó„Å§„Å§„ÅÇ„Å£„ÅüÂè≥ÂÅ¥[„Åø„Åé„Åå„Çè]Âêë[„ÇÄ„Åì]„ÅÜ„ÅÆ„ÄÅÈªí‰Ωè[„Åè„Çç„Åö„Åø]Âõ£‰∏É[„Å†„Çì„Åó„Å°]„Å™„Çâ„Å¨Âè§È´ò[„Åµ„Çã„Åü„Åã]Êñ∞ÂÖµË°õ[„Åó„Çì„Åπ„Åà]„ÅÆËÑáËÖπ[„Çè„Åç„Å∞„Çâ]„Å´„ÄÅ„ÅØ„ÉÉ„Åó„Å®ÂëΩ‰∏≠[„ÇÅ„ÅÑ„Å°„ÇÖ„ÅÜ]„ÅÑ„Åü„Åó„Åæ„Åó„Åü„ÄÇ
print(dictreader.furigana(reader.furigana(text)))
print(dictreader.furigana(text))

In [None]:
text = "InterviewÈõ≤Áî∞„ÅØ„Çã„Åì:BL„Åã„Çâ„ÄéÊò≠ÂíåÂÖÉÁ¶ÑËêΩË™ûÂøÉ‰∏≠„Äè„Åæ„Åß‰∫∫Èñì„ÅÆÂÄãÊÄß„ÇíË¶ã„Å§„ÇÅ„ÇãÁ®Ä‰ª£„ÅÆÊèè„ÅçÊâã"  # ,InterviewÈõ≤Áî∞[„ÅÜ„Çì„Åß„Çì]„ÅØ„Çã„Åì:BL„Åã„Çâ„ÄéÊò≠Âíå[„Åó„Çá„ÅÜ„Çè]ÂÖÉÁ¶Ñ[„Åí„Çì„Çç„Åè]ËêΩË™û[„Çâ„Åè„Åî]ÂøÉ‰∏≠[„Åó„Çì„Åò„ÇÖ„ÅÜ]„Äè„Åæ„Åß‰∫∫Èñì[„Å´„Çì„Åí„Çì]„ÅÆÂÄãÊÄß[„Åì„Åõ„ÅÑ]„ÇíË¶ã[„Åø]„Å§„ÇÅ„ÇãÁ®Ä‰ª£[„Åç„Åü„ÅÑ]„ÅÆÊèè[„Åà„Åå]„ÅçÊâã[„Å¶]
reader.furigana(text)

In [None]:
text = "ÁâπÈõÜÁîüÊàê„Åô„ÇãË∫´‰Ωì"  # ,ÁâπÈõÜ[„Å®„Åè„Åó„ÇÖ„ÅÜ]ÁîüÊàê[„Åõ„ÅÑ„Åõ„ÅÑ]„Åô„ÇãË∫´‰Ωì[„Åó„Çì„Åü„ÅÑ]
print(dictreader.furigana(reader.furigana(text)))
print(dictreader.furigana(text))

In [None]:
text = "‰ªäÊó•„ÅÆ‰∏ñÁïåÊÉÖÂã¢„ÅØ"
print(dictreader.furigana(reader.furigana(text)))
print(dictreader.furigana(text))

In [None]:
text = "„ÅÇ„ÅÆÂäõÂ£´„Å´„ÅØÈáëÊòü„ÅØ„Å©„Çå„Åê„Çâ„ÅÑ„ÅÇ„ÇãÔºü"
print(dictreader.furigana(reader.furigana(text)))
print(dictreader.furigana(text))

In [None]:
text = "ÈªÑËâ≤„Å®Èªí„ÅÆÁµÑ„ÅøÂêà„Çè„Åõ„ÅØ„ÄÅÂç±Èô∫„Åß„ÅÇ„Çã„Åì„Å®„ÇíË°®„Åô"
reader.furigana(text)  # Ë°®„ÄÄis in but Ë°®„Åô„ÄÄis properly parsed as not ambiguous

In [None]:
text = "Ë°®ÂèÇÈÅì„Å´Ë°å„Åç„Åæ„Åô"
reader.furigana(
    text
)  # Ë°®„ÄÄis in but since is in the compound Ë°®ÂèÇÈÅì„ÄÄ it is properly recognized as something that should be looked up in a dictionary

In [None]:
text = "„Åù„ÅÆË°®„ÇíË¶ã„Åõ„Å¶„Åè„Å†„Åï„ÅÑ"
reader.furigana(text)  # Correct

In [None]:
text = "„ÅÇ„ÅÆÂÆ∂„ÅÆË°®„ÅØÁ∂∫È∫ó„Åß„Åô"
reader.furigana(text)  # Correct

In [None]:
text = "Âª∫ÁØâË°®„ÇíË¶ã„Åõ„Å¶„Åè„Å†„Åï„ÅÑ"
reader.furigana(text)  # Failed?

# Code structure

In [None]:
from yomikata import utils
from yomikata.dbert import dBert

In [None]:
reader = dBert()

In [None]:
test_sentence = '„Åù„Åó„Å¶„ÄÅ{Áï≥/„Åü„Åü„Åø}„ÅÆ{Ë°®/„Åä„ÇÇ„Å¶}„ÅØ„ÄÅ„Åô„Åß„Å´{Âπæ/„ÅÑ„Åè}{Âπ¥/„Å≠„Çì}{Ââç/„Åæ„Åà}„Å´{Êèõ/„Åã}„Åà„Çâ„Çå„Åü„ÅÆ„Åã{ÂàÜ/„Çè„Åã}„Çâ„Å™„Åã„Å£„Åü'

In [None]:
%time
disambiguated_sentence = reader.furigana(utils.remove_furigana(test_sentence))
print(disambiguated_sentence)

In [None]:
from yomikata.dictionary import Dictionary

dictreader = Dictionary()
dictreader.furigana(utils.remove_furigana(test_sentence))

In [None]:
dictreader.furigana(disambiguated_sentence)

In [None]:
dictreader.furigana(disambiguated_sentence) == dictreader.furigana(
    utils.remove_furigana(test_sentence)
)

## Test on datasets

In [None]:
from pathlib import Path

import torch
from yomikata.config import config
from yomikata.dbert import dBert
from yomikata.main import get_artifacts_dir_from_run

# artifacts_dir = get_artifacts_dir_from_run("e392694b345e4ca19fd97f6a872ced98")
# artifacts_dir = Path(
#    get_artifacts_dir_from_run("4d19dfb0d0b64b518d8e5506e3f6a726"), "checkpoint-10200"
# )

reader = dBert()

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files={
        "train": str(
            Path(config.TRAIN_DATA_DIR, "train_optimized_strict_heteronyms.csv")
        ),
        "val": str(Path(config.VAL_DATA_DIR, "val_optimized_strict_heteronyms.csv")),
        "test": str(Path(config.TEST_DATA_DIR, "test_optimized_strict_heteronyms.csv")),
    },
)

dataset = dataset.map(
    reader.batch_preprocess_function, batched=True, fn_kwargs={"pad": False}
)
dataset = dataset.filter(
    lambda entry: any(label != -100 for label in entry["labels"])
)

In [None]:
from transformers import Trainer
from yomikata.custom_bert import CustomDataCollatorForTokenClassification
import evaluate

data_collator = CustomDataCollatorForTokenClassification(
    tokenizer=reader.tokenizer, padding=True
)

accuracy_metric = evaluate.load("accuracy")
recall_metric = evaluate.load("recall")

def compute_metrics(p):
    predictions, labels = p  # predictions are already the argmax of logits
    true_predictions = [pred for prediction, label in zip(predictions, labels) for pred, lab in zip(prediction, label) if lab != -100]
    true_labels = [lab for prediction, label in zip(predictions, labels) for pred, lab in zip(prediction, label) if lab != -100]
    return {"accuracy": accuracy_metric.compute(references=true_labels, predictions=true_predictions)["accuracy"], "recall": recall_metric.compute(references=true_labels, predictions=true_predictions, average="macro", zero_division=0)["recall"]}

trainer = Trainer(
    model=reader.model,
    tokenizer=reader.tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=lambda logits, _: torch.argmax(logits, dim=-1)
)

In [None]:
%%time
import numpy as np
from yomikata.config import logger

reader.model.eval()
full_performance = {}
# for key in dataset.keys():
for key in ["test"]:
    max_evals = min(1000000, len(dataset[key]))
    # max_evals = len(dataset[key])
    logger.info(f"getting predictions for {key}")
    subset = dataset[key].shuffle().select(range(max_evals))
    prediction_output = trainer.predict(subset)
    logger.info(f"processing predictions for {key}")
    metrics = prediction_output[2]
    labels = prediction_output[1]

    logger.info("processing performance")
    performance = {
        heteronym: {
            "n": 0,
            "readings": {
                reading: {
                    "n": 0,
                    "found": {readingprime: 0 for readingprime in list(reader.heteronyms[heteronym].keys())}
                }
                for reading in list(reader.heteronyms[heteronym].keys())
            },
        }
        for heteronym in reader.heteronyms.keys()
    }

    flattened_logits = [
        logit
        for sequence_logits, sequence_labels in zip(prediction_output[0], labels)
        for (logit, l) in zip(sequence_logits, sequence_labels) if l != -100
    ] # this is already argmaxed in preprocess_logits_for_metrics, so the resulting list is 1d. valid_mask processing in CustomBertForTokenClassification.forward takes care of zeoring out irrelevant logits

    true_labels = [
        str(reader.label_encoder.index_to_class[l])
        for label in labels
        for l in label if l != -100
    ]

    for i, true_label in enumerate(true_labels):
        (true_surface, true_reading) = true_label.split(":")
        performance[true_surface]["n"] += 1
        performance[true_surface]["readings"][true_reading]["n"] += 1
        predicted_label = reader.label_encoder.index_to_class[flattened_logits[i]]
        predicted_reading = predicted_label.split(":")[1]
        performance[true_surface]["readings"][true_reading]["found"][predicted_reading] += 1

    for surface in performance:
        for true_reading in performance[surface]["readings"]:
            true_count = performance[surface]["readings"][true_reading]["n"]
            predicted_count = performance[surface]["readings"][true_reading]["found"][true_reading]
            performance[surface]["readings"][true_reading]["accuracy"] = predicted_count / true_count if true_count > 0 else "NaN"
        correct_count = sum(performance[surface]["readings"][true_reading]["found"][true_reading] for true_reading in performance[surface]["readings"])
        all_count = performance[surface]["n"]
        performance[surface]["accuracy"] = correct_count / all_count if all_count > 0 else "NaN"

    performance = {
        "metrics": metrics,
        "heteronym_performance": performance,
    }

    full_performance[key] = performance

full_performance

# Performance for dictionary 

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
from yomikata.config import config, logger
from speach.ttlig import RubyFrag, RubyToken
from yomikata import utils
from yomikata.dictionary import Dictionary
from yomikata.dataset import breakdown
from yomikata.dataset.split import replace_furigana

reader = Dictionary("sudachi")
heteronyms = config.HETERONYMS

In [None]:
filename = Path(config.TEST_DATA_DIR, "test_optimized_strict_heteronyms.csv")
max_evals = 1000000
df = pd.read_csv(filename, header=0)
df = df.sample(frac=1)
if max_evals is not None:
    max_evals = max(max_evals, 1)
    max_evals = min(max_evals, len(df))
    df = df.head(max_evals)

df["furigana_found"] = df.apply(
    lambda x: reader.furigana(utils.standardize_text(x["sentence"])), axis=1
)

sentences = df["furigana_found"].tolist()
sentences += df["furigana"].tolist()
(split_dict, no_translation) = breakdown.sentence_list_to_breakdown_dictionary(sentences)

In [None]:
df["furigana_found"] = df["furigana_found"].apply(
    lambda s: replace_furigana(s, split_dict)
)

In [None]:
from tqdm import tqdm
performance = {
    heteronym: {
        "n": 0,
        "readings": {
            reading: {
                "n": 0,
                "found": {
                    readingprime: 0
                    for readingprime in list(heteronyms[heteronym].keys()) + ["<OTHER>"]
                },
            }
            for reading in list(heteronyms[heteronym].keys())
        },
    }
    for heteronym in heteronyms.keys()
}
failures = 0
for i, row in tqdm(df.iterrows(), total=df.shape[0], desc="processing performance"):
    matches = utils.find_all_substrings(row["sentence"], heteronyms.keys())
    furis_true = utils.get_furis(row["furigana"])
    furis_found = utils.get_furis(row["furigana_found"])
    failure = False
    for location in matches:
        surface = matches[location]
        reading_true = utils.get_reading_from_furi(location, len(surface), furis_true)
        if not reading_true:
            continue
        reading_found = utils.get_reading_from_furi(location, len(surface), furis_found)
        if not reading_found:
#            print(location, surface, row["furigana"], row["furigana_found"], row["sentence"])
            failure = True
        performance[surface]["n"] += 1
        if (reading_true in performance[surface]["readings"].keys()):
            found_reading = reading_found if reading_found in performance[surface]["readings"].keys() else "<OTHER>"
            performance[surface]["readings"][reading_true]["n"] += 1
            performance[surface]["readings"][reading_true]["found"][found_reading] += 1
    if failure:
        failures += 1
n = 0
correct = 0
for surface in performance.keys():
    for true_reading in performance[surface]["readings"].keys():
        performance[surface]["readings"][true_reading]["accuracy"] = np.round(
            performance[surface]["readings"][true_reading]["found"][true_reading]
            / np.array(performance[surface]["readings"][true_reading]["n"]),
            3,
        )

    performance[surface]["accuracy"] = np.round(
        sum(
            performance[surface]["readings"][true_reading]["found"][true_reading]
            for true_reading in performance[surface]["readings"].keys()
        )
        / np.array(performance[surface]["n"]),
        3,
    )

    correct += sum(
        performance[surface]["readings"][true_reading]["found"][true_reading]
        for true_reading in performance[surface]["readings"].keys()
    )
    n += performance[surface]["n"]

In [None]:
print(failures, len(df))

In [None]:
print("Total accuracy:", correct/n)

In [None]:
print({key: performance[key]["accuracy"] for key in performance.keys()})

In [None]:
performance

# Details of classifying based on textual embeddings

With the T5 model I am fine-tuning the whole encoder-decoder architecture to encode the embeddings and then output the correct readings for every token. This assumes essentially that every token can be ambiguous and can have any possible reading.

The Amazon paper does something simpler. It takes the BERT encodings as input and for ambiguous tokens trains a small classifier model to choose between 2 or 3 readings. Is this kind of thing possible for Japanese? Let's look at some tokenizations and see if such a thing is possible for japanese.

## Proof of concept: Do contextual embeddings significantly differ for heteronyms?

In [None]:
word = "ÈáëÊòü"

In [None]:
text1 = "ÈáëÊòü„ÅØÂ§™ÈôΩÁ≥ª„ÅßÂ§™ÈôΩ„Å´Ëøë„ÅÑÊñπ„Åã„Çâ2Áï™ÁõÆ„ÅÆÊÉëÊòü„ÄÇ"
text2 = "ÈáëÊòü„Å®„ÅØ„ÄÅÂ§ßÁõ∏Êí≤„Åß„ÄÅÂπ≥Âπï„ÅÆÂäõÂ£´„ÅåÊ®™Á∂±„Å®ÂèñÁµÑ„Çí„Åó„Å¶ÂãùÂà©„Åô„Çã„Åì„Å®„Åß„ÅÇ„Çã„ÄÇ"
texts = [text1, text2]

In [None]:
from yomikata.dictionary import DictionaryReader

DicReader = DictionaryReader()
for text in texts:
    print(DicReader.tagger(text))

In [None]:
# Based on tokenizer results below Âπ≥Âπï appears to be in unidic but not unidic_lite

In [None]:
from transformers import BertJapaneseTokenizer

tokenizer = BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v2")

In [None]:
%time
for text in texts:
    text_encoded = tokenizer(
        text,
        add_special_tokens=False,
    )
    input_ids = text_encoded["input_ids"]
    input_mask = text_encoded["attention_mask"]
    print(input_ids)
    print([tokenizer._convert_id_to_token(input_id) for input_id in input_ids])
    tokenizer.decode(input_ids)

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("cl-tohoku/bert-base-japanese-v2")
model.eval();

In [None]:
for text in texts:
    text_encoded = tokenizer(
        text,
        max_length=16,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
        add_special_tokens=False,
    )  # needs to be pytorch tensors
    input_ids = text_encoded["input_ids"]
    input_mask = text_encoded["attention_mask"]

    print(input_ids.shape)

    outputs = model.forward(input_ids=input_ids, attention_mask=input_mask)

    print(outputs.last_hidden_state)
    print(outputs.last_hidden_state.shape)

## Embedding visualization

In [None]:
from transformers import BertJapaneseTokenizer, BertModel

tokenizer = BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v2")
model = BertModel.from_pretrained("cl-tohoku/bert-base-japanese-v2")
model.eval()
import numpy as np

words = np.array(list(tokenizer.vocab.keys()))
wordembs = model.embeddings.word_embeddings.weight

In [None]:
print(wordembs.shape)  # 32768 is the vocab size and 768 the embedding dimension

In [None]:
wordembs = wordembs.detach().numpy()

In [None]:
# Determine vocabulary to use for t-SNE/visualization. The indices are hard-coded based partially on inspection:
char_indices_to_use = np.arange(851, 1063, 1)
voc_indices_to_plot = np.append(char_indices_to_use, np.arange(23000, 27000, 1))
voc_indices_to_use = np.append(char_indices_to_use, np.arange(17000, 27000, 1))

In [None]:
print(len(voc_indices_to_plot))
print(len(voc_indices_to_use))

In [None]:
# list(words[bert_voc_indices_to_use])

In [None]:
wordembs_to_use = wordembs[voc_indices_to_use]

In [None]:
from sklearn.manifold import TSNE

# Run t-SNE on the BERT vocabulary embeddings we selected:
mytsne_words = TSNE(n_components=2, early_exaggeration=12, metric="cosine", init="pca")
wordembs_to_use_tsne = mytsne_words.fit_transform(wordembs_to_use)

In [None]:
wordembs_to_use.shape

In [None]:
wordembs_to_use

In [None]:
words_to_plot = words[voc_indices_to_plot]
print(len(words_to_plot))

In [None]:
# Plot the transformed BERT vocabulary embeddings:
import japanize_matplotlib
import matplotlib.pyplot as plt

plt.rcParams["font.family"] = "VL Gothic"

fig = plt.figure(figsize=(100, 60))
alltexts = list()
for i, txt in enumerate(words_to_plot):
    plt.scatter(wordembs_to_use_tsne[i, 0], wordembs_to_use_tsne[i, 1], s=0)
    currtext = plt.text(wordembs_to_use_tsne[i, 0], wordembs_to_use_tsne[i, 1], txt)
    alltexts.append(currtext)


# Save the plot before adjusting.
plt.savefig("japanese-viz-bert-voc-noadj.pdf", format="pdf")
# print('now running adjust_text')
# Using autoalign often works better in my experience, but it can be very slow for this case, so it's false by default below:
# numiters = adjust_text(alltexts, autoalign=True, lim=50)
# from adjustText import adjust_text
# numiters = adjust_text(alltexts, autoalign=False, lim=50)
# print('done adjust text, num iterations: ', numiters)
# plt.savefig('japanese-viz-bert-voc-tsne10k-viz4k-adj50.pdf', format='pdf')

# plt.show()

In [None]:
### „Éï„Ç©„É≥„Éà‰∏ÄË¶ß„ÇíÁ¢∫Ë™ç„Åô„Çã„Çµ„É≥„Éó„É´„Ç≥„Éº„Éâ
# import matplotlib.pyplot as plt
# import matplotlib.font_manager as fm
# import numpy as np

# fonts = list(np.unique([f.name for f in matplotlib.font_manager.fontManager.ttflist]))

# fig = plt.figure(figsize=(8, 100))
# ax = fig.add_subplot(1, 1, 1)
# ax.set_ylim([-1, len(fonts)])
# ax.set_yticks(np.arange(0, len(fonts), 10))

# for i, f in enumerate(fonts):
#     ax.text(0.2, i,  'Êó•Êú¨Ë™ûÂº∑ {}'.format(f), fontdict={'family': f, 'fontsize': 14})

# plt.show()

In [None]:
from pathlib import Path

import pandas as pd
from yomikata.config import config, logger

df = pd.read_csv(Path(config.SENTENCE_DATA_DIR, "aozora.csv"))

In [None]:
word = "Â∏ÇÂ†¥"
word_classes = ["„Åó„Åò„Çá„ÅÜ", "„ÅÑ„Å°„Å∞"]
word = "Á§ºÊãù"
word_classes = ["„Çå„ÅÑ„ÅØ„ÅÑ", "„Çâ„ÅÑ„ÅØ„ÅÑ"]
word = "‰ªäÊó•"
word_classes = ["„Åç„Çá„ÅÜ", "„Åì„Çì„Å´„Å°"]
word = "‰ªäÊó•"
word_classes = ["„Åç„Çá„ÅÜ", "„Åì„Çì„Å´„Å°"]
word = "Ë°®"
word_classes = ["„Å≤„Çá„ÅÜ", "„Åä„ÇÇ„Å¶"]
word = "‰ªÆÂêç"
word_classes = ["„Åã„Å™", "„Åã„ÇÅ„ÅÑ"]
word = "Â§âÂåñ"
word_classes = ["„Å∏„Çì„Åã", "„Å∏„Çì„Åí"]

In [None]:
from yomikata.heteronyms import heteronyms

print(heteronyms[heteronyms["surface"] == word])

from pathlib import Path

pronunciation_df = pd.read_csv(Path(config.PRONUNCIATION_DATA_DIR, "all.csv"))
print(pronunciation_df[pronunciation_df["surface"] == word]["pronunciations"].values)

In [None]:
df_keyword = df[df["sentence"].str.contains(word)]
df_keyword = df_keyword.reset_index(drop=True)
window_size = 128
df_keyword["sentence-shorter"] = df_keyword["sentence"].apply(
    lambda sentence: (
        idx := sentence.index(word),
        sentence[np.max([0, idx - window_size]) : idx]
        + sentence[idx : np.min([len(sentence), idx + window_size])],
    )[1]
)
print(len(df_keyword))

In [None]:
def reading_matcher(furigana, word, word_classes):
    try:
        shifted_furigana = furigana[furigana.index(word) :]
    except ValueError:
        print(word)
        print(furigana)
        return -1
    found_reading = shifted_furigana[
        shifted_furigana.index("[") + 1 : shifted_furigana.index("]")
    ]
    # print(found_reading)
    for reading in word_classes:
        if found_reading.find(reading) != -1:
            return reading
    return -1

In [None]:
df_keyword["reading"] = df_keyword["furigana"].apply(
    lambda sentence: reading_matcher(sentence, word, word_classes)
)

In [None]:
# TODO: Improve the code for classifying words with furigana into one of the reading classes.

In [None]:
for word_class in word_classes:
    print(f"{word_class} {len(df_keyword[df_keyword['reading'] == word_class])}")
print("failures", len(df_keyword[df_keyword["reading"] == -1]))
df_keyword[df_keyword["reading"] == -1]

In [None]:
df_keyword = df_keyword[df_keyword["reading"] != -1]

In [None]:
word_id = tokenizer.encode(word, add_special_tokens=False)[0]
pad_size = 32
df_keyword["sentence-encoded"] = df_keyword["sentence-shorter"].apply(
    lambda sentence: tokenizer.encode(
        sentence,
        add_special_tokens=False,
        max_length=pad_size,
        truncation=True,
        padding="max_length",
    )
)
df_keyword["encoding-success"] = df_keyword["sentence-encoded"].apply(
    lambda encoding: word_id in encoding
)
print(len(df_keyword[~df_keyword["encoding-success"]]), "encoding failures")
df_keyword = df_keyword[df_keyword["encoding-success"]]
df_keyword = df_keyword.reset_index(drop=True)
df_keyword["keyword-index"] = df_keyword["sentence-encoded"].apply(
    lambda encoding: encoding.index(word_id)
)

In [None]:
df_keyword["keyword-index"] = df_keyword["sentence-encoded"].apply(
    lambda encoding: encoding.index(word_id)
)

In [None]:
encoding_stack = np.vstack(df_keyword["sentence-encoded"])

In [None]:
import torch

forward_pass = model.forward(torch.tensor(encoding_stack))

In [None]:
np.shape(forward_pass[0])

In [None]:
embs = []
for i in range(len(df_keyword)):
    embs.append(forward_pass[0][i][df_keyword.at[i, "keyword-index"]].detach().numpy())
embs = np.array(embs)

In [None]:
from sklearn.manifold import TSNE

# Run t-SNE on the contextualized embeddings:
mytsne_tokens = TSNE(
    n_components=2,
    early_exaggeration=12,
    verbose=2,
    metric="cosine",
    init="pca",
    n_iter=2000,
)
embs_tsne = mytsne_tokens.fit_transform(embs)

In [None]:
# Plot the keyword+context strings.
import japanize_matplotlib
import matplotlib.pyplot as plt

plt.rcParams["font.family"] = "VL Gothic"

colors = ["red", "black", "blue", "green"]
classes = list(df_keyword["reading"].unique())

fig = plt.figure(figsize=(6, 4))
cs = [
    colors[classes.index(df_keyword["reading"].iloc[i])] for i in range(len(df_keyword))
]

fig = plt.figure(figsize=(6, 4))
plt.scatter(embs_tsne[:, 0], embs_tsne[:, 1], s=1, color=cs)

plt.savefig("japanese-viz-bert-ctx-points-" + word + ".pdf", format="pdf")
plt.savefig("japanese-viz-bert-ctx-points-" + word + ".png", format="png")

plt.show()

In [None]:
# Plot the keyword+context strings.
# import matplotlib.pyplot as plt
# import japanize_matplotlib

# plt.rcParams["font.family"] = "VL Gothic"

# colors = ['red', 'black']
# classes = list(df_keyword['reading'].unique())
# fig = plt.figure(figsize=(50, 30))
# alltexts = list()
# for i, txt in enumerate(df_keyword['sentence-shorter']):
#     if i % 100 == 0:
#         print(i)
#     plt.scatter(embs_tsne[i,0], embs_tsne[i,1], s=0)
#     c = colors[classes.index(df_keyword['reading'].iloc[i])]
#     currtext = plt.text(embs_tsne[i,0], embs_tsne[i,1], txt, color=c)
#     #alltexts.append(currtext)

# plt.savefig('japanese-viz-bert-ctx-text-'+word+'.pdf', format='pdf')
# # print('now running adjust_text')
# #numiters = adjust_text(alltexts, autoalign=True, lim=50)
# #numiters = adjust_text(alltexts, autoalign=False, lim=50)
# #print('done adjust text, num iterations: ', numiters)
# #plt.savefig('viz-bert-ctx-values-viz750-adj.pdf', format='pdf')

# plt.show

## Handling out of vocab heteronyms

In [None]:
text = "„Åù„ÅÆÂäõÂ£´„Å´„ÅØÈáëÊòü„ÅåÂ§ö„Åè„Å¶Â§ß‰∫∫Ê∞ó„ÄÇ"
text = "‰∏ÄÊôÇ"

In [None]:
from yomikata.dictionary import DictionaryReader

DicReader = DictionaryReader()
DicReader.tagger(text)

Here we see a problem: The ambiguous word Â§ß‰∫∫Ê∞ó is marked as two tokens. Does bert use the same tokenizer? (It uses unidic-lite)

In [None]:
from transformers import BertJapaneseTokenizer

tokenizer = BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v2")

In [None]:
%time
text_encoded = tokenizer(
    text,
    add_special_tokens=False,
)
input_ids = text_encoded["input_ids"]
input_mask = text_encoded["attention_mask"]
print(input_ids)
print([tokenizer._convert_id_to_token(input_id) for input_id in input_ids])
tokenizer.decode(input_ids)

In [None]:
"‰∏ÄÊôÇ" in list(tokenizer.vocab.keys())

In [None]:
tokenizer.vocab["‰∏ÄÊôÇ"]

In [None]:
tokenizer.encode("‰∏ÄÊôÇ")

In [None]:
len(tokenizer)

In [None]:
tokenizer.add_tokens(["‰∏ÄÊôÇ"])

In [None]:
tokenizer.decode(tokenizer.encode(["‰∏ÄÊôÇ"], add_special_tokens=False))

In [None]:
len(tokenizer)

Note this is not a contextual embedding yet, let's look at it after contextualizing

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("cl-tohoku/bert-base-japanese-v2")
model.eval();

In [None]:
text_encoded = tokenizer(
    text,
    # max_length=4,
    # truncation=True,
    # padding="max_length",
    return_tensors="pt",
    add_special_tokens=False,
)  # needs to be pytorch tensors
input_ids = text_encoded["input_ids"]
input_mask = text_encoded["attention_mask"]

print(input_ids.shape)

outputs = model.forward(input_ids=input_ids, attention_mask=input_mask)

print(outputs.last_hidden_state)
print(outputs.last_hidden_state.shape)

Now let's add a word to the vocabulary 

In [None]:
tokenizer.add_tokens(["Â§ß‰∫∫Ê∞ó"])
model.resize_token_embeddings(
    len(tokenizer)
)  # Resize the dictionary size of the embedding layer

In [None]:
len(tokenizer)

In [None]:
%time
text_encoded = tokenizer(
    text,
    add_special_tokens=False,
)
input_ids = text_encoded["input_ids"]
input_mask = text_encoded["attention_mask"]
print(input_ids)
print([tokenizer._convert_id_to_token(input_id) for input_id in input_ids])
tokenizer.decode(input_ids)

In [None]:
text_encoded = tokenizer(
    text,
    # max_length=4,
    # truncation=True,
    # padding="max_length",
    return_tensors="pt",
    add_special_tokens=False,
)  # needs to be pytorch tensors
input_ids = text_encoded["input_ids"]
input_mask = text_encoded["attention_mask"]

print(input_ids.shape)

outputs = model.forward(input_ids=input_ids, attention_mask=input_mask)

print(outputs.last_hidden_state)
print(outputs.last_hidden_state.shape)