<a id='top'></a><a name='top'></a>
# Chapter 2: Tokenization, Morphological Analysis, and Dependency Parsing

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/gbih/nlp/blob/main/ja_nlp_book/chp02_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

* [2.0 Imports and Setup](#2.0)
* [2.1 An Introduction to fugashi](#2.1)
    - [2.1.1 Setup](#2.1.1)
    - [2.1.2 Morphological Analysis Mini Project: Automatic Fuseji](#2.1.2)
    - [2.1.3 Censoring Unknown Words](#2.1.3)
    - [2.1.4 Use Readings to Censor only Part of Words](#2.1.4)
* [2.2 Improving Tokenization Quality with a User Dictionary](#2.2)
    - [2.2.1 Why Make a Custom Tokenizer Dictionary?](#2.2.1)
    - [2.2.2 Generating a MeCab User Dictionary](#2.2.2)
    - [2.2.3 Creating a SudachiPy User Dictionary](#2.2.3)
    - [2.2.4 Sourcing Your Own Data](#2.2.4)
    - [2.2.5 Sourcing Internet Data](#2.2.5)

---
<a name='2.0'></a><a id='2.0'></a>
# 2.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
from pathlib import Path

data_root = Path("chp02")
req_file = data_root / "requirements_2.txt"

if not data_root.is_dir():
    data_root.mkdir()
else:
    print(f"{data_root} exists.")

In [2]:
%%writefile {req_file}
fugashi[unidic]==1.2.1
watermark==2.3.1

Writing chp02/requirements_2.txt


In [3]:
# unidic==1.1.0
import sys
import os
check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

if IS_COLAB:
    print("Installing packages")
    !pip install --quiet -r {req_file}
    !python -m unidic download
    print("Packages installed.")
else:
    print("Running locally.")

Installing packages
[K     |████████████████████████████████| 615 kB 17.0 MB/s 
[K     |████████████████████████████████| 1.6 MB 33.6 MB/s 
[?25h  Building wheel for unidic (setup.py) ... [?25l[?25hdone
download url: https://cotonoha-dic.s3-ap-northeast-1.amazonaws.com/unidic-3.1.0.zip
Dictionary version: 3.1.0+2021-08-31
Downloading UniDic v3.1.0+2021-08-31...
unidic-3.1.0.zip: 100% 526M/526M [00:27<00:00, 19.4MB/s]
Finished download.
Downloaded UniDic v3.1.0+2021-08-31 to /usr/local/lib/python3.8/dist-packages/unidic/dicdir
Packages installed.


In [4]:
# Standard Library imports
from importlib.metadata import version
import os
import sys

# Third-party imports
import fugashi
from fugashi import Tagger
from random import sample
from watermark import watermark

def HR():
    print("-"*50)

# Examine all imported packages
print(watermark(iversions=True, globals_=globals(),python=True, machine=True))

Python implementation: CPython
Python version       : 3.8.16
IPython version      : 7.9.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.10.133+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

sys    : 3.8.16 (default, Dec  7 2022, 01:12:13) 
[GCC 7.5.0]
fugashi: 1.2.1



In [5]:
assert version('fugashi') == '1.2.1'

print("Successfully imported specified packages.")

Successfully imported specified packages.


---
<a name='2.1'></a><a id='2.1'></a>
# 2.1 An Introduction to fugashi
<a href="#top">[back to top]</a>

Adapted from [2.1-fugashi-fuseji.ipynb](https://github.com/octanove/janlpbook-code/blob/main/en/2.1-fugashi-fuseji.ipynb) by Paul O'Leary McCann and Masato Hagiwara 


fugashi provides four different dictionaries pre‐packaged:

1. JumanDic
2. UniDic
3. unidic‐lite
4. IPAdic

**Reference**

https://github.com/polm/fugashi

<a name='2.1.1'></a><a id='2.1.1'></a>
## 2.1.1 Setup
<a href="#top">[back to top]</a>

Assuming installation above. Test via the command-line:

In [6]:
!echo "毎年東麻布ではかかし祭りが開催されます" | fugashi -O wakati

毎年 東 麻布 で は かかし祭り が 開催 さ れ ます


In [7]:
tagger = Tagger()

text = "形態素解析をやってみた"
words = tagger(text)
print(words)
HR()

for word in words:
    print(word.surface, word.feature.lemma, word.feature.kana, sep="\t")

[形態, 素, 解析, を, やっ, て, み, た]
--------------------------------------------------
形態	形態	ケイタイ
素	素	ソ
解析	解析	カイセキ
を	を	ヲ
やっ	遣る	ヤッ
て	て	テ
み	見る	ミ
た	た	タ


<a name='2.1.2'></a><a id='2.1.2'></a>
## 2.1.2 Morphological Analysis Mini Project: Automatic Fuseji
<a href="#top">[back to top]</a>

In [8]:
tagger = Tagger()

def fuseji_node(text, ratio=1.0):
    """This function will take a node from tokenization and actually replace parts of it with filler characters."""
    ll = len(text)
    idxs = sample(range(ll), max(1, int(ratio * ll)))
    out = []
    for ii, cc in enumerate(text):
        out.append("◯" if ii in idxs else cc)
    return "".join(out)


def fuseji_text(text, ratio=1.0):
    """Given an input string, apply fuseji. """
    out = []
    for node in tagger(text):
        # Normal Japanese text doesn't use white space, but this is necessary 
        # if you include latin text, for example. 
        out.append(node.white_space)
        if node.feature.pos2 != "固有名詞":
            out.append(node.surface)
        else:
            out.append(fuseji_node(node.surface))
    return "".join(out)

print(fuseji_text("犯人はヤス"))
print(fuseji_text("東京タワーの高さは333m"))

犯人は◯◯
◯◯タワーの高さは333m


In [9]:
!echo "毎年東麻布ではかかし祭りが開催されます" | fugashi

毎年	名詞,普通名詞,副詞可能,,,,マイトシ,毎年,毎年,マイトシ,毎年,マイトシ,混,"","","","","","",体,マイトシ,マイトシ,マイトシ,マイトシ,"0","C2","",9737558477120000,35425
東	名詞,普通名詞,一般,,,,ヒガシ,東,東,ヒガシ,東,ヒガシ,和,"","","","","","",体,ヒガシ,ヒガシ,ヒガシ,ヒガシ,"0,3","C2","",8566303715631616,31164
麻布	名詞,固有名詞,地名,一般,,,アザブ,アザブ,麻布,アザブ,麻布,アザブ,固,"","","","","","",地名,アザブ,アザブ,アザブ,アザブ,"0","","",163560978260480,595
で	助詞,格助詞,,,,,デ,で,で,デ,で,デ,和,"","","","","","",格助,デ,デ,デ,デ,"","動詞%F2@0,名詞%F1","",7014343053025792,25518
は	助詞,係助詞,,,,,ハ,は,は,ワ,は,ワ,和,"","","","","","",係助,ハ,ハ,ハ,ハ,"","動詞%F2@0,名詞%F1,形容詞%F2@-1","",8059703733133824,29321
かかし祭り	名詞,普通名詞,一般,,,,カカシマツリ,案山子祭り,かかし祭り,カカシマツリ,かかし祭り,カカシマツリ,和,"","","","","","",体,カカシマツリ,カカシマツリ,カカシマツリ,カカシマツリ,"4","C1","",76478189161030144,278226
が	助詞,格助詞,,,,,ガ,が,が,ガ,が,ガ,和,"","","","","","",格助,ガ,ガ,ガ,ガ,"","動詞%F2@0,名詞%F1","",2168520431510016,7889
開催	名詞,普通名詞,サ変可能,,,,カイサイ,開催,開催,カイサイ,開催,カイサイ,漢,"","","","","","",体,カイサイ,カイサイ,カイサイ,カイサイ,"0","C2","",65579280150700544,238576
さ	動詞,非自立可能,,,サ行変格,未然形-サ,スル,為る,さ,サ,する,スル,和,"","","","","","",用,サ,スル,サ,スル,"0","C5",

<a name='2.1.3'></a><a id='2.1.3'></a>
## 2.1.3 Censoring Unknown Words
<a href="#top">[back to top]</a>

In [10]:
def should_hide(node):
    """Check if this node should be hidden or not. """
    if node.is_unk:
        return True
    ff = node.feature
    if ff.pos1 == "名詞" and ff.pos2 == "固有名詞":
        return True
    return False

def fuseji_text(text, ratio=1.0):
    """Given an input string, apply fuseji. """
    out = []
    for node in tagger(text):
        out.append(node.white_space)
        word = fuseji_node(node.surface) if should_hide(node) else node.surface
        out.append(word)
    return "".join(out)

texts = [
    "犯人はヤス",
    "魔法の言葉はヒラケゴマ",
    "『さかしま』（仏: À rebours）は、フランスの作家ジョリス＝カルル・ユイスマンスによる小説",
    "鈴木爆発で最初に解体する爆弾はみかんの形をしている",
]

for text in texts:
    print(fuseji_text(text))

犯人は◯◯
魔法の言葉は◯◯◯◯◯
『さかしま』（仏: ◯ ◯◯◯◯◯◯◯）は、◯◯◯◯の作家◯◯◯◯＝◯◯◯・◯◯◯◯◯◯による小説
◯◯爆発で最初に解体する爆弾はみかんの形をしている


<a name='2.1.4'></a><a id='2.1.4'></a>
## 2.1.4 Use Readings to Censor only Part of Words
<a href="#top">[back to top]</a>

In [11]:
def fuseji_text(text, ratio=1.0):
    """Given an input string, apply fuseji. """
    out = []
    for node in tagger(text):
        out.append(node.white_space)
        node_text = node.surface if node.is_unk else node.feature.kana
        word = fuseji_node(node_text, ratio=0.5) if should_hide(node) else node.surface
        out.append(word)
    return "".join(out)

texts = [
    "黒幕の正体はガーランド",
]

for text in texts:
    print(fuseji_text(text))

黒幕の正体はガーランド


---
<a name='2.2'></a><a id='2.2'></a>
# 2.2 Improving Tokenization Quality with a User Dictionary
<a href="#top">[back to top]</a>


<a name='2.2.1'></a><a id='2.2.1'></a>
## 2.2.1 Why Make a Custom Tokenizer Dictionary?
<a href="#top">[back to top]</a>

No source code

<a name='2.2.2'></a><a id='2.2.2'></a>
## 2.2.2 Generating a MeCab User Dictionary
<a href="#top">[back to top]</a>

In [12]:
# The Minimal Approach
pos = "名 詞,固 有 名 詞,一 般,*".split(",")
words = ["ドロッチェ", "デデデ", "水しょう"]
empty = "*"

for word in words:
# pos is four fields, so (26 ‐ 4) == 22
    entry = [word, "", "", "100"] + pos + (22 * [empty]) 
    print(",".join(entry))

ドロッチェ,,,100,名 詞,固 有 名 詞,一 般,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
デデデ,,,100,名 詞,固 有 名 詞,一 般,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
水しょう,,,100,名 詞,固 有 名 詞,一 般,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*


### The Thorough Approach

In [13]:
from fugashi import UnidicFeatures26

# field names come from fugashi
words = [("水 し ょ う", {"pron": "ス イ シ ョ ー", "lemma": "水 晶"})]
fields = UnidicFeatures26._fields

for word, data in words:
    entry = {}
    for field in fields:
        entry[field] = data.get(field, "*")
    
    # assume pos is hard‐coded
    entry["pos1"] = "名詞"
    entry["pos1"] = "固有名詞"
    entry["pos1"] = "一般"
    print(",".join(entry.values()))

一般,*,*,*,*,*,*,水 晶,*,ス イ シ ョ ー,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*


### The Extra Approach

To-do

If we need to specify the pathway to the system dictionary, we can use `unidic.DICDIR`

<a name='2.2.3'></a><a id='2.2.3'></a>
## 2.2.3 Creating a SudachiPy User Dictionary
<a href="#top">[back to top]</a>

No source code

<a name='2.2.4'></a><a id='2.2.4'></a>
## 2.2.4 Sourcing Your Own Data
<a href="#top">[back to top]</a>

No source code

<a name='2.2.5'></a><a id='2.2.5'></a>
## 2.2.5 Sourcing Internet Data
<a href="#top">[back to top]</a>

No source code