# Fugashi with Unidic-Lite Tokenizer-Dictionary System

**started 11/20/2024**

website link: https://www.dampfkraft.com/nlp/how-to-tokenize-japanese.html

In [1]:
# installing tokenizer: 

!pip install fugashi[unidic-lite]

# note this can take a long time to download 

# might benefit from installing locally for future runs of this notebook
# just run this in your command prompt connected to your path to install
# (assuming pip is also installed on your path)

Collecting fugashi[unidic-lite]
  Downloading fugashi-1.4.0-cp39-cp39-win_amd64.whl (512 kB)
     -------------------------------------- 512.6/512.6 kB 1.1 MB/s eta 0:00:00
Collecting unidic-lite
  Downloading unidic-lite-1.0.8.tar.gz (47.4 MB)
     -------------------------------------- 47.4/47.4 MB 788.5 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: unidic-lite
  Building wheel for unidic-lite (setup.py): started
  Building wheel for unidic-lite (setup.py): finished with status 'done'
  Created wheel for unidic-lite: filename=unidic_lite-1.0.8-py3-none-any.whl size=47658818 sha256=cf67886cc1c203d08c6cfedbcbe5646cf4aa6b395378e7bf26dd5e503366d3d6
  Stored in directory: c:\users\alica\appdata\local\pip\cache\wheels\56\9c\4f\2c115e896b4b6c584039ca19de3581d333856782ef108cdc5c
Successfully built unidic-lite
Installing collected packages: unidic-lite, fugashi
Successfully install

In [3]:
import fugashi

In [5]:
# tagger "holds state about the dictionary" 
# which I think just means it is our currently used dictionary 
# if we choose to change later

tagger = fugashi.Tagger()

In [7]:
sample_text = '真夜中のドアをたたき。帰らないでと泣いた。あの季節が今目の前'

In [16]:
words = [word.surface for word in tagger(sample_text)]
print(*words)    # just "print(words)" returns the list, 
                 #but adding * returns the sentence with the spaces the tokenizer added 
                 # (each space denotes a new token has been made)

真 夜中 の ドア を たたき 。 帰ら ない で と 泣い た 。 あの 季節 が 今 目 の 前


_Notice how it doesn't split every hiragana character into its own token, I am unsure how it would react with words written in katakana, so lets see how it does on カエル　for frog. It seems to be able to tell what is a conjugation and what is a particle, so that's good!_
　

In [22]:
frog = "カエルはあそのリンゴを食べたい。"
print(*[word.surface for word in tagger(frog)])

カエル は あ その リンゴ を 食べ たい 。


_cool liking it so far_

In [24]:
# now let's see how it does on one of our sample texts:

sample = '1 二世も - 一一世 ^^ 心せょ " 米國鄉軍は顔る公平ね 0-'
print(*[word.surface for word in tagger(sample)])

1 二 世 も - 一 一 世 ^^ 心 せ ょ " 米 國 鄉軍 は 顔 る 公平 ね 0 -


_so It doesn't break when wrong chacters are added, but it also split up issei and nisei, so I might have to modify the dictionary to count that as a word_

## Using Lemma to avoid ambiguitity 

So since Japanese has many words with the same meaning, this tokenizer has the ability to return the lemma of a word (say if it is written in hiragana, it will try to interpret its meaning and return its kanji version so there is little ambiguity at its meaning

_take なく ー＞ 鳴く as an example_

In [25]:
# example: 

# my mother is very tall, all written in hiragana: 

hiragana_text = 'わたしのはははせがたかいです。'

print(*[word.surface for word in tagger(hiragana_text)])

わたし の ははは せ が たかい です 。


In [26]:
# see that it didn't split はは　from は

# now taking the lemma: 

for word in tagger(hiragana_text):
    print(word.surface, word.feature.lemma, sep = '\t')

わたし	私
の	の
ははは	ははは
せ	背
が	が
たかい	高い
です	です
。	。


In [28]:
# i bet it would do better with okaasan!

hiragana_text2 = 'わたしのおかあさんはせがたかいです。'

for word in tagger(hiragana_text2):
    print(word.surface, word.feature.lemma, sep = '\t')

わたし	私
の	の
お	御
かあ	母
さん	さん
は	は
せ	背
が	が
たかい	高い
です	です
。	。


_see that even though no one is practically going to write the lemma for the honorific お using its lemma reduces ambiguity from it being something else or it being given the same meaning as another お that shows up in the same text or even in another text when we beginng training sets_

In [29]:
verb_string = "食べ食べたい食べます食べなくて食べないたべた。"
#testing how it choosen lemma for the same verb but different conjugations

for word in tagger(verb_string):
    print(word.surface, word.feature.lemma, sep = '\t')

食べ	食べる
食べ	食べる
たい	たい
食べ	食べる
ます	ます
食べ	食べる
なく	ない
て	て
食べ	食べる
ない	ない
たべ	食べる
た	た
。	。


_okay so far I'm satisfied with this result, it is splitting the lemma correctly and keeping the part of the conjugation that adds context, such as wanting to do something or if its past tense, etc. . ._

_this means that one word is being split into multiple tokens, were the inflection is being separate from the stem: example in english being changing looked = look + ed which makes sense, look is important to the meaning, and ed is implied to be past tense, same thing for たべた＝食べる＋た_

## Computing Power

It takes a lot for the computer to run tagging, so vectorize when you can. This is very easy to do when using data frames like that of pandas, so shoudln't be difficult to implement.

Creating a new tagger is much more expensive than just using the same tagger in a list comprehension or a vectorized or for loop approach

But basically just don't reasign tagger, just use the same one you define in the beginning as "tagger" instead of fugashi.Tagger()"