# Fugashi with Unidic-Lite Tokenizer-Dictionary System

**started 11/20/2024**

website link: https://www.dampfkraft.com/nlp/how-to-tokenize-japanese.html

**Bibtext Citation (just double click on this to get the correct formatting for putting in a LaTeX document)**

@inproceedings{mccann-2020-fugashi,
    title = "fugashi, a Tool for Tokenizing {J}apanese in Python",
    author = "McCann, Paul",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.7",
    pages = "44--51",
    abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.",
}



_Alicia Roberts Fall 2024_

In [87]:
# installing tokenizer: 

# !pip install fugashi[unidic-lite] 

# note this can take a long time to download 

# might benefit from installing locally for future runs of this notebook
# just run this in your command prompt connected to your path to install
# (assuming pip is also installed on your path)

In [88]:
# libraries:

import fugashi # tokenizer


In [89]:
# tagger "holds state about the dictionary" 
# which I think just means it is our currently used dictionary 
# if we choose to change later

tagger = fugashi.Tagger()

In [90]:
sample_text = '真夜中のドアをたたき。帰らないでと泣いた。あの季節が今目の前'

In [91]:
words = [word.surface for word in tagger(sample_text)]
print(*words)    # just "print(words)" returns the list, 
                 #but adding * returns the sentence with the spaces the tokenizer added 
                 # (each space denotes a new token has been made)

真 夜中 の ドア を たたき 。 帰ら ない で と 泣い た 。 あの 季節 が 今 目 の 前


_Notice how it doesn't split every hiragana character into its own token, I am unsure how it would react with words written in katakana, so lets see how it does on カエル　for frog. It seems to be able to tell what is a conjugation and what is a particle, so that's good!_
　

In [92]:
frog = "カエルはあそのリンゴを食べたい。"
print(*[word.surface for word in tagger(frog)])

カエル は あ その リンゴ を 食べ たい 。


_cool liking it so far_

In [93]:
# now let's see how it does on one of our sample texts:

sample = '1 二世も - 一一世 ^^ 心せょ " 米國鄉軍は顔る公平ね 0-'
print(*[word.surface for word in tagger(sample)])

1 二 世 も - 一 一 世 ^^ 心 せ ょ " 米 國 鄉軍 は 顔 る 公平 ね 0 -


_so It doesn't break when wrong chacters are added, but it also split up issei and nisei, so I might have to modify the dictionary to count that as a word_

## Using Lemma to avoid ambiguitity 

So since Japanese has many words with the same meaning, this tokenizer has the ability to return the lemma of a word (say if it is written in hiragana, it will try to interpret its meaning and return its kanji version so there is little ambiguity at its meaning

_take なく ー＞ 鳴く as an example_

In [94]:
# example: 

# my mother is very tall, all written in hiragana: 

hiragana_text = 'わたしのはははせがたかいです。'

print(*[word.surface for word in tagger(hiragana_text)])

わたし の ははは せ が たかい です 。


In [95]:
# see that it didn't split はは　from は

# now taking the lemma: 

for word in tagger(hiragana_text):
    print(word.surface, word.feature.lemma, sep = '\t')

わたし	私
の	の
ははは	ははは
せ	背
が	が
たかい	高い
です	です
。	。


In [96]:
# i bet it would do better with okaasan!

hiragana_text2 = 'わたしのおかあさんはせがたかいです。'

for word in tagger(hiragana_text2):
    print(word.surface, word.feature.lemma, sep = '\t')

わたし	私
の	の
お	御
かあ	母
さん	さん
は	は
せ	背
が	が
たかい	高い
です	です
。	。


_see that even though no one is practically going to write the lemma for the honorific お using its lemma reduces ambiguity from it being something else or it being given the same meaning as another お that shows up in the same text or even in another text when we beginng training sets_

In [97]:
verb_string = "食べ食べたい食べます食べなくて食べないたべた。"
#testing how it choosen lemma for the same verb but different conjugations

for word in tagger(verb_string):
    print(word.surface, word.feature.lemma, sep = '\t')

食べ	食べる
食べ	食べる
たい	たい
食べ	食べる
ます	ます
食べ	食べる
なく	ない
て	て
食べ	食べる
ない	ない
たべ	食べる
た	た
。	。


_okay so far I'm satisfied with this result, it is splitting the lemma correctly and keeping the part of the conjugation that adds context, such as wanting to do something or if its past tense, etc. . ._

_this means that one word is being split into multiple tokens, were the inflection is being separate from the stem: example in english being changing looked = look + ed which makes sense, look is important to the meaning, and ed is implied to be past tense, same thing for たべた＝食べる＋た_

## Computing Power

It takes a lot for the computer to run tagging, so vectorize when you can. This is very easy to do when using data frames like that of pandas, so shoudln't be difficult to implement.

Creating a new tagger is much more expensive than just using the same tagger in a list comprehension or a vectorized or for loop approach

But basically just don't reasign tagger, just use the same one you define in the beginning as "tagger" instead of fugashi.Tagger()"

## Testing it on a Sample Data set

Given my small pre-data set for training this model, let's see how it does on splitting up the strings of yes and nos.

Will it be able to keep issei as one word or will it be split up into ichi + sei? 

My _hope_ is that it will be able to distinguish from context when it is a generational term or just gibberish, which can be tested a lot of different ways, but let's see how this method goes

In [98]:
# importing libraries and data

import pandas as pd
import numpy as np

data = pd.read_csv('issei_training_data - Sheet1.csv')

data

Unnamed: 0,article link,Date,classification,text,comments
0,https://hojishinbun.hoover.org/en/newspapers/n...,1940/02/16,1,會员大募集運動市協活動準備第一世諸氏の援助協力を希望,"seems good to me, is using it as a generationa..."
1,https://hojishinbun.hoover.org/en/newspapers/k...,1940/10/06,1,一世行進曲 | ’ ，， 常石芝靑作,"needs to be verified, but seems related to poe..."
2,https://hojishinbun.hoover.org/en/newspapers/k...,1940/10/18,-1,"1 二世も - 一一世 ^^ 心せょ "" 米國鄉軍は顔る公平ね 0-",OCR read 二 as 一一 resulting in 二世 looking like ...
3,https://hojishinbun.hoover.org/en/newspapers/k...,1940/10/18,1,しズ 0 t 家 * に纖されねぱな -^ a* 今や 19 始時代から永らく奮 H を續け...,
4,https://hojishinbun.hoover.org/en/newspapers/k...,1940/10/18,1,此第一世の遺,"""this first generation's legacy"""
5,https://hojishinbun.hoover.org/en/newspapers/k...,1940/10/18,1,故に一世 1 二世备 < も在,translation is VERY wrong: 1 is supposed to be...
6,https://hojishinbun.hoover.org/en/newspapers/k...,1940/10/18,1,大統領遺舉に付き第一世に訴ふ,
7,Daijūkyūseiki Shinbun 1893.02.04: Page 2,1893/02/04,0,二世界,these are from other students' work!
8,Sōkō Hyōron 1893.12.03: Page 14,1893/12/03,0,一人,these are from other students' work!
9,Shin Sekai 1895.08.28: Page 1,1895/08/28,-1,一世に雄飛せるのみ,these are from other students' work!


In [99]:
def split_text(s, tagger):
    '''given a string S from a text, a sample, or whatever, split it up into its tokens using Fugashi's TAGGER'''
    words = [word.surface for word in tagger(s)]
    return words
    

In [100]:
iterations = 3 # just to run this a few times

for i in np.arange(iterations): 
    samples = data.sample(3)['text'].values # take a random sample of 5 data points from our data set
    splits = [split_text(s, tagger) for s in samples] # split each sample into its tokens based on our tagger
    for l in range(len(splits)):
        print('Is 一世 in str?:', '一世' in splits[l], '\nstr: ', *splits[l], '\n')

Is 一世 in str?: False 
str:  一 世 に 雄飛 せる のみ 

Is 一世 in str?: False 
str:  一人 

Is 一世 in str?: False 
str:  大統領 遺 舉 に 付き 第 一 世 に 訴 ふ 

Is 一世 in str?: False 
str:  會 员大 募集 運動 市 協 活動 準備 第 一 世 諸氏 の 援助 協力 を 希望 

Is 一世 in str?: False 
str:  大統領 遺 舉 に 付き 第 一 世 に 訴 ふ 

Is 一世 in str?: False 
str:  一 世紀 

Is 一世 in str?: False 
str:  大統領 遺 舉 に 付き 第 一 世 に 訴 ふ 

Is 一世 in str?: False 
str:  一人 

Is 一世 in str?: False 
str:  一番 



So it seems that issei is not in this dictionary, at least not in any way I can tell




In [194]:
# tagger?

The documentation (https://pypi.org/project/fugashi/) says that you can use any dictionary you want, so I might have investigate into dictionaries that have issei in them, or manually add it in myself to an existing copy of a dictionary

## next steps: 

1. explore OCR and furher applications 
2. narrow down the methods I want to do 
3. implement better tokenization and find more documentation 

## Implementation idea: remove issei and nisei, then tokenize. 

what it fixes: since issei and nisei are not being recognized as compound words, if we know that each string in our training set will contain nissei and issei, and we can check future tests for the presence, we can just remove the word from the string and see if we can get context from the removal of this. this means that we shouldn't lose context from the surrounding characters if the use of issei and nisei is correct (ie, not a mis-translationg on the OCR's part) 

what issues it might cause: 
if it is _not_ a hit, and instead is a mistranslation or picking up on neighboring words that share characters, then we are losing information that is important and can confuse the model. 


all in all, I think it's worth trying

**with the small data set we are working with:**

let's try vectorizing using a one hot encoding method and use PCA to remove computational power and the cost of our model. 

**steps**


1. remove the hit (eitheer issei, nisei, or any other word you want to find the useage of) from the string. 
> Don't change the original string. You can store this in a new data frame or add a column to the imported data set. 
> also record _where_ in the string the hit occurs so you can focus your tokenizing in that area. Assume that context that can help determine our word usage decreases as the characters get further away from the hit. 

2. Tokenize the string
>this can be done using whatever tokenizer you want, this notebook is using the fugashi tokenizer. 
> I will also be returning the _lema_ so that there is a reduction in the ambiguity of words that have the same hiragana spelling being treated the same when they mean different things (take hana as flower va hana as nose) 
> the fugashi tokenizer does a fairly good job picking up on the kanji meaning from context, but this could also be an area of error to account for.  

3. remove stop words
> these are words that don't add value, such as OCR mishaps (characters like |, ^, /, { that show up in the transcription) and particles like を、が、は、で, etc. . .
> you can also choose to remove all but the stems of verbs if you don't care about the conjugations and only the presence of the verb. 
 


4. clean the data to only include the neighboring characters/words, so pick a size (say, 20 characters on either side of the hit) to reduce the storage cost of these training points. 
> if there are multiple hits, split the hit into multiple data points. This should already be implemented in the csv file itself, but just incase run a scan to see if there are multiple hits in one data point. 
>Be sure that overlap won't matter _don't change the original data, only a copy of it_

5. extend the string (which is now an array) to where each value gets its own column. This is done through pivoting the table and having a count be the values in the table. 
>This is what i refer to as One Hot Encoding. This is where we assign either 1 or 0 to a characteristic, where in this case we are saying the existence of a word in our string is the characteristic. This can be very costly, as we can have MANY different words in all of our stirngs. 

6. PCA - Principle Component Analysis 
> this determines which characteristsics have the greatest affect on the classification of our target (ie, is issei/ nisei being used the way we want it to be used to be a hit?) by measuring the variance of all strings that have this word in it. 
> we then choose the top N characteristics that give us the most variance (just variation in our options) so that we can have a trained model. Think that if every string has the same word in it, this would be varaince 0. 
>We can make no distinction between if the use of our target word is one meaning or another if they all have the same word, so this word would be tossed out of our analysis to reduce computing power and also make the model run faster.  

In [120]:
# sample string to use for testing the functions 

sample_text = '真夜中のドアをたたき。帰らないでと泣いた。あの季節が今目の前'

### Step 1: removing the hit from the string

To do this, I wrote a function remove_hit, which just removes the hit from the string and returns the index of the first character of the first hit. If there are multiple, it will return one string, but multiple index values. If there is no hit, it will return an empty string and n = -1. This will be useful for cleaning in the future. You can just filter out all rows that have n = -1 so they are not used in the training or tested

In [171]:
# step 1: removing the hit

def remove_hit(string, hit):
    '''return a shortened version of STRING that is centered around HIT with N characters on each side of it
    STRING: any string
    HIT: any word
    returns:
    STRING wihout HIT, N: any positive integer that is the location of HIT in STRING. will return the first occurence of the first character of HIT'''
    ns = [] # this means there is no occurence of HIT if empty
    mod_string = ''
    
    if (hit in string): # first see that HIT is actually in STRING to avoid errors
        size_hit = len(hit) # how many characters to examine at once 
        
        for n in range(len(string) - size_hit): # itterate through STRING until you reach HIT
            
            if string[n:n+size_hit] == hit: # iterating till we reach HIT
                if mod_string == '':
                    mod_string = string[0:n] + string[n+size_hit:] # create a modified string without HIT
                    
                else:
                    mod_string = mod_string[0:n - size_hit] + mod_string[n:] # since mod_string is already 3 indeces shorter, you have to account for that 
                ns.append(n)    
        if len(ns) == 1:
            return np.array(mod_string), ns[0]# return modified string + index value of the first occurence 
        return np.array(mod_string), ns
    
    # if HIT is not in STRING, return empty string and -1 (to be removed later):
    return '', -1 #

    
    
# test case:
hit = '季節'
print('STRING without HIT: ',remove_hit(sample_text, hit)[0])
print('index of first character of HIT in STRING: ',remove_hit(sample_text, hit)[1])

print('the hit based on remove_hit index:',sample_text[23:23 +len(hit)])

# second test: multiple hits: 

multiple_hits = 'おはよう、お母さん。どこお母さんですか？'

hit = 'お母さん'

print('\nSTRING without HIT: ',remove_hit(multiple_hits, hit)[0])
print('index of first character of HIT in STRING: ',remove_hit(multiple_hits, hit)[1])
print('the hits based on remove_hit index:',multiple_hits[5:5 +len(hit)], multiple_hits[12:12+len(hit)] )


# third test: no hit:

hit = '山田'
print('\nSTRING without HIT: ',remove_hit(sample_text, hit)[0])
print('index of first character of HIT in STRING: ',remove_hit(sample_text, hit)[1])
print('the hit based on remove_hit index:',sample_text[-1:-1 + len(hit)])

STRING without HIT:  真夜中のドアをたたき。帰らないでと泣いた。あのが今目の前
index of first character of HIT in STRING:  23
the hit based on remove_hit index: 季節

STRING without HIT:  おはよう、。どこですか？
index of first character of HIT in STRING:  [5, 12]
the hits based on remove_hit index: お母さん お母さん

STRING without HIT:  
index of first character of HIT in STRING:  -1
the hit based on remove_hit index: 


### Step 2: Tokenizing 

use the function TOKENIZE to take  STRING and turn it into its tokens. The tokens will be the lemma form, so they might not resemble the original string, but this is to avoid ambiguity when training with synonyms. 

In [172]:
# step 2: tokenizing 

cleaned_string, n_hit = remove_hit(sample_text, hit)

def tokenize(string):
    '''given a string STRING, return the tokens in lemma form in the form of a numpy array'''
    tokens = np.array([word.feature.lemma for word in tagger(string)]) # store the lemma of each token
    return tokens


# test case: 

sample_tokens = tokenize(sample_text)

print('original text:\t  ' ,sample_text, '\ntokenized version:',  *sample_tokens) # looks good! 

original text:	   真夜中のドアをたたき。帰らないでと泣いた。あの季節が今目の前 
tokenized version: 真 夜中 の ドア-door を 叩き 。 返る ない て と 泣く た 。 彼の 季節 が 今 目 の 前


### Step 3: Remove Stop Words 

Using a list of common japanese stop words I got from this githib repo: https://github.com/stopwords-iso/stopwords-ja/tree/master, I created a list of stop words in combination with anticipated ones such as mistranslated characters and puncutation. There could verywell be more I haven't anticipated, which can just be added to the end of ADD_WORDS array in the next cell.

**NOTE**

This assumes the strings are in JAPANESE, so any latin based characters will NOT be removed

In [182]:
# step 3: remove extra characters and stop words:

# read in stop words library: 

stop_words_data = pd.read_csv('stopwords-ja.txt', sep = ' ', header = None)

# I think we need to convert these to LEMMA first -- so they are actually removed if it is a stop word 

# adding new ones that I anticipate seeing: 

add_words = ['、','。','・','!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-', '+', '=', 'I', '{', '}', '[', ']', '0','1','2','3','4','5','6','7','8','9']

stop_words = np.append(stop_words_data.values, add_words)

In [175]:
# now let's remove stop words: 

def remove_stopwords(tokens):
    '''given a list of tokens TOKENS, remove all stop words'''
    new_tokens = [] # to store non-stop words 
    for t in tokens: 
        if t not in stop_words: 
            new_tokens.append(t)
    return np.array(new_tokens) # to keep as an np array

In [187]:
print('Original list of tokens', sample_tokens)
print('\nNew cleaned list of tokens:', remove_stopwords(sample_tokens))


sample_tokens_cleaned = remove_stopwords(sample_tokens)
# note this removes verb stems, so if you want to keep verb stems, you would have to remove that from the list. I recommend doing this BEFORE turning the dataframe into a list

Original list of tokens ['真' '夜中' 'の' 'ドア-door' 'を' '叩き' '。' '返る' 'ない' 'て' 'と' '泣く' 'た' '。' '彼の'
 '季節' 'が' '今' '目' 'の' '前']

New cleaned list of tokens: ['真' '夜中' 'ドア-door' '叩き' '返る' '泣く' '彼の' '季節' '今' '目' '前']


### Step 4: Clean Data by Shortening Strings 

This is for long arrays, or just long strings of tokens. This requires using the index location from the remove_hits function to identify where to center your shortened string. For this method, the actual position of the characters do not matter, but if this were to be extended into a BERT NN, then the order would be preserved. 

Choose a value N that determine the number of characters to keep on either side of the hit index. If the value of N extends beyond the length of the array, then the value of N will be reduced on the side that exceeds the range. This means that if you choose N = a million or something, you should get the original array returned. 

In [223]:
def shorten_tokens(tokens, index, d = len(tokens)//2): # oh theres an error here 
                                         # -- we remove the words that show the index, so we need to change n as we go 
                                         #-- i'll cut it in half for now
    '''given an array of TOKENS, return the array as a shortened version centered at INDEX where there are D characters on either side of the index'''
    return tokens[index]

In [225]:
hit = '季節'
index = remove_hit(sample_text, hit)[1]

print(sample_text)
print(shorten_tokens(sample_tokens_cleaned, index//4))


真夜中のドアをたたき。帰らないでと泣いた。あの季節が今目の前
泣く


In [221]:
index

23

_This step still needs work, but I think we can make our table we want first to get things rolling_

### Step 5: make the training table!

To do this, have all data samples with their tokens as a column and then pivot the table so the column values are the new column labels 

In [226]:
sample_tokens_cleaned

array(['真', '夜中', 'ドア-door', '叩き', '返る', '泣く', '彼の', '季節', '今', '目', '前'],
      dtype='<U7')

In [243]:
data_table = pd.DataFrame(data = {'test':sample_tokens_cleaned})

In [257]:
data_table.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
test,真,夜中,ドア-door,叩き,返る,泣く,彼の,季節,今,目,前


In [248]:
data_table

Unnamed: 0,test
0,真
1,夜中
2,ドア-door
3,叩き
4,返る
5,泣く
6,彼の
7,季節
8,今
9,目
