<a href="https://www.kaggle.com/code/saurabh8112/nlp-japanese-morphological-tokenization?scriptVersionId=172766400" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Morphological splitting of japanese sentences into words

In contrast to English, Japanese (and Chinese) lacks spaces between words in sentences. This characteristic poses a significant challenge for natural language processing tasks, particularly for tokenization. 

Tokenization is the process of breaking down text into individual words or subwords, which is a crucial step for various NLP models like BERT or GPT that rely on sub-word tokenizers such as **WordPiece or Byte Pair Encoding (BPE)**.

Consider the Japanese sentence: **彼女は日本語を勉強しています (She is studying Japanese)**. Unlike in English, where we can easily split the sentence into words using spaces, Japanese sentences require more sophisticated methods for segmentation. This sentence can be split, for example, in these two ways

* 彼女 (she) / は (is) / 日本語 (Japanese) / を (object marker) / 勉強 (study) / しています (is doing)
* 彼女 (she) / は (is) / 日本語を (Japanese) / 勉強しています (is studying)

How do we split this sentence into words? This is where morphological analysis of text comes into picture. In this notebook I will try to give high level overview and usage of some popular moropholical analyzers available.

Note: *This initial step of segmenting Japanese text into its constituent units is essential before proceeding with training models like BERT(emebedding generation) or GPT(text generation). Without a reliable tokenization strategy, these models cannot effectively process Japanese language data.*

# Tokenization with MeCab and fugashi

Mecab is a popular tool for morphological analysis of text. Fugashi is a CPython wrapper over MeCab

Let's install MeCab, Fugashi and iPadic (dictionary distributed with MeCab)

In [1]:
!pip install fugashi
!pip install unidic-lite

Collecting fugashi
  Downloading fugashi-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Downloading fugashi-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (600 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m600.9/600.9 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fugashi
Successfully installed fugashi-1.3.2
Collecting unidic-lite
  Downloading unidic-lite-1.0.8.tar.gz (47.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.4/47.4 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
[?25hBuilding wheels for collected packages: unidic-lite
  Building wheel for unidic-lite (setup.py) ... [?25l- \ done
[?25h  Created wheel for unidic-lite: filename=unidic_lite-1.0.8-py3-none-any.whl size=47658817 sha256=dc660439fe3d2880afcbdd19a906021ec06d3261c6604cc0ac4edd817a7f0cce
  Stored in directory: 

In [2]:
!pip install ipadic
!pip install mecab-python3

Collecting ipadic
  Downloading ipadic-1.0.0.tar.gz (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
[?25hBuilding wheels for collected packages: ipadic
  Building wheel for ipadic (setup.py) ... [?25l- \ done
[?25h  Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556704 sha256=edabc509d12c48109b44af35d8a24f0b7f511a199adcdaadd560ade27ea8b430
  Stored in directory: /root/.cache/pip/wheels/5b/ea/e3/2f6e0860a327daba3b030853fce4483ed37468bbf1101c59c3
Successfully built ipadic
Installing collected packages: ipadic
Successfully installed ipadic-1.0.0
Collecting mecab-python3
  Downloading mecab_python3-1.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
Downloading mecab_python3-1.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (581 kB)
[2K   [90m━━━━━━━━━━━━━━━━━

# Morphological tokenization!
Let's tokenize the sentence with MeCab

In [3]:
import MeCab

text = "彼女は日本語を勉強しています"
wakati = MeCab.Tagger()
wakati.parse(text).split()

['彼女',
 'カノジョ',
 'カノジョ',
 '彼女',
 '代名詞',
 '1',
 'は',
 'ワ',
 'ハ',
 'は',
 '助詞-係助詞',
 '日本',
 'ニッポン',
 'ニッポン',
 '日本',
 '名詞-固有名詞-地名-国',
 '3',
 '語',
 'ゴ',
 'ゴ',
 '語',
 '名詞-普通名詞-一般',
 '1',
 'を',
 'オ',
 'ヲ',
 'を',
 '助詞-格助詞',
 '勉強',
 'ベンキョー',
 'ベンキョウ',
 '勉強',
 '名詞-普通名詞-サ変可能',
 '0',
 'し',
 'シ',
 'スル',
 '為る',
 '動詞-非自立可能',
 'サ行変格',
 '連用形-一般',
 '0',
 'て',
 'テ',
 'テ',
 'て',
 '助詞-接続助詞',
 'い',
 'イ',
 'イル',
 '居る',
 '動詞-非自立可能',
 '上一段-ア行',
 '連用形-一般',
 '0',
 'ます',
 'マス',
 'マス',
 'ます',
 '助動詞',
 '助動詞-マス',
 '終止形-一般',
 'EOS']

## Wait that's too much information

MeCab provides detailed linguistic analysis for each token in the input sentence. It contains surface form, reading, base form, part of speech and other featues. We don't need all of that if we want to use japanese language for text generation or embedding generation.



## Let's focus only on the essentials
This is too much it has a lot of information we don't want. If we only need text then we can add a flag `-Owakati`

Adding -Owakati as an argument to the MeCab tagger specifies the output format as the tokenized text only, without additional linguistic information such as readings, part-of-speech tags, or other features. 

In [4]:
text = "彼女は日本語を勉強しています"
wakati = MeCab.Tagger('-Owakati')
wakati.parse(text).split()

['彼女', 'は', '日本', '語', 'を', '勉強', 'し', 'て', 'い', 'ます']

## With fugashi!

In [5]:

import fugashi

text = "彼女は日本語を勉強しています"
tagger = fugashi.Tagger()
tokens = [word.surface for word in tagger(text)]
print(tokens)


['彼女', 'は', '日本', '語', 'を', '勉強', 'し', 'て', 'い', 'ます']


# Bonus: Tokenization with Fugashi

Sudachi is a morphological analyzer based on the double-array trie structure, allowing for efficient dictionary lookup and morphological analysis of Japanese text. 

It supports multiple dictionaries, including a system dictionary and user-defined dictionaries, and offers features like unknown word handling and customizable tokenization rules.



## install fugashi and fugashi-core

In [6]:
!pip install sudachipy
!pip install sudachidict_core

Collecting sudachipy
  Downloading SudachiPy-0.6.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading SudachiPy-0.6.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sudachipy
Successfully installed sudachipy-0.6.8
Collecting sudachidict_core
  Downloading SudachiDict_core-20240409-py3-none-any.whl.metadata (2.5 kB)
Downloading SudachiDict_core-20240409-py3-none-any.whl (72.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sudachidict_core
Successfully installed sudachidict_core-20240409


## Split sentence into words

In [7]:
from sudachipy import tokenizer, dictionary

text = "彼女は日本語を勉強しています"
tokenizer_obj = dictionary.Dictionary().create()
tokens = [m.surface() for m in tokenizer_obj.tokenize(text)]
print(tokens)


['彼女', 'は', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます']


## Subtle difference between sudachi and MeCab
Its imortant to note that different morophological tokenizers sometimes have slightly different splits for the same sentence. 

For instance, the sentence "彼女は日本語を勉強しています" was split into:
* ['彼女', 'は', '日本', '語', 'を', '勉強', 'し', 'て', 'い', 'ます'] by MeCab
* ['彼女', 'は', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます'] by Sudachi

The literal 日本語 can be split into:
* '日本' (Japan) and '語' (language)
* '日本語' (Japanese)


And both make equal sense!