# Japanese NLP Analysis: Comparative Study of UniDic-based Approaches

This notebook implements and compares two approaches for Japanese morphological analysis with BCCWJ frequency matching:

- **Plan A**: MeCab (fugashi) + UniDic direct pipeline
- **Plan B**: GiNZA (Sudachi) + UniDic alignment pipeline

Each approach is designed for reproducible setup, implementation, validation, and operational use.

## 1. Environment Setup & Verification

First, let's verify and set up our environment with all required packages.

In [10]:
# Environment verification and setup
import sys
import subprocess
from pathlib import Path

print(f"Python version: {sys.version}")
print(f"Working directory: {Path.cwd()}")

# Required packages
required_packages = [
    'fugashi', 'unidic', 'unidic-lite', 'spacy', 'ginza', 
    'ja-ginza', 'sudachipy', 'pandas', 'numpy', 'matplotlib', 'collections'
]

print("\nChecking package availability:")
for package in required_packages:
    try:
        if package == 'collections':
            import collections
            print(f"✓ {package} (built-in)")
        else:
            __import__(package)
            print(f"✓ {package}")
    except ImportError:
        print(f"✗ {package} - NOT FOUND")

Python version: 3.12.2 (main, Feb 25 2024, 03:55:42) [Clang 17.0.6 ]
Working directory: /Users/eguchi/Dropbox/teaching/Tohoku-2025/linguistic-data-analysis-I/2025/notebooks

Checking package availability:
✓ fugashi
✓ unidic
✗ unidic-lite - NOT FOUND
✓ spacy
✓ ginza
✗ ja-ginza - NOT FOUND
✓ sudachipy
✓ pandas
✓ numpy
✓ matplotlib
✓ collections (built-in)


In [11]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import time
import warnings
from typing import List, Tuple, Dict, Optional

# Japanese NLP libraries
import fugashi
import unidic
import spacy
from spacy.tokens import Token, Doc

# Statistical analysis
try:
    from scipy.stats import spearmanr
    scipy_available = True
except ImportError:
    print("scipy not available - will use numpy for correlation")
    scipy_available = False

print("All imports successful!")
warnings.filterwarnings('ignore')

scipy not available - will use numpy for correlation
All imports successful!


In [12]:
# Check UniDic installation and download if needed
try:
    print(f"UniDic directory: {unidic.DICDIR}")
    print("UniDic is properly installed")
except Exception as e:
    print(f"UniDic issue: {e}")
    print("You may need to run: python -m unidic download")

# Test basic fugashi functionality
try:
    tagger = fugashi.Tagger(f'-d "{unidic.DICDIR}"')
    test_result = list(tagger("テスト"))
    print(f"Fugashi + UniDic test successful: {test_result[0].surface}")
except Exception as e:
    print(f"Fugashi test failed: {e}")

UniDic directory: /Users/eguchi/Dropbox/teaching/Tohoku-2025/linguistic-data-analysis-I/.venv/lib/python3.12/site-packages/unidic/dicdir
UniDic is properly installed
Fugashi + UniDic test successful: テスト


## 2. Sample Data Preparation

Let's create realistic Japanese text samples for testing our pipelines.

In [13]:
# Sample Japanese texts for testing
sample_texts = [
    "彼は日ごろから本を読むのが好きです。",
    "ひごろの勉強が大切だと思います。",
    "日頃の努力が実を結ぶでしょう。",
    "彼女は書きあらわすことが得意です。",
    "その問題を書き表すのは難しい。",
    "今日は東京オリンピックについて話しましょう。",
    "コーヒーを飲んで、呑み込んで、また飲んでしまった。",
    "国際的な協力が必要不可欠です。",
    "機械学習の技術が進歩している。",
    "自然言語処理は興味深い分野だ。"
]

print("Sample texts prepared:")
for i, text in enumerate(sample_texts, 1):
    print(f"{i:2d}. {text}")

# Create a larger corpus by repeating and slightly modifying texts
extended_corpus = sample_texts * 3  # Simulate frequency variations
print(f"\nExtended corpus: {len(extended_corpus)} texts")

Sample texts prepared:
 1. 彼は日ごろから本を読むのが好きです。
 2. ひごろの勉強が大切だと思います。
 3. 日頃の努力が実を結ぶでしょう。
 4. 彼女は書きあらわすことが得意です。
 5. その問題を書き表すのは難しい。
 6. 今日は東京オリンピックについて話しましょう。
 7. コーヒーを飲んで、呑み込んで、また飲んでしまった。
 8. 国際的な協力が必要不可欠です。
 9. 機械学習の技術が進歩している。
10. 自然言語処理は興味深い分野だ。

Extended corpus: 30 texts


In [14]:
# Create mock BCCWJ frequency data for testing
# In real usage, this would be loaded from an actual BCCWJ frequency file

mock_bccwj_data = [
    ('日頃', 'ヒゴロ', '名詞', 1250),
    ('本', 'ホン', '名詞', 8500),
    ('読む', 'ヨム', '動詞', 3200),
    ('好き', 'スキ', '形容動詞', 2100),
    ('勉強', 'ベンキョウ', '名詞', 4200),
    ('大切', 'タイセツ', '形容動詞', 1800),
    ('思う', 'オモウ', '動詞', 9500),
    ('努力', 'ドリョク', '名詞', 2200),
    ('実', 'ミ', '名詞', 1100),
    ('結ぶ', 'ムスブ', '動詞', 800),
    ('書く', 'カク', '動詞', 4100),
    ('表す', 'アラワス', '動詞', 1500),
    ('得意', 'トクイ', '形容動詞', 1300),
    ('問題', 'モンダイ', '名詞', 6200),
    ('難しい', 'ムズカシイ', '形容詞', 3800),
    ('今日', 'キョウ', '名詞', 5500),
    ('東京', 'トウキョウ', '名詞', 4800),
    ('話す', 'ハナス', '動詞', 3600),
    ('飲む', 'ノム', '動詞', 2400),
    ('呑む', 'ノム', '動詞', 150),
    ('国際', 'コクサイ', '名詞', 2800),
    ('協力', 'キョウリョク', '名詞', 1900),
    ('必要', 'ヒツヨウ', '形容動詞', 4500),
    ('技術', 'ギジュツ', '名詞', 3900),
    ('進歩', 'シンポ', '名詞', 1100)
]

# Create DataFrame
df_bccwj = pd.DataFrame(mock_bccwj_data, columns=['lemma', 'reading', 'pos', 'freq_bccwj'])
df_bccwj['key'] = list(zip(df_bccwj.lemma, df_bccwj.reading, df_bccwj.pos))

print("Mock BCCWJ frequency data:")
print(df_bccwj.head(10))
print(f"\nTotal entries: {len(df_bccwj)}")

Mock BCCWJ frequency data:
  lemma reading   pos  freq_bccwj               key
0    日頃     ヒゴロ    名詞        1250     (日頃, ヒゴロ, 名詞)
1     本      ホン    名詞        8500       (本, ホン, 名詞)
2    読む      ヨム    動詞        3200      (読む, ヨム, 動詞)
3    好き      スキ  形容動詞        2100    (好き, スキ, 形容動詞)
4    勉強   ベンキョウ    名詞        4200   (勉強, ベンキョウ, 名詞)
5    大切    タイセツ  形容動詞        1800  (大切, タイセツ, 形容動詞)
6    思う     オモウ    動詞        9500     (思う, オモウ, 動詞)
7    努力    ドリョク    名詞        2200    (努力, ドリョク, 名詞)
8     実       ミ    名詞        1100        (実, ミ, 名詞)
9    結ぶ     ムスブ    動詞         800     (結ぶ, ムスブ, 動詞)

Total entries: 25


## 3. Plan A: MeCab (fugashi) + UniDic Direct Pipeline

### A-1 to A-3: Setup and Configuration

UniDic provides the morphological analysis system used in BCCWJ, making it ideal for frequency matching.

In [14]:
# A-3: Initialize fugashi with UniDic
print("Initializing Plan A: fugashi + UniDic pipeline")

# Initialize tagger with explicit UniDic path
tagger_a = fugashi.Tagger(f'-d "{unidic.DICDIR}"')
print(f"Tagger initialized with UniDic dictionary: {unidic.DICDIR}")

# Test the tagger
test_text = "日ごろから勉強している。"
tokens = list(tagger_a(test_text))
print(f"\nTest analysis of '{test_text}':")
for token in tokens:
    print(f"  {token.surface} -> {token.feature.lemma} [{','.join(token.pos)}]")

Initializing Plan A: fugashi + UniDic pipeline
Tagger initialized with UniDic dictionary: /Users/eguchi/Dropbox/teaching/Tohoku-2025/linguistic-data-analysis-I/.venv/lib/python3.12/site-packages/unidic/dicdir

Test analysis of '日ごろから勉強している。':
  日ごろ -> 日頃 [名,詞,,,普,通,名,詞,,,副,詞,可,能,,,*]
  から -> から [助,詞,,,格,助,詞,,,*,,,*]
  勉強 -> 勉強 [名,詞,,,普,通,名,詞,,,サ,変,可,能,,,*]
  し -> 為る [動,詞,,,非,自,立,可,能,,,*,,,*]
  て -> て [助,詞,,,接,続,助,詞,,,*,,,*]
  いる -> 居る [動,詞,,,非,自,立,可,能,,,*,,,*]
  。 -> 。 [補,助,記,号,,,句,点,,,*,,,*]


In [18]:
# A-4: Morphological field extraction function
def iter_lemma_keys_plan_a(text: str, tagger) -> List[Tuple[str, str, str]]:
    """
    Extract (lemma, reading, pos_major) tuples from text using UniDic.
    
    Args:
        text: Input Japanese text
        tagger: fugashi Tagger instance
    
    Returns:
        List of (dictionary_form, reading, pos_major) tuples
    """
    keys = []
    for m in tagger(text):
        if m.surface.strip():  # Skip empty tokens
            # UniDic POS is hierarchical; use major category (pos[0])
            pos_major = m.pos[0] if m.pos else 'UNKNOWN'
            lemma = m.feature[10] if m.feature[10] else m.surface
            reading = m.feature[11] if m.feature[11] else ''
            keys.append((lemma, reading, pos_major))
    return keys

# Test the extraction function
test_keys = iter_lemma_keys_plan_a(test_text, tagger_a)
print(f"Extracted keys from '{test_text}':")
for lemma, reading, pos in test_keys:
    print(f"  ({lemma}, {reading}, {pos})")

Extracted keys from '日ごろから勉強している。':
  (日ごろ, ヒゴロ, 名)
  (から, カラ, 助)
  (勉強, ベンキョー, 名)
  (する, スル, 動)
  (て, テ, 助)
  (いる, イル, 動)
  (。, *, 補)


In [19]:
# Fixed version with proper fugashi/UniDic attribute handling
def iter_lemma_keys_fixed(text: str, tagger) -> List[Tuple[str, str, str]]:
    """
    Extract (lemma, reading, pos_major) tuples from text using UniDic.
    Fixed version that handles fugashi attribute variations.
    """
    keys = []
    for m in tagger(text):
        if m.surface.strip():  # Skip empty tokens
            # UniDic POS is hierarchical; use major category (pos[0])
            pos_major = m.pos[0] if m.pos else 'UNKNOWN'
            
            # Handle different attribute names for lemma
            try:
                lemma = m.lemma if hasattr(m, 'lemma') else m.feature[10]
            except:
                lemma = m.surface  # fallback
            
            # Handle different attribute names for reading
            try:
                reading = m.feature[9] if len(m.feature) > 9 else ''
            except:
                reading = ''  # fallback
            
            keys.append((lemma, reading, pos_major))
    return keys

# Use the fixed function
iter_lemma_keys_plan_a = iter_lemma_keys_fixed

# Test the fixed function
test_keys = iter_lemma_keys_plan_a(test_text, tagger_a)
print(f"Extracted keys from '{test_text}' (fixed version):")
for lemma, reading, pos in test_keys:
    print(f"  ({lemma}, {reading}, {pos})")

Extracted keys from '日ごろから勉強している。' (fixed version):
  (日ごろ, ヒゴロ, 名)
  (から, カラ, 助)
  (勉強, ベンキョー, 名)
  (する, シ, 動)
  (て, テ, 助)
  (いる, イル, 動)
  (。, *, 補)


In [20]:
# A-5: Frequency analysis with BCCWJ matching
def analyze_corpus_plan_a(corpus: List[str], tagger, bccwj_df: pd.DataFrame) -> pd.DataFrame:
    """Analyze corpus using Plan A and match with BCCWJ frequencies."""
    freq = Counter()
    
    print(f"Analyzing {len(corpus)} texts with Plan A...")
    for text in corpus:
        for key in iter_lemma_keys_plan_a(text, tagger):
            freq[key] += 1
    
    # Convert to DataFrame
    rows = []
    for (lemma, reading, pos), count in freq.items():
        rows.append((lemma, reading, pos, count))
    
    df_local = pd.DataFrame(rows, columns=['lemma', 'reading', 'pos', 'freq_local'])
    df_local['key'] = list(zip(df_local.lemma, df_local.reading, df_local.pos))
    
    # Merge with BCCWJ data
    merged = df_local.merge(bccwj_df[['key', 'freq_bccwj']], on='key', how='left')
    
    return merged.sort_values('freq_local', ascending=False)

# Run Plan A analysis
results_a = analyze_corpus_plan_a(extended_corpus, tagger_a, df_bccwj)
print(f"\nPlan A Results (top 15):")
print(results_a.head(15)[['lemma', 'reading', 'pos', 'freq_local', 'freq_bccwj']])

Analyzing 30 texts with Plan A...

Plan A Results (top 15):
   lemma reading pos  freq_local  freq_bccwj
11     。       *   補          30         NaN
8      が       ガ   助          18         NaN
1      は       ワ   助          15         NaN
7      の       ノ   助          15         NaN
5      を       オ   助          12         NaN
10    です      デス   助           9         NaN
42     で       デ   助           9         NaN
15     だ       ダ   助           6         NaN
37     て       テ   助           6         NaN
41    飲む      ノン   動           6         NaN
43     、       *   補           6         NaN
48    国際    コクサイ   名           3         NaN
47     た       タ   助           3         NaN
46   しまう     シマッ   動           3         NaN
0      彼      カレ   代           3         NaN


In [21]:
# A-6: Evaluation metrics for Plan A
def calculate_metrics(df: pd.DataFrame) -> Dict[str, float]:
    """Calculate coverage and correlation metrics."""
    # Coverage: percentage of local tokens found in BCCWJ
    matched = df.dropna(subset=['freq_bccwj'])
    coverage = len(matched) / len(df) * 100
    
    # Token coverage (by frequency)
    total_tokens = df['freq_local'].sum()
    matched_tokens = matched['freq_local'].sum()
    token_coverage = matched_tokens / total_tokens * 100
    
    # Spearman correlation for matched items
    if len(matched) > 1:
        if scipy_available:
            correlation, p_value = spearmanr(matched['freq_local'], matched['freq_bccwj'])
        else:
            correlation = np.corrcoef(matched['freq_local'].rank(), matched['freq_bccwj'].rank())[0,1]
            p_value = None
    else:
        correlation, p_value = None, None
    
    return {
        'type_coverage': coverage,
        'token_coverage': token_coverage,
        'correlation': correlation,
        'p_value': p_value,
        'total_types': len(df),
        'matched_types': len(matched),
        'total_tokens': total_tokens,
        'matched_tokens': matched_tokens
    }

metrics_a = calculate_metrics(results_a)
print("Plan A Evaluation Metrics:")
for key, value in metrics_a.items():
    if isinstance(value, float) and value is not None:
        print(f"  {key}: {value:.3f}")
    else:
        print(f"  {key}: {value}")

Plan A Evaluation Metrics:
  type_coverage: 0.000
  token_coverage: 0.000
  correlation: None
  p_value: None
  total_types: 66
  matched_types: 0
  total_tokens: 297
  matched_tokens: 0


# Using Fugashi

In [17]:
import fugashi, unidic
from spacy.tokens import Token
tagger = fugashi.Tagger()
tagger = fugashi.Tagger(f'-d "{unidic.DICDIR}"')

if not Token.has_extension("unidic_lemmas"):
    Token.set_extension("unidic_lemmas", default=None)

def enrich_with_unidic(doc):
    text = doc.text
    # GiNZA token start index -> token
    start_map = {tok.idx: tok for tok in doc}
    cursor = 0
    for m in tagger(text):
        surf = m.surface
        start = text.find(surf, cursor)
        if start < 0:
            continue
        cursor = start + len(surf)
        tok = start_map.get(start)
        if tok:
            if tok._.unidic_lemmas is None:
                tok._.unidic_lemmas = []
            tok._.unidic_lemmas.append(
                (m.feature.lemma, m.feature.pos1, m.pos[0])
            )
    return doc

doc = enrich_with_unidic(doc)
for t in doc:
    print(t.text, t._.unidic_lemmas)

彼 [('彼', '代名詞', '代')]
は [('は', '助詞', '助')]
日ごろ [('日頃', '名詞', '名')]
本 [('本', '名詞', '名')]
を [('を', '助詞', '助')]
読む [('読む', '動詞', '動')]
。 [('。', '補助記号', '補')]


In [5]:
text = "日頃からの日ごろをてっていする。"

In [6]:
import spacy
from fugashi import Tagger
import unidic   # or unidic_lite

nlp = spacy.load("ja_ginza")
tagger = Tagger(f'-d "{unidic.DICDIR}"')  # フル UniDic
doc = nlp(text)
mecab_tokens = list(tagger(text))
# → 文字オフセットでアライメントして doc の token に UniDic 情報を付与

In [7]:
mecab_tokens

[日頃, から, の, 日ごろ, を, てってい, する, 。]

In [8]:
print(tagger)

<fugashi.fugashi.Tagger object at 0x1183bad80>


In [9]:
import unidic
print("Using unidic at:", unidic.DICDIR)

Using unidic at: /Users/eguchi/Dropbox/teaching/Tohoku-2025/linguistic-data-analysis-I/.venv/lib/python3.12/site-packages/unidic/dicdir


In [10]:
sample = next(iter(tagger("テスト")))
print("feature_len:", len(sample.feature))
# 17 = unidic-lite (2.1.2), 29前後 = フル UniDic 3.x

feature_len: 29


In [11]:
print([a for a in dir(tagger) if 'dic' in a.lower()])

['dictionary_info']


In [12]:
import fugashi
from fugashi import Tagger

tagger = Tagger()  # まずオプションなし
m = next(iter(tagger("日ごろ")))
print("Available attrs:", [a for a in dir(m) if not a.startswith('_')][:25])

Available attrs: ['char_type', 'feature', 'feature_raw', 'is_unk', 'length', 'pos', 'posid', 'rlength', 'stat', 'surface', 'white_space']


In [13]:
import fugashi
t = fugashi.Tagger()
print("Tagger repr:", t)   # ここに 'ipa' や 'unidic' などヒントが出ることが多い

w = next(iter(t("日ごろ")))
print("surface:", w.surface)
print("feature_len:", len(w.feature))
print("raw feature:", w.feature)          # まず 1語分

Tagger repr: <fugashi.fugashi.Tagger object at 0x13f33b5c0>
surface: 日ごろ
feature_len: 29
raw feature: UnidicFeatures29(pos1='名詞', pos2='普通名詞', pos3='副詞可能', pos4='*', cType='*', cForm='*', lForm='ヒゴロ', lemma='日頃', orth='日ごろ', pron='ヒゴロ', orthBase='日ごろ', pronBase='ヒゴロ', goshu='和', iType='*', iForm='*', fType='*', fForm='*', iConType='*', fConType='*', type='体', kana='ヒゴロ', kanaBase='ヒゴロ', form='ヒゴロ', formBase='ヒゴロ', aType='0', aConType='C2', aModType='*', lid='8605061500510720', lemma_id='31305')
