<a href="https://colab.research.google.com/github/exphon/exphon2026/blob/main/MFA_Korean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Forced align LJSpeech dataset using Montreal Forced Aligner (MFA)


**Note**: The notebook takes 20 minutes to finish.

**DATA**:  https://www.korean.go.kr/
          
           모두의 말뭉치 -> '서울말 낭독체 발화 말뭉치'

Expected results:
![kfalign.png](https://github.com/exphon/exphon2026/blob/main/fig/korean_mfa.png?raw=1)



# STEP 1: miniconda 설치를 위한 install_mfa.sh 작성

https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner

In [None]:
%%writefile install_mfa.sh
#!/bin/bash

## a script to install Montreal Forced Aligner (MFA)

root_dir=${1:-/tmp/mfa}

# Clean up previous installation
if [ -d "$root_dir" ]; then
    echo "Removing existing MFA installation at $root_dir"
    rm -rf $root_dir
fi
mkdir -p $root_dir
cd $root_dir

# download miniconda3
wget -q --show-progress https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $root_dir/miniconda3 -f

# Initialize conda for the current shell to enable 'conda activate' etc.
eval "$($root_dir/miniconda3/bin/conda shell.bash hook)"

# Accept Conda Terms of Service
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create MFA environment with a specific Python version (e.g., 3.9) for compatibility
conda create -n aligner python=3.9 -y

# Activate the environment
conda activate aligner

# Install Montreal Forced Aligner into the activated environment
conda install -c conda-forge montreal-forced-aligner -y

echo -e "\n======== DONE =========="
echo -e "\nTo activate MFA, run: source $root_dir/miniconda3/bin/activate aligner"
echo -e "\nTo delete MFA, run: rm -rf $root_dir"
echo -e "\nSee: https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html to know how to use MFA"

### install_mfa.sh 실행

In [None]:
# download and install mfa
INSTALL_DIR="/tmp/mfa" # path to install directory

!bash ./install_mfa.sh {INSTALL_DIR}
# The following command needs to be executed in a way that the conda environment is properly activated
# Using `bash -c "source ... && mfa ..."` ensures it runs in a single shell context
!bash -c "source {INSTALL_DIR}/miniconda3/bin/activate aligner && export MPLBACKEND=Agg && mfa align --help"

# STEP 2: kdata 업로드
  - `탐색기`를 이용하여 wav & txt 파일 업로드
  - `터미널`을 이용한 업로드된 파일들 data 폴더로 이동

In [None]:
# Sample Korean Data

# /content# mkdir data
# /content# mv *.txt data
# /content# mv *.wav data



In [None]:
!cat ./data/fv01_t01_s05.txt

from IPython.display import Audio
Audio("./data/fv01_t01_s05.wav")


## STEP 3: sox 설치

이 코드는 sox라는 오디오 처리 도구를 사용하여 .wav 파일을 변환합니다. --norm=-3은 오디오 볼륨을 -3 dBFS로 정규화하여 일관된 볼륨 수준을 유지합니다. -r 16k는 샘플링 속도를 16kHz로 설정하고, -c 1은 오디오를 모노 채널로 변환합니다. 마지막으로, pwd/wav/{}는 처리된 파일을 ./wav 디렉토리에 저장하도록 지정합니다.

In [None]:
# install sox tool
!sudo apt install -q -y sox
# convert to 16k audio clips
!mkdir -p ./wav
!echo "normalize audio clips to sample rate of 16k"
!find ./data -name "*.wav" -type f -execdir sox --norm=-3 {} -r 16k -c 1 `pwd`/wav/{} \;
!echo "Number of clips" $(ls ./wav/ | wc -l)

## STEP4: 터미널 이용하여 Korean MFA 사용한 후, zip 파일 다운로드

In [None]:
# 가상환경을 notebook에서 실현하기가 업력기 때문에 terminal에서 아래의 작업을 함

"""
! source /tmp/mfa/miniconda3/bin/activate aligner


(aligner) /content# mfa version
(aligner) /content# mfa model download acoustic korean_mfa
(aligner) /content# mfa model download dictionary korean_mfa
(aligner) /content# mfa model inspect acoustic korean_mfa
(aligner) /content# mfa align data/ korean_mfa korean_mfa korean/
(aligner) /content# pip install python-mecab-ko jamo
(aligner) /content# mfa align data/ korean_mfa korean_mfa korean/
(aligner) /content# zip kalign.zip korean/*
"""


## Note: OOV 해결할 필요가 있음.
- spn(speech-like noise)로 tagging됨

Expected results:
![spn_example.png](https://github.com/exphon/exphon2026/blob/main/fig/korean_mfa_spn.png?raw=1)


# G2P

In [None]:
%%writefile mccab_wordlist.py
from mecab import MeCab
import glob

mecab = MeCab()
words = set()

for fname in glob.glob("data/*.txt"):
    with open(fname, encoding="utf-8") as f:
        text = f.read().strip()
        for morph in mecab.morphs(text):
            words.add(morph)

with open("wordlist.txt", "w", encoding="utf-8") as f:
    for w in sorted(words):
        f.write(w + "\n")

In [None]:
"""
python mccab_wordlist.py
mfa model download g2p korean_mfa
mfa model inspect g2p korean_mfa
mfa g2p wordlist.txt korean_mfa g2p.dict
! cat /root/Documents/MFA/pretrained_models/dictionary/korean_mfa.dict g2p.dict \
 | sort -u > korean_mfa_extended.dict
"""

In [None]:
# see: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/pull/480
import re
lexicon = open("korean_mfa_extended.dict").readlines()
with open("modified_korean_mfa_extended.dict", "w") as f:
    for line in lexicon:
        word, *phonemes = re.split(r"\s+", line.strip())
        phonemes = " ".join(phonemes)
        f.write(f"{word}\t{phonemes}\n")

In [None]:
# mfa align data/ g2p.dict korean_mfa korean_g2p_aligned/ --clean

In [None]:
#(aligner) /content# zip kalign_new.zip korean_g2p_aligned/*


In [None]:
from google.colab import files
files.download('kalign_new.zip')

# References
- https://gist.github.com/NTT123/12264d15afad861cb897f7a20a01762e
- [A Gentle Guide to Montreal Forced Aligner by Dr. Chenzi Xu](https://chenzixu.rbind.io/resources/1forcedalignment/fa6/)
-