<a href="https://colab.research.google.com/github/ejpark78/codelab/blob/master/bert/huggingface_konlpy/01_huggingface_konlpy_usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

출처: https://github.com/lovit/huggingface_konlpy/blob/master/tutorials/01_huggingface_konlpy_usage.ipynb

## KoNLPy as pre-tokenizer

In [2]:
from huggingface_konlpy.tokenizers_konlpy import KoNLPyPreTokenizer
from konlpy.tag import Komoran

sent_ko = '신종 코로나바이러스 감염증(코로나19) 사태가 심각합니다'
komoran_pretok = KoNLPyPreTokenizer(Komoran())

print(komoran_pretok(sent_ko))

신종 코로나바이러스 감염증 ( 코로나 19 ) 사태 가 심각 하 ㅂ니다


In [3]:
!mkdir -p ./model/KomoranBertWordPieceTokenizer/

In [4]:
from huggingface_konlpy.tokenizers_konlpy import KoNLPyPretokBertWordPieceTokenizer
from huggingface_konlpy.transformers_konlpy import KoNLPyPretokBertTokenizer


komoran_bertwordpiece_tokenizer = KoNLPyPretokBertWordPieceTokenizer(
    konlpy_pretok = komoran_pretok)

komoran_bertwordpiece_tokenizer.train(
    files = ['./data/2020-07-29_covid_news_sents.txt'],
    vocab_size = 3000)

komoran_bertwordpiece_tokenizer.save_model(
    directory='./model/KomoranBertWordPieceTokenizer/',
    name='covid')

komoran_pretok_berttokenizer = KoNLPyPretokBertTokenizer(
    konlpy_pretok = komoran_pretok,
    vocab_file = './model/KomoranBertWordPieceTokenizer/covid-vocab.txt')

In [5]:
from huggingface_konlpy import compose

indices = komoran_pretok_berttokenizer.encode(sent_ko)
tokens = [komoran_pretok_berttokenizer.ids_to_tokens[ids] for ids in indices]

print(' '.join(compose(tokens)))

[CLS] 신종 코로나바이러스 감염증 ( 코로나 19 ) 사태 가 심 ##각 하 [UNK] [SEP]


## KoNLPy WordPiece Tokenizer

### with tag

In [6]:
from huggingface_konlpy.tokenizers_konlpy import KoNLPyWordPieceTokenizer
from konlpy.tag import Mecab

mecab_wordpiece_notag = KoNLPyWordPieceTokenizer(Mecab(), use_tag=False)
print(' '.join(mecab_wordpiece_notag.tokenize(sent_ko)))

신종 코로나 ##바이러스 감염증 ##( ##코로나 ##19 ##) 사태 ##가 심각 ##합니다


In [7]:
mecab_wordpiece_usetag = KoNLPyWordPieceTokenizer(Mecab(), use_tag=True)
print(' '.join(mecab_wordpiece_usetag.tokenize(sent_ko)))

신종/NNG 코로나/NNP ##바이러스/NNG 감염증/NNG ##(/SSO ##코로나/NNP ##19/SN ##)/SSC 사태/NNG ##가/JKS 심각/XR ##합니다/XSA+EC


In [8]:
from huggingface_konlpy.tokenizers_konlpy import KoNLPyBertWordPieceTrainer

mecab_wordpiece_notag_trainer = KoNLPyBertWordPieceTrainer(Mecab(), use_tag=False)

mecab_wordpiece_notag_trainer.train(files=['./data/2020-07-29_covid_news_sents.txt'])

mecab_wordpiece_notag_trainer.save_model('./model/BertStyleMecab/', 'notag')

Initialize alphabet 1/1: 100%|██████████| 70964/70964 [00:00<00:00, 98441.64it/s]
Train vocab 1/1: 100%|██████████| 70964/70964 [00:10<00:00, 6504.49it/s]


[/home/jovyan/bert/huggingface_konlpy/model/BertStyleMecab/notag-vocab.txt]


In [9]:
from huggingface_konlpy.transformers_konlpy import KoNLPyBertTokenizer

konlpy_bert_notag = KoNLPyBertTokenizer(
    konlpy_wordpiece = KoNLPyWordPieceTokenizer(Mecab(), use_tag=False),
    vocab_file = './model/BertStyleMecab/notag-vocab.txt'
)

print(' '.join(konlpy_bert_notag.tokenize(sent_ko)))

신종 코로나 ##바이러스 감염증 ##( ##코로나 ##19 ##) 사태 ##가 심각 ##합니다


In [10]:
mecab_wordpiece_usetag_trainer = KoNLPyBertWordPieceTrainer(Mecab(), use_tag=True)

mecab_wordpiece_usetag_trainer.train(files=['./data/2020-07-29_covid_news_sents.txt'])
mecab_wordpiece_usetag_trainer.save_model('./model/BertStyleMecab/', 'usetag')

konlpy_bert_usetag = KoNLPyBertTokenizer(
    konlpy_wordpiece = KoNLPyWordPieceTokenizer(Mecab(), use_tag=True),
    vocab_file = './model/BertStyleMecab/usetag-vocab.txt'
)

indices = konlpy_bert_usetag.encode(sent_ko)
tokens = [konlpy_bert_usetag.ids_to_tokens[ids] for ids in indices]

print(' '.join(compose(tokens)))

Initialize alphabet 1/1: 100%|██████████| 70964/70964 [00:00<00:00, 105826.48it/s]
Train vocab 1/1: 100%|██████████| 70964/70964 [00:10<00:00, 6705.41it/s]


[/home/jovyan/bert/huggingface_konlpy/model/BertStyleMecab/usetag-vocab.txt]
[CLS] 신종/NNG 코로나/NNP ##바이러스/NNG 감염증/NNG ##(/SSO ##코로나/NNP ##19/SN ##)/SSC 사태/NNG ##가/JKS 심각/XR 합 니 다 [SEP]
