# Tokenization Based on SoyNLP

soynlp 학습 전 기본적인 데이터 전처리
* 구두점 제거 및 특수문자 제거
* 모든 sentence를 join하여 하나의 corpus로 생성

In [1]:
import pandas as pd

save_dir = ('../data/조선경제_2019.09.01_2021.08.31.csv')
df = pd.read_csv(save_dir)
df['body'].head()

0    \n\n\n\n\n\n\n                         1일 오후 1...
1    \n\n\n\n\n\n\n                         80대 노모와...
2    \n\t\t\t\t\t\t\t\t.\n                         ...
3    \n\t\t\t\t\t\t\t\t주말과 휴일 동안 강원지역에서 산악사고와 교통사고가...
4    \n\t\t\t\t\t\t\t\t제1, 2, 3 석유류 30만ℓ 보관 탱크에 119...
Name: body, dtype: object

In [2]:
df['body_prep'] = df['body'].str.replace(pat=r'[^A-Za-z0-9가-힣]', repl= r' ', regex=True)

In [3]:
df['body_prep'].head()

0                                    1일 오후 10시 31분께...
1                                    80대 노모와 지체 장애를...
2                                                  ...
3             주말과 휴일 동안 강원지역에서 산악사고와 교통사고가 잇따랐다    ...
4             제1  2  3 석유류 30만  보관 탱크에 119구조대장 기어가 ...
Name: body_prep, dtype: object

In [4]:
corpus = ""
for body in df['body_prep']:
    corpus += body

In [8]:
corpus_file = open("../data/chosun_corpus_0928.txt", 'w', encoding='utf-8')
corpus_file.write(corpus)
corpus_file.close()

## Word Extraction

In [11]:
from soynlp.utils import DoublespaceLineCorpus
from soynlp.word import WordExtractor

corpus_path = "../data/chosun_corpus_0928.txt"
sents = DoublespaceLineCorpus(corpus_path, iter_sent=True)


word_extractor = WordExtractor(min_frequency=100,
    min_cohesion_forward=0.05, 
    min_right_branching_entropy=0.0
)
word_extractor.train(sents) # list of str or like
words = word_extractor.extract()

training was done. used memory 1.995 Gb
all cohesion probabilities was computed. # words = 22273
all branching entropies was computed # words = 354305
all accessor variety was computed # words = 354305


## Tokenizer

In [14]:
from soynlp.tokenizer import LTokenizer

cohesion_score = {word:score.cohesion_forward for word, score in words.items()}
tokenizer = LTokenizer(scores=cohesion_score)

In [21]:
df['tokenized'] = [tokenizer.tokenize(body) for body in df['body_prep']]

In [23]:
df.to_csv("../data/joongang_accident_token_df.csv")