# Natural Language Processing
## 4️⃣ Korean Natural Language Processing

### Characteristic of Korean

- In general, Natural Language Processing models predict sentence's meaning and characteristics using **words** of the sentence.
- However, the **criteria for Korean words are not clear**.
- In Korean, words are composed of **a combination of a semantic part and a grammatical part.** (i.e. 먹- + -다, 먹- + -었다, 먹- + -는, ...)

### KoNLPy

**KoNLPy** is a python library that **extract korean words** from a sentence using several korean '형태소' dictionaries. (한나눔, 꼬꼬마, Komoran, Mecab, Open Korean Text)

In [2]:
from konlpy.tag import Kkma

sent = "안녕 나는 우혁이야 반가워. 너의 이름은 뭐야?"

kkma = Kkma()

print(kkma.nouns(sent))
print(kkma.pos(sent)) # returns each word and each word's 형태소 품사
print(kkma.sentences(sent)) # Split sent to each sentences

['안녕', '우혁', '너', '이름', '뭐']
[('안녕', 'NNG'), ('나', 'VV'), ('는', 'ETD'), ('우혁', 'UN'), ('이야', 'JX'), ('반갑', 'VA'), ('어', 'ECS'), ('.', 'SF'), ('너', 'NP'), ('의', 'JKG'), ('이름', 'NNG'), ('은', 'JX'), ('뭐', 'NP'), ('야', 'JX'), ('?', 'SF')]
['안녕 나는 우혁이야 반가워. 너의 이름은 뭐야?']


In [3]:
from konlpy.tag import Okt

okt = Okt()

print(okt.nouns(sent))
print(okt.pos(sent))
print(okt.pos(sent, stem=True)) # Automatically stems each words

['안녕', '나', '우혁', '너', '이름', '뭐']
[('안녕', 'Noun'), ('나', 'Noun'), ('는', 'Josa'), ('우혁', 'Noun'), ('이야', 'Josa'), ('반가워', 'Adjective'), ('.', 'Punctuation'), ('너', 'Noun'), ('의', 'Josa'), ('이름', 'Noun'), ('은', 'Josa'), ('뭐', 'Noun'), ('야', 'Josa'), ('?', 'Punctuation')]
[('안녕', 'Noun'), ('나', 'Noun'), ('는', 'Josa'), ('우혁', 'Noun'), ('이야', 'Josa'), ('반갑다', 'Adjective'), ('.', 'Punctuation'), ('너', 'Noun'), ('의', 'Josa'), ('이름', 'Noun'), ('은', 'Josa'), ('뭐', 'Noun'), ('야', 'Josa'), ('?', 'Punctuation')]


### soynlp

If we process words based on dictionaries as KoNLPy, **out of vocabulary** problem can occur as below:

In [4]:
from konlpy.tag import Kkma

sent = "보코하람 테러로 소말리아에서 전쟁이 있었어요."

kkma = Kkma()

print(kkma.nouns(sent))

['보', '보코', '코', '테러', '소말리', '전쟁']


**soynlp** distinguishes the boundaries of words based on patterns that frequently occur in the learning data to solve o.o.v problem.

In [None]:
from soynlp.utils import DoublespaceLineCorpus
from soynlp.word import WordExtractor
from soynlp.noun import LRNounExtractor_v2

train_data = DoublespaceLineCorpus('/data.txt')

noun_extractor = LRNounExtractor_v2()
nouns = noun_extractor.train_extract(train_data) # returns nouns extracted from the data

word_extractor = WordExtractor()
words = word_extractor.train_extract(train_data) # returns words extracted from the data

In [1]:
from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import LRNounExtractor_v2

sent = '트와이스 아이오아이 좋아여 tt가 저번에 1위 했었죠?'

# 학습에 사용할 데이터가 train_data에 저장되어 있습니다.
corpus_path = 'articles.txt'
train_data = DoublespaceLineCorpus(corpus_path)
print("학습 문서의 개수: %d" %(len(train_data)))

# LRNounExtractor_v2 객체를 이용해 train_data에서 명사로 추정되는 단어를 nouns 변수에 저장하세요.
noun_extractor = LRNounExtractor_v2()
nouns = noun_extractor.train_extract(train_data)

# 생성된 명사의 개수를 확인해봅니다.
print(len(nouns))

# 생성된 명사 목록을 사용해서 sent에 주어진 문장에서 명사를 sent_nouns 리스트에 저장하세요.
sent_nouns = []
for word in sent.split():
    if word in nouns:
        sent_nouns.append(word)

print(sent_nouns)

학습 문서의 개수: 30091
[Noun Extractor] use default predictors
[Noun Extractor] num features: pos=3929, neg=2321, common=107
[Noun Extractor] counting eojeols
[EojeolCounter] n eojeol = 403896 from 30091 sents. mem=0.172 Gb                    
[Noun Extractor] complete eojeol counter -> lr graph
[Noun Extractor] has been trained. #eojeols=4434442, mem=0.918 Gb
[Noun Extractor] batch prediction was completed for 119705 words
[Noun Extractor] checked compounds. discovered 70639 compounds
[Noun Extractor] postprocessing detaching_features : 109312 -> 92205
[Noun Extractor] postprocessing ignore_features : 92205 -> 91999
[Noun Extractor] postprocessing ignore_NJ : 91999 -> 90643
[Noun Extractor] 90643 nouns (70639 compounds) with min frequency=1
[Noun Extractor] flushing was done. mem=0.976 Gb                    
[Noun Extractor] 76.63 % eojeols are covered
90643
['트와이스', '아이오아이', '1위']


### Similarity between sentences

#### Jaccard Similarity

$$\text{Jaccard Similarity between } Sent_1 \text{ and } Sent_2 = \frac{\text{number of common words in two }Sents}{\text{number of all the words in two }Sents}$$

In [2]:
import nltk

sent_1 = "오늘 중부지방을 중심으로 소나기가 예상됩니다"
sent_2 = "오늘 전국이 맑은 날씨가 예상됩니다"

def tokenize(sent):
    words_sent = set()
    for word in sent.split():
        words_sent.add(word)
    return words_sent

# Implementing function that returns jaccard similarity
def cal_jaccard_sim(sent1, sent2):

    words_sent1 = tokenize(sent1)
    words_sent2 = tokenize(sent2)
    
    intersection = len(words_sent1 & words_sent2)
    
    union = len(words_sent1 | words_sent2)

    return float(intersection / union)

# cal_jaccard_sim() 함수 실행 결과를 확인합니다.
print(cal_jaccard_sim(sent_1, sent_2))

# Using jaccard_distance function from nltk to implement function that returns jaccard similarity. 
set1 = tokenize(sent_1)
set2 = tokenize(sent_2)
nltk_jaccard_sim = 1 - nltk.jaccard_distance(set1, set2)

# This function returns the same value as cal_jaccard_sim too.
print(nltk_jaccard_sim)

0.25
0.25


#### Cosine Similarity

**Cosine Similarity** is calculated based on an **angle between sentence vectors**.

$$\text{Cosine Similarity between } A \text{ and } B = \frac{A * B}{||A|| ||B||}$$

In [4]:
from numpy import sqrt, dot
from scipy.spatial import distance
from sklearn.metrics import pairwise

sent_1 = [0.3, 0.2, 0.2133, 0.3891, 0.8852, 0.586, 1.244, 0.777, 0.882]
sent_2 = [0.03, 0.223, 0.1, 0.4, 2.931, 0.122, 0.5934, 0.8472, 0.54]
sent_3 = [0.13, 0.83, 0.827, 0.92, 0.1, 0.32, 0.28, 0.34, 0]

# Implement function that returns cosine similarity
def cal_cosine_sim(v1, v2):
    return dot(v1, v2) / (sqrt(dot(v1, v1)) * sqrt(dot(v2, v2)))
    
print(cal_cosine_sim(sent_1, sent_2))

# Use distance.cosine() from scipy to calculate cosine similarity.
scipy_cosine_sim = 1 - distance.cosine(sent_1, sent_2)

print(scipy_cosine_sim)

# Use pairwise.cosine_similarity() from scikit-learn to calculate cosine similarity.
all_sent = [sent_1] + [sent_2] + [sent_3]
scikit_learn_cosine_sim  = pairwise.cosine_similarity(all_sent)

print(scikit_learn_cosine_sim)

0.7137224896052109
0.7137224896052109
[[1.         0.71372249 0.4876509 ]
 [0.71372249 1.         0.2801926 ]
 [0.4876509  0.2801926  1.        ]]
