# I. Basic Concept
---

<table class="table"><tbody><tr><th>English</th><th>한국어</th><th>Description</th></tr><tr><td>Document</td><td>문서</td><td>-</td></tr><tr><td>Corpus</td><td>말뭉치</td><td>A set of documents</td></tr><tr><td>Token</td><td>토큰</td><td>Meaningful elements in a text such as words or phrases or symbols</td></tr><tr><td>Morphemes</td><td>형태소</td><td>Smallest meaningful unit in a language</td></tr><tr><td>POS</td><td>품사</td><td>Part-of-speech (ex: Nouns)</td></tr></tbody></table>

- 출처 : https://www.lucypark.kr/courses/2015-dm/text-mining.html

# II. NLP in English

In [7]:
import nltk

In [27]:
texts = "Mr. Kim, How are you? I'm fine. Thank you, and you?"
# 모두 몇개의 문장일까?

### 1. 문장분리; End of Sentence(EOS) Detection

In [30]:
# 나이브하게 접근한다면...
texts.split(".")

['Mr', " Kim, How are you? I'm fine", ' Thank you, and you?']

In [31]:
# nltk에 다 있음!
sentences = nltk.tokenize.sent_tokenize(texts)
print(sentences)

['Mr. Kim, How are you?', "I'm fine.", 'Thank you, and you?']


### 2. 토큰화; Tokenization 
- 문장을 토큰으로 분리

In [42]:
# word Toknize
tokens_list = [nltk.word_tokenize(sent) for sent in sentences]
print(tokens_list)

[['Mr.', 'Kim', ',', 'How', 'are', 'you', '?'], ['I', "'m", 'fine', '.'], ['Thank', 'you', ',', 'and', 'you', '?']]


### 3. 품사 부착; Part-of-speech(POS) Tagging
- 토큰에 품사 정보를 지정.
- [태그 정보]( http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [43]:
# Pos tagging
pos_tagged_tokens = [nltk.pos_tag(tokens) for tokens in tokens_list]
print(pos_tagged_tokens)

[[('Mr.', 'NNP'), ('Kim', 'NNP'), (',', ','), ('How', 'NNP'), ('are', 'VBP'), ('you', 'PRP'), ('?', '.')], [('I', 'PRP'), ("'m", 'VBP'), ('fine', 'JJ'), ('.', '.')], [('Thank', 'NNP'), ('you', 'PRP'), (',', ','), ('and', 'CC'), ('you', 'PRP'), ('?', '.')]]


### 4. 구문 분석; Parsing

### 5. Chunking

In [48]:
for chunk in nltk.ne_chunk_sents(pos_tagged_tokens):
    chunk.draw()

# III. NLP in Korean

In [50]:
import konlpy
import nltk

# 꼬꼬마 형태소 분석기, 트위터 형태소 분석기를 이용한 형태소 분석
from konlpy.tag import Kkma 
from konlpy.tag import Twitter

ImportError: No module named 'jpype'

In [None]:
kkma = Kkma()
twitter = Twitter()

In [51]:
sentence_ver1 = "친구야, 밥먹었니? 난 먹었단다."
sentence_ver2 = "얔ㅋㅋㅋ 밥먹었냨ㅋㅋㅋㅋ"

In [None]:
pprint(kkma.pos(sentence_ver1))

In [None]:
pprint(kkma.pos(sentence_ver2))

In [52]:
sentence2 = "강남역 1번 출구 토즈타워에서 방과 후 스터디를 했습니다."
words = konlpy.tag.Twitter().pos(sentence2)

NameError: name 'konlpy' is not defined

In [53]:
# Define a chunk grammar, or chunking rules, then chunk
# 명사가 연속적으로 등장한 후 접미사(suffix)가 선택적으로 붙은 경우를 명사구(NP)로 정의, 
# 마찬가지 방식으로 동사구(VP)와 형용사구(AP)를 정의
grammar = """
NP: {<N.*>*<Suffix>?}   # Noun phrase
VP: {<V.*>*}            # Verb phrase
AP: {<A.*>*}            # Adjective phrase
"""

In [None]:
parser = nltk.RegexpParser(grammar)

In [None]:
chunks = parser.parse(words)
print("# Print whole tree")
print(chunks.pprint())

In [None]:
chunks.pprint()

In [None]:
print("\n# Print noun phrases only")
for subtree in chunks.subtrees():
    if subtree.label()=='NP':
        print(' '.join((e[0] for e in list(subtree))))
        print(subtree.pprint())