# Tutorial 8-1. Natural Language Processing (NLP)

**GOAL**: Let's taste how to process natural language text using Python!

### 0. Installation

Before starting this tutorial, please be prepared by installing the two packages.
- English: `nltk` package
- Korean: `konlpy` package

You can install them by running the following lines in Anaconda Prompt:
```
>> conda install nltk
>> pip install konlpy
```

If you are using Windows OS, you have to install JDK (see slides).

### 1. English NLP using NLTK package

[NLTK](https://www.nltk.org/) is a pioneering NLP package built in Python.

First, import the `nltk` package and download a tokenizer and a pos tagger.

In [1]:
import nltk

nltk.download('punkt') # tokenizer
nltk.download('averaged_perceptron_tagger') # pos tagger

[nltk_data] Downloading package punkt to /home/gugu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/gugu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Try your own sentences. You can tokenize, POS tag, or extract nouns.

In [2]:
sentence = "By the end of this course I will be a great data scientist! 100% sure!"

tokens = nltk.word_tokenize(sentence.lower()) # lowercase
print(tokens)

tagged = nltk.pos_tag(tokens)
print(tagged)

nouns = [word for word, tag in tagged if tag[:2] == 'NN']
print(nouns)

['by', 'the', 'end', 'of', 'this', 'course', 'i', 'will', 'be', 'a', 'great', 'data', 'scientist', '!', '100', '%', 'sure', '!']
[('by', 'IN'), ('the', 'DT'), ('end', 'NN'), ('of', 'IN'), ('this', 'DT'), ('course', 'NN'), ('i', 'NN'), ('will', 'MD'), ('be', 'VB'), ('a', 'DT'), ('great', 'JJ'), ('data', 'NNS'), ('scientist', 'NN'), ('!', '.'), ('100', 'CD'), ('%', 'NN'), ('sure', 'JJ'), ('!', '.')]
['end', 'course', 'i', 'data', 'scientist', '%']


### 2. Korean NLP using KoNLPy pacakge

[KoNLPy](https://konlpy.org/) is a Python package for NLP of Korean language.

It contains Hannanum (`Hannanum`), Kkma(`Kkma`), and Open Korean Text (`Okt`, aka Twitter) POS taggers.

Here, we will use Twitter tagger. Let's load it.

In [3]:
from konlpy.tag import Okt

okt = Okt()

Try your own sentence. You can tokenize, POS tag, or extract nouns and phrases.

In [4]:
sentence = '데이터 분석 수업이 넘나 재밌어서 현기증이 나요...ㅋ_ㅠ'
# sentence = '공부를 하면할수록 모르는게 많다는 것을 알게 됩니다. 배운건 많았는데... 다 까먹어버렸네요? ㅋㅋ 그래도 계속 공부합니다. 재밌으니까!'

tokens = okt.morphs(sentence)
print(tokens)

tagged = okt.pos(sentence)
print(tagged)

nouns = okt.nouns(sentence)
print(nouns)

phrases = okt.phrases(sentence)
print(phrases)

['데이터', '분석', '수업', '이', '넘나', '재밌어서', '현기증', '이', '나요', '...', 'ㅋ', '_', 'ㅠ']
[('데이터', 'Noun'), ('분석', 'Noun'), ('수업', 'Noun'), ('이', 'Josa'), ('넘나', 'Verb'), ('재밌어서', 'Adjective'), ('현기증', 'Noun'), ('이', 'Josa'), ('나요', 'Eomi'), ('...', 'Punctuation'), ('ㅋ', 'KoreanParticle'), ('_', 'Punctuation'), ('ㅠ', 'KoreanParticle')]
['데이터', '분석', '수업', '현기증']
['데이터', '데이터 분석', '데이터 분석 수업', '현기증', '분석', '수업']
