# 형태소 분석기

## ntlk 
- 가장 오래된 자연어 처리 라이브러리
- 한국어 지원 X

In [1]:
import nltk 
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [5]:
sentence= """
At eight o'clock on Thursday morning
Arthur didn't feel very good."""
sentence

"\nAt eight o'clock on Thursday morning\nArthur didn't feel very good."

- 토큰화

In [6]:
tokens = nltk.word_tokenize(sentence)
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

- 문장속 단어에 품사 적용

In [7]:
tagged = nltk.pos_tag(tokens)
tagged

[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 ('Arthur', 'NNP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ'),
 ('.', '.')]

In [8]:
# 동사, 명사만 추출해서 리스트에 append 
[token for token, pos in tagged if pos.startswith("N") or pos.startswith("V")]

["o'clock", 'Thursday', 'morning', 'Arthur', 'did', 'feel']

## spacy
- python 기반 라이브러리
- 다양한 언어 지원 

In [9]:
import spacy
from spacy.lang.en.examples import sentences

- 품사 구분 외에도 표현해주는 기능이 존재

In [10]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentences[0])
print(doc.text)
print('-' * 80)
print("단어","원형","품사","태그", "의존성", "모양", "알파벳", "금칙어",sep="\t")
for token in doc:
    print(
        token.text # 단어
        , token.lemma_ # 원형
        , token.pos_ # 품사
        , token.tag_ # 태그
        , token.dep_ # 의존성
        , token.shape_ # 모양
        , token.is_alpha # 알파벳
        , token.is_stop # 금칙어
        , sep='\t')

Apple is looking at buying U.K. startup for $1 billion
--------------------------------------------------------------------------------
단어	원형	품사	태그	의존성	모양	알파벳	금칙어
Apple	Apple	PROPN	NNP	nsubj	Xxxxx	True	False
is	be	AUX	VBZ	aux	xx	True	True
looking	look	VERB	VBG	ROOT	xxxx	True	False
at	at	ADP	IN	prep	xx	True	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False
U.K.	U.K.	PROPN	NNP	dobj	X.X.	False	False
startup	startup	NOUN	NN	dobj	xxxx	True	False
for	for	ADP	IN	prep	xxx	True	True
$	$	SYM	$	quantmod	$	False	False
1	1	NUM	CD	compound	d	False	False
billion	billion	NUM	CD	pobj	xxxx	True	False


### 한국어

In [12]:
import locale
def getpreferredencoding(do_setlocale=True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [13]:
!python -m spacy download ko_core_news_sm

2023-03-14 00:29:52.078268: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-14 00:29:52.078457: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-14 00:29:54.135906: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ko-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/downloa

In [14]:
import spacy 
from spacy.lang.ko.examples import sentences

In [15]:
nlp = spacy.load("ko_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
print('-'*80)
print("단어","원형","품사","태그", "의존성", "모양", "알파벳", "금칙어",sep="\t")
for token in doc:
    print(
        token.text # 단어
        , token.lemma_ # 원형
        , token.pos_ # 품사
        , token.tag_ # 태그
        , token.dep_ # 의존성
        , token.shape_ # 모양
        , token.is_alpha # 알파벳
        , token.is_stop # 금칙어
        , sep='\t')

# for token in doc:
#     print(token.text, token.pos_, token.dep_)

애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.
--------------------------------------------------------------------------------
단어	원형	품사	태그	의존성	모양	알파벳	금칙어
애플이	애플이	CCONJ	nq+jcj	dislocated	xxx	True	False
영국의	영국+의	PROPN	nq+jcm	nmod	xxx	True	False
스타트업을	스타트업+을	NOUN	ncn+jcs	dislocated	xxxx	True	False
10억	10+억	NUM	nnc+nnc	nummod	ddx	False	False
달러에	달러+에	ADV	nbu+jca	obl	xxx	True	False
인수하는	인수+하+는	VERB	ncpa+xsv+etm	acl	xxxx	True	False
것을	것+을	NOUN	nbn+jco	obj	xx	True	False
알아보고	알+아+보+고	AUX	pvg+ecx+px+ecx	ROOT	xxxx	True	False
있다	있+다	AUX	px+ef	aux	xx	True	False
.	.	PUNCT	sf	punct	.	False	False


## Konlpy
- 한국어만 지원하는 라이브러리

In [16]:
!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git
!bash /content/Mecab-ko-for-Google-Colab/install_mecab-ko_on_colab190912.sh

Cloning into 'Mecab-ko-for-Google-Colab'...
remote: Enumerating objects: 115, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 115 (delta 11), reused 10 (delta 3), pack-reused 91[K
Receiving objects: 100% (115/115), 1.27 MiB | 5.08 MiB/s, done.
Resolving deltas: 100% (50/50), done.
Installing konlpy.....
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 KB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: JPype1, konlpy
Successfully installed JPype1-1.4.1 k

### Okt 

In [17]:
from konlpy.tag import Okt

In [18]:
okt = Okt()

In [19]:
txt = '아버지가방에들어가신다.'

In [20]:
okt.pos(txt)

[('아버지', 'Noun'),
 ('가방', 'Noun'),
 ('에', 'Josa'),
 ('들어가신다', 'Verb'),
 ('.', 'Punctuation')]

In [21]:
[token[0] for token in okt.pos(txt) if token[1][0] in 'NVJ']

['아버지', '가방', '에', '들어가신다']

In [22]:
okt.nouns(txt)

['아버지', '가방']

### Mecab
- 띄어쓰기가 안 된 문장에 강하다

In [24]:
from konlpy.tag import Mecab

In [25]:
mec = Mecab()

In [26]:
txt = '아버지가방에들어가신다.'

In [27]:
mec.pos(txt)

[('아버지', 'NNG'),
 ('가', 'JKS'),
 ('방', 'NNG'),
 ('에', 'JKB'),
 ('들어가', 'VV'),
 ('신다', 'EP+EF'),
 ('.', 'SF')]

In [28]:
[token[0] for token in mec.pos(txt) if token[1][0] in 'NVJ']

['아버지', '가', '방', '에', '들어가']

In [29]:
mec.nouns(txt)

['아버지', '방']