<a href="https://colab.research.google.com/github/ean0418/ean0418/blob/main/Aug06_2_Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing - 텍스트 전처리

내가 해결하고자 하는 문제의 용도에 맞게 텍스트를 사전에 처리해버리는 작업



In [1]:
import nltk # 자연어 처리를 위한 패키지
from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence
nltk.download('punkt') # 문장 구조를 학습한 일종의 모델

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 토큰화(Tokenization)

어떤 문장을 단어로 잘라내서 정제하고, 정규화를 시키는 과정

- 구두점 (Functiuation)
  * 마침표, 쉼표, 물음표, 느낌표, 세미콜론, ...
-

In [2]:
text ="Slow and steady win the game. A friend in need is a friend indeed. The first property is the health."

In [4]:
print(word_tokenize(text))
# word_tokenize : Don't => Do와 n't / you're => you와 're로 구분
print()
print(WordPunctTokenizer().tokenize(text))
# 구두점을 별도로 표시
print()
print(text_to_word_sequence(text))
# keras의 text_to_word_sequence : 모든 알파벳을 소문자로 바꿔줌
#                                 구두점 제거
#                                 you're, don't,ain't 같은 경우는 보존함


['Slow', 'and', 'steady', 'win', 'the', 'game', '.', 'A', 'friend', 'in', 'need', 'is', 'a', 'friend', 'indeed', '.', 'The', 'first', 'property', 'is', 'the', 'health', '.']

['Slow', 'and', 'steady', 'win', 'the', 'game', '.', 'A', 'friend', 'in', 'need', 'is', 'a', 'friend', 'indeed', '.', 'The', 'first', 'property', 'is', 'the', 'health', '.']

['slow', 'and', 'steady', 'win', 'the', 'game', 'a', 'friend', 'in', 'need', 'is', 'a', 'friend', 'indeed', 'the', 'first', 'property', 'is', 'the', 'health']


# 문장 토큰화(Sentence Tokenization)

In [1]:
sentence ="""ROMEO.
He jests at scars that never felt a wound.

Juliet appears above at a window.

But soft, what light through yonder window breaks?
It is the east, and Juliet is the sun!
Arise fair sun and kill the envious moon,
Who is already sick and pale with grief,
That thou her maid art far more fair than she.
Be not her maid since she is envious;
Her vestal livery is but sick and green,
And none but fools do wear it; cast it off.
It is my lady, O it is my love!
O, that she knew she were!
She speaks, yet she says nothing. What of that?
Her eye discourses, I will answer it.
I am too bold, ’tis not to me she speaks.
Two of the fairest stars in all the heaven,
Having some business, do entreat her eyes
To twinkle in their spheres till they return.
What if her eyes were there, they in her head?
The brightness of her cheek would shame those stars,
As daylight doth a lamp; her eyes in heaven
Would through the airy region stream so bright
That birds would sing and think it were not night.
See how she leans her cheek upon her hand.
O that I were a glove upon that hand,
That I might touch that cheek."""
sentence

'ROMEO.\nHe jests at scars that never felt a wound.\n\nJuliet appears above at a window.\n\nBut soft, what light through yonder window breaks?\nIt is the east, and Juliet is the sun!\nArise fair sun and kill the envious moon,\nWho is already sick and pale with grief,\nThat thou her maid art far more fair than she.\nBe not her maid since she is envious;\nHer vestal livery is but sick and green,\nAnd none but fools do wear it; cast it off.\nIt is my lady, O it is my love!\nO, that she knew she were!\nShe speaks, yet she says nothing. What of that?\nHer eye discourses, I will answer it.\nI am too bold, ’tis not to me she speaks.\nTwo of the fairest stars in all the heaven,\nHaving some business, do entreat her eyes\nTo twinkle in their spheres till they return.\nWhat if her eyes were there, they in her head?\nThe brightness of her cheek would shame those stars,\nAs daylight doth a lamp; her eyes in heaven\nWould through the airy region stream so bright\nThat birds would sing and think it 

In [7]:
from nltk.tokenize import sent_tokenize
sent_tokenize(sentence)
# NLTK는 단순하게 마침표로 문장을 구분하지 않음
# Dr. , Mrs. Mr. 등 단어들은 마침표를 기분으로 해서 나뉘어지지 않음 => 성공적!!

['ROMEO.',
 'He jests at scars that never felt a wound.',
 'Juliet appears above at a window.',
 'But soft, what light through yonder window breaks?',
 'It is the east, and Juliet is the sun!',
 'Arise fair sun and kill the envious moon,\nWho is already sick and pale with grief,\nThat thou her maid art far more fair than she.',
 'Be not her maid since she is envious;\nHer vestal livery is but sick and green,\nAnd none but fools do wear it; cast it off.',
 'It is my lady, O it is my love!',
 'O, that she knew she were!',
 'She speaks, yet she says nothing.',
 'What of that?',
 'Her eye discourses, I will answer it.',
 'I am too bold, ’tis not to me she speaks.',
 'Two of the fairest stars in all the heaven,\nHaving some business, do entreat her eyes\nTo twinkle in their spheres till they return.',
 'What if her eyes were there, they in her head?',
 'The brightness of her cheek would shame those stars,\nAs daylight doth a lamp; her eyes in heaven\nWould through the airy region stream so 

In [8]:
# KSS(Korean Sentence Splitter)
!pip install kss

Collecting kss
  Downloading kss-6.0.4.tar.gz (1.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.1/1.1 MB[0m [31m50.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting emoji==1.2.0 (from kss)
  Downloading emoji-1.2.0-py3-none-any.whl.metadata (4.3 kB)
Collecting pecab (from kss)
  Downloading pecab-1.0.8.tar.gz (26.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.4/26.4 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jamo (from kss)
  Downloading jamo-0.4.1-py3-none-any.whl.metadata (2.3 kB)
Collecting hangul-jamo (from kss)
  Downloading hangul_jamo-1.0.1-py3-none-any.whl.metadata 

In [9]:
import kss

In [11]:
kor = "오늘부터 AI 시작이에요. 텍스트 전처리는 한국어가 영어보다 훨씬 난이도가 높아요. 한번 경험해봅시다"
kor

'오늘부터 AI 시작이에요. 텍스트 전처리는 한국어가 영어보다 훨씬 난이도가 높아요. 한번 경험해봅시다'

In [14]:
print(kss.split_sentences(kor))




['오늘부터 AI 시작이에요.', '텍스트 전처리는 한국어가 영어보다 훨씬 난이도가 높아요.', '한번 경험해봅시다']


# 한국어 = 교착어(어근 + 접사)

한국어에는 [조사]가 존재

- 글자 뒤에 띄어쓰기 없이 존재
- 형태소 (morpheme)
  -> 말의 가장 작은 단위
    - 자립형태소 : 명사, 대명사, 수사, 관형사, 부사, ...
    - 의존형태소 : 다른 형태소와 결합을 해야만하는... 어간, 어미, 접소, 조사, ...

# 품사 태킹(Part-of-speech tagging) : 단어 토큰화를 거친 토큰들(단어들)에게 품사를 붙여주는 작업

동음이의어

mean : 동사] 의미하다/ 명사] 평균 / 형용사] 비열한, 못된

연패 : 연속해서 패하다 / 연속해서 이기다



In [15]:
# NLTK / KoNLPy

In [16]:
nltk.download('averaged_perceptron_tagger') # 품사태깅을 위한 라이브러리

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [17]:
from nltk.tag import pos_tag

In [18]:
text ="Slow and steady win the game. A friend in need is a friend indeed. The first property is the health."
text

'Slow and steady win the game. A friend in need is a friend indeed. The first property is the health.'

In [20]:
tokenized_sentence = word_tokenize(text)
print(tokenized_sentence)
print(pos_tag(tokenized_sentence))

['Slow', 'and', 'steady', 'win', 'the', 'game', '.', 'A', 'friend', 'in', 'need', 'is', 'a', 'friend', 'indeed', '.', 'The', 'first', 'property', 'is', 'the', 'health', '.']
[('Slow', 'NNP'), ('and', 'CC'), ('steady', 'JJ'), ('win', 'VB'), ('the', 'DT'), ('game', 'NN'), ('.', '.'), ('A', 'DT'), ('friend', 'NN'), ('in', 'IN'), ('need', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('friend', 'NN'), ('indeed', 'RB'), ('.', '.'), ('The', 'DT'), ('first', 'JJ'), ('property', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('health', 'NN'), ('.', '.')]


# PRP : 인칭대명사
# RB : 부사
# DT : 관사
# VBP : 단수, 현재형, 3인칭 동사
# W~ : wh~
# JJ : 형용사
# NN : 단수명사
# NNS : 복수명사
# MD : 조동사
# VB : 동사 기본형
# VBD : 동사 과거시제
# VBG : 동명사


# 한국어 자연어처리 : KoNLPy라는 파이썬 패키지

KoNLPy에서 사용할 수 있는 형태소 분석기
- Okt (Open Korean Text)
- Komoran
- kkma
- Mecab
- Hannanum


In [21]:
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting JPype1>=0.7.0 (from konlpy)
  Downloading JPype1-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m75.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading JPype1-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (488 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m488.6/488.6 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: JPype1, konlpy
Successfully installed JPype1-1.5.0 konlpy-0.6.0


In [23]:
from konlpy.tag import Okt
from konlpy.tag import Kkma

In [24]:
okt = Okt()

print(okt.morphs("오늘은 화요일이고요. 내일은 없습니다!"))
# morphs : 형태소 분석 : 어떤 대상의 어절을 최소 의미단위인 형태소로 분석하는 것
print(okt.pos("오늘은 화요일이고요. 내일은 없습니다!"))
# pos : 품사 태깅(Part-of-Speech tagging)
print(okt.nouns("오늘은 화요일이고요. 내일은 없습니다!"))
# nouns : 명사 추출

['오늘', '은', '화요일', '이', '고요', '.', '내일', '은', '없습니다', '!']
[('오늘', 'Noun'), ('은', 'Josa'), ('화요일', 'Noun'), ('이', 'Josa'), ('고요', 'Noun'), ('.', 'Punctuation'), ('내일', 'Noun'), ('은', 'Josa'), ('없습니다', 'Adjective'), ('!', 'Punctuation')]
['오늘', '화요일', '고요', '내일']


In [25]:
kkma = Kkma()
print(kkma.morphs("오늘은 화요일이고요. 내일은 없습니다!"))
# morphs : 형태소 분석 : 어떤 대상의 어절을 최소 의미단위인 형태소로 분석하는 것
print(kkma.pos("오늘은 화요일이고요. 내일은 없습니다!"))
# pos : 품사 태깅(Part-of-Speech tagging)
print(kkma.nouns("오늘은 화요일이고요. 내일은 없습니다!"))
# nouns : 명사 추출

['오늘', '은', '화요일', '이', '고요', '.', '내일', '은', '없', '습니다', '!']
[('오늘', 'NNG'), ('은', 'JX'), ('화요일', 'NNG'), ('이', 'VCP'), ('고요', 'EFN'), ('.', 'SF'), ('내일', 'NNG'), ('은', 'JX'), ('없', 'VA'), ('습니다', 'EFN'), ('!', 'SF')]
['오늘', '화요일', '내일']


# 코퍼스(Coupus) : 말뭉치

보통 여러 단어들로 이루어진 문장, 분석하려는 대상, 문서, 데이터셋

코퍼스에서 용도에 맞게 토큰을 나누는 것을 토큰화(Tokenization), 정규화(Normalization)를 하는 것이 필요 !

- 정제(Cleaning) : 가지고 있는 말뭉치에서 노이즈 데이터를 제거
- 정규화(Normalization) : 표현 방법이 서로 다른 단어들을 통일시켜서 같은 단어로 재가공
  1. 규칙에 따라서 표기가 다른 언어를 통합시키기
  ex) US USA us U.S.A
  2. 대소문자를 통합