# 텍스트 전처리
---
- 패키지 설치
    * NLTK : pip install nltk
    * KoNLPy : pip install Konlpy

## [1] 토큰화(Tokenization)
---
- 문장/문서를 의미를 지닌 작은 단위로 나누는 것
- 나누어진 단어를 토큰(Token)이라 함
- 종류
    * 문장 토큰화
    * 단어 토큰화

In [43]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [44]:
import nltk

In [45]:
sent='Wiki is in Ward is original description: The simplest online database that could possibly work.\
Wiki is a piece of server software that allows users to freely create and edit Web page content using any Web browser. Wiki supports hyperlinks and has a simple text syntax for creating new pages and crosslinks between internal pages on the fly.\
Wiki is unusual among group communication mechanisms in that it allows the organization of contributions to be edited in addition to the content itself.Like many simple concepts, "open editing" has some profound and subtle effects on Wiki usage. Allowing everyday users to create and edit any page in a Web site is exciting in that it encourages democratic use of the Web and promotes content composition by nontechnical users.'

In [46]:
# 단어 단위 토큰화
wordTokens = word_tokenize(sent)

In [47]:
print(wordTokens, len(wordTokens))

['Wiki', 'is', 'in', 'Ward', 'is', 'original', 'description', ':', 'The', 'simplest', 'online', 'database', 'that', 'could', 'possibly', 'work.Wiki', 'is', 'a', 'piece', 'of', 'server', 'software', 'that', 'allows', 'users', 'to', 'freely', 'create', 'and', 'edit', 'Web', 'page', 'content', 'using', 'any', 'Web', 'browser', '.', 'Wiki', 'supports', 'hyperlinks', 'and', 'has', 'a', 'simple', 'text', 'syntax', 'for', 'creating', 'new', 'pages', 'and', 'crosslinks', 'between', 'internal', 'pages', 'on', 'the', 'fly.Wiki', 'is', 'unusual', 'among', 'group', 'communication', 'mechanisms', 'in', 'that', 'it', 'allows', 'the', 'organization', 'of', 'contributions', 'to', 'be', 'edited', 'in', 'addition', 'to', 'the', 'content', 'itself.Like', 'many', 'simple', 'concepts', ',', '``', 'open', 'editing', "''", 'has', 'some', 'profound', 'and', 'subtle', 'effects', 'on', 'Wiki', 'usage', '.', 'Allowing', 'everyday', 'users', 'to', 'create', 'and', 'edit', 'any', 'page', 'in', 'a', 'Web', 'site', 

## [2] 정제 & 정규화
---
- 불용어 제거 => 노이즈 제거
- 텍스트의 동일화
    * 대문자 또는 소문자로 통일
    * 문장의 길이

### [2-1] 불용어 (Stopword)

In [48]:
en_stopwords=nltk.corpus.stopwords.words('english')

In [49]:
print(en_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [50]:
# 단어에서 불용어 제거
words=[]
for word in wordTokens:
    if word not in en_stopwords:
        words.append(word)
        
print(words, len(words))

['Wiki', 'Ward', 'original', 'description', ':', 'The', 'simplest', 'online', 'database', 'could', 'possibly', 'work.Wiki', 'piece', 'server', 'software', 'allows', 'users', 'freely', 'create', 'edit', 'Web', 'page', 'content', 'using', 'Web', 'browser', '.', 'Wiki', 'supports', 'hyperlinks', 'simple', 'text', 'syntax', 'creating', 'new', 'pages', 'crosslinks', 'internal', 'pages', 'fly.Wiki', 'unusual', 'among', 'group', 'communication', 'mechanisms', 'allows', 'organization', 'contributions', 'edited', 'addition', 'content', 'itself.Like', 'many', 'simple', 'concepts', ',', '``', 'open', 'editing', "''", 'profound', 'subtle', 'effects', 'Wiki', 'usage', '.', 'Allowing', 'everyday', 'users', 'create', 'edit', 'page', 'Web', 'site', 'exciting', 'encourages', 'democratic', 'use', 'Web', 'promotes', 'content', 'composition', 'nontechnical', 'users', '.'] 85


# 