# Tokenization 文章標記 段詞

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Supreme Court won't hear Obamacare case before election"
res1 = [str(token) for token in nlp(text.lower())]
print(res1)

['supreme', 'court', 'wo', "n't", 'hear', 'obamacare', 'case', 'before', 'election']


In [2]:
# 推特用的標記
from nltk.tokenize import TweetTokenizer
tweet = "Tell both of your senators to vote against the #GOPCoverup aimed at protecting Trump."
tokenizer = TweetTokenizer()
res2 = tokenizer.tokenize(tweet.lower())
print(res2)


['tell', 'both', 'of', 'your', 'senators', 'to', 'vote', 'against', 'the', '#gopcoverup', 'aimed', 'at', 'protecting', 'trump', '.']


# N-gram

+ N-Gram是一種基於統計語言模型的算法。它的基本思想是將文本里面的內容按照字節進行大小為N的"滑動窗口"操作，形成了長度是N的字節片段序列

In [3]:
def n_gram(text,n):
    return [text[i:i+n] for i in range(len(text)-n+1)]
print(n_gram(res1,3))

[['supreme', 'court', 'wo'], ['court', 'wo', "n't"], ['wo', "n't", 'hear'], ["n't", 'hear', 'obamacare'], ['hear', 'obamacare', 'case'], ['obamacare', 'case', 'before'], ['case', 'before', 'election']]


# Text Normalization

+ Used to prepare text, words, and documents for further processing.

![](https://i.imgur.com/iF4tIwv.png)

# 1.Lemma 詞型還原

+ Lemma（詞根） 和 Wordform（詞型）
+ Cat 和 cats 属于相同的词根，但是却是不同的词形。

https://spacy.io/api/annotation#lemmatization

In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("he was running late")
for token in doc:
    print(token,'--->',token.lemma_)
# -PRON- = 人稱代名詞

he ---> -PRON-
was ---> be
running ---> run
late ---> late


# 2.Stemming 詞幹提取 

+ Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

1. cats > cat  
2. effective > effect

https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

In [5]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer


In [6]:
#A list of words to be stemmed
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lancaster.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   


NameError: name 'porter' is not defined

+ 差異性:  
    **Porter** 簡單的移除後墜詞，快速方便，但有時會產生出奇怪的單字。  
    **Lancaster** :The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm with rules saved externally.

# 3. Pos標籤

+ 字詞的分類  
+ part of speech  
+ 詞性分析

8 kinds of pos  
http://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("John Roberts Has More Power Than Mitch McConnell Would Like You to Think. But Will He Use It?")
for token in doc:
    print(token,'---->',token.pos_)

# 4. Chunking

+ 文章分段

https://spacy.io/usage/linguistic-features#pos-tagging

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])
print([(X.text, X.label_) for X in doc.noun_chunks])

# 5. 句子結構