# 學習目的：試試 NLTK, SPACY，瞭解英文NLP的基礎知識

### 讀檔、轉小寫、刪除標點符號

In [5]:
import string
import nltk

text = open('bcc_news_01-a.txt', encoding='utf-8').read()

# 轉小寫
lower_case = text.lower()

print('\n----< 讀檔、轉小寫、刪除標點符號 >-----\n')
# Removing punctuations
cleaned_text = lower_case.translate(str.maketrans('', '', string.punctuation))
print(cleaned_text)


----< 讀檔、轉小寫、刪除標點符號 >-----

soaring over the delhi skyline  at 240 ft  the qutub minar is one of the capitals most iconic and stunning monuments now a court will decide whether temples demolished centuries ago in the complex surrounding the monument should be restored

the world heritage site was built as tower of victory  possibly inspired by afghan minarets  by qutbuddin aibak the first sultan of delhi after defeating the hindu rulers in 1192 the redandbuff sandstone monument contains some of the earliest structures of muslim rule in the country it was expanded upwards and renovated by three successors  it is now five storeys tall and 379 steps lead to the top

historian william dalrymple noted that the qutub minar tower which looked like a fully extended telescope placed lens down on a plateau in delhis aravalli hills was a boastful and triumphant statement of arrival

the fortified complex housing the minaret has a chequered history twentyseven hindu and jain temples located there

### 分詞

In [6]:
print('\n----< 分詞 (空白) >-----\n')
from nltk.tokenize import WhitespaceTokenizer
word_tokens = nltk.word_tokenize(cleaned_text)
print(*word_tokens, sep = ", ") 

soaring, over, the, delhi, skyline, at, 240, ft, the, qutub, minar, is, one, of, the, capitals, most, iconic, and, stunning, monuments, now, a, court, will, decide, whether, temples, demolished, centuries, ago, in, the, complex, surrounding, the, monument, should, be, restored, the, world, heritage, site, was, built, as, tower, of, victory, possibly, inspired, by, afghan, minarets, by, qutbuddin, aibak, the, first, sultan, of, delhi, after, defeating, the, hindu, rulers, in, 1192, the, redandbuff, sandstone, monument, contains, some, of, the, earliest, structures, of, muslim, rule, in, the, country, it, was, expanded, upwards, and, renovated, by, three, successors, it, is, now, five, storeys, tall, and, 379, steps, lead, to, the, top, historian, william, dalrymple, noted, that, the, qutub, minar, tower, which, looked, like, a, fully, extended, telescope, placed, lens, down, on, a, plateau, in, delhis, aravalli, hills, was, a, boastful, and, triumphant, statement, of, arrival, the, fort

### 刪除停用詞

In [3]:
print('\n----< 刪除停用詞 >-----\n')
nltk.download('stopwords')
nltk_stopwords = nltk.corpus.stopwords.words('english')
print('nltk_stopwords.count = ', len(nltk_stopwords))
# print('The first five stop words are {}'.format(list(nltk_stopwords)[:5]))
filtered_sentence = []
for word in word_tokens: 
    if word not in nltk_stopwords:
        filtered_sentence.append(word)
print(filtered_sentence)



----< 刪除停用詞 >-----

nltk_stopwords.count =  179
['soaring', 'delhi', 'skyline', '240', 'ft', 'qutub', 'minar', 'one', 'capitals', 'iconic', 'stunning', 'monuments', 'court', 'decide', 'whether', 'temples', 'demolished', 'centuries', 'ago', 'complex', 'surrounding', 'monument', 'restored', 'world', 'heritage', 'site', 'built', 'tower', 'victory', 'possibly', 'inspired', 'afghan', 'minarets', 'qutbuddin', 'aibak', 'first', 'sultan', 'delhi', 'defeating', 'hindu', 'rulers', '1192', 'redandbuff', 'sandstone', 'monument', 'contains', 'earliest', 'structures', 'muslim', 'rule', 'country', 'expanded', 'upwards', 'renovated', 'three', 'successors', 'five', 'storeys', 'tall', '379', 'steps', 'lead', 'top', 'historian', 'william', 'dalrymple', 'noted', 'qutub', 'minar', 'tower', 'looked', 'like', 'fully', 'extended', 'telescope', 'placed', 'lens', 'plateau', 'delhis', 'aravalli', 'hills', 'boastful', 'triumphant', 'statement', 'arrival', 'fortified', 'complex', 'housing', 'minaret', 'chequered'

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/eddiehua/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### N-gram
- 分析文本中 連續單詞 出現的次數
- 用來預測關鍵字

In [7]:
from collections import Counter
from nltk.util import ngrams
# 計算最高頻率的五個 1-gram (先把stopwords去除比較好)
news_unigrams = ngrams(filtered_sentence, 1)
news_unigrams_freq = Counter(news_unigrams)
print("Top 5 unigrams:\n{}".format(news_unigrams_freq.most_common(5)))

# 計算最高頻率的五個 2-grams
news_bigrams = ngrams(filtered_sentence, 2)
news_bigrams_freq = Counter(news_bigrams)
print("Top 5 bigrams:\n{}".format(news_bigrams_freq.most_common(5)))

# 計算最高頻率的五個 3-grams
news_trigrams = ngrams(filtered_sentence, 3)
news_trigrams_freq = Counter(news_trigrams)
print("Top 5 trigrams:\n{}".format(news_trigrams_freq.most_common(5)))


Top 5 unigrams:
[(('monument',), 4), (('delhi',), 3), (('qutub',), 3), (('minar',), 3), (('temples',), 3)]
Top 5 bigrams:
[(('qutub', 'minar'), 3), (('soaring', 'delhi'), 1), (('delhi', 'skyline'), 1), (('skyline', '240'), 1), (('240', 'ft'), 1)]
Top 5 trigrams:
[(('soaring', 'delhi', 'skyline'), 1), (('delhi', 'skyline', '240'), 1), (('skyline', '240', 'ft'), 1), (('240', 'ft', 'qutub'), 1), (('ft', 'qutub', 'minar'), 1)]


### 分句

In [19]:
from nltk.tokenize import sent_tokenize  
sent_tokens = sent_tokenize(text)

for i in range(len(sent_tokens)):
    print('[{}] {}'.format(i+1, sent_tokens[i]))

[1] Soaring over the Delhi skyline - at 240 ft - the Qutub Minar is one of the capital's most iconic and stunning monuments.
[2] Now a court will decide whether temples demolished centuries ago in the complex surrounding the monument should be restored.
[3] The World Heritage site was built as tower of victory - possibly inspired by Afghan minarets - by Qutbuddin Aibak, the first sultan of Delhi, after defeating the Hindu rulers in 1192.
[4] The red-and-buff sandstone monument contains some of the earliest structures of Muslim rule in the country.
[5] It was expanded upwards and renovated by three successors - it is now five storeys tall and 379 steps lead to the top.
[6] Historian William Dalrymple noted that the Qutub Minar tower, which looked like a "fully extended telescope placed lens down on a plateau in [Delhi's] Aravalli hills" was a "boastful and triumphant statement of arrival".
[7] The fortified complex housing the minaret has a chequered history.
[8] Twenty-seven Hindu and 

### 詞性標註 Part-of-speech tagging
- CC（Coordinating Conjunction）：並列連詞
- CD（Cardinal Digit）：基數
- DT（Determiner）：限定詞
- EX（Existential There）：存在句
  - Example: “there is” … think of it like “there exists”
- FW（Foreign Word）：外來語
- IN（Preposition/Subordinating Conjunction）：介詞/從屬連詞
- JJ（Adjective）：形容詞
- JJR（Adjective, Comparative）：形容詞，比較級
- JJS（Adjective, Superlative）：形容詞，最高級
- LS（List Marker 1）：
- MD（Modal）：模態
- NN（Noun, Sigular）：名詞，單數
- NNS（Noun Plural）：名詞，複數
- NNP（Proper Noun, Singular）：專有名詞，單數
- NNPS（Proper Noun, Plural）：專有名詞，複數
- PDT（Predeterminer）：預定義器。
- POS（Possessive Ending）：所有格結尾，例：parent’s
- PRP（Personal Pronoun）：人稱代詞，例：I, he, she
- PRP$（Possessive Pronoun）：所有格代詞。 例：my, his, hers
- RB（Adverb）：副詞。例：very, silently
- RBR（Adverb, Comparative）：副詞、比較。例：better
- RBS（Adverb, Superlative）：副詞、最高級。例：best
- PR（Particle）：粒子，Ex. give up
- TO（to）：就是 go 'to' ...
- UH（Interjection）：呃感嘆詞。 Ex：errrrrrrrm
- VB（Verb, Base Form）：動詞，基本形式。Ex. take
- VBD（Verb, Past Tense）：動詞，過去式。Ex. took
- VBG（Verb, Gerund/Present Participle）：動詞，動名詞/現在分詞。Ex. taking
- VBN（Verb, Past Participle）：動詞，過去分詞。Ex. taken
- VBP（Verb,non-3rd person singular present）：動詞，非第三人稱單數
- VBZ（Verb, 3rd person singular present）：動詞，第三人稱單數
- WDT（Wh-determiner）： 限定詞
  - 關係限定詞：whose, which
  - 疑問限定詞：what, which, whose
- WP（Wh-pronoun）：代詞, who whose which
- WP$（Possessive wh-pronoun）：所有格代詞
- WRB（Wh-adverb）：疑問代詞   how where when

In [15]:
from nltk import RegexpParser
pos_tagged_sent = nltk.pos_tag(filtered_sentence)   # 需要 averaged_perceptron_tagger
print(*pos_tagged_sent, sep = ", ") 

('soaring', 'VBG'), ('delhi', 'JJ'), ('skyline', 'NN'), ('240', 'CD'), ('ft', 'NN'), ('qutub', 'NN'), ('minar', 'VBP'), ('one', 'CD'), ('capitals', 'NNS'), ('iconic', 'JJ'), ('stunning', 'JJ'), ('monuments', 'NNS'), ('court', 'NN'), ('decide', 'VBP'), ('whether', 'IN'), ('temples', 'NNS'), ('demolished', 'VBN'), ('centuries', 'NNS'), ('ago', 'IN'), ('complex', 'JJ'), ('surrounding', 'VBG'), ('monument', 'NN'), ('restored', 'VBD'), ('world', 'NN'), ('heritage', 'NN'), ('site', 'NN'), ('built', 'VBD'), ('tower', 'JJR'), ('victory', 'NN'), ('possibly', 'RB'), ('inspired', 'VBD'), ('afghan', 'JJ'), ('minarets', 'NNS'), ('qutbuddin', 'VBP'), ('aibak', 'IN'), ('first', 'JJ'), ('sultan', 'JJ'), ('delhi', 'NN'), ('defeating', 'VBG'), ('hindu', 'NN'), ('rulers', 'NNS'), ('1192', 'CD'), ('redandbuff', 'NN'), ('sandstone', 'NN'), ('monument', 'NN'), ('contains', 'VBZ'), ('earliest', 'JJS'), ('structures', 'NNS'), ('muslim', 'JJ'), ('rule', 'JJ'), ('country', 'NN'), ('expanded', 'VBD'), ('upwards'

### 名詞短語分塊 (Noun Phrase Chunking)

In [18]:
tokenised_sent = nltk.word_tokenize(sent_tokens[0])
pos_tagged_sent = nltk.pos_tag(tokenised_sent)
# specifying the formal grammar of an noun phrase: "grammar_name: {RegEx}"
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN.?>}"
np_chunk_parser = RegexpParser(np_chunk_grammar)
# chunk parsing a sentence
np_chunked_sent = np_chunk_parser.parse(pos_tagged_sent)
print(np_chunked_sent)
# visualising parsing result
#np_chunked_sent.draw()

(S
  Soaring/VBG
  over/IN
  (NP the/DT Delhi/NNP)
  (NP skyline/NN)
  -/:
  at/IN
  240/CD
  ft/JJ
  -/:
  (NP the/DT Qutub/NNP)
  (NP Minar/NNP)
  is/VBZ
  one/CD
  of/IN
  (NP the/DT capital/NN)
  's/POS
  most/RBS
  iconic/JJ
  and/CC
  (NP stunning/JJ monuments/NNS)
  ./.)


### 找出高頻率單字

In [21]:
from nltk.probability import FreqDist
fdist = FreqDist(filtered_sentence)
# 列出前10名
fdist1 = fdist.most_common(10)
print(fdist1)

[('monument', 4), ('delhi', 3), ('qutub', 3), ('minar', 3), ('temples', 3), ('hindu', 3), ('one', 2), ('demolished', 2), ('complex', 2), ('site', 2)]


### 語幹提取 / 去除字尾 (Stemming)

In [26]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

for word in filtered_sentence:
    word_stem = ps.stem(word)
    if (word != word_stem):
        print(word, '->', ps.stem(word))


soaring -> soar
skyline -> skylin
capitals -> capit
iconic -> icon
stunning -> stun
monuments -> monument
decide -> decid
temples -> templ
demolished -> demolish
centuries -> centuri
surrounding -> surround
restored -> restor
heritage -> heritag
victory -> victori
possibly -> possibl
inspired -> inspir
minarets -> minaret
defeating -> defeat
rulers -> ruler
sandstone -> sandston
contains -> contain
structures -> structur
country -> countri
expanded -> expand
upwards -> upward
renovated -> renov
successors -> successor
storeys -> storey
steps -> step
dalrymple -> dalrympl
noted -> note
looked -> look
fully -> fulli
extended -> extend
telescope -> telescop
placed -> place
lens -> len
delhis -> delhi
aravalli -> arav
hills -> hill
boastful -> boast
arrival -> arriv
fortified -> fortifi
housing -> hous
chequered -> chequer
history -> histori
temples -> templ
located -> locat
demolished -> demolish
debris -> debri
used -> use
delhis -> delhi
mosque -> mosqu
temples -> templ
retained -> reta

### NER (Named Entity Recognition)

In [29]:
import spacy

ner_def = {"PERSON": "人物",
           "NORP": "國家、宗教、政治團體",
           "FAC": "建築、機場、高速公路、橋樑",
           "ORG": "組織公司、機構",
           "GPE": "國家、城市、州",
           "LOC": "山脈、水體",
           "DATE": "日期",
           "TIME": "小於1天的時間",
           "EVENT": "颶風、戰爭、體育賽事等",
           "LAW": "法律文書",
           "LANGUAGE": "語言",
           "PERCENT": "百分比",
           "MONEY": "貨幣價值",
           "QUANTITY": "度量單位",
           "CARDINAL": "數量詞",
           "ORDINAL" : "序數詞"
          }

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

for entity in doc.ents:
    # 取不到中文說明就顯示原本的文字
    chiName = ner_def.get(entity.label_, entity.label_)
    print(entity.text, ":", chiName)


Delhi : 國家、城市、州
240 : 數量詞
Afghan : 國家、宗教、政治團體
Qutbuddin Aibak : 組織公司、機構
first : 序數詞
Delhi : 國家、城市、州
Hindu : 國家、宗教、政治團體
1192 : 日期
Muslim : 國家、宗教、政治團體
three : 數量詞
five : 數量詞
379 : 數量詞
William Dalrymple : 人物
Qutub Minar : 人物
Aravalli : 人物
Twenty-seven : 數量詞
Hindu : 國家、宗教、政治團體
Delhi : 國家、城市、州
1926 : 日期
JA Page : 人物
the Archaeological Survey : 組織公司、機構
India : 國家、城市、州
Hindu : 國家、宗教、政治團體
Hanuman Chalisa : 人物
Vishnu Stambh' : 人物
May 10, 2022 : 日期
New Delhi : 國家、城市、州
India : 國家、城市、州


In [31]:
from spacy import displacy

# displacy.render(doc, style="dep")
displacy.render(doc, style="ent")