# 第四週範例程式 -- 使用 nltk 處理英文

新聞來源：https://edition.cnn.com/2023/03/05/asia/ukraine-war-us-pacific-alliances-intl-hnk/index.html

In [None]:
Title_CNN = 'Ukraine war has made it easier for US to isolate China in the Pacific'
Content_CNN = '''A year after Russia invaded Ukraine, Xi Jinping’s backing of Vladimir Putin has opened the door for the United States and partners in the Pacific to shore up sometimes frayed relationships to the detriment of Beijing.

In the past few months alone, Japan has pledged to double defense spending and acquire long-range weapons from the US; South Korea has acknowledged that stability in the Taiwan Strait is essential to its security; the Philippines has announced new US base access rights and is talking about joint patrols of the South China Sea with Australia, Japan and the United States.

Those might be the biggest initiatives, but they are far from the only events that have left China increasingly isolated in its own backyard as it refuses to condemn the invasion of a sovereign country by its partner in Moscow while keeping military pressure on the self-ruled island of Taiwan.

Analysts say all these things would have likely happened without the war in Ukraine, but the war, and China’s backing of Russia, has helped grease the skids to get these projects done.

Take the situation of Japan, a country limited in its post-World War II constitution to “self-defense” forces. Now it’s going to buy long-range Tomahawk cruise missiles from the US, weapons that could strike well inside China.

“I myself have a strong sense of urgency that Ukraine today may be East Asia tomorrow,” Japanese Prime Minister Fumio Kishida told a major defense conference in Singapore last summer.

In December, Kishida followed that up with a plan to double Tokyo’s defense spending while acquiring weapons with ranges well outside Japanese territory.
'''


In [None]:
# 必須先下載模型、語料
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

基本單位：字

In [None]:
from nltk.tokenize import word_tokenize

word_result = word_tokenize(Content_CNN)

# check result
print(word_result)

['A', 'year', 'after', 'Russia', 'invaded', 'Ukraine', ',', 'Xi', 'Jinping', '’', 's', 'backing', 'of', 'Vladimir', 'Putin', 'has', 'opened', 'the', 'door', 'for', 'the', 'United', 'States', 'and', 'partners', 'in', 'the', 'Pacific', 'to', 'shore', 'up', 'sometimes', 'frayed', 'relationships', 'to', 'the', 'detriment', 'of', 'Beijing', '.', 'In', 'the', 'past', 'few', 'months', 'alone', ',', 'Japan', 'has', 'pledged', 'to', 'double', 'defense', 'spending', 'and', 'acquire', 'long-range', 'weapons', 'from', 'the', 'US', ';', 'South', 'Korea', 'has', 'acknowledged', 'that', 'stability', 'in', 'the', 'Taiwan', 'Strait', 'is', 'essential', 'to', 'its', 'security', ';', 'the', 'Philippines', 'has', 'announced', 'new', 'US', 'base', 'access', 'rights', 'and', 'is', 'talking', 'about', 'joint', 'patrols', 'of', 'the', 'South', 'China', 'Sea', 'with', 'Australia', ',', 'Japan', 'and', 'the', 'United', 'States', '.', 'Those', 'might', 'be', 'the', 'biggest', 'initiatives', ',', 'but', 'they', '

In [None]:
freq_word = nltk.FreqDist(word_result)

topk = 20
print(f'排名前 {topk} 名的字頻：', freq_word.most_common(topk))


排名前 20 名的字頻： [('the', 19), (',', 11), ('to', 9), ('of', 8), ('.', 8), ('in', 7), ('has', 5), ('and', 5), ('that', 5), ('a', 5), ('’', 4), ('s', 4), ('its', 4), ('China', 4), ('Ukraine', 3), ('Japan', 3), ('defense', 3), ('weapons', 3), ('from', 3), ('US', 3)]


基本單位：兩個字

In [None]:
bigrm_result = nltk.bigrams(word_result)
for bi in bigrm_result:
    print(bi)

('A', 'year')
('year', 'after')
('after', 'Russia')
('Russia', 'invaded')
('invaded', 'Ukraine')
('Ukraine', ',')
(',', 'Xi')
('Xi', 'Jinping')
('Jinping', '’')
('’', 's')
('s', 'backing')
('backing', 'of')
('of', 'Vladimir')
('Vladimir', 'Putin')
('Putin', 'has')
('has', 'opened')
('opened', 'the')
('the', 'door')
('door', 'for')
('for', 'the')
('the', 'United')
('United', 'States')
('States', 'and')
('and', 'partners')
('partners', 'in')
('in', 'the')
('the', 'Pacific')
('Pacific', 'to')
('to', 'shore')
('shore', 'up')
('up', 'sometimes')
('sometimes', 'frayed')
('frayed', 'relationships')
('relationships', 'to')
('to', 'the')
('the', 'detriment')
('detriment', 'of')
('of', 'Beijing')
('Beijing', '.')
('.', 'In')
('In', 'the')
('the', 'past')
('past', 'few')
('few', 'months')
('months', 'alone')
('alone', ',')
(',', 'Japan')
('Japan', 'has')
('has', 'pledged')
('pledged', 'to')
('to', 'double')
('double', 'defense')
('defense', 'spending')
('spending', 'and')
('and', 'acquire')
('acq

In [None]:
# 那三個字？
trigrm_result = nltk.trigrams(word_result)
for tri in trigrm_result:
    print(tri)

('A', 'year', 'after')
('year', 'after', 'Russia')
('after', 'Russia', 'invaded')
('Russia', 'invaded', 'Ukraine')
('invaded', 'Ukraine', ',')
('Ukraine', ',', 'Xi')
(',', 'Xi', 'Jinping')
('Xi', 'Jinping', '’')
('Jinping', '’', 's')
('’', 's', 'backing')
('s', 'backing', 'of')
('backing', 'of', 'Vladimir')
('of', 'Vladimir', 'Putin')
('Vladimir', 'Putin', 'has')
('Putin', 'has', 'opened')
('has', 'opened', 'the')
('opened', 'the', 'door')
('the', 'door', 'for')
('door', 'for', 'the')
('for', 'the', 'United')
('the', 'United', 'States')
('United', 'States', 'and')
('States', 'and', 'partners')
('and', 'partners', 'in')
('partners', 'in', 'the')
('in', 'the', 'Pacific')
('the', 'Pacific', 'to')
('Pacific', 'to', 'shore')
('to', 'shore', 'up')
('shore', 'up', 'sometimes')
('up', 'sometimes', 'frayed')
('sometimes', 'frayed', 'relationships')
('frayed', 'relationships', 'to')
('relationships', 'to', 'the')
('to', 'the', 'detriment')
('the', 'detriment', 'of')
('detriment', 'of', 'Beijing'

In [None]:
# 那四個字？
fourgrm_result = nltk.ngrams(word_result, 4)
for fourG in fourgrm_result:
    print(fourG)

# 五以上，同理可證！

('A', 'year', 'after', 'Russia')
('year', 'after', 'Russia', 'invaded')
('after', 'Russia', 'invaded', 'Ukraine')
('Russia', 'invaded', 'Ukraine', ',')
('invaded', 'Ukraine', ',', 'Xi')
('Ukraine', ',', 'Xi', 'Jinping')
(',', 'Xi', 'Jinping', '’')
('Xi', 'Jinping', '’', 's')
('Jinping', '’', 's', 'backing')
('’', 's', 'backing', 'of')
('s', 'backing', 'of', 'Vladimir')
('backing', 'of', 'Vladimir', 'Putin')
('of', 'Vladimir', 'Putin', 'has')
('Vladimir', 'Putin', 'has', 'opened')
('Putin', 'has', 'opened', 'the')
('has', 'opened', 'the', 'door')
('opened', 'the', 'door', 'for')
('the', 'door', 'for', 'the')
('door', 'for', 'the', 'United')
('for', 'the', 'United', 'States')
('the', 'United', 'States', 'and')
('United', 'States', 'and', 'partners')
('States', 'and', 'partners', 'in')
('and', 'partners', 'in', 'the')
('partners', 'in', 'the', 'Pacific')
('in', 'the', 'Pacific', 'to')
('the', 'Pacific', 'to', 'shore')
('Pacific', 'to', 'shore', 'up')
('to', 'shore', 'up', 'sometimes')
('s

### Quiz-5 : 執行到現在，有沒有哪裡覺得怪怪的？申論看看？

In [None]:
# 哪裡怪怪的？

In [None]:
from nltk.tokenize import sent_tokenize

sent_result = sent_tokenize(Content_CNN)

# check result
print(sent_result)

['A year after Russia invaded Ukraine, Xi Jinping’s backing of Vladimir Putin has opened the door for the United States and partners in the Pacific to shore up sometimes frayed relationships to the detriment of Beijing.', 'In the past few months alone, Japan has pledged to double defense spending and acquire long-range weapons from the US; South Korea has acknowledged that stability in the Taiwan Strait is essential to its security; the Philippines has announced new US base access rights and is talking about joint patrols of the South China Sea with Australia, Japan and the United States.', 'Those might be the biggest initiatives, but they are far from the only events that have left China increasingly isolated in its own backyard as it refuses to condemn the invasion of a sovereign country by its partner in Moscow while keeping military pressure on the self-ruled island of Taiwan.', 'Analysts say all these things would have likely happened without the war in Ukraine, but the war, and C