# 環境設定

## NLTK

* 英文斷句
* 英文斷詞
* 移除標點
* 移除停用詞
* 文字標準化
* 英文詞性標註
* 英文實體名稱識別（NER）

In [None]:
# 安裝 NLTK 套件
!pip install nltk



## spaCy

* 中文斷句
* 中文詞性標註
* 中文實體名稱識別

In [None]:
# 安裝 spaCy 套件
!pip install spacy



In [None]:
# 下載中英文語言模型
# spaCy 語言模型首頁： https://github.com/explosion/spacy-models

# 下載英文語言模型（sm = small）
!python -m spacy download en_core_web_sm

# 下載中文語言模型（md = middle）
!python -m spacy download zh_core_web_md

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting zh-core-web-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.7.0/zh_core_web_md-3.7.0-py3-none-any.whl (78.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation success

## 範例文本

In [None]:
chinese_text = """
在這個資訊爆炸的時代，自然語言處理(Natural Language Processing, NLP)技術正在迅速發展，成為人工智慧領域的一個重要分支。NLP的應用範圍廣泛，包括語音識別、機器翻譯、情感分析、文本摘要等。隨著深度學習技術的進步，NLP的研究和應用正在不斷突破傳統的限制，為人們的生活和工作帶來了許多便利。

自然語言處理涉及到多個步驟，其中包括語言檢測、文本清理、文本標準化、斷句、斷詞、詞性標註等。這些前處理步驟對於後續的分析和模型訓練至關重要。例如，文本清理可以去除無用的標點符號和特殊字符，斷詞則將句子分割成單個的詞彙，便於進一步的處理。

在NLP的應用中，一個常見的任務是情感分析，即判斷文本中所表達的情感是正面還是負面。這在社交媒體分析、市場研究等領域有著廣泛的應用。另一個重要的應用是機器翻譯，隨著全球化的發展，準確高效的翻譯工具變得越來越重要。

然而，NLP也面臨著許多挑戰，其中之一是語言的多樣性和複雜性。每種語言都有其獨特的語法結構和詞彙規則，這使得開發通用的NLP模型變得困難。此外，不同語境下同一詞彙的含義可能會有所不同，這就需要模型能夠理解上下文資訊。

總的來說，自然語言處理是一個充滿挑戰和機遇的領域。隨著技術的不斷進步，我們有理由相信，NLP將在未來發揮更大的作用，為人類社會帶來更多的便利和進步。
"""

# 語系偵測（Language Detection）

* [ISO 639 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)

| Language Code | Language Name    | Language Code | Language Name    |
|---------------|------------------|---------------|------------------|
| zh            | Chinese          | ko            | Korean           |
| en            | English          | ru            | Russian          |
| es            | Spanish          | pt            | Portuguese       |
| fr            | French           | it            | Italian          |
| de            | German           | ar            | Arabic           |
| ja            | Japanese         | tr            | Turkish          |
| nl            | Dutch            | pl            | Polish           |
| sv            | Swedish          | da            | Danish           |
| fi            | Finnish          | no            | Norwegian        |
| el            | Greek            | he            | Hebrew           |
| th            | Thai             | id            | Indonesian       |
| vi            | Vietnamese       |               |                  |

In [None]:
# 套件安裝（Package Installation）
!pip install langid

Collecting langid
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m27.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langid
  Building wheel for langid (setup.py) ... [?25l[?25hdone
  Created wheel for langid: filename=langid-1.1.6-py3-none-any.whl size=1941172 sha256=0110ddb6c1a1898c5a9b3872b5686ab53711d2c51a0ab854b93f39525e9908b1
  Stored in directory: /root/.cache/pip/wheels/23/c8/c6/eed80894918490a175677414d40bd7c851413bbe03d4856c3c
Successfully built

In [None]:
# 文本（Text）設定
chinese_sentence = "你好！我的名字叫做李傑克！"
english_sentence = "Hello! My name is Jack Lee."

In [None]:
# 語系偵測（分數未標準化）
import langid

result = langid.classify(chinese_sentence)
print(result)

('zh', -153.1489453315735)


In [None]:
# 語系偵測（分數已標準化）
from langid.langid import LanguageIdentifier, model

detector = LanguageIdentifier.from_modelstring(model, norm_probs=True)

result = detector.classify(chinese_sentence)
print(result)

('zh', 0.9999999999667954)


In [None]:
# 語系偵測（指定候選語系 + 分數標準化）
import langid
from langid.langid import LanguageIdentifier, model

# 指定候選語系
langid.set_languages(['en','zh'])
# 自建語系偵測器 + 標準化分數
detector = LanguageIdentifier.from_modelstring(model, norm_probs=True)

result = detector.classify(chinese_sentence)
print(result)

result = detector.classify(english_sentence)
print(result)

('zh', 0.9999999999667954)
('en', 0.9977290649556259)


In [None]:
# 中英夾雜
zh_en_sen = 'wefnio溫看誒女母下用lbjknl'
result = detector.classify(zh_en_sen)
print(result)
result = detector.classify(english_sentence)
print(result)

('zh', 0.9999951222243043)
('en', 0.9977290649556259)


# 斷句（Sentence Segmentation）

## 英文斷句

In [None]:
# 載入 NLTK
import nltk

# 下載斷句用模型（punkt = Punctuation）
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# 利用 sent_tokenize 斷句
from nltk.tokenize import sent_tokenize

text = "Hello, world! Welcome to the world of NLP. This is an example of sentence tokenization."
sentences = sent_tokenize(text)

print(sentences)

['Hello, world!', 'Welcome to the world of NLP.', 'This is an example of sentence tokenization.']


## 中文斷句

In [None]:
import spacy

# 載入中文模型
nlp = spacy.load("zh_core_web_md")

# 使用 spaCy 進行斷句
doc = nlp(chinese_text)

# 去除「空白行」與「行末換行符號 \n」
sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# 輸出結果
print(sentences)

['在這個資訊爆炸的時代，自然語言處理(Natural Language Processing, NLP)技術正在迅速發展，成為人工智慧領域的一個重要分支。', 'NLP的應用範圍廣泛，包括語音識別、機器翻譯、情感分析、文本摘要等。', '隨著深度學習技術的進步，NLP的研究和應用正在不斷突破傳統的限制，為人們的生活和工作帶來了許多便利。', '自然語言處理涉及到多個步驟，其中包括語言檢測、文本清理、文本標準化、斷句、斷詞、詞性標註等。', '這些前處理步驟對於後續的分析和模型訓練至關重要。', '例如，文本清理可以去除無用的標點符號和特殊字符，斷詞則將句子分割成單個的詞彙，便於進一步的處理。', '在NLP的應用中，一個常見的任務是情感分析，即判斷文本中所表達的情感是正面還是負面。', '這在社交媒體分析、市場研究等領域有著廣泛的應用。', '另一個重要的應用是機器翻譯，隨著全球化的發展，準確高效的翻譯工具變得越來越重要。', '然而，NLP也面臨著許多挑戰，其中之一是語言的多樣性和複雜性。', '每種語言都有其獨特的語法結構和詞彙規則，這使得開發通用的NLP模型變得困難。', '此外，不同語境下同一詞彙的含義可能會有所不同，這就需要模型能夠理解上下文資訊。', '總的來說，自然語言處理是一個充滿挑戰和機遇的領域。', '隨著技術的不斷進步，我們有理由相信，NLP將在未來發揮更大的作用，為人類社會帶來更多的便利和進步。']


In [None]:
# 中英夾雜
zh_en_sen = '大家好，我是黃（huang）。很高興今天見到大家！hello everybody, my name is huang! Nice to meet u guys.'

result = detector.classify(english_sentence)
print(result)
sentences = sent_tokenize(zh_en_sen)
print(sentences, len(sentences))

result = detector.classify(zh_en_sen)
print(result)
doc = nlp(zh_en_sen)
sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
print(sentences, len(sentences))

('en', 0.9977290649556259)
['大家好，我是黃（huang）。很高興今天見到大家！hello everybody, my name is huang!', 'Nice to meet u guys.'] 2
('zh', 1.0)
['大家好，我是黃（huang）。', '很高興今天見到大家！', 'hello everybody, my name is huang! Nice to meet u guys.'] 3


# 顏文字處理（Emoji Processing）

In [None]:
# 安裝 emoji 套件
!pip install emoji

Collecting emoji
  Downloading emoji-2.11.1-py2.py3-none-any.whl (433 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/433.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/433.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.11.1


In [None]:
# 引入 emoji 套件
import emoji

# 用 demojize() 將顏文字轉成純文字
print(emoji.demojize('Python is ❤️'))

Python is :red_heart:


# 拼寫檢查（Spelling Check）

## 英文拼寫檢查

| Language   | Code | Language   | Code | Language   | Code |
|------------|------|------------|------|------------|------|
| English    | en   | Dutch      | nl   | French     | fr   |
| German     | de   | Spanish    | es   | Portuguese | pt   |
| Arabic     | ar   | Asturian   | ast  | Belarusian | be   |
| Breton     | br   | Catalan    | ca   | Chinese    | zh   |
| Danish     | da   | Esperanto  | eo   | Galician   | gl   |
| Greek      | el   | Irish      | ga   | Italian    | it   |
| Japanese   | ja   | Khmer      | km   | Norwegian  | no   |
| Persian    | fa   | Polish     | pl   | Romanian   | ro   |
| Russian    | ru   | Slovak     | sk   | Slovenian  | sl   |
| Swedish    | sv   | Tagalog    | tl   | Tamil      | ta   |
| Ukrainian  | uk   |            |      |            |      |



In [None]:
# 安裝 language-tool-python 套件
!pip install language-tool-python

Collecting language-tool-python
  Downloading language_tool_python-2.8-py3-none-any.whl (35 kB)
Installing collected packages: language-tool-python
Successfully installed language-tool-python-2.8


In [None]:
# 載入 language_tool_python 套件
import language_tool_python

# 指定拼寫檢查之語系
tool = language_tool_python.LanguageTool('en-US')

# 開始拼寫檢查
text = 'Speling erors in sentense are anoying.'
matches = tool.check(text)

# 印出拼寫檢查結果之資料結構
print(matches)

Downloading LanguageTool 6.4: 100%|██████████| 246M/246M [00:03<00:00, 74.6MB/s]
INFO:language_tool_python.download_lt:Unzipping /tmp/tmpuv3m8n9k.zip to /root/.cache/language_tool_python.
INFO:language_tool_python.download_lt:Downloaded https://www.languagetool.org/download/LanguageTool-6.4.zip to /root/.cache/language_tool_python.


[Match({'ruleId': 'MORFOLOGIK_RULE_EN_US', 'message': 'Possible spelling mistake found.', 'replacements': ['Spelling', 'Spewing', 'Spieling'], 'offsetInContext': 0, 'context': 'Speling erors in sentense are anoying.', 'offset': 0, 'errorLength': 7, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', 'sentence': 'Speling erors in sentense are anoying.'}), Match({'ruleId': 'MORFOLOGIK_RULE_EN_US', 'message': 'Possible spelling mistake found.', 'replacements': ['errors', 'Eros', 'errs'], 'offsetInContext': 8, 'context': 'Speling erors in sentense are anoying.', 'offset': 8, 'errorLength': 5, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', 'sentence': 'Speling erors in sentense are anoying.'}), Match({'ruleId': 'MORFOLOGIK_RULE_EN_US', 'message': 'Possible spelling mistake found.', 'replacements': ['sentence', 'sen tense'], 'offsetInContext': 17, 'context': 'Speling erors in sentense are anoying.', 'offset': 17, 'errorLength': 8, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', '

In [None]:
# 以拼寫檢查結果修正原文並印出
corrected_text = language_tool_python.utils.correct(text, matches)
print(corrected_text)

Spelling errors in sentence are annoying.


## 中文拼寫檢查

In [None]:
# 安裝 PyCorrector 套件
!pip install pycorrector

Collecting pycorrector
  Downloading pycorrector-1.0.4.tar.gz (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m51.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pypinyin (from pycorrector)
  Downloading pypinyin-0.51.0-py2.py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m76.8 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from pycorrector)
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
Collecting loguru (from pycorrector)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets->pycorrector)
  Downloading dill-0.3.8-py3-none-any.whl (116

In [None]:
# 載入 MLM (Masked Language Model) as Correction using BERT (MacBERT) 模型
from pycorrector import MacBertCorrector
corrector = MacBertCorrector()

# 測試用中文文本
error_sentences = [
    '今天新情很好',
    '我遇到一位老友跟我療天。',
    '他們只能有兩個選擇：接受降新或自動離職。'
]

# 進行拼寫檢查
batch_results = corrector.correct_batch(error_sentences)

# 印出結果 note: 100個句子約是極限
for result in batch_results:
    print(result)
    print(result['source'])
    print(result['target'])
    print("----------------------------------")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/409M [00:00<?, ?B/s]

[32m2024-04-30 05:32:47.698[0m | [34m[1mDEBUG   [0m | [36mpycorrector.macbert.macbert_corrector[0m:[36m__init__[0m:[36m30[0m - [34m[1mUse device: cuda[0m
[32m2024-04-30 05:32:47.705[0m | [34m[1mDEBUG   [0m | [36mpycorrector.macbert.macbert_corrector[0m:[36m__init__[0m:[36m31[0m - [34m[1mLoaded macbert4csc model: shibing624/macbert4csc-base-chinese, spend: 11.592 s.[0m


{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]}
今天新情很好
今天心情很好
----------------------------------
{'source': '我遇到一位老友跟我療天。', 'target': '我遇到一位老友跟我聊天。', 'errors': [('療', '聊', 9)]}
我遇到一位老友跟我療天。
我遇到一位老友跟我聊天。
----------------------------------
{'source': '他們只能有兩個選擇：接受降新或自動離職。', 'target': '他們只能有兩個選擇：接受降薪或自動離職。', 'errors': [('新', '薪', 13)]}
他們只能有兩個選擇：接受降新或自動離職。
他們只能有兩個選擇：接受降薪或自動離職。
----------------------------------


In [None]:
# 中英夾雜
zh_en_sen = '大夾好，窩是黃（huang）。粉高興經天見到大家！hello everybudy, my nane is huang! Nice too meet u guys.'

result = detector.classify(english_sentence)
print(result)
sentences = sent_tokenize(zh_en_sen)
print(sentences, len(sentences))

tool = language_tool_python.LanguageTool('en-US')
matches = tool.check(zh_en_sen)
corrected_text = language_tool_python.utils.correct(zh_en_sen, matches)
print(corrected_text)


result = detector.classify(zh_en_sen)
print(result)
doc = nlp(zh_en_sen)
sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
print(sentences, len(sentences))

batch_results = corrector.correct_batch(sentences)
for result in batch_results:
    # print(result)
    print(result['source'])
    print(result['target'])
    print("----------------------------------")

('en', 0.9977290649556259)
['大夾好，窩是黃（huang）。粉高興經天見到大家！hello everybudy, my nane is huang!', 'Nice too meet u guys.'] 2
大夾好，窩是黃（huang）。粉高興經天見到大家！hello everybody, my name is Huang! Nice too meets u guys.
('zh', 1.0)
['大夾好，窩是黃（huang）。', '粉高興經天見到大家！', 'hello everybudy, my nane is huang! Nice too meet u guys.'] 3
大夾好，窩是黃（huang）。
大夾好，我是黃（huang）。ng）。
----------------------------------
粉高興經天見到大家！
粉高興今天見到大家！
----------------------------------
hello everybudy, my nane is huang! Nice too meet u guys.
hello everybudy, my nane is huang! Nice too meet u guys.et u guys.
----------------------------------


# 中文繁簡轉換

In [None]:
# 安裝並引入 OpenCC 套件
!pip install opencc
from opencc import OpenCC

Collecting opencc
  Downloading OpenCC-1.1.7-cp310-cp310-manylinux1_x86_64.whl (779 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/779.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/779.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/779.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.8/779.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: opencc
Successfully installed opencc-1.1.7


In [None]:
# 定義一個專門做「繁簡轉換」的函數
def convert_chinese(text, target='simplified', dialect_convert=True):
    # 根據傳入的 'target' 與 'dialect_convert' 參數，取得對應的 JSON 檔
    configurations = {
        ('simplified', True): 'tw2sp.json',
        ('simplified', False): 'tw2s.json',
        ('traditional', True): 's2twp.json',
        ('traditional', False): 's2tw.json'
    }
    configuration = configurations.get((target, dialect_convert))

    # 初始化對應的轉換器
    converter = OpenCC(configuration)

    # 轉換並將結果傳回去
    return converter.convert(text)

In [None]:
# 呼叫轉換函數，並印出轉換結果
print(convert_chinese(chinese_text))


在这个信息爆炸的时代，自然语言处理(Natural Language Processing, NLP)技术正在迅速发展，成为人工智能领域的一个重要分支。NLP的应用范围广泛，包括语音识别、机器翻译、情感分析、文本摘要等。随着深度学习技术的进步，NLP的研究和应用正在不断突破传统的限制，为人们的生活和工作带来了许多便利。

自然语言处理涉及到多个步骤，其中包括语言检测、文本清理、文本标准化、断句、断词、词性标注等。这些前处理步骤对于后续的分析和模型训练至关重要。例如，文本清理可以去除无用的标点符号和特殊字符，断词则将句子分割成单个的词汇，便于进一步的处理。

在NLP的应用中，一个常见的任务是情感分析，即判断文本中所表达的情感是正面还是负面。这在社交媒体分析、市场研究等领域有着广泛的应用。另一个重要的应用是机器翻译，随着全球化的发展，准确高效的翻译工具变得越来越重要。

然而，NLP也面临着许多挑战，其中之一是语言的多样性和复杂性。每种语言都有其独特的语法结构和词汇规则，这使得开发通用的NLP模型变得困难。此外，不同语境下同一词汇的含义可能会有所不同，这就需要模型能够理解上下文信息。

总的来说，自然语言处理是一个充满挑战和机遇的领域。随着技术的不断进步，我们有理由相信，NLP将在未来发挥更大的作用，为人类社会带来更多的便利和进步。



# 斷詞與文本清理（Tokenization & Text Cleaning）

* 文本清理：包含移除標點、移除停用詞

## 英文

In [None]:
# 引入必要函式庫
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# 下載 NLTK 的 Tokenizer 和 stopwords 資源
nltk.download('punkt')
nltk.download('stopwords')

# 獲取英語的停用詞列表
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# 定義一個函數來進行斷詞、移除標點符號和停用詞
def tokenize_without_punctuation_and_stopwords(text):
    # 使用 NLTK 的 word_tokenize 進行斷詞
    tokens = word_tokenize(text)
    # 移除標點符號和停用詞
    filtered_tokens = [token for token in tokens if token not in string.punctuation and token.lower() not in stop_words]
    return filtered_tokens

In [None]:
# 測試文本
text = "Hello, how are you doing today?"

# 呼叫函數進行斷詞並移除標點和停用詞
tokens = tokenize_without_punctuation_and_stopwords(text)
print(tokens)

['Hello', 'today']


## 中文

In [None]:
# 安裝 Jieba（結巴）
!pip install jieba

# 取得繁體中文自定義辭典（可自行修改添加，選用）
import os
Dictionary_File = 'dict.txt.big'

if not os.path.isfile(Dictionary_File):
    os.system('wget https://raw.githubusercontent.com/cnchi/datasets/master/' + Dictionary_File)

# 取得繁體中文的「停止詞」辭典（如：的、了、...）
StopWords_File = "stopWords_big5.txt"

if not os.path.isfile(StopWords_File):
    os.system('wget https://raw.githubusercontent.com/cnchi/datasets/master/' + StopWords_File)



In [None]:
# 載入 Jieba
import jieba

# 載入繁體中文自定義辭典（選用）
# jieba.set_dictionary(Dictionary_File)

# 設定標點符號與停止詞清單
punctuation = set("$!&#%\()+-*/_,. 　?:;'\"<=>^`|~[]{}’0123456789?_“”、。《》！，：；？「」（）")
stopWords = set()
with open(StopWords_File, "rt", encoding="utf-8") as f:
  for line in f:
    line = line.strip() # Remove trailing \n
    stopWords.add(line)

# 將斷完句的文字，一次一句送入斷詞＆文本清理
result = []
for sentence in sentences:
  # 針對本句子斷詞
  seg_sentence = list(jieba.cut(sentence))
  # 移除標點符號
  seg_sentence_no_punct = [word for word in seg_sentence if word not in punctuation]
  # 移除停止詞
  seg_sentence_no_stopwords = [word for word in seg_sentence_no_punct if word not in stopWords]

  # 將結果加入 result 串列
  result.append(seg_sentence_no_stopwords)

# 印出結果
print(result)

[['大夾好', '窩', '黃', 'huang'], ['粉高興', '經天', '見'], ['hello', 'everybudy', 'my', 'nane', 'is', 'huang', 'Nice', 'too', 'meet', 'u', 'guys']]


# 英文文本標準化（Text Normalization）

* 小寫化：Lower Text
* 詞幹提取：Stemming
* 詞形還原：Lemmatization

In [None]:
# 載入必要套件
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# 下載 NLTK 資源
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# 初始化 Stemmer 和 Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# 定義一個函數來展示小寫、Stemming 和 Lemmatization
def process_text(text):
    # 轉成小寫
    text = text.lower()
    # 使用 NLTK 的 word_tokenize 進行斷詞
    words = nltk.word_tokenize(text)

    # Stemming
    stemmed_words = [stemmer.stem(word) for word in words]

    # Lemmatization
    # 為了更精確的 Lemmatization，我們需要提供正確的詞性 (Part-of-Speech)
    lemmatized_words = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(word)) for word in words]

    return stemmed_words, lemmatized_words

In [None]:
# NLTK 的詞性與 WordNet 詞性的對應
def get_wordnet_pos(word):
    """Map NLTK's part-of-speech tags to those used by WordNet Lemmatizer"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
# 測試文本
text = "The leaves on the tree have fallen due to strong winds."

# 處理文本
stemmed, lemmatized = process_text(text)
print("Stemmed Words:", stemmed)
print("Lemmatized Words:", lemmatized)

Stemmed Words: ['the', 'leav', 'on', 'the', 'tree', 'have', 'fallen', 'due', 'to', 'strong', 'wind', '.']
Lemmatized Words: ['the', 'leaf', 'on', 'the', 'tree', 'have', 'fall', 'due', 'to', 'strong', 'wind', '.']


# 詞性標註（Part-Of-Speech Tagging / POS Tagging）




## 英文詞性標註

* 使用 Penn Treebank 詞性標籤

| Tag  | Description                                | Example              | Tag   | Description                                | Example            |
|------|--------------------------------------------|----------------------|-------|--------------------------------------------|--------------------|
| CC   | Coordinating conjunction                   | and, but, or         | CD    | Cardinal number                            | one, two, 123      |
| DT   | Determiner                                 | the, a, some         | EX    | Existential there                          | there is, there was|
| FW   | Foreign word                               | d’hoevre             | IN    | Preposition or subordinating conjunction   | in, of, like       |
| JJ   | Adjective                                  | big, happy           | JJR   | Adjective, comparative                     | bigger, happier    |
| JJS  | Adjective, superlative                     | biggest, happiest    | LS    | List item marker                           | 1, One:            |
| MD   | Modal                                      | can, could, will     | NN    | Noun, singular or mass                     | dog, car           |
| NNS  | Noun, plural                               | dogs, cars           | NNP   | Proper noun, singular                      | John, London       |
| NNPS | Proper noun, plural                        | Vikings, Americans   | PDT   | Predeterminer                              | all, both          |
| POS  | Possessive ending                          | 's                   | PRP   | Personal pronoun                           | I, you, he         |
| PRP\$| Possessive pronoun                         | my, your, his        | RB    | Adverb                                     | very, too, not     |
| RBR  | Adverb, comparative                        | better               | RBS   | Adverb, superlative                        | best               |
| RP   | Particle                                   | up, off              | SYM   | Symbol                                     | +, %, &            |
| TO   | to                                         | to                   | UH    | Interjection                               | ah, oops, ouch     |
| VB   | Verb, base form                            | eat, walk            | VBD   | Verb, past tense                           | ate, walked        |
| VBG  | Verb, gerund or present participle         | eating, walking      | VBN   | Verb, past participle                      | eaten, walked      |
| VBP  | Verb, non-3rd person singular present      | eat, walk            | VBZ   | Verb, 3rd person singular present          | eats, walks        |
| WDT  | Wh-determiner                              | which, whatever      | WP    | Wh-pronoun                                 | who, whoever       |
| WP\$ | Possessive wh-pronoun                      | whose                | WRB   | Wh-adverb                                  | where, when        |


In [None]:
# 載入必要的函式庫 NLTK
import nltk
from nltk.tokenize import word_tokenize

# 下載 NLTK 的 Tokenizer 和 POS Tagger 資源
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
# 定義一個函數來執行詞性標註
def pos_tagging(text):
    # 斷詞
    words = word_tokenize(text)
    # 執行詞性標註
    pos_tags = nltk.pos_tag(words)
    return pos_tags

In [None]:
# 測試文本
text = "The quick brown fox jumps over the lazy dog."

# 獲得詞性標註結果
tagged_text = pos_tagging(text)
print("POS Tagging Result:", tagged_text)

POS Tagging Result: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


## 中文詞性標註

| Tag   | Description                 | Tag   | Description                   |
|-------|-----------------------------|-------|-------------------------------|
| ADJ   | adjective                   | ADP   | adposition                    |
| ADV   | adverb                      | AUX   | auxiliary                     |
| CONJ  | conjunction                 | CCONJ | coordinating conjunction      |
| DET   | determiner                  | INTJ  | interjection                  |
| NOUN  | noun                        | NUM   | numeral                       |
| PART  | particle                    | PRON  | pronoun                       |
| PROPN | proper noun                 | PUNCT | punctuation                   |
| SCONJ | subordinating conjunction   | SYM   | symbol                        |
| VERB  | verb                        | X     | other                         |
| SPACE | space                       |       |                               |

In [None]:
# 載入 spaCy 套件
import spacy

# 載入中文模型
nlp = spacy.load("zh_core_web_md")

# 定義繁體中文的句子
text = "我來到位於台北的政治大學校園。"

# 處理文本
doc = nlp(text)

# 輸出每個詞的文本和詞性標籤
for token in doc:
    print(f"{token.text} ({token.pos_})")

# 實體名稱識別（Named Entity Recognition, NER）


| Entity Type | Description                                                  |
|-------------|--------------------------------------------------------------|
| PERSON      | People, including fictional.                                 |
| NORP        | Nationalities or religious or political groups.              |
| FAC         | Buildings, airports, highways, bridges, etc.                 |
| ORG         | Companies, agencies, institutions, etc.                      |
| GPE         | Countries, cities, states.                                   |
| LOC         | Non-GPE locations, mountain ranges, bodies of water.         |
| PRODUCT     | Objects, vehicles, foods, etc. (Not services.)               |
| EVENT       | Named hurricanes, battles, wars, sports events, etc.        |
| WORK_OF_ART | Titles of books, songs, etc.                                 |
| LAW         | Named documents made into laws.                              |
| LANGUAGE    | Any named language.                                          |
| DATE        | Absolute or relative dates or periods.                       |
| TIME        | Times smaller than a day.                                    |
| PERCENT     | Percentage, including "%".                                   |
| MONEY       | Monetary values, including unit.                             |
| QUANTITY    | Measurements, as of weight or distance.                      |
| ORDINAL     | "first", "second", etc.                                      |
| CARDINAL    | Numerals that do not fall under another type.                |


## 英文

In [None]:
# 載入 spaCy 套件
import spacy

# 載入英文的小型 NLP 模型
nlp = spacy.load("en_core_web_sm")



In [None]:
# 定義一個函數進行命名實體識別
def named_entity_recognition(text):
    # 處理文本
    doc = nlp(text)
    # 提取命名實體
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

In [None]:
# 測試文本
text = "Apple Inc. announced its new iPhone in San Francisco."

# 執行命名實體識別
entities = named_entity_recognition(text)

# 印出識別出的實體
print("Identified Entities:", entities)

Identified Entities: [('Apple Inc.', 'ORG'), ('iPhone', 'ORG'), ('San Francisco', 'GPE')]


In [None]:
# try by myself
text = "At Apple, we’re demonstrating \
every day that business can and \
should be a force for good. And \
we’ve made important progress \
over the last year through our \
Environmental, Social, and \
Governance (ESG) initiatives. \
That would not be possible \
without the innovation and \
collaboration of teams across \
Apple, and the people and \
organizations we partner with. \
As we look ahead, we know \
there is more to be done. We’re \
committed to continue to build \
on our efforts and drive even \
greater impact in the years \
to come. "
# 執行命名實體識別
entities = named_entity_recognition(text)

# 印出識別出的實體
print("Identified Entities:", entities)

Identified Entities: [('we’re', 'PERSON'), ('demonstrating', 'PERSON')]


## 中文

In [None]:
# 載入 spaCy 套件
import spacy

# 載入中文的中型 NLP 模型
nlp = spacy.load("zh_core_web_md")

# 定義繁體中文的句子
text = "台北101是台灣的一個地標，位於台北市。"

# 處理文本
doc = nlp(text)

# 印出識別到的命名實體
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

台北 (GPE)
101 (CARDINAL)
台灣 (GPE)
台北市 (GPE)


In [None]:
# try by myself
# 定義繁體中文的句子
text = "本報告書參照全球報告倡議組織(Global Reporting Initiatives，GRI)發布之永續性報告準則 2021 年版進行編製，同時也參考聯合國全球盟約 (UNGC)、氣候相關財務揭露 (TCFD)、SASB 準則、「上市公司編製與申報永續報告書作業辦法」、以及整合性報告書 (IR) 框架精神等永續規範進行編製。本報告書由第三方查證單位進 行保證，確認報告書內容符合 AA1000 AS v3 保證標準的第一類型中度保證等級。"

# 處理文本
doc = nlp(text)

# 印出識別到的命名實體
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

CFD) (PERSON)
SASB (PERSON)
第三方 (ORG)
AS v3 (PRODUCT)
第一 (ORDINAL)
