<a href="https://colab.research.google.com/github/cheng1610/news-category/blob/main/news_category.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 基於深度學習的新聞主題自動分類系統
### 1. 專案介紹&簡介
- 本專案開發一個自動判斷新聞主題的系統，可以將新聞進行多分類，例如：
  - World：國際新聞
  - Sports：體育新聞
  - Business：商業與經濟新聞
  - Sci/Tech：科技與科學新聞
- **開發環境**：Google Colab
- **深度學習框架**：TensorFlow
- **資料集**：使用包含在TensorFlow Datasets中的AG News Dataset (一個經典的英文新聞分類資料集，常用於自然語言處理和文本分類的研究與測試)
- **資料量**：訓練集共120,000篇文章，測試集共7,600 篇文章
- **資料結構**：訓練集以及測試集的每筆資料(文章)包含兩個部分
   - text：新聞文字，型態為 TensorFlow Tensor (dtype=tf.string)
   - label(標籤)：對應新聞類別，標籤為整數 0~3，依序對應 World、Sports、Business、Sci/Tech，型態為 TensorFlow Tensor (dtype=tf.int64)
### 2. 網路架構
1. **資料預處理**  
   - 使用Tokenizer將文字轉成數字序列
   - 用Embeddings層轉換為向量形式
   - 使用 pad_sequences 將序列長度統一
2. **模型訓練**  
   - 使用Embeding, LSTM, Dense模型進行文本分類  
   - 設定損失函數與優化器，訓練模型
3. **模型評估**  
   - 使用準確率與精確率進行評估
     
### 3. 預期成果
- 模型能對新聞文本自動分類並且達到高準確率


In [None]:
#載入相關套件
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import tensorflow_datasets as tfds
import numpy as np

In [None]:
#下載AG news資料集，並且分割成訓練集以及測試集
dataset = tfds.load(
    "ag_news_subset",
    as_supervised=True
)

train_ds = dataset['train']
test_ds = dataset['test']



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/ag_news_subset/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ag_news_subset/incomplete.MO1GUS_1.0.0/ag_news_subset-train.tfrecord*...: …

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ag_news_subset/incomplete.MO1GUS_1.0.0/ag_news_subset-test.tfrecord*...:  …

Dataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.


In [None]:
#將TensorFlow裡的Dataset物件格式的新聞資料轉換為Python list，為後續Tokenizer做使用

#訓練資料
train_texts = []  #train_texts =  [新聞文字，新聞文字，.....，新聞文字]
train_labels = [] #train_labels = [標籤，標籤，.....，標籤]

for text, label in train_ds:
    train_texts.append(text.numpy().decode('utf-8')) #將原本以tensor物件儲存的文字內容解碼為字串格式
    train_labels.append(label.numpy())

#測試資料
test_texts = []   #test_texts =  [新聞文字，新聞文字，.....，新聞文字]
test_labels = []  #test_labels = [標籤，標籤，.....，標籤]

for text, label in test_ds:
    test_texts.append(text.numpy().decode('utf-8'))  #將原本以tensor物件儲存的文字內容解碼為字串格式
    test_labels.append(label.numpy())

In [None]:
#將文字資料透過Tokenizer轉換為模型可接受的數值序列

vocab_size = 10000  # 詞彙量上限
max_len = 200       # 最大序列長度

#設置Tokenizer，將訓練集裡最多出現次數的詞彙依序編號，最多到10000(vocab_size),超過10000的以<OOV>表示
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
#將訓練集放置在Tokenizer裡面建立詞彙索引
tokenizer.fit_on_texts(train_texts)

#將文字轉換為數值序列
#例子: "Apple releases new iPhone..." --> [1, 2, 3, 4,...]
X_train = tokenizer.texts_to_sequences(train_texts)
X_test = tokenizer.texts_to_sequences(test_texts)

#透過pad_sequences將所有輸入資料的長度(序列長度)一致(max_len = 200)，不足往前補0
#例子:
#   原始序列1: [5, 2, 1, ...](假設長度 100)
#       pad_sequences 後: [0, 0, ...0, 5, 2, 1,...] (長度 200)
#   原始序列2: [7, 3, 8, 2, 9, 4, ...] (假設長度 250)
#       pad_sequences 後: [7, 3, 8, 2, 9, 4, ...] (只保留前 200 個元素)
x_train = pad_sequences(X_train, maxlen=max_len, padding='pre', truncating='post')
x_test = pad_sequences(X_test, maxlen=max_len, padding='pre', truncating='post')

# 標籤轉成 numpy array，方便做運算
y_train = np.array(train_labels)
y_test = np.array(test_labels)

In [None]:
#設置模型，用於新聞主題分類
embedding_dim = 64 #Embedding 層的維度，每個詞彙將被表示為 64 維向量
#例子:
#假設一段文字有'Apple'這個詞彙，經 Embedding 層轉換後會得到一個 64 維向量表示，例如：
#'Apple' -> [0.12, -0.03, 0.45, ..., 0.08]  # 共 64 個數字(列表長度為64)
#同理，每個詞都會被映射成一個向量，向量中包含詞的語意資訊

#使用Sequential來串接神經網路模型
#第一層神經網路Embedding層將文字向量化
#第二層神經網路LSTM長短期記憶層，用於捕捉序列中詞語的上下文語意
#第三層神經網路Dense全連接層，32 個神經元將 LSTM 輸出的語意特徵進行線性組合並加上非線性激活（ReLU），提取更高階的特徵表示。
#第四層神經網路Dense全連接層，4 個神經元對應將特徵映射到 4 類新聞，使用 softmax 激活函數輸出各類別機率，模型最終根據最大機率決定新聞分類
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_len),
    LSTM(64, return_sequences=False),
    Dense(32, activation='relu'),
    Dense(4, activation='softmax')  # 4 類新聞
])

#編譯模型，設定損失函數、優化器與評估指標
#損失函數用於多分類問題，當標籤是整數形式（0~3）時使用 sparse_categorical_crossentropy
#優化器Adam是一種自適應學習率的梯度下降方法，收斂快且穩定
#訓練與測試時會計算準確率 (accuracy) 作為模型性能參考
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

#輸出模型的完整架構
model.summary()


In [None]:
#為了避免避免模型學到「順序」而不是「內容」，因此需要使用shuffle隨機打亂資料順序，所以需要先將文字序列與標籤轉換成TensorFlow Dataset以便操作
batch_size = 64 #將資料分成每批64筆進行訓練或測試，總共需要1875批(資料大小為120000筆，因此120000/64=1875)

#把X_train和y_train變回TensorFlow Dataset 物件
train_ds_tf = tf.data.Dataset.from_tensor_slices((x_train, y_train))
#自動在內部為你建立一個大小為10000筆資料的緩衝區(由於資料過大因此需要緩衝區來減少記憶體)，用來隨機抽資料(每批64筆資料)
train_ds_tf = train_ds_tf.shuffle(10000).batch(batch_size)

#把X_test和y_test變回TensorFlow Dataset 物件
test_ds_tf = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_ds_tf = test_ds_tf.batch(batch_size)


In [None]:
#負責訓練模型並記錄訓練歷史
epochs = 5 #表示整個訓練資料將被模型完整看 5 遍

# 訓練模型並記錄訓練過程
history = model.fit(
    train_ds_tf, #訓練資料（已經 shuffle + batch）
    validation_data=test_ds_tf,# 驗證資料，用於每個 epoch 後評估模型準確率
    epochs=epochs# 訓練的總輪數
)


Epoch 1/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 11ms/step - accuracy: 0.7928 - loss: 0.5313 - val_accuracy: 0.9001 - val_loss: 0.2944
Epoch 2/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 11ms/step - accuracy: 0.9179 - loss: 0.2439 - val_accuracy: 0.9071 - val_loss: 0.2829
Epoch 3/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 11ms/step - accuracy: 0.9293 - loss: 0.2016 - val_accuracy: 0.9043 - val_loss: 0.2925
Epoch 4/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 11ms/step - accuracy: 0.9405 - loss: 0.1678 - val_accuracy: 0.8980 - val_loss: 0.3367
Epoch 5/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 11ms/step - accuracy: 0.9509 - loss: 0.1371 - val_accuracy: 0.8988 - val_loss: 0.3794


# 模型實測
使用python套件newspaper3k抓取BBC新聞內容放進模型裡進行分類
## 使用新聞內容
### 1.體育新聞
- 網址: https://www.bbc.com/sport/basketball/articles/c8dyzyj9d88o

### 2.科學新聞
- 網址: https://www.bbc.com/news/articles/cd6xl3ql3v0o

### 3.金融新聞
- 網址: https://www.bbc.com/news/articles/cd74lyr094vo

### 4.科學新聞
- 網址: https://www.bbc.com/future/article/20251023-how-hydrofoil-boats-could-cut-emissions-from-water-transport

### 5.世界新聞
- 網址: https://www.bbc.com/culture/article/20251223-the-salt-path-and-2025s-most-scandalous-books


In [None]:
!pip install newspaper3k
!pip install lxml[html_clean]

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m135.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading tinysegmenter-0.3.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Colle

In [None]:
#抓取新聞文章
from newspaper import Article

news = []
titles = []

#體育新聞
url = 'https://www.bbc.com/sport/basketball/articles/c8dyzyj9d88o'
article = Article(url)
article.download()
article.parse()
text1 = article.text  # 內文
news.append(text1)
titles.append(article.title)
print("text1:", text1)

#科技新聞
url = 'https://www.bbc.com/news/articles/cd6xl3ql3v0o'
article = Article(url)
article.download()
article.parse()
text2 = article.text  # 內文
news.append(text2)
titles.append(article.title)
print("text2:", text2)

#金融新聞
url = 'https://www.bbc.com/news/articles/cd74lyr094vo'
article = Article(url)
article.download()
article.parse()
text3 = article.text  # 內文
news.append(text3)
titles.append(article.title)
print("text3:", text3)

#科學新聞
url = 'https://www.bbc.com/future/article/20251023-how-hydrofoil-boats-could-cut-emissions-from-water-transport'
article = Article(url)
article.download()
article.parse()
text4 = article.text  # 內文
news.append(text4)
titles.append(article.title)
print("text4:", text3)

#世界新聞(文化)
url = 'https://www.bbc.com/culture/article/20251223-the-salt-path-and-2025s-most-scandalous-books'
article = Article(url)
article.download()
article.parse()
text5 = article.text  # 內文
news.append(text5)
titles.append(article.title)
print("text5:", text5)

print(news)

text1: Nikola Jokic recorded a 56-point triple-double and broke a record set by Steph Curry as the Denver Nuggets beat the Minnesota Timberwolves 142-138 on Christmas Day.

The Serb hit 56 points, recorded 16 rebounds and 15 assists - becoming the first player in NBA history to hit at least 55 points, 15 rebounds and 15 assists in a triple-double.

Three-time MVP Jokic hit 18 of his 56 points in overtime, breaking Curry's record of 17 overtime points from 2016.

The Timberwolves took the game in Denver to overtime after clawing back a 15-point deficit in the final five minutes of the game.

Anthony Edwards top-scored for the Timberwolves with 44 points, including the game-tying three that took the game to overtime.

But the 24-year-old was ejected in the extra period for arguing over foul calls as the Nuggets claimed the win.

The Nuggets are third in the Western Conference, with the Timberwolves in fifth.
text2: One in three using AI for emotional support and conversation, UK says

18

In [None]:
seq = tokenizer.texts_to_sequences(news)
padded = pad_sequences(seq, maxlen=max_len, padding='post')

#放進模型做分類
pred = model.predict(padded)
labels = ['World', 'Sports', 'Business', 'Sci/Tech']

# print(pred)

#輸出分類結果
for title, p in zip(titles, pred):
    pred_class = labels[np.argmax(p)]
    print(f"新聞: {title}")
    print(f"預測主題: {pred_class}\n")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
新聞: Nikola Jokic breaks Steph Curry record with historic triple-double in Denver Nuggets win
預測主題: Sports

新聞: One in three using AI for emotional support and conversation, UK says
預測主題: Sci/Tech

新聞: US pauses offshore wind projects over national security concerns
預測主題: Business

新聞: 'The sound completely changes': To electrify boats, make them fly
預測主題: Sci/Tech

新聞: The Salt Path and 2025's most scandalous books
預測主題: World

