# 文本分類(Text classification)練習
* 課程名稱: 文字探勘
* 授課教師: 吳政隆教授
* 根據ptt文章標題按看板分類
* 資料集: ptt文章，包括八卦、C洽、棒球、股票、NBA、政黑等6個看板

### 參考網站: 
* https://sfhsu29.medium.com/nlp-%E5%85%A5%E9%96%80-1-text-classification-sentiment-analysis-%E6%A5%B5%E7%B0%A1%E6%98%93%E6%83%85%E6%84%9F%E5%88%86%E9%A1%9E%E5%99%A8-bag-of-words-naive-bayes-e40d61de9a7f
* https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92%E6%87%89%E7%94%A8-%E5%9E%83%E5%9C%BE%E8%A8%8A%E6%81%AF%E5%81%B5%E6%B8%AC-%E8%88%87-tf-idf%E4%BB%8B%E7%B4%B9-%E5%90%AB%E7%AF%84%E4%BE%8B%E7%A8%8B%E5%BC%8F-2cddc7f7b2c5

# 匯入、分割資料
* 將看板名稱類別轉換成數值，進行編碼(LabelEncoder)
* 文章句子用jieba斷詞以利文本分析
* 將資料集切分成訓練、測試集

In [2]:
import pandas as pd
import jieba
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [8]:
df = pd.read_csv("ptt文章.csv", index_col = 0)
df = df[~df['標題'].str.contains('Re:')]
df

Unnamed: 0,標題,看板名稱
0,[問卦] 板上有復合成功幸福美滿的例子嗎？,Gosssiping
2,[新聞] 真愜意！裸上身按摩開會　亞航老闆自拍慘,Gosssiping
3,[問卦] 麥當勞 大蛋捲冰有賣嗎?,Gosssiping
4,[新聞] 走鐘獎挨轟難看、隨便！呱吉「千字文」,Gosssiping
5,[新聞] 15人獲文協獎章 盼促台本土文化,Gosssiping
...,...,...
370,[新聞] 快訊／民眾黨堅持全民調　金溥聰：若只,HatePolitics
372,[討論] 兼顧民主初選跟民調的好方法,HatePolitics
375,[討論] 想要阿北贏要選兩次好累,HatePolitics
379,[黑特] 藍白合到底是要演給誰看的,HatePolitics


In [9]:
df['LABEL'] = LabelEncoder().fit_transform(df['看板名稱'])
df['LABEL']

0      2
2      2
3      2
4      2
5      2
      ..
370    3
372    3
375    3
379    3
381    3
Name: LABEL, Length: 1885, dtype: int32

In [10]:
df['標題_SEG'] = [jieba.lcut(sent) for sent in df['標題']]
df['標題_SEG'] = df['標題_SEG'].apply(lambda x:' '.join(x))
df.head(10)

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\CHARLIE\AppData\Local\Temp\jieba.cache
Loading model cost 0.599 seconds.
Prefix dict has been built successfully.


Unnamed: 0,標題,看板名稱,LABEL,標題_SEG
0,[問卦] 板上有復合成功幸福美滿的例子嗎？,Gosssiping,2,[ 問卦 ] 板上 有 復 合 成功 幸福 美滿 的 例子 嗎 ？
2,[新聞] 真愜意！裸上身按摩開會　亞航老闆自拍慘,Gosssiping,2,[ 新聞 ] 真 愜意 ！ 裸 上身 按摩 開會 亞航 老 闆 自 拍慘
3,[問卦] 麥當勞 大蛋捲冰有賣嗎?,Gosssiping,2,[ 問卦 ] 麥當勞 大蛋 捲 冰有 賣 嗎 ?
4,[新聞] 走鐘獎挨轟難看、隨便！呱吉「千字文」,Gosssiping,2,[ 新聞 ] 走 鐘獎 挨 轟難 看 、 隨便 ！ 呱吉 「 千字文 」
5,[新聞] 15人獲文協獎章 盼促台本土文化,Gosssiping,2,[ 新聞 ] 15 人 獲文協 獎章 盼 促台 本土 文化
6,[新聞] 稱口罩國家隊賺回扣遭3大公會譴責 李鴻源,Gosssiping,2,[ 新聞 ] 稱 口罩 國家隊 賺 回扣 遭 3 大公 會 譴責 李鴻源
7,[問卦] 爸爸過世了媽媽整天上班小公主該怎辦????,Gosssiping,2,[ 問卦 ] 爸爸 過世 了 媽媽 整天 上班 小 公主 該 怎辦 ? ? ? ?
8,[問卦] 圓山動物園的林旺怎麼來台灣的,Gosssiping,2,[ 問卦 ] 圓山動 物園 的 林旺 怎麼 來 台灣 的
10,[問卦] 以色列會變成下一個光明頂嗎？,Gosssiping,2,[ 問卦 ] 以色列 會變 成下 一個 光明 頂 嗎 ？
11,[問卦] 現股買進 能改帳成現股賣出嗎,Gosssiping,2,[ 問卦 ] 現股 買進 能 改帳 成現 股賣 出 嗎


In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    df['標題_SEG'], df['LABEL'], test_size=0.33, random_state=42)
print(y_train.shape, y_test.shape)

(1262,) (623,)


In [13]:
def show_result(predicted, predicted_proba, target):
    print('*'*50)
    print('predicted class of first 3 test data')
    print(predicted[:3])

    print('*'*50)
    print('predicted class proba. of first 3 test data')
    print(predicted_proba[:3])

    np.mean(predicted == target)
    print('*'*50)
    print('accuracy performance on test data')
    print(np.mean(predicted == target))

# Feature vectorization
* 將原始資料轉換為特徵向量，也會根據資料建立新的特徵。可以利用計數向量(CountVectorizer)或TF-IDF向量(TfidfVectorizer)作為特徵
* 詞袋(Bag of words): 利用Countvectorizer作為實作詞袋的模組，根據該單字出現的頻率，無法體現出文字間的「距離」
* TF-IDF: TF為詞頻(某詞出現在文章中的頻率、比例)，IDF(Inverse Document Frequency)是逆向文件頻率，如果某詞在越多文章中出現，相對的IDF會比較小，可能就不是那麼重要的詞。
* 利用TF和IDF計算每一個詞對應每篇文章的score，轉換成特徵向量

> ## CountVectorizer

In [15]:
# create feature vectors
count_vect = CountVectorizer(max_features=1000) # max_features=130107
X_train_counts = count_vect.fit_transform(X_train)

#prints the train data shape
print('train data shape using CountVectorizer')
print(X_train_counts.shape)

#prints the test data shape
X_test_counts = count_vect.transform(X_test)
print('test data shape using CountVectorizer')
print(X_test_counts.shape)

train data shape using CountVectorizer
(1262, 1000)
test data shape using CountVectorizer
(623, 1000)


> ## TfidfVectorizer

In [16]:
tfidf_vect = TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf_vect.fit_transform(X_train)

print('train data shape using TfidfVectorizer')
print(X_train_tfidf.shape)

X_test_tfidf = tfidf_vect.transform(X_test)
print('test data shape using TfidfVectorizer')
print(X_test_tfidf.shape)

train data shape using TfidfVectorizer
(1262, 1000)
test data shape using TfidfVectorizer
(623, 1000)


# Create classifier
* 轉換為特徵向量後開始建模，這裡使用Naive Bayes、KNN、SVM等分類器進行訓練，根據測試集的預測表現來比較
* 從結果來看，TF-IDF轉換特徵向量後使用SVM訓練出的表現最好(約0.85)，預測每篇文章看板的正確率較高

> ## Naive Bayes classifier with CountVectorizer

In [17]:
# Create classifier and use count vectors
MultinomialNB_clf = MultinomialNB()
print('*'*50)
print('MultinomialNB classifier with CountVectorizer')
print(MultinomialNB_clf)

# fit train data
MultinomialNB_clf.fit(X_train_counts, y_train)

# predict the class and class proba.
predicted = MultinomialNB_clf.predict(X_test_counts)
predicted_proba = MultinomialNB_clf.predict_proba(X_test_counts)

show_result(predicted, predicted_proba, y_test)

**************************************************
MultinomialNB classifier with CountVectorizer
MultinomialNB()
**************************************************
predicted class of first 3 test data
[5 5 5]
**************************************************
predicted class proba. of first 3 test data
[[7.97671432e-04 1.85564574e-03 5.25305409e-05 4.14088938e-05
  2.68219075e-03 9.94570553e-01]
 [3.09494138e-01 7.35416507e-03 3.12491535e-02 6.86943843e-03
  1.96084987e-02 6.25424607e-01]
 [4.72054153e-02 5.62236357e-02 4.82078607e-02 4.38093505e-02
  4.75992539e-02 7.56954484e-01]]
**************************************************
accuracy performance on test data
0.8250401284109149


> ## Naive Bayes classifier with TfidfVectorizer

In [18]:
# Create classifier and use tf-idf vectors
MultinomialNB_clf = MultinomialNB()
print('*'*50)
print('MultinomialNB classifier with TfidfVectorizer')
print(MultinomialNB_clf)

MultinomialNB_clf.fit(X_train_tfidf, y_train)

predicted = MultinomialNB_clf.predict(X_test_tfidf)
predicted_proba = MultinomialNB_clf.predict_proba(X_test_tfidf)

show_result(predicted, predicted_proba, y_test)

**************************************************
MultinomialNB classifier with TfidfVectorizer
MultinomialNB()
**************************************************
predicted class of first 3 test data
[5 5 5]
**************************************************
predicted class proba. of first 3 test data
[[0.06641456 0.09389482 0.03436574 0.03209355 0.08780072 0.68543062]
 [0.24768708 0.11371352 0.11235325 0.07159839 0.14278431 0.31186345]
 [0.07092355 0.07390103 0.06540822 0.06167969 0.07393306 0.65415444]]
**************************************************
accuracy performance on test data
0.8314606741573034


> ## KNN classifier with CountVectorizer

In [25]:
# Create classifier and use count vectors
KNeighborsClassifier_clf = KNeighborsClassifier(n_neighbors=3, weights='distance')
print('*'*50)
print('KNeighbors classifier with CountVectorizer')
print(KNeighborsClassifier_clf)

KNeighborsClassifier_clf.fit(X_train_counts, y_train)

predicted = KNeighborsClassifier_clf.predict(X_test_counts)
predicted_proba = KNeighborsClassifier_clf.predict_proba(X_test_counts)

show_result(predicted, predicted_proba, y_test)

**************************************************
KNeighbors classifier with CountVectorizer
KNeighborsClassifier(n_neighbors=3, weights='distance')
**************************************************
predicted class of first 3 test data
[5 0 5]
**************************************************
predicted class proba. of first 3 test data
[[0.         0.         0.         0.         0.         1.        ]
 [0.41421356 0.         0.         0.29289322 0.         0.29289322]
 [0.         0.         0.         0.         0.         1.        ]]
**************************************************
accuracy performance on test data
0.7126805778491172


> ## KNN classifier with TfidfVectorizer

In [32]:
# Create classifier and use tf-idf vectors
KNeighborsClassifier_clf = KNeighborsClassifier(n_neighbors=3, weights='distance')
print('*'*50)
print('KNeighbors classifier with TfidfVectorizer')
print(KNeighborsClassifier_clf)

KNeighborsClassifier_clf.fit(X_train_tfidf, y_train)

predicted = KNeighborsClassifier_clf.predict(X_test_tfidf)
predicted_proba = KNeighborsClassifier_clf.predict_proba(X_test_tfidf)

show_result(predicted, predicted_proba, y_test)

**************************************************
KNeighbors classifier with TfidfVectorizer
KNeighborsClassifier(n_neighbors=3, weights='distance')
**************************************************
predicted class of first 3 test data
[5 0 5]
**************************************************
predicted class proba. of first 3 test data
[[0.         0.2959052  0.         0.         0.         0.7040948 ]
 [0.34390499 0.         0.         0.         0.33095939 0.32513562]
 [0.         0.         0.         0.         0.         1.        ]]
**************************************************
accuracy performance on test data
0.6388443017656501


> ## SVM classifier with CountVectorizer

In [33]:
# Create classifier and use count vectors
SVC_clf = SVC(probability=True)
print('*'*50)
print('SVM classifier with CountVectorizer')
print(SVC_clf)

# fit train data
SVC_clf.fit(X_train_counts, y_train)

# predict the class and class proba.
predicted = SVC_clf.predict(X_test_counts)
predicted_proba = SVC_clf.predict_proba(X_test_counts)

print('*'*50)
print('predicted class of first 3 test data')
print(predicted[:3])

print('*'*50)
print('predicted class proba. of first 3 test data')
print(predicted_proba[:3])

np.mean(predicted == y_test)
print('*'*50)
print('accuracy performance on test data')
print(np.mean(predicted == y_test))

**************************************************
SVM classifier with CountVectorizer
SVC(probability=True)
**************************************************
predicted class of first 3 test data
[5 5 5]
**************************************************
predicted class proba. of first 3 test data
[[0.02551504 0.02179138 0.00830757 0.00683324 0.03439352 0.90315923]
 [0.09546779 0.00743315 0.06511505 0.01403252 0.0074876  0.81046389]
 [0.00274804 0.02995261 0.00422909 0.0025458  0.00129117 0.95923329]]
**************************************************
accuracy performance on test data
0.8298555377207063


> ## SVM classifier with TfidfVectorizer

In [34]:
# Create classifier and use tf-idf vectors
SVC_clf = SVC(probability=True)
print('*'*50)
print('SVM classifier with TfidfVectorizer')
print(SVC_clf)

SVC_clf.fit(X_train_tfidf, y_train)

predicted = SVC_clf.predict(X_test_tfidf)
predicted_proba = SVC_clf.predict_proba(X_test_tfidf)

print('*'*50)
print('predicted class of first 3 test data')
print(predicted[:3])

print('*'*50)
print('predicted class proba. of first 3 test data')
print(predicted_proba[:3])

np.mean(predicted == y_test)
print('*'*50)
print('accuracy performance on test data')
print(np.mean(predicted == y_test))

**************************************************
SVM classifier with TfidfVectorizer
SVC(probability=True)
**************************************************
predicted class of first 3 test data
[5 5 5]
**************************************************
predicted class proba. of first 3 test data
[[1.08837254e-02 3.15268795e-02 7.59456981e-03 4.63304754e-03
  1.96354517e-02 9.25726326e-01]
 [1.71602207e-01 1.67527760e-02 6.21936426e-02 1.65607201e-02
  1.63202163e-02 7.16570438e-01]
 [1.20135407e-04 9.65147686e-04 1.17765853e-03 3.76024504e-04
  4.57089325e-04 9.96903945e-01]]
**************************************************
accuracy performance on test data
0.8507223113964687
