# 自然語言處理與文字探勘技術：Fundamental of Classification
## 課程練習(Homework week 07)

### 姓名（Name）：陳嬿伃
### 學號（Student ID）：o902108008

## 第一部分：教學及教材的實作
## Task I: Hands On Practice

### 請就實體課程，線上教學，教材等提到的實作，依序於此實際動手寫程式並執行，就結果討論。
### refer "Fundamental of Classification.pdf" and complete "LET'S CODE" practices.

## Converting label and text

### Read Dataset, page 70.

In [1]:
import xlrd

def get_dataset(file_path):
    workbook = xlrd.open_workbook(file_path)
    booksheet = workbook.sheet_by_name('training')
    text = []
    label = []
    
    for i in range(booksheet.nrows):
        gender = booksheet.cell(i, 1).value
        if gender == 'M':
            label.append("0")
        else:
            label.append("1")
        text.append(booksheet.cell(i, 0).value)
    return label, text

### The Implementation of Text Classifier, page 71

In [2]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold #做幾等分的交叉驗證用
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

labels, texts = get_dataset("./blog-gender-dataset.xlsx") #這裡暫不考慮stopword
tfidf_BOW = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b",
                           max_df =0.9, min_df = 0.1)
tfidf_BOW.fit_transform(texts) #把文字清單轉成向量

predicted = []
expected = []
for train_index, test_index in KFold(n_splits = 10, shuffle = True).split(texts):
    x_train = np.array(texts)[train_index]
    y_train = np.array(labels)[train_index]

    x_test = np.array(texts)[test_index]
    y_test = np.array(labels)[test_index]

    vectors_training = tfidf_BOW.fit_transform(x_train)
    vectors_test = tfidf_BOW.transform(x_test)
    
    model = MultinomialNB(alpha = .01)
    model.fit(vectors_training, y_train) #fit訓練資料
    
    expected.extend(y_test)
    predicted.extend(model.predict(vectors_test))

### Performance Evaluation, page 72

In [3]:
print("Macro-average: {0}".format(metrics.f1_score(expected, predicted, average='macro')))
print("Micro-average: {0}".format(metrics.f1_score(expected, predicted, average='micro')))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Macro-average: 0.6325767123262274
Micro-average: 0.6395420792079208
              precision    recall  f1-score   support

           0       0.65      0.52      0.58      1547
           1       0.63      0.75      0.68      1685

    accuracy                           0.64      3232
   macro avg       0.64      0.63      0.63      3232
weighted avg       0.64      0.64      0.63      3232

[[ 811  736]
 [ 429 1256]]


## The 20 Newsgroups Dataset

### The 20 Newsgroups Dataset (2/3), page 75

In [1]:
###The 20 Newsgroups Dataset (資料在sklearn裡面就有)
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset = 'train')
print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [2]:
print(type(fetch_20newsgroups))
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)
print(newsgroups_train.target[:10])
print(newsgroups_train.data[0])

<class 'function'>
(11314,)
(11314,)
[ 7  4  4  1 14 16 13  3  2  4]
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### The 20 Newsgroups Dataset (3/3), page 76

In [6]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset = 'train')
#為了方便先選則兩個做為範例
cats = ['alt.atheism', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset = 'train', categories=cats)
print(list(newsgroups_train.target_names))
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)
print(newsgroups_train.target[:10])

['alt.atheism', 'sci.space']
(1073,)
(1073,)
[0 1 1 1 0 1 1 0 0 0]


### Topic classification using Multinomial NB with TF-IDF representation (1/3), page 78

In [3]:
# Topic classification using Multinomial NB with TF-IDF representation
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
trainingData = fetch_20newsgroups(subset = 'train', categories = categories)
vectorizer = TfidfVectorizer()
vectors_training = vectorizer.fit_transform(trainingData.data)

#training process
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics #轉成向量後拿測試資料
testData = fetch_20newsgroups(subset = 'test', categories = categories)
vectors_test =  vectorizer.transform(testData.data)
model = MultinomialNB(alpha = .01)
model.fit(vectors_training, trainingData.target)

#text process
predicted = model.predict(vectors_test)
print("Macro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='macro')))
print("Micro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='micro')))
print(metrics.classification_report(testData.target, predicted))
print(metrics.confusion_matrix(testData.target, predicted))

Macro-average: 0.8821359240272957
Micro-average: 0.893569844789357
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       319
           1       0.95      0.95      0.95       389
           2       0.91      0.96      0.93       394
           3       0.85      0.76      0.80       251

    accuracy                           0.89      1353
   macro avg       0.89      0.88      0.88      1353
weighted avg       0.89      0.89      0.89      1353

[[274   2   9  34]
 [  5 368  16   0]
 [  2  15 377   0]
 [ 47   3  11 190]]


In [5]:
type(fetch_20newsgroups)

function

### Topic classification using Multinomial NB (2/3 ~ 3/3), page 79, 80

In [8]:
# Topic classification using Multinomial NB
###資料全做，不做篩選的情況
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
trainingData = fetch_20newsgroups(subset = 'train')
vectorizer = TfidfVectorizer()
vectors_training = vectorizer.fit_transform(trainingData.data)

#training process
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
testData = fetch_20newsgroups(subset = 'test')
vectors_test =  vectorizer.transform(testData.data)
model = MultinomialNB(alpha = .01)
model.fit(vectors_training, trainingData.target)

#text procecc
predicted = model.predict(vectors_test)
print("Macro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='macro')))
print("Micro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='micro')))
print(metrics.classification_report(testData.target, predicted))


Macro-average: 0.8290659644474043
Micro-average: 0.8352363250132767
              precision    recall  f1-score   support

           0       0.82      0.78      0.80       319
           1       0.69      0.75      0.72       389
           2       0.74      0.63      0.68       394
           3       0.65      0.75      0.69       392
           4       0.83      0.84      0.83       385
           5       0.84      0.78      0.81       395
           6       0.82      0.78      0.80       390
           7       0.89      0.90      0.90       396
           8       0.93      0.96      0.95       398
           9       0.95      0.94      0.95       397
          10       0.95      0.97      0.96       399
          11       0.89      0.93      0.91       396
          12       0.79      0.77      0.78       393
          13       0.89      0.84      0.86       396
          14       0.87      0.91      0.89       394
          15       0.82      0.95      0.88       398
          16 

In [9]:
print(metrics.confusion_matrix(testData.target, predicted)) #全部的資料矩陣

[[249   0   0   4   0   1   0   0   1   1   0   1   0   5   5  28   3   3
    1  17]
 [  0 290  15  14  10  23   6   0   0   3   0   4  12   0   7   2   0   2
    0   1]
 [  1  32 248  52   4  20   5   0   2   1   1   6   3   3   5   4   0   0
    4   3]
 [  0  11  26 293  22   1  11   1   0   1   0   1  21   0   4   0   0   0
    0   0]
 [  0   7  10  14 322   1   8   4   1   2   1   2   9   2   1   0   1   0
    0   0]
 [  0  40  14  11   6 307   3   1   2   0   0   3   2   1   4   0   1   0
    0   0]
 [  0   4   6  26   8   0 306  11   9   1   5   0   9   4   1   0   0   0
    0   0]
 [  0   1   1   5   1   0  10 358   6   1   0   0   6   3   1   0   2   0
    1   0]
 [  0   1   0   1   1   0   2   7 383   0   0   0   3   0   0   0   0   0
    0   0]
 [  0   0   0   0   1   0   3   4   0 373  11   1   0   0   2   0   0   2
    0   0]
 [  0   0   0   0   0   1   1   0   0   4 387   2   0   1   0   2   1   0
    0   0]
 [  1   3   1   2   2   1   3   3   0   0   0 370   1   3   2   0

### Topic classification using NB with BOW representation (1/2 ~ 2/2), page 81, 82

In [10]:
# Topic classification using NB with BOW representation
###資料全做，不做篩選的情況
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer #改這裡跟上面來比較
trainingData = fetch_20newsgroups(subset = 'train')
vectorizer = CountVectorizer() #改這裡
vectors_training = vectorizer.fit_transform(trainingData.data)

#training process
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
testData = fetch_20newsgroups(subset = 'test')
vectors_test =  vectorizer.transform(testData.data)
model = MultinomialNB(alpha = .01)
model.fit(vectors_training, trainingData.target)

#text procecc
predicted = model.predict(vectors_test)
print("Macro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='macro')))
print("Micro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='micro')))
print(metrics.classification_report(testData.target, predicted))


#效果比TF-IDF差

Macro-average: 0.7852092132952866
Micro-average: 0.8039033457249071
              precision    recall  f1-score   support

           0       0.80      0.83      0.82       319
           1       0.57      0.78      0.66       389
           2       0.75      0.04      0.07       394
           3       0.55      0.78      0.64       392
           4       0.74      0.83      0.78       385
           5       0.80      0.73      0.76       395
           6       0.79      0.85      0.82       390
           7       0.86      0.90      0.88       396
           8       0.91      0.96      0.94       398
           9       0.95      0.93      0.94       397
          10       0.96      0.96      0.96       399
          11       0.88      0.93      0.91       396
          12       0.77      0.76      0.76       393
          13       0.88      0.83      0.86       396
          14       0.87      0.89      0.88       394
          15       0.89      0.92      0.91       398
          16 

In [11]:
print(metrics.confusion_matrix(testData.target, predicted)) #全部的資料矩陣

[[265   1   0   4   0   0   0   0   2   2   0   2   0   2   2  14   2   3
    0  20]
 [  1 304   1  16  13  17   8   0   0   1   0   8  13   0   5   2   0   0
    0   0]
 [  1  87  15 146  39  44   9   3   4   1   1   6   9   8   6   2   2   0
    7   4]
 [  0   8   1 307  30   2  11   1   0   1   0   2  26   0   3   0   0   0
    0   0]
 [  0  11   1  16 320   1  15   2   2   2   0   1  11   2   1   0   0   0
    0   0]
 [  0  66   2  14   7 288   5   2   3   0   0   2   2   1   3   0   0   0
    0   0]
 [  0   6   0  20   9   0 330  10   3   1   3   0   4   1   2   0   1   0
    0   0]
 [  0   1   0   2   1   0  12 355  11   1   0   0   7   2   1   0   2   0
    1   0]
 [  0   0   0   1   1   0   3   9 382   0   0   0   2   0   0   0   0   0
    0   0]
 [  1   1   0   0   1   0   4   3   0 369   9   0   0   3   4   0   0   0
    2   0]
 [  0   0   0   0   0   1   1   1   0   5 385   1   0   1   0   1   0   0
    2   1]
 [  1   4   0   3   2   2   3   3   0   1   0 367   1   2   2   0

### Simplified procedures using Pipeline, page 83

In [12]:
# Simplified procedures using Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline #修改這裡
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

model = Pipeline([('tfidf', TfidfVectorizer()),('clf', MultinomialNB(alpha=.01))]) #修改這裡
trainingData = fetch_20newsgroups(subset = 'train')
testData = fetch_20newsgroups(subset = 'test')
model.fit(trainingData.data, trainingData.target)
predicted = model.predict(testData.data)
print("Macro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='macro')))
print("Micro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='micro')))
print(metrics.classification_report(testData.target, predicted))


Macro-average: 0.8290659644474043
Micro-average: 0.8352363250132767
              precision    recall  f1-score   support

           0       0.82      0.78      0.80       319
           1       0.69      0.75      0.72       389
           2       0.74      0.63      0.68       394
           3       0.65      0.75      0.69       392
           4       0.83      0.84      0.83       385
           5       0.84      0.78      0.81       395
           6       0.82      0.78      0.80       390
           7       0.89      0.90      0.90       396
           8       0.93      0.96      0.95       398
           9       0.95      0.94      0.95       397
          10       0.95      0.97      0.96       399
          11       0.89      0.93      0.91       396
          12       0.79      0.77      0.78       393
          13       0.89      0.84      0.86       396
          14       0.87      0.91      0.89       394
          15       0.82      0.95      0.88       398
          16 

In [13]:
print(metrics.confusion_matrix(testData.target, predicted)) #全部的資料矩陣

[[249   0   0   4   0   1   0   0   1   1   0   1   0   5   5  28   3   3
    1  17]
 [  0 290  15  14  10  23   6   0   0   3   0   4  12   0   7   2   0   2
    0   1]
 [  1  32 248  52   4  20   5   0   2   1   1   6   3   3   5   4   0   0
    4   3]
 [  0  11  26 293  22   1  11   1   0   1   0   1  21   0   4   0   0   0
    0   0]
 [  0   7  10  14 322   1   8   4   1   2   1   2   9   2   1   0   1   0
    0   0]
 [  0  40  14  11   6 307   3   1   2   0   0   3   2   1   4   0   1   0
    0   0]
 [  0   4   6  26   8   0 306  11   9   1   5   0   9   4   1   0   0   0
    0   0]
 [  0   1   1   5   1   0  10 358   6   1   0   0   6   3   1   0   2   0
    1   0]
 [  0   1   0   1   1   0   2   7 383   0   0   0   3   0   0   0   0   0
    0   0]
 [  0   0   0   0   1   0   3   4   0 373  11   1   0   0   2   0   0   2
    0   0]
 [  0   0   0   0   0   1   1   0   0   4 387   2   0   1   0   2   1   0
    0   0]
 [  1   3   1   2   2   1   3   3   0   0   0 370   1   3   2   0

## The 20 Newsgroups Dataset: SVM

### 20Newsgroup Dataset using SVM, page 108, 109

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
trainingData = fetch_20newsgroups(subset = 'train', categories = categories)
vectorizer = TfidfVectorizer(stop_words = 'english')
vectors_training = vectorizer.fit_transform(trainingData.data)

#training process
from sklearn import svm ###改這行
from sklearn import metrics
testData = fetch_20newsgroups(subset = 'test', categories = categories)
vectors_test =  vectorizer.transform(testData.data)
model = svm.SVC(kernel = 'linear') ###改這行
model.fit(vectors_training, trainingData.target)

#text procecc
predicted = model.predict(vectors_test)
print("Macro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='macro')))
print("Micro-average: {0}".format(metrics.f1_score(testData.target, predicted, average='micro')))
print(metrics.classification_report(testData.target, predicted))


Macro-average: 0.8849384370085389
Micro-average: 0.8965262379896525
              precision    recall  f1-score   support

           0       0.88      0.82      0.85       319
           1       0.90      0.98      0.94       389
           2       0.96      0.94      0.95       394
           3       0.81      0.79      0.80       251

    accuracy                           0.90      1353
   macro avg       0.89      0.88      0.88      1353
weighted avg       0.90      0.90      0.90      1353



In [7]:
print(metrics.confusion_matrix(testData.target, predicted))

[[262  10   5  42]
 [  0 380   5   4]
 [  1  20 372   1]
 [ 35  11   6 199]]


## 小結論：
1. KFold是為了將資料分成幾等分，做為交叉驗證使用
2. sklearn支援很多種分類方式，只要一行就可以宣告模型，此外，亦可以修改幾行就換成另一種分類方式，很方便!
3. 要做主題分類(Topic classification)的話，使用Multinomial NB並用TF-IDF呈現，其結果比用傳統的Bag-of-words呈現來的好
4. Pipeline有點像是工廠，一步一步往下做
5. 使用SVM分類結果較Naïve Bayes分類來的好
6. SVM分類要找margin最大的，代表可以把資料分得很好
7. SVM投射到更高維度的空間，找到比較好的解，使用Kernel Functions
8. 在做文字分類之前的處理很重要，要篩選好資料再帶入模型，預測成果至少都有8.9成的效果