### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [2]:
#load json data

file = open("All_Beauty.json", 'r', encoding='utf-8')
all_reviews = []
count = 0
for line in file.readlines():
    if count < 10000:
        dic = json.loads(line)
        all_reviews.append(dic)
        count += 1
        
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [3]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
for idx in range(len(all_reviews)):
    if 'reviewText' in all_reviews[idx] and 'overall' in all_reviews[idx]:
        corpus.append(all_reviews[idx]['reviewText'])
        if all_reviews[idx]['overall'] == 1 or all_reviews[idx]['overall'] == 2:
            labels.append(1)
        elif all_reviews[idx]['overall'] == 3:
            labels.append(2)
        elif all_reviews[idx]['overall'] == 4 or all_reviews[idx]['overall'] == 5:
            labels.append(3)



In [4]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
def Punctuation(string): 
  
    # punctuation marks 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
  
    # traverse the given string and if any punctuation 
    # marks occur replace it with null 
    for x in string.lower(): 
        if x in punctuations: 
            string = string.replace(x, "") 
  
    # Print string without punctuation 
    return string 
    
    
for idx in range(len(corpus)):
    match = re.findall(r"\w\S*@.*\b", corpus[idx])
    for mail in match:
        corpus[idx] = corpus[idx].replace(mail, '')
    corpus[idx] = Punctuation(corpus[idx])
    corpus[idx] = corpus[idx].replace('\n', '')
    

In [5]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=2, shuffle=True)

len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [6]:
#change corpus into vector
#you can use tfidf or BoW here
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer() 
## 用X_train來build 字典，字數還有document 數量
vectorizer.fit(x_train)


#transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train)
x_test =  vectorizer.transform(x_test)

### 訓練與預測

In [7]:
#build classification model (decision tree, random forest, or adaboost)
#start training

#決策樹
decision_tree_cls = DecisionTreeClassifier(criterion='entropy', max_depth=6,
                                           min_samples_split=500, min_samples_leaf=100)

decision_tree_cls.fit(x_train, y_train)

# RandomForestClassifier

rand_decision_tree_cls = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=6,
                                           min_samples_split=500, min_samples_leaf=100)

rand_decision_tree_cls.fit(x_train, y_train)

#adaboost
adaboost_cls = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy', max_depth=6,
                                           min_samples_split=500, min_samples_leaf=100), n_estimators=100, learning_rate=0.8)

adaboost_cls.fit(x_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=6,
                                                         min_samples_leaf=100,
                                                         min_samples_split=500),
                   learning_rate=0.8, n_estimators=100)

In [8]:
#start inference
y_pred_decision = decision_tree_cls.predict(x_test)
y_pred_rand_decision = rand_decision_tree_cls.predict(x_test)
y_pred_adaboost = adaboost_cls.predict(x_test)


In [9]:
#calculate accuracy
def cal_acc(y_true, y_pred):
    
    return sum(y_true == y_pred) / len(y_test)

print(f"決策樹 Accuracy: {cal_acc(y_test, y_pred_decision)}")
print(f"RandomForest Accuracy: {cal_acc(y_test, y_pred_rand_decision)}")
print(f"adaboost Accuracy: {cal_acc(y_test, y_pred_adaboost)}")


決策樹 Accuracy: 0.8759379689844923
RandomForest Accuracy: 0.8759379689844923
adaboost Accuracy: 0.8834417208604303


In [11]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred_adaboost))
print(confusion_matrix(y_test, y_pred_adaboost))

              precision    recall  f1-score   support

           1       0.61      0.38      0.47       146
           2       0.28      0.10      0.14       102
           3       0.91      0.97      0.94      1751

    accuracy                           0.88      1999
   macro avg       0.60      0.48      0.52      1999
weighted avg       0.85      0.88      0.86      1999

[[  55    3   88]
 [   8   10   84]
 [  27   23 1701]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現