### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [38]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
import numpy as np

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [39]:
#load json data
all_reviews = []
with open('./All_Beauty_5.json') as f:
    for review in f:
        all_reviews.append(json.loads(review))
        
all_reviews[0]

{'overall': 5.0,
 'verified': True,
 'reviewTime': '09 1, 2016',
 'reviewerID': 'A3CIUOJXQ5VDQ2',
 'asin': 'B0000530HU',
 'style': {'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice Blue'},
 'reviewerName': 'Shelly F',
 'reviewText': 'As advertised. Reasonably priced',
 'summary': 'Five Stars',
 'unixReviewTime': 1472688000}

In [40]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

for review in all_reviews[:10000]:
    if review.get("reviewText", False) and review.get("overall", False):
        corpus.append(review['reviewText'])
        labels.append(review['overall'])
        
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
for i, label in enumerate(labels):
    if label == 1 or label == 2:
        labels[i] = 1
    elif label == 3:
        labels[i] = 2
    else:
        labels[i] = 3


In [41]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

pattern = r"\S*@S\*|\\n|[^a-zA-Z0-9]"

for i, review in enumerate(corpus):
    fil_review = [w for w in re.sub(pattern, " ", review).split(" ") if w != '']
    corpus[i] = ' '.join(fil_review)
    
# corpus = np.array(corpus).reshape(-1, 1)

In [42]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0)

len(x_train), len(x_test), len(y_train), len(y_test)

(4211, 1053, 4211, 1053)

In [43]:
x_train

['Perfect for a pleasant all day essence',
 'Used for years it has DHT blocker',
 'This gel is a genuine imported product from France Over the years I have bought my wife similar products by other very expensive brands from France Real high quality at a very affordable price',
 'If you have found the top off the tube of toothpaste like I have you know this pump is a great idea My grandkids sometimes forget to put the cap back on the toothpaste so this is perfect The pump works great and the kids love it',
 'I used its works I is good',
 'Best shampoo conditioner hands down',
 'I stopped using soap when I started working in the beauty industry about 15 years Having used a wide array of skin care products and in particular body wash I was delighted and surprised when I tried this one for the first time I didn t realize that I actually like the smell of Nutmeg itself when not baked into my mom s oatmeal raisin cookie Truly a unique experience in the shower and my skin loved every minute o

In [44]:
#change corpus into vector
#you can use tfidf or BoW here

tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(x_train)

#transform training and testing corpus into vector form
x_train = tfidf_vec.transform(x_train)
x_test = tfidf_vec.transform(x_test)

### 訓練與預測

In [45]:
#build classification model (decision tree, random forest, or adaboost)
#start training

tree = DecisionTreeClassifier(max_depth=6, min_samples_split=2)
tree.fit(x_train, y_train)

DecisionTreeClassifier(max_depth=6)

In [46]:
#start inference
y_pred = tree.predict(x_test)

In [47]:
#calculate accuracy
print(f"Accuracy: {tree.score(x_test,y_test)}")

Accuracy: 0.976258309591643


In [48]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.91      0.70      0.79        30
           2       1.00      0.07      0.12        15
           3       0.98      1.00      0.99      1008

    accuracy                           0.98      1053
   macro avg       0.96      0.59      0.64      1053
weighted avg       0.98      0.98      0.97      1053

[[  21    0    9]
 [   0    1   14]
 [   2    0 1006]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟好評搞混