### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [9]:
import json
import re
import gzip
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [10]:
def parse(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

In [11]:
#load json data
all_reviews = [review for review in parse("All_Beauty_5.json.gz")]
all_reviews[0]

{'overall': 5.0,
 'verified': True,
 'reviewTime': '09 1, 2016',
 'reviewerID': 'A3CIUOJXQ5VDQ2',
 'asin': 'B0000530HU',
 'style': {'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice Blue'},
 'reviewerName': 'Shelly F',
 'reviewText': 'As advertised. Reasonably priced',
 'summary': 'Five Stars',
 'unixReviewTime': 1472688000}

In [27]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

for reviews in all_reviews:
    if "reviewText" in reviews and "overall" in reviews:
        corpus.append(reviews["reviewText"])
        labels.append(reviews["overall"])
        
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
transformed_lables = []
for label in labels:
    if label in (1,2):
        transformed_lables.append(1)
    elif label == 3:
        transformed_lables.append(2)
    elif label in (4,5):
        transformed_lables.append(3)

print(len(all_reviews))
print(len(labels))
print(len(transformed_lables))

5269
5264
5264


In [28]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

pattern = r"\S*@\S*|\\n|[^a-zA-Z0-9 ]"

for i, review in enumerate(corpus):
    fil_review = [w for w in re.sub(pattern, " ", review).split(" ") if w != ""]
    corpus[i] = " ".join(fil_review)

In [30]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=2, shuffle=True)

len(x_train), len(x_test), len(y_train), len(y_test)

(4211, 1053, 4211, 1053)

In [32]:
#change corpus into vector
#you can use tfidf or BoW here

tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(x_train)

#transform training and testing corpus into vector form
x_train = tfidf_vec.transform(x_train).toarray()
x_test = tfidf_vec.transform(x_test).toarray()

### 訓練與預測

In [38]:
#build classification model (decision tree, random forest, or adaboost)
#start training

tree = DecisionTreeClassifier(max_depth=6, min_samples_split=2)
tree.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [42]:
#start inference
y_pred = tree.predict(x_test)
y_pred

array([5., 5., 4., ..., 5., 5., 5.])

In [45]:
#calculate accuracy
tree.score(x_test, y_test)

0.9012345679012346

In [46]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

         1.0       1.00      0.42      0.59        19
         2.0       0.75      0.17      0.27        18
         3.0       0.83      0.20      0.32        25
         4.0       0.69      0.16      0.27        67
         5.0       0.90      1.00      0.95       924

   micro avg       0.90      0.90      0.90      1053
   macro avg       0.84      0.39      0.48      1053
weighted avg       0.89      0.90      0.87      1053

[[  8   0   1   0  10]
 [  0   3   0   0  15]
 [  0   1   5   3  16]
 [  0   0   0  11  56]
 [  0   0   0   2 922]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現