### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [142]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble impRandomForestClassifier AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [143]:
#load json data
all_reviews = []
with open(r'All_Beauty.json', 'rt') as f:
    for line in f.readlines():
        all_reviews.append(json.loads(line))

all_reviews = all_reviews[:10000]
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [144]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

for review in all_reviews:
    if 'reviewText' in review and 'overall' in review:
        corpus.append(review['reviewText'])
        labels.append(review['overall'])
        
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
import numpy as np
labels = np.array(labels)
labels = np.select([labels <= 2, labels == 3, labels >= 4], [1,2,3])

In [145]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

stopwords_english = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

for i, corp in enumerate(corpus):
    corp = re.sub(r'\S*@.*\.\S*', ' ', corp)
    corp = re.sub(r'\n', ' ', corp)
    corp = re.sub(r'[^A-z0-9]', ' ', corp)
    corp_ws = word_tokenize(corp)
    corp_ws = [word for word in corp_ws if word not in set(stopwords_english)]
    corpus[i] = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in corp_ws])

In [146]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size = 0.2)

len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [147]:
#change corpus into vector
#you can use tfidf or BoW here
vectorizer = TfidfVectorizer(max_features=2000)
vectorizer.fit(x_train)


#transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

### 訓練與預測

In [166]:
#build classification model (decision tree, random forest, or adaboost)
#start training

clf = AdaBoostClassifier()
clf.fit(x_train, y_train)

AdaBoostClassifier()

In [167]:
#start inference
y_pred = clf.predict(x_test)

In [168]:
#calculate accuracy
'Accuracy: {}'.format(sum(y_pred == y_test) / len(y_test))

'Accuracy: 0.9074537268634317'

In [169]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.65      0.34      0.45       133
           2       0.29      0.03      0.05        80
           3       0.92      0.99      0.95      1786

    accuracy                           0.91      1999
   macro avg       0.62      0.45      0.48      1999
weighted avg       0.88      0.91      0.88      1999

[[  45    0   88]
 [  10    2   68]
 [  14    5 1767]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現