### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [3]:
import json
import pandas as pd
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [21]:
#load json data
all_reviews = []
with open('All_Beauty.json', 'r') as f:
    counter = 1
    for review in f:
        if counter < 10001:
            all_reviews.append(json.loads(review))
            counter += 1
        else:
            break
        
len(all_reviews), all_reviews[0]

(10000,
 {'overall': 1.0,
  'verified': True,
  'reviewTime': '02 19, 2015',
  'reviewerID': 'A1V6B6TNIC10QE',
  'asin': '0143026860',
  'reviewerName': 'theodore j bigham',
  'reviewText': 'great',
  'summary': 'One Star',
  'unixReviewTime': 1424304000})

In [2]:
#load json data
all_reviews = []
###<your code>###
        
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [52]:
type(all_reviews[0])

dict

In [64]:
corpus = []
labels = []
for i in all_reviews:
    try:
        corpus.append(i['reviewText'])
        labels.append(i['overall'])
    except:
        continue

In [65]:
# transform data
label_map = {1: 1, 2: 1, 3: 2, 4: 3, 5: 3}
labels = [label_map[int(label)] for label in labels]

In [69]:
corpus

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
 'This book was very informative, covering all aspects of game.',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
 "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!",
 'Today I gave a book about the Negro Leagues of Baseball to a traveling friend. Its a book I\'ve read more than once and felt that my friend would truly enjoy. It felt like giving a gift that you wanted to keep for yourself. I parted with the book knowing that

In [84]:
# Preprocessing
word_dic = set()
processed_sentence = []
for sentence in corpus:
    sentence = sentence.lower()
    pattern = r'[\W\s]'
    sentence = re.sub(pattern,' ', sentence)
    word_dic |= set(sentence)
    processed_sentence.append(sentence)

In [85]:
# 切割資料
x_train,x_test,y_train,y_test = train_test_split(processed_sentence, labels, test_size = 0.2, random_state = 1)

len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [86]:
x_train

['so much more quality than i was expecting  it s held up great to numerous washes and an unruly and very messy little adventurer  my toddler is a huge peppa fan and absolutely adores this blouse  and its so much cuter than most of the character shirts you find ',
 'if you are looking for a stylish  sturdy ecinomical razor stand  then look no further  this is a quality stand and would make a perfect gift ',
 'a nice feature for the razor',
 'gave new life to my old razor  didn t realize how dull it had gotten until i put this on and got the cleanest  fastest shave i ve had in a long time ',
 'i use this to help me with my upkeep as i am pregnant and at 36 weeks it is impossible to see but this helps me stay tame ',
 'nice stand that arrived in perfect condition  goes nicely with other shaving gear ',
 'a solid  handsome looking stand ',
 'i ve owned this brush for a month and half now  i use it with the merkur heavy duty double edge razor  34c and the perfecto 100  pure badger shaving 

In [88]:
vectorizer = TfidfVectorizer(max_features = 2000)
vectorizer.fit_transform(x_train)
tfidf_train = vectorizer.transform(x_train)
tfidf_test = vectorizer.transform(x_test)

In [89]:
tfidf_train

<7996x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 181405 stored elements in Compressed Sparse Row format>

In [90]:
decision_tree_cls = DecisionTreeClassifier(criterion='entropy', max_depth=5)
decision_tree_cls.fit(tfidf_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=5)

In [94]:
y_pred = decision_tree_cls.predict(tfidf_test)

In [97]:
sum(y_test==y_pred)/len(y_test)

0.9024512256128064

In [98]:
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.53      0.18      0.27       135
           2       0.00      0.00      0.00        64
           3       0.91      0.99      0.95      1800

    accuracy                           0.90      1999
   macro avg       0.48      0.39      0.41      1999
weighted avg       0.86      0.90      0.87      1999

[[  24    1  110]
 [   3    0   61]
 [  18    2 1780]]


In [5]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
      
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3

###<your code>###

In [6]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

###<your code>###

In [7]:
#split corpus and label into train and test
###<your code>###

len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [8]:
#change corpus into vector
#you can use tfidf or BoW here

###<your code>###

#transform training and testing corpus into vector form
x_train = ###<your code>###
x_test = ###<your code>###

### 訓練與預測

In [9]:
#build classification model (decision tree, random forest, or adaboost)
#start training

###<your code>###

DecisionTreeClassifier(max_depth=6)

In [10]:
#start inference
y_pred = ###<your code>###

In [13]:
#calculate accuracy
###<your code>###

Accuracy: 0.9054527263631816


In [17]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.71      0.22      0.33       134
           2       0.00      0.00      0.00        73
           3       0.91      0.99      0.95      1792

    accuracy                           0.91      1999
   macro avg       0.54      0.40      0.43      1999
weighted avg       0.87      0.91      0.88      1999

[[  29    4  101]
 [   3    0   70]
 [   9    2 1781]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現