### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd './drive/My Drive/NLP/day28'

/content/drive/My Drive/NLP/day28


In [3]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz
!gunzip All_Beauty.json.gz

--2021-01-10 10:11:51--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47350910 (45M) [application/octet-stream]
Saving to: ‘All_Beauty.json.gz’


2021-01-10 10:11:56 (11.8 MB/s) - ‘All_Beauty.json.gz’ saved [47350910/47350910]



In [3]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [4]:
# ref:https://blog.csdn.net/u011318077/article/details/88550775
#load json data
all_reviews = []
filename = ('./All_Beauty.json')  
with open(filename , 'r' , encoding='utf-8' , errors='ignored') as f:
  for review in f.readlines():
    all_reviews.append(json.loads(review))

all_reviews[0]

{'asin': '0143026860',
 'overall': 1.0,
 'reviewText': 'great',
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'reviewerName': 'theodore j bigham',
 'summary': 'One Star',
 'unixReviewTime': 1424304000,
 'verified': True}

In [5]:
all_reviews = all_reviews[:10000]

In [6]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

for i in range(len(all_reviews)):
  corpus.append(all_reviews[i].get('reviewText'))
  labels.append(int(all_reviews[i].get('overall')))
        
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
for i , label in enumerate(labels):
  if label == 1 or label == 2:
    labels[i] = 1
  elif label == 3:
    labels[i] = 2
  else:
    labels[i] = 3
    
print(labels[:5])

[1, 3, 3, 3, 3]


In [7]:
print(corpus[:5])

['great', "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you", 'This book was very informative, covering all aspects of game.', 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.', "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!"]


In [8]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

# \w+(?=@)@(?<=@)\w+.\w+ for email address

pattern = r"(\w+(?=@)@(?<=@)\w+.\w+)|\W+"

counter = 0
for contents in (corpus):
  if contents!=None:
    contents = re.sub(pattern , " " , contents.strip('\n'))
    corpus[counter] = ' '.join([content for content in contents.split() if content!=""])
    counter +=1

print(corpus[2:4])

['This book was very informative covering all aspects of game', 'I am already a baseball fan and knew a bit about the Negro leagues but I learned a lot more reading this book']


In [9]:
#split corpus and label into train and test
x_train , x_test , y_train , y_test = train_test_split(corpus , labels , test_size = 0.2)
len(x_train), len(x_test), len(y_train), len(y_test)

(8000, 2000, 8000, 2000)

In [10]:
#change corpus into vector
#you can use tfidf or BoW here
cv = CountVectorizer()
cv.fit(x_train)

#transform training and testing corpus into vector form
x_train = cv.transform(x_train)
x_test = cv.transform(x_test)

In [48]:
"""
tfidf = TfidfVectorizer()
tfidf.fit(x_train)

x_train = tfidf.transform(x_train)
x_test = tfidf.transform(x_test)
"""

### 訓練與預測

In [11]:
#build classification model (decision tree, random forest, or adaboost)
#start training
tree = DecisionTreeClassifier(max_depth=5,min_samples_leaf=5,min_samples_split=10)
tree.fit(x_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [12]:
#start inference
y_pred = tree.predict(x_test)

In [13]:
#calculate accuracy
tree.score(x_test , y_test)

0.8935

In [14]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.27      0.03      0.05       139
           2       0.00      0.00      0.00        69
           3       0.90      0.99      0.94      1792

    accuracy                           0.89      2000
   macro avg       0.39      0.34      0.33      2000
weighted avg       0.82      0.89      0.85      2000

[[   4    0  135]
 [   2    0   67]
 [   9    0 1783]]


  _warn_prf(average, modifier, msg_start, len(result))


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現