題目:電商產品評分文件以機器學習方式分辨是否為正向或負向<br>
<br>
說明：輸入文件positive.review和negative.review，兩者都是XML檔。我們用BeautifulSoup讀進來，<br>
擷取review_text，然後用NLTK自建Tokenizer。先產生word-to-index map再產生word-frequency vectors。<br>
之後shuffle data創造train/test splits，留100個給test用。接著用Logistic Regression分類器<br>
找出訓練組和測試組的準確度(Accuracy)。接著我們可以看看每個單字的正負權重，可以訂一個閥值，<br>
比方絕對值大於正負0.5，以確認情緒是顯著的。最後我們找出根據現有演算法歸類錯誤最嚴重的正向情緒和<br>
負向情緒的例子。<br>
<br>
延伸:可用不同的tokenizer，不同的tokens_to_vector，不同的ML分類器做改進準確率的比較。<br>
最後可用您的model去預測unlabeled.review檔的內容。<br>
<br>
範例程式檔名: sentiment_情緒分析.py，以LogisticRegression方式完成情緒分析。<br>
模組: sklearn, bs4, numpy, nltk<br>
輸入檔：stopwords.txt, /electronics 下 positive.review, negative.review<br>
成績：辨識百分率<br>

In [None]:
import nltk
nltk.download(['punkt', 'wordnet'])

import numpy as np
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

from http://www.lextek.com/manuals/onix/stopwords1.html

In [None]:
stopwords = set(w.rstrip() for w in open('stopwords(作業數據).txt'))

另一個stopwords的來源

In [None]:
# from nltk.corpus import stopwords
# stopwords.words('english')

讀正向與負向 reviews<br>
data courtesy of http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

In [None]:
positive_reviews = BeautifulSoup(open('electronics/positive(作業數據).review', encoding='utf-8').read(), features="html5lib")
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('electronics/negative(作業數據).review', encoding='utf-8').read(), features="html5lib")
negative_reviews = negative_reviews.findAll('review_text')

基於nltk自建tokenizer

In [None]:
def my_tokenizer(s):
    s = s.lower() # lowercase
    tokens = nltk.tokenize.word_tokenize(s)   # 將文字改為tokens
    tokens = [t for t in tokens if len(t) > 2]   # 去除短字
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]   # 詞形還原
    tokens = [t for t in tokens if t not in stopwords]   # 去除stopwords
    return tokens

先產生word-to-index map再產生word-frequency vectors<br>
同時儲存tokenized版本未來不需再做tokenization

In [None]:
word_index_map = {}
current_index = 0
positive_tokenized = []
negative_tokenized = []
orig_reviews = []

for review in positive_reviews:
    orig_reviews.append(review.text)
    tokens = my_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

for review in negative_reviews:
    orig_reviews.append(review.text)
    tokens = my_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

print("len(word_index_map):", len(word_index_map))

len(word_index_map): 11092


now let's create our input matrices

In [None]:
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1)   # 最後一個元素是標記
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x / x.sum()   # 正規化數據提升未來準確度
    x[-1] = label
    return x

In [None]:
N = len(positive_tokenized) + len(negative_tokenized)
# (N x D+1) 矩陣-擺在一塊將來便於shuffle
data = np.zeros((N, len(word_index_map) + 1))
i = 0
for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1)
    data[i,:] = xy
    i += 1

for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0)
    data[i,:] = xy
    i += 1

shuffle data創造train/test splits<br>
多次嘗試!

In [None]:
orig_reviews, data = shuffle(orig_reviews, data)

X = data[:,:-1]
Y = data[:,-1]

最後100列是測試用

In [None]:
Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]

In [None]:
model = LogisticRegression()
model.fit(Xtrain, Ytrain)
print('Train accuracy:', model.score(Xtrain, Ytrain))
print('Test accuracy:', model.score(Xtest, Ytest))

Train accuracy: 0.7768421052631579
Test accuracy: 0.8


列出每個字的正負 weight<br>
用不同的 threshold values!<br>

In [None]:
threshold = 0.5
for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print(word, weight)

unit -0.7225767382226405
bad -0.7590355255469529
cable 0.7991942864440522
time -0.7389171600022015
've 0.6691197623294667
month -0.8217497172388112
pro 0.5024137089076871
sound 0.8885182281196364
lot 0.7987790201174184
you 0.9014058014310883
n't -2.069762684449128
easy 1.7036181463185287
quality 1.4179503887593552
company -0.5359627707634567
item -1.0078918224343958
wa -1.4990551236759795
perfect 1.0019288880051633
fast 0.8259036307619256
ha 0.8116915236691564
price 2.710783810197265
value 0.5715639161795052
money -0.9403280877072396
memory 0.9487261566527143
picture 0.5374321460033797
buy -0.8442773205197956
bit 0.6442904103961374
happy 0.6042895453233155
pretty 0.7891636394141085
doe -1.297952225680561
highly 1.0256824909491744
recommend 0.6940276845229171
customer -0.676210878584801
support -0.889815712021811
little 0.8741606212532931
returned -0.7855742411429959
excellent 1.360673773944718
love 1.187474185393386
home 0.5297641857862257
useless -0.501477371377712
week -0.74913172144

找出歸類錯誤的例子

In [None]:
preds = model.predict(X)
P = model.predict_proba(X)[:,1]   # p(y=1|x)

只列出最糟的

In [None]:
minP_whenYis1 = 1
maxP_whenYis0 = 0
wrong_positive_review = None
wrong_negative_review = None
wrong_positive_prediction = None
wrong_negative_prediction = None
for i in range(N):
    p = P[i]
    y = Y[i]
    if y == 1 and p < 0.5:
        if p < minP_whenYis1:
            wrong_positive_review = orig_reviews[i]
            wrong_positive_prediction = preds[i]
            minP_whenYis1 = p
    elif y == 0 and p > 0.5:
        if p > maxP_whenYis0:
            wrong_negative_review = orig_reviews[i]
            wrong_negative_prediction = preds[i]
            maxP_whenYis0 = p

print(f"Most wrong positive review (prob = {minP_whenYis1}, pred = {wrong_positive_prediction}):")
print(wrong_positive_review)
print(f"Most wrong negative review (prob = {maxP_whenYis0}, pred = {wrong_negative_prediction}):")
print(wrong_negative_review)

Most wrong positive review (prob = 0.3506766224268191, pred = 0.0):

A device like this either works or it doesn't.  This one happens to work

Most wrong negative review (prob = 0.6029946186758385, pred = 1.0):

The Voice recorder meets all my expectations and more
Easy to use, easy to transfer great results

