# Sentiment analysis (文本情緒分析)
## 情緒意見分析是利用自然語言處理、文本分析以及語意特性來決定句子、文章甚至文本的主觀訊息

# 分類器標準架構
引入機器學習模型，來教會電腦判斷情緒

<img src="http://www.nltk.org/images/supervised-classification.png" width="500" align="left">

<img src="https://www.dropbox.com/s/kxhf5eadds2jmzq/senti.png?dl=1" width="400" align="left">

<img src="http://www.python-course.eu/images/supervised_learning.png" width="500" align="left">

# 最常見的分類器
## 貝氏分類器 (Naive Bayes classifiers)

###  基本理論：貝氏定理

就是我們熟知的條件機率  

<img src="https://www.dropbox.com/s/9mwjf5h3e9o2bqx/bayes.png?dl=1" width="200" align="left">

這個關係式，可以用於分類上面  
該公式解釋成白話文，意思是：
如果有出現這些字，其屬於某一類別的機率 == （該類別底下，有出現這些字的機率）* 該類別出現的機率 / 出現這些字的機率

<img src="https://www.dropbox.com/s/o9xwjo4a2c5gk7j/nb2.png?dl=1" width="500" align="left">

# 為何我們稱之為 Naive

### 問題是...

<mark style='color:red'>該類別底下，有出現這些字的機率</mark>  
e.q. 請計算負面句子當中，同時出現好棒棒、廠廠、三寶、酸民的機率  
若訓練資料裏面，沒有同時出現 <mark style='color:red'>好棒棒、廠廠、三寶、酸民</mark>的句子  
那他屬於負面句子的機率是0  
正面的句子也是0（我不相信正面句子會講什麼三寶）  
最後判斷會淪為猜測（導致準確度趨近0.5）  

引入Naive Bayes Classifier

<img src="https://www.dropbox.com/s/qyw4bcw22muz5c7/nb3.png?dl=1" width="500" align="left">

<img src="https://www.dropbox.com/s/9s7gtuj6568efht/nb4.png?dl=1" width="500" align="left">

## 實作重點 (特徵擷取)
#### create_Mainfeatures： 卡方的計算
    * 將正面與反正資料串在一起
    * 計算每個單字出現的頻率
    * 利用卡方公式，如果該單字經常出現在正面文集或是負面文集，就是情緒性的單字
    * 將情緒性的單字集成字典並回傳 -> 就是 bestMainFeatures

In [None]:
import itertools, pickle, json, sys
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def create_Mainfeatures(pos_data, neg_data, BestFeatureVec, chi_sq):
    posWords = list(itertools.chain(*pos_data)) 
    negWords = list(itertools.chain(*neg_data)) 

    word_fd = FreqDist() 
    cond_word_fd = ConditionalFreqDist() 
    for word in posWords:
        word_fd[word] += 1
        cond_word_fd['pos'][word] += 1
    for word in negWords:
        word_fd[word] += 1
        cond_word_fd['neg'][word] += 1

    pos_word_count = cond_word_fd['pos'].N() 
    neg_word_count = cond_word_fd['neg'].N() 
    total_word_count = pos_word_count + neg_word_count

    word_features = {}
    for word, freq in word_fd.items():
        if chi_sq==True:
            pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count) 
            neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count) 
        else:
            pos_score = freq
            neg_score = freq
        word_features[word] = pos_score + neg_score

    def find_best_words(number):
        best = sorted(word_features.items(), key=lambda x: -x[1])[:number]
        return set(w for w, s in best)

    best = find_best_words(BestFeatureVec)
    return best

In [None]:
create_Mainfeatures(pos_data=json.load(open("pos.json", 'r')), neg_data=json.load(open("neg.json", 'r')), BestFeatureVec=10, chi_sq=True)

## 分類器的演算法

建立一個叫作swinger的類別  
以下解釋函式功能
1. load函式：
    * 把訓練資料載入
    * 透過前面建立好的create_Mainfeatures，從訓練資料中找出最好的情緒字典，best main features
    * 透過bestMainFeatures，把訓練資料的句字去蕪存菁，再送入分類器做訓練
2. buildTestData：
    * 將測試資料去蕪存菁
3. best_Mainfeatures：
    * 使用bestMainFeatures，將句子去蕪存菁的函式
4. score：
    * 用測試資料去算準確度
5. swing：
    * 分類的api，給一句話，他會依據模型去判斷pos或是neg

In [None]:
# -*- coding: utf-8 -*-
import nltk, json, pickle, sys, collections, jieba, os
from random import shuffle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score
from nltk.metrics.scores import (accuracy, precision, recall, f_measure, log_likelihood, approxrand)


class Swinger(object):
    """docstring for Swinger"""
    classifier_table = {
        'MultinomialNB':MultinomialNB(),
        'BernoulliNB':BernoulliNB(),
    }
    
    def __init__(self, chi_sq):
        self.train = []
        self.test = []
        self.classifier = ''
        self.chi_sq = chi_sq

    def load(self, model, pos, neg, BestFeatureVec=700):
        BestFeatureVec = int(BestFeatureVec)
        self.pos_origin = json.load(open(pos, 'r'))
        self.neg_origin = json.load(open(neg, 'r'))
        shuffle(self.pos_origin)
        shuffle(self.neg_origin)
        poslen = len(self.pos_origin)
        neglen = len(self.neg_origin)

        # build train and test data.
        self.pos_review = self.pos_origin[:int(poslen*0.9)]
        self.pos_test = self.pos_origin[int(poslen*0.9):]
        self.neg_review = self.neg_origin[:int(neglen*0.9)]
        self.neg_test = self.neg_origin[int(neglen*0.9):]

        self.bestMainFeatures = create_Mainfeatures(pos_data=self.pos_review, neg_data=self.neg_review, BestFeatureVec=BestFeatureVec, chi_sq=self.chi_sq)
        print(self.bestMainFeatures)
        # build model
        print('start building {} model!!!'.format(model))

        self.classifier = SklearnClassifier(self.classifier_table[model]) 
        if len(self.train) == 0:
            print('build training data')
            posFeatures = self.emotion_features(self.best_Mainfeatures, self.pos_review, 'pos')
            negFeatures = self.emotion_features(self.best_Mainfeatures, self.neg_review, 'neg')
            self.train = posFeatures + negFeatures
        self.classifier.train(self.train) #訓練分類器

    def buildTestData(self, pos_test, neg_test):
        pos_test = json.load(open(pos_test, 'r'))
        neg_test = json.load(open(neg_test, 'r'))
        posFeatures = self.emotion_features(self.best_Mainfeatures, pos_test, 'pos')
        negFeatures = self.emotion_features(self.best_Mainfeatures, neg_test, 'neg')
        return posFeatures + negFeatures

    def best_Mainfeatures(self, word_list):
        return {word:True for word in word_list if word in self.bestMainFeatures}

    def score(self, pos_test, neg_test):
        from sklearn.metrics import precision_recall_curve
        from sklearn.metrics import roc_curve
        from sklearn.metrics import auc
        # build test data set
        if len(self.test) == 0:
            self.test = self.buildTestData(pos_test, neg_test)

        test, test_tag = zip(*self.test)
        pred = list(map(lambda x:1 if x=='pos' else 0, self.classifier.classify_many(test))) #對開發測試集的數據進行分類，給出預測的標籤
        tag = list(map(lambda x:1 if x=='pos' else 0, test_tag))
        # ROC AUC
        fpr, tpr, _ = roc_curve(tag, pred, pos_label=1)
        print("ROC AUC:" + str(auc(fpr, tpr)))
        return auc(fpr, tpr)

    def emotion_features(self, feature_extraction_method, data, emo):
        return list(map(lambda x:[feature_extraction_method(x), emo], data)) #爲積極文本賦予"pos"

    def swing(self, sentence):
        sentence = self.best_Mainfeatures(CutAndrmStopWords(sentence))
        return self.classifier.classify(sentence)

In [None]:
import jieba.posseg as pseg
import jieba, os

BASEDIR = os.path.dirname('.')
stopwords = json.load(open(os.path.join(BASEDIR, 'stopwords', 'stopwords.json'), 'r'))
jieba.load_userdict(os.path.join(BASEDIR, 'dictionary', 'dict.txt.big.txt'))
jieba.load_userdict(os.path.join(BASEDIR, "dictionary", "NameDict_Ch_v2"))

def CutAndrmStopWords(sentence):
    def condition(x):
        x = list(x)
        word, flag = x[0], x[1]
        if len(word) > 1 and flag!='eng' and flag != 'm' and flag !='mq' and word not in stopwords:
            return True
        return False

    result = filter(condition, pseg.cut(sentence))
    result = map(lambda x:list(x)[0], result)
    return list(result)

### MultinomialNB V.S. BernoulliNB
都是Naive Bayes的一種  
差異在於：
1. Multinomial 會計算該單字出現再該類別幾次
2. Bernoulli 只是計算該單字出現與否而已

通常Multinomial會更適合用在Text classification上面

## 先用你們今天自己爬下來的資料

### 沒有卡方的版本

In [None]:
s = Swinger(False)
s.load('MultinomialNB', pos='MyPos.json', neg='MyNeg.json', BestFeatureVec=10)
s.score(pos_test='MyPos.json', neg_test='MyNeg.json')

In [None]:
s.swing('大停電的夜晚，我很幸運看到了星空')

In [None]:
s.swing('XXX 停電害我不能打電動拉')

## 有卡方的版本

In [None]:
s = Swinger(True)
s.load('MultinomialNB', pos='MyPos.json', neg='MyNeg.json', BestFeatureVec=10)
s.score(pos_test='pos.json', neg_test='neg.json')

In [None]:
s.swing('大停電的夜晚，我很幸運看到了星空')

In [None]:
s.swing('XXX 停電害我不能打電動拉')

## 使用我們從黑特版、黑皮版等等蒐集而來的訓練資料

### 沒卡方

In [None]:
s = Swinger(False)
s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=50)
s.score(pos_test='pos.json', neg_test='neg.json')

In [None]:
s.swing('大停電的夜晚，我很幸運看到了星空')

In [None]:
s.swing('XXX 停電害我不能打電動拉')

### 有卡方

In [None]:
s = Swinger(True)
s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=50)
s.score(pos_test='pos.json', neg_test='neg.json')

In [None]:
s.swing('大停電的夜晚，我很幸運看到了星空')

In [None]:
s.swing('XXX 停電害我不能打電動拉')

## 不同的feature數量對準確度的影響?



In [None]:
import matplotlib.pyplot as plt

multi = []
bernou = []
for num in range(10, 50, 10):
    s = Swinger(True)
    s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=num)
    multi.append(s.score(pos_test='pos.json', neg_test='neg.json'))
    
    s.load('BernoulliNB', pos='pos.json', neg='neg.json', BestFeatureVec=num)
    bernou.append(s.score(pos_test='pos.json', neg_test='neg.json'))

plt.plot(range(10, 50, 10), multi, 'o-', color="y",label="Multinomial")
plt.plot(range(10, 50, 10), bernou, 'o-', color="r",label="Bernoulli")
plt.legend(loc='best')
plt.xlabel("features vectors")
plt.ylabel("AUC")
plt.show()    