### 貝氏分類器

#### 理論

貝氏定理就是我們熟知的條件機率  

首先呢：
![img](貝氏1.png)

倒過來也一樣
![img](貝氏2.png)

所以兩式個關係是這樣：
![img](貝氏3.png)

這個關係式，可以用於分類上面  
該公式解釋成白話文，意思是：
1. 如果有出現這些字，讓他屬於某一類別的機率 == （該類別底下，有出現這些字的機率）* 該類別出現的機率 / 出現這些字的機率
![img](貝氏4.png)

### 問題是...

<mark style='color:red'>該類別底下，有出現這些字的機率</mark>  
e.q. 請計算負面句子當中，同時出現好棒棒、廠廠、三寶、酸民的機率  
若訓練資料裏面，沒有同時出現 <mark style='color:red'>好棒棒、廠廠、三寶、酸民</mark>的句子  
那他屬於負面句子的機率是0  
正面的句子也是0（我不相信正面句子會講什麼三寶）  
最後判斷會淪為猜測（導致準確度趨近0.5）  

![img](naiveB.png)
![img](naiveB1.png)

所以如果我們拿掉 <mark style='color:red'>同時出現</mark>這個constraint呢？</mark>  
假設這些字出現的機率為獨立事件  
則我們可以將公式改寫成
![img](naiveB2.png)
![img](naiveB3.png)

這就是今天所使用的 NaiveBayes 

首先要先自制兩個函式  
會幫我們對資料進行前處理
1. create_Mainfeatures：
    * 將正面與反正資料串在一起
    * 計算每個單字出現的頻率
    * 利用卡方公式，如果該單字經常出現在正面文集或是負面文集，就是情緒性的單字
    * 將情緒性的單字集成字典並回傳 -> 就是 bestMainFeatures
2. CutAndrmStopWords：
    * 輸入一個句子
    * 使用結巴斷詞
    * 也移除stopwords
    * 將結果回傳

In [7]:
import itertools, pickle, json, sys
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def create_Mainfeatures(pos_data, neg_data, BestFeatureVec):
    posWords = list(itertools.chain(*pos_data)) #把多為數組解煉成一維數組
    negWords = list(itertools.chain(*neg_data)) #同理

    # bigram
    bigram_finder = BigramCollocationFinder.from_words(posWords)
    posBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)
    bigram_finder = BigramCollocationFinder.from_words(negWords)
    negBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)
    posWords += posBigrams #詞和雙詞搭配
    negWords += negBigrams

    word_fd = FreqDist() #可統計所有詞的詞頻
    cond_word_fd = ConditionalFreqDist() #可統計積極文本中的詞頻和消極文本中的詞頻
    for word in posWords:
        word_fd[word] += 1
        cond_word_fd['pos'][word] += 1
    for word in negWords:
        word_fd[word] += 1
        cond_word_fd['neg'][word] += 1

    pos_word_count = cond_word_fd['pos'].N() #積極詞的數量
    neg_word_count = cond_word_fd['neg'].N() #消極詞的數量
    total_word_count = pos_word_count + neg_word_count

    word_features = {}
    for word, freq in word_fd.items():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count) #計算積極詞的卡方統計量，這裏也可以計算互信息等其它統計量
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count) #同理
        word_features[word] = pos_score + neg_score

    def find_best_words(number):
        best = sorted(word_features.items(), key=lambda x: -x[1])[:number] # 把詞按信息量倒序排序。number 是特徵的微度，式可以不斷調整至最優的
        return set(w for w, s in best)

    best = find_best_words(BestFeatureVec)
    pickle.dump(best, open('bestMainFeatures.pickle.{}'.format(BestFeatureVec), 'wb'))
    return best

import jieba.posseg as pseg
import jieba, os

BASEDIR = os.path.dirname('.')
stopwords = json.load(open(os.path.join(BASEDIR, 'stopwords', 'stopwords.json'), 'r'))
jieba.load_userdict(os.path.join(BASEDIR, 'dictionary', 'dict.txt.big.txt'))
jieba.load_userdict(os.path.join(BASEDIR, "dictionary", "NameDict_Ch_v2"))
def CutAndrmStopWords(sentence):
    def condition(x):
        x = list(x)
        word, flag = x[0], x[1]
        if len(word) > 1 and flag!='eng' and flag != 'm' and flag !='mq' and word not in stopwords:
            return True
        return False

    result = filter(condition, pseg.cut(sentence))
    result = map(lambda x:list(x)[0], result)
    return list(result)


## 分類器的演算法

建立一個叫作swinger的類別  
以下解釋函式功能
1. load函式：
    * 把訓練資料載入
    * 透過前面建立好的create_Mainfeatures，從訓練資料中找出最好的情緒字典，best main features
    * 透過bestMainFeatures，把訓練資料的句字去蕪存菁，再送入分類器做訓練
2. buildTestData：
    * 將測試資料去蕪存菁
3. best_Mainfeatures：
    * 使用bestMainFeatures，將句子去蕪存菁的函式
4. score：
    * 用測試資料去算準確度
5. swing：
    * 分類的api，給一句話，他會依據模型去判斷pos或是neg

In [8]:
# -*- coding: utf-8 -*-
import nltk, json, pickle, sys, collections, jieba, os
from random import shuffle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from nltk.metrics.scores import (accuracy, precision, recall, f_measure, log_likelihood, approxrand)


class Swinger(object):
    """docstring for Swinger"""
    classifier_table = {
        'MultinomialNB':MultinomialNB(),
        'BernoulliNB':BernoulliNB(),
    }
    
    def __init__(self):
        self.train = []
        self.test = []
        self.classifier = ''

    def load(self, model, pos, neg, BestFeatureVec=700):
        BestFeatureVec = int(BestFeatureVec)

        print('load bestMainFeatures failed!!\nstart creating bestMainFeatures ...')

        self.pos_origin = json.load(open(pos, 'r'))
        self.neg_origin = json.load(open(neg, 'r'))
        shuffle(self.pos_origin)
        shuffle(self.neg_origin)
        poslen = len(self.pos_origin)
        neglen = len(self.neg_origin)

        # build train and test data.
        self.pos_review = self.pos_origin[:int(poslen*0.9)]
        self.pos_test = self.pos_origin[int(poslen*0.9):]
        self.neg_review = self.neg_origin[:int(neglen*0.9)]
        self.neg_test = self.neg_origin[int(neglen*0.9):]

        self.bestMainFeatures = create_Mainfeatures(pos_data=self.pos_review, neg_data=self.neg_review, BestFeatureVec=BestFeatureVec) # 使用詞和雙詞搭配作為特徵
        print(self.bestMainFeatures)
        # build model
        print('start building {} model!!!'.format(model))

        self.classifier = SklearnClassifier(self.classifier_table[model]) #nltk在sklearn的接口
        if len(self.train) == 0:
            print('build training data')
            posFeatures = self.emotion_features(self.best_Mainfeatures, self.pos_review, 'pos')
            negFeatures = self.emotion_features(self.best_Mainfeatures, self.neg_review, 'neg')
            self.train = posFeatures + negFeatures
        self.classifier.train(self.train) #訓練分類器
        pickle.dump(self.classifier, open('{}.pickle.{}'.format(model, BestFeatureVec),'wb'))

    def buildTestData(self, pos_test, neg_test):
        pos_test = json.load(open(pos_test, 'r'))
        neg_test = json.load(open(neg_test, 'r'))
        posFeatures = self.emotion_features(self.best_Mainfeatures, pos_test, 'pos')
        negFeatures = self.emotion_features(self.best_Mainfeatures, neg_test, 'neg')
        return posFeatures + negFeatures

    def best_Mainfeatures(self, word_list):
        return {word:True for word in word_list if word in self.bestMainFeatures}

    def score(self, pos_test, neg_test):
        from sklearn.metrics import precision_recall_curve
        from sklearn.metrics import roc_curve
        from sklearn.metrics import auc
        # build test data set
        if len(self.test) == 0:
            self.test = self.buildTestData(pos_test, neg_test)

        test, test_tag = zip(*self.test)
        pred = list(map(lambda x:1 if x=='pos' else 0, self.classifier.classify_many(test))) #對開發測試集的數據進行分類，給出預測的標籤
        tag = list(map(lambda x:1 if x=='pos' else 0, test_tag))
        # ROC AUC
        fpr, tpr, _ = roc_curve(tag, pred, pos_label=1)
        print("ROC AUC:" + str(auc(fpr, tpr)))
        return auc(fpr, tpr)

    def emotion_features(self, feature_extraction_method, data, emo):
        return list(map(lambda x:[feature_extraction_method(x), emo], data)) #爲積極文本賦予"pos"

    def swing(self, sentence):
        sentence = self.best_Mainfeatures(CutAndrmStopWords(sentence))
        return self.classifier.classify(sentence)

### MultinomialNB V.S. BernoulliNB
都是Naive Bayes的一種  
差異在於：
1. Multinomial 會計算該單字出現再該類別幾次
2. Bernoulli 只是計算該單字出現與否而已

通常Multinomial會更適合用在Text classification上面

In [9]:
s = Swinger()
s.load('MultinomialNB', pos='MyPos.json', neg='MyNeg.json', BestFeatureVec=10)
s.score(pos_test='MyPos.json', neg_test='MyNeg.json')

load bestMainFeatures failed!!
start creating bestMainFeatures ...
{'完全', '酒店', '板主', '南部', '律師', '噁心', '亂源', '父母', '這種', '社會'}
start building MultinomialNB model!!!
build training data
ROC AUC:0.608679883946


0.60867988394584138

In [10]:
s.swing('大停電的夜晚，我很幸運看到了星空')

'pos'

In [11]:
s.swing('XXX 停電害我不能打電動拉')

'pos'

In [None]:
s = Swinger()
s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=50)
s.score(pos_test='pos.json', neg_test='neg.json')

load bestMainFeatures failed!!
start creating bestMainFeatures ...


In [6]:
s.swing('大停電的夜晚，我很幸運看到了星空')

'pos'

In [None]:
s.swing('XXX 停電害我不能打電動拉')

## 不同的feature數量對準確度的影響?



In [None]:
import matplotlib.pyplot as plt

multi = []
bernou = []
for num in range(10, 50, 10):
    s = Swinger()
    s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=num)
    multi.append(s.score(pos_test='pos.json', neg_test='neg.json'))
    
    s.load('BernoulliNB', pos='pos.json', neg='neg.json', BestFeatureVec=num)
    bernou.append(s.score(pos_test='pos.json', neg_test='neg.json'))

plt.plot(range(10, 50, 10), multi, 'o-', color="y",label="Multinomial")
plt.plot(range(10, 50, 10), bernou, 'o-', color="r",label="Bernoulli")
plt.legend(loc='best')
plt.xlabel("features vectors")
plt.ylabel("AUC")
plt.show()
    
    