## 休息一下


## 下午場開始

目錄：

1. 斷詞
    * 安裝結巴
    * 斷詞原理簡單講
    * 下載字典
2. 貝氏分類器
    * 原理
    * 卡方
    * 情緒字典
3. 實作


### 安裝結巴

> pip install jieba



### 斷詞原理簡單講

首先，知道詞跟詞出現在上下文的機率  
透過viterbi等演算法實現HMM模型  
找出機率最高的斷詞組合  

![img](jieba_procedure.png)

![img](https://upload.wikimedia.org/wikipedia/commons/7/73/Viterbi_animated_demo.gif)

斷詞，需要知道每個字：
1. S(獨立成詞)、B（詞的開頭）、M（中間）、E（結尾）四種詞的狀態的機率

如此就能算出機率最大的斷詞組合

![img](viterbi.png)
圖片引用自 [中文斷詞：斷句不要悲劇](http://s.itho.me/techtalk/2017/%E4%B8%AD%E6%96%87%E6%96%B7%E8%A9%9E%EF%BC%9A%E6%96%B7%E5%8F%A5%E4%B8%8D%E8%A6%81%E6%82%B2%E5%8A%87.pdf)

### 以下用WIKI百科上的viterbi做示範（參考即可）

[wiki -viterbi](https://zh.wikipedia.org/wiki/%E7%BB%B4%E7%89%B9%E6%AF%94%E7%AE%97%E6%B3%95)

使用viterbi時  
需要先知道上一個狀態變化到下一個狀態的機率  
以及每個狀態的發生機率是多少  
wiki是以醫生看病當例子

In [3]:
states = ('Healthy', 'Fever')
 
observations = ('normal', 'cold', 'dizzy')
 
start_probability = {'Healthy': 0.6, 'Fever': 0.4}
 
transition_probability = {
   'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
   'Fever' : {'Healthy': 0.4, 'Fever': 0.6},
   }
 
emission_probability = {
   'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
   'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6},
   }

In [4]:
# Helps visualize the steps of Viterbi.
def print_dptable(V):
    print("    ")
    for i in range(len(V)):
        print("%8d" % i, end='')
    print()

    for y in V[0].keys():
        print("%.5s: " % y, end="")
        for t in range(len(V)):
            print("%.7s" % ("%f" % V[t][y]), end=" ")
        print()

def viterbi(obs, states, start_p, trans_p, emit_p):
    Pro = [{}]
    path = {}

    for s in states:
        Pro[0][s] = start_p[s] * emit_p[s][obs[0]]
        path[s] = [s]

    for index in range(1, len(obs)):
        Pro.append({})
        newPath = {}
        for newstate in states:
            prob, state = max([ (Pro[index-1][oldState] * trans_p[oldState][newstate] * emit_p[newstate][obs[index]], oldState) for oldState in states])

            Pro[index][newstate] = prob
            newPath[newstate] = path[state] + [newstate]
        path = newPath

    print_dptable(Pro)
    prob, state = max([(value, key) for key, value in Pro[-1].items()])
    return prob, path[state]

def example():
    return viterbi(observations,
                   states,
                   start_probability,
                   transition_probability,
                   emission_probability)
print(example())

    
       0       1       2
Fever: 0.04000 0.02700 0.01512 
Healt: 0.30000 0.08400 0.00588 
(0.01512, ['Healthy', 'Healthy', 'Fever'])


### 斷詞示範

In [7]:
import jieba, os
print(jieba.lcut('吉林市長春藥店'))


Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\udic\AppData\Local\Temp\jieba.cache
Loading model cost 3.743 seconds.
Prefix dict has been built succesfully.


['吉林市', '長', '春藥店']


### 下載字典 

答案不是 ~~春藥店~~  
是**長春** **藥店**  
但是蒐集到的單字不夠多  
導致演算法覺得這種組合的機率很小  

要改善就需要額外的字典

In [8]:
jieba.load_userdict(os.path.join('', 'dictionary', 'dict.txt.big.txt'))
jieba.load_userdict(os.path.join('', "dictionary", "NameDict_Ch_v2"))
print(jieba.lcut('吉林市長春藥店'))

['吉林市', '長春', '藥店']


### 貝氏分類器

#### 理論

貝氏定理就是我們熟知的條件機率  

首先呢：
![img](貝氏1.png)

倒過來也一樣
![img](貝氏2.png)

所以兩式個關係是這樣：
![img](貝氏3.png)

這個關係式，可以用於分類上面  
該公式解釋成白話文，意思是：
1. 如果有出現這些字，讓他屬於某一類別的機率 == （該類別底下，有出現這些字的機率）* 該類別出現的機率 / 出現這些字的機率
![img](貝氏4.png)

### 問題是...

<mark style='color:red'>該類別底下，有出現這些字的機率</mark>  
e.q. 請計算負面句子當中，同時出現好棒棒、廠廠、三寶、酸民的機率  
若訓練資料裏面，沒有同時出現 <mark style='color:red'>好棒棒、廠廠、三寶、酸民</mark>的句子  
那他屬於負面句子的機率是0  
正面的句子也是0（我不相信正面句子會講什麼三寶）  
最後判斷會淪為猜測（導致準確度趨近0.5）  

![img](naiveB.png)
![img](naiveB1.png)

所以如果我們拿掉 <mark style='color:red'>同時出現</mark>這個constraint呢？</mark>  
假設這些字出現的機率為獨立事件  
則我們可以將公式改寫成
![img](naiveB2.png)
![img](naiveB3.png)

這就是今天所使用的 NaiveBayes 

首先要先自制兩個函式  
會幫我們對資料進行前處理
1. create_Mainfeatures：
    * 將正面與反正資料串在一起
    * 計算每個單字出現的頻率
    * 利用卡方公式，如果該單字經常出現在正面文集或是負面文集，就是情緒性的單字
    * 將情緒性的單字集成字典並回傳 -> 就是 bestMainFeatures
2. CutAndrmStopWords：
    * 輸入一個句子
    * 使用結巴斷詞
    * 也移除stopwords
    * 將結果回傳

In [11]:
import itertools, pickle, json, sys
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def create_Mainfeatures(pos_data, neg_data, BestFeatureVec):
    posWords = list(itertools.chain(*pos_data)) #把多為數組解煉成一維數組
    negWords = list(itertools.chain(*neg_data)) #同理

    # bigram
    bigram_finder = BigramCollocationFinder.from_words(posWords)
    posBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)
    bigram_finder = BigramCollocationFinder.from_words(negWords)
    negBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)
    posWords += posBigrams #詞和雙詞搭配
    negWords += negBigrams

    word_fd = FreqDist() #可統計所有詞的詞頻
    cond_word_fd = ConditionalFreqDist() #可統計積極文本中的詞頻和消極文本中的詞頻
    for word in posWords:
        word_fd[word] += 1
        cond_word_fd['pos'][word] += 1
    for word in negWords:
        word_fd[word] += 1
        cond_word_fd['neg'][word] += 1

    pos_word_count = cond_word_fd['pos'].N() #積極詞的數量
    neg_word_count = cond_word_fd['neg'].N() #消極詞的數量
    total_word_count = pos_word_count + neg_word_count

    word_features = {}
    for word, freq in word_fd.items():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count) #計算積極詞的卡方統計量，這裏也可以計算互信息等其它統計量
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count) #同理
        word_features[word] = pos_score + neg_score

    def find_best_words(number):
        best = sorted(word_features.items(), key=lambda x: -x[1])[:number] # 把詞按信息量倒序排序。number 是特徵的微度，式可以不斷調整至最優的
        return set(w for w, s in best)

    best = find_best_words(BestFeatureVec)
    pickle.dump(best, open('bestMainFeatures.pickle.{}'.format(BestFeatureVec), 'wb'))
    return best

import jieba.posseg as pseg
import jieba, os

BASEDIR = os.path.dirname('.')
stopwords = json.load(open(os.path.join(BASEDIR, 'stopwords', 'stopwords.json'), 'r'))
jieba.load_userdict(os.path.join(BASEDIR, 'dictionary', 'dict.txt.big.txt'))
jieba.load_userdict(os.path.join(BASEDIR, "dictionary", "NameDict_Ch_v2"))
def CutAndrmStopWords(sentence):
    def condition(x):
        x = list(x)
        word, flag = x[0], x[1]
        if len(word) > 1 and flag!='eng' and flag != 'm' and flag !='mq' and word not in stopwords:
            return True
        return False

    result = filter(condition, pseg.cut(sentence))
    result = map(lambda x:list(x)[0], result)
    return list(result)


UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 2: illegal multibyte sequence

## 分類器的演算法

建立一個叫作swinger的類別  
以下解釋函式功能
1. load函式：
    * 把訓練資料載入
    * 透過前面建立好的create_Mainfeatures，從訓練資料中找出最好的情緒字典，best main features
    * 透過bestMainFeatures，把訓練資料的句字去蕪存菁，再送入分類器做訓練
2. buildTestData：
    * 將測試資料去蕪存菁
3. best_Mainfeatures：
    * 使用bestMainFeatures，將句子去蕪存菁的函式
4. score：
    * 用測試資料去算準確度
5. swing：
    * 分類的api，給一句話，他會依據模型去判斷pos或是neg

In [13]:
# -*- coding: utf-8 -*-
import nltk, json, pickle, sys, collections, jieba, os
from random import shuffle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from nltk.metrics.scores import (accuracy, precision, recall, f_measure, log_likelihood, approxrand)


class Swinger(object):
    """docstring for Swinger"""
    classifier_table = {
        'MultinomialNB':MultinomialNB(),
        'BernoulliNB':BernoulliNB(),
    }
    
    def __init__(self):
        self.train = []
        self.test = []
        self.classifier = ''

    def load(self, model, pos, neg, BestFeatureVec=700):
        BestFeatureVec = int(BestFeatureVec)

        print('load bestMainFeatures failed!!\nstart creating bestMainFeatures ...')

        self.pos_origin = json.load(open(pos, 'r'))
        self.neg_origin = json.load(open(neg, 'r'))
        shuffle(self.pos_origin)
        shuffle(self.neg_origin)
        poslen = len(self.pos_origin)
        neglen = len(self.neg_origin)

        # build train and test data.
        self.pos_review = self.pos_origin[:int(poslen*0.9)]
        self.pos_test = self.pos_origin[int(poslen*0.9):]
        self.neg_review = self.neg_origin[:int(neglen*0.9)]
        self.neg_test = self.neg_origin[int(neglen*0.9):]

        self.bestMainFeatures = create_Mainfeatures(pos_data=self.pos_review, neg_data=self.neg_review, BestFeatureVec=BestFeatureVec) # 使用詞和雙詞搭配作為特徵
        print(self.bestMainFeatures)
        # build model
        print('start building {} model!!!'.format(model))

        self.classifier = SklearnClassifier(self.classifier_table[model]) #nltk在sklearn的接口
        if len(self.train) == 0:
            print('build training data')
            posFeatures = self.emotion_features(self.best_Mainfeatures, self.pos_review, 'pos')
            negFeatures = self.emotion_features(self.best_Mainfeatures, self.neg_review, 'neg')
            self.train = posFeatures + negFeatures
        self.classifier.train(self.train) #訓練分類器
        pickle.dump(self.classifier, open('{}.pickle.{}'.format(model, BestFeatureVec),'wb'))

    def buildTestData(self, pos_test, neg_test):
        pos_test = json.load(open(pos_test, 'r'))
        neg_test = json.load(open(neg_test, 'r'))
        posFeatures = self.emotion_features(self.best_Mainfeatures, pos_test, 'pos')
        negFeatures = self.emotion_features(self.best_Mainfeatures, neg_test, 'neg')
        return posFeatures + negFeatures

    def best_Mainfeatures(self, word_list):
        return {word:True for word in word_list if word in self.bestMainFeatures}

    def score(self, pos_test, neg_test):
        from sklearn.metrics import precision_recall_curve
        from sklearn.metrics import roc_curve
        from sklearn.metrics import auc
        # build test data set
        if len(self.test) == 0:
            self.test = self.buildTestData(pos_test, neg_test)

        test, test_tag = zip(*self.test)
        pred = list(map(lambda x:1 if x=='pos' else 0, self.classifier.classify_many(test))) #對開發測試集的數據進行分類，給出預測的標籤
        tag = list(map(lambda x:1 if x=='pos' else 0, test_tag))
        # ROC AUC
        fpr, tpr, _ = roc_curve(tag, pred, pos_label=1)
        print("ROC AUC:" + str(auc(fpr, tpr)))
        return auc(fpr, tpr)

    def emotion_features(self, feature_extraction_method, data, emo):
        return list(map(lambda x:[feature_extraction_method(x), emo], data)) #爲積極文本賦予"pos"

    def swing(self, sentence):
        sentence = self.best_Mainfeatures(CutAndrmStopWords(sentence))
        return self.classifier.classify(sentence)

ImportError: No module named 'numpy'

### MultinomialNB V.S. BernoulliNB
都是Naive Bayes的一種  
差異在於：
1. Multinomial 會計算該單字出現再該類別幾次
2. Bernoulli 只是計算該單字出現與否而已

通常Multinomial會更適合用在Text classification上面

In [None]:
s = Swinger()
s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=10)
s.score(pos_test='pos.json', neg_test='neg.json')

In [None]:
s = Swinger()
s.load('BernoulliNB', pos='pos.json', neg='neg.json', BestFeatureVec=10)
s.score(pos_test='pos.json', neg_test='neg.json')

In [None]:
s.swing('大停電的夜晚，我很幸運看到了星空')

In [None]:
s.swing('XXX 停電害我不能打電動拉')

## 不同的feature數量對準確度的影響?



In [None]:
import matplotlib.pyplot as plt

multi = []
bernou = []
for num in range(10, 50, 10):
    s = Swinger()
    s.load('MultinomialNB', pos='pos.json', neg='neg.json', BestFeatureVec=num)
    multi.append(s.score(pos_test='pos.json', neg_test='neg.json'))
    
    s.load('BernoulliNB', pos='pos.json', neg='neg.json', BestFeatureVec=num)
    bernou.append(s.score(pos_test='pos.json', neg_test='neg.json'))

plt.plot(range(10, 50, 10), multi, 'o-', color="y",label="Multinomial")
plt.plot(range(10, 50, 10), bernou, 'o-', color="r",label="Bernoulli")
plt.legend(loc='best')
plt.xlabel("features vectors")
plt.ylabel("AUC")
plt.show()
    
    