## 休息一下


## 下午場開始

目錄：

1. 斷詞
    * 安裝結巴
    * 斷詞原理簡單講
    * 下載字典
2. 貝氏分類器
    * 原理
    * 卡方
    * 情緒字典
3. 實作


### 安裝結巴

> pip install jieba



### 斷詞原理簡單講

首先，知道詞跟詞出現在上下文的機率  
透過viterbi等演算法實現HMM模型  
找出機率最高的斷詞組合  

![img](jieba_procedure.png)

![img](https://upload.wikimedia.org/wikipedia/commons/7/73/Viterbi_animated_demo.gif)

斷詞，需要知道每個字：
1. S(獨立成詞)、B（詞的開頭）、M（中間）、E（結尾）四種詞的狀態的機率

如此就能算出機率最大的斷詞組合

![img](viterbi.png)
圖片引用自 [中文斷詞：斷句不要悲劇](http://s.itho.me/techtalk/2017/%E4%B8%AD%E6%96%87%E6%96%B7%E8%A9%9E%EF%BC%9A%E6%96%B7%E5%8F%A5%E4%B8%8D%E8%A6%81%E6%82%B2%E5%8A%87.pdf)

### 以下用WIKI百科上的viterbi做示範（參考即可）

[wiki -viterbi](https://zh.wikipedia.org/wiki/%E7%BB%B4%E7%89%B9%E6%AF%94%E7%AE%97%E6%B3%95)

使用viterbi時  
需要先知道上一個狀態變化到下一個狀態的機率  
以及每個狀態的發生機率是多少  
wiki是以醫生看病當例子

In [3]:
states = ('Healthy', 'Fever')
 
observations = ('normal', 'cold', 'dizzy')
 
start_probability = {'Healthy': 0.6, 'Fever': 0.4}
 
transition_probability = {
   'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
   'Fever' : {'Healthy': 0.4, 'Fever': 0.6},
   }
 
emission_probability = {
   'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
   'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6},
   }

In [4]:
# Helps visualize the steps of Viterbi.
def print_dptable(V):
    print("    ")
    for i in range(len(V)):
        print("%8d" % i, end='')
    print()

    for y in V[0].keys():
        print("%.5s: " % y, end="")
        for t in range(len(V)):
            print("%.7s" % ("%f" % V[t][y]), end=" ")
        print()

def viterbi(obs, states, start_p, trans_p, emit_p):
    Pro = [{}]
    path = {}

    for s in states:
        Pro[0][s] = start_p[s] * emit_p[s][obs[0]]
        path[s] = [s]

    for index in range(1, len(obs)):
        Pro.append({})
        newPath = {}
        for newstate in states:
            prob, state = max([ (Pro[index-1][oldState] * trans_p[oldState][newstate] * emit_p[newstate][obs[index]], oldState) for oldState in states])

            Pro[index][newstate] = prob
            newPath[newstate] = path[state] + [newstate]
        path = newPath

    print_dptable(Pro)
    prob, state = max([(value, key) for key, value in Pro[-1].items()])
    return prob, path[state]

def example():
    return viterbi(observations,
                   states,
                   start_probability,
                   transition_probability,
                   emission_probability)
print(example())

    
       0       1       2
Fever: 0.04000 0.02700 0.01512 
Healt: 0.30000 0.08400 0.00588 
(0.01512, ['Healthy', 'Healthy', 'Fever'])


### 斷詞示範

In [7]:
import jieba, os
print(jieba.lcut('吉林市長春藥店'))


Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\udic\AppData\Local\Temp\jieba.cache
Loading model cost 3.743 seconds.
Prefix dict has been built succesfully.


['吉林市', '長', '春藥店']


### 下載字典 

答案不是 ~~春藥店~~  
是**長春** **藥店**  
但是蒐集到的單字不夠多  
導致演算法覺得這種組合的機率很小  

要改善就需要額外的字典

In [8]:
jieba.load_userdict(os.path.join('', 'dictionary', 'dict.txt.big.txt'))
jieba.load_userdict(os.path.join('', "dictionary", "NameDict_Ch_v2"))
print(jieba.lcut('吉林市長春藥店'))

['吉林市', '長春', '藥店']
