## 问题  

听起来很傻的问题，这里有两句话： 

- 'He saw their was a football in the park.'   
- 'He saw there was a football in the park.'   

哪句是对的呢？  

我需要做的是，用 `First Order Markov Language Model` 和 `Second Order Markov Language Model` 分别求解出哪句话是对的。  

In [1]:
import nltk
# 如果你不想画图，是不需要 import plotly 的
import plotly
from plotly.graph_objs import Scatter, Layout

In [2]:
unigrams = [word.lower() for word in nltk.corpus.brown.words()]

In [3]:
unigrams_freq_dist = nltk.FreqDist(unigrams)
unigrams_counts = unigrams_freq_dist.most_common()

In [4]:
plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot({
    "data": [Scatter(x=[k for (k,v) in unigrams_counts[:30] ], y=[v for (k,v) in unigrams_counts[:30] ])],
    "layout": Layout(title="频数最高的 30 个 unigrams")
})

In [5]:
s1 = 'He saw their was a football in the park.'  
s2 = 'He saw there was a football in the park.'  

tokens_in_s1 = [word.lower() for word in nltk.tokenize.wordpunct_tokenize(s1) ]
tokens_in_s2 = [word.lower() for word in  nltk.tokenize.wordpunct_tokenize(s2) ]

print(tokens_in_s1)

['he', 'saw', 'their', 'was', 'a', 'football', 'in', 'the', 'park', '.']


## First Order Way

In [6]:
bigrams = nltk.ngrams(unigrams,2) # generator returned 

bigram_freq_dist = nltk.FreqDist(bigrams)
bigrams_counts = bigram_freq_dist.most_common()

In [None]:
plotly.offline.iplot({
    "data": [Scatter(x=[k[0]+' '+k[1] for (k,v) in bigrams_counts[:30] ], y=[v for (k,v) in bigrams_counts[:30] ])],
    "layout": Layout(title="频数最高的 30 个 bigrams")
})

In [8]:
print(
    [(bigram, bigram_freq_dist.freq(bigram)/ unigrams_freq_dist.freq(bigram[0])) for bigram in nltk.ngrams(tokens_in_s2,2) ]
)

[(('he', 'saw'), 0.00974026812842305), (('saw', 'there'), 0.0), (('there', 'was'), 0.2100441691564777), (('was', 'a'), 0.07916461224050571), (('a', 'football'), 0.0002586766616558999), (('football', 'in'), 0.0), (('in', 'the'), 0.2823735852574503), (('the', 'park'), 0.0002286663586193746), (('park', '.'), 0.1170213773726854)]


遇到问题了， 不少 `bigrams` 的 `c(w0,w1)/c(w0)` 是 0 . 

接下来考虑一下 `discounting method`，这里让 `beta = 0.5`:  

为了方便描述，定义两个集合：  

- `集合A（w0）` ：  `{w1 ： c(w0,w1) > 0}`。 首先要注意，这个集合是单词的集合，或者说是 `unigram` 的集合，而不是 `bigram` 的集合。 这个集合表示的是，我们要考察的 bigram 是 `w0 ?`，也就是， `bigram` 里面的第一个词已经确定了，第二词还有很多种不同的可能。 对于每一种可能，如果在训练的文集里面，它的频数是大于零的，那么这个 bigram 中的第二个词语，就属于这个集合。   

- `集合B（w0）` ：  `{w1 ： c(w0,w1) = 0}`， 这个集合表示的是，我们要考察的 bigram 是 `w0 ?`，也就是， `bigram` 里面的第一个词已经确定了，第二词还有很多种不同的可能。 对于每一种可能，如果在训练的文集里面，它的频数是等于零的，那么这个 bigram 中的第二个词语，就属于这个集合。 

那么接下来看 `discounting method`: 
  
-  $ q(w1|w0) = (c(w0,w1) - \beta )/c(w0) $ ， 如果 `w1 属于 A(w0) 集合`.  
-  `q(w1|w0) = alpha(w0,w1) * c(w1)/c(A(w0))`, 如果 `w1 属于 B（w0） 集合`,    
- `c(B（w0）)`: 表示的是 `B（w0）` 集合里面，所有 unigrams 在训练文集里出现的频数之和。   
- `alpha(w0,w1) 是 1-sum([q(w1|w0) for w1 属于 A(w0)])`    

参考 [Week1-Required-Reading 的 16 页](Week1-required-reading.pdf)  

In [9]:
beta = 0.5
def get_bigram_probability(bigram):
    # bigram: (v,w)
    # set A : {w : c(v,w) > 0} 
    # set B : {w : c(v,w) = 0}
    v = bigram[0]
    w = bigram[1]
    c_v_w = bigram_freq_dist[bigram]
    
    if c_v_w > 0: 
        return (c_v_w - beta) / unigrams_freq_dist[v]
    else:
        # (v , *)： * 可能的单词有多少种？ 
        num_of_possible_word_after_v = 0;
        for bi in bigram_freq_dist.keys(): 
            if bi[0] == v:
                num_of_possible_word_after_v += 1
        alpha = num_of_possible_word_after_v * beta / unigrams_freq_dist[v]
        
        set_A_total_counts = 0 
        for bi in bigram_freq_dist.keys(): 
            if bi[0] == v:
                set_A_total_counts += unigrams_freq_dist[bi[1]]
        
        return alpha*unigrams_freq_dist[w]/(unigrams_freq_dist.N() - set_A_total_counts)
                

好啦，现在来看看情况如何呢？ 

In [10]:
print(
    [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s1,2) ]
)

[(('he', 'saw'), 0.009687892752408882), (('saw', 'their'), 0.0014204545454545455), (('their', 'was'), 0.002881783360951596), (('was', 'a'), 0.07911360163015792), (('a', 'football'), 0.0002371200689803837), (('football', 'in'), 0.006179374849610485), (('in', 'the'), 0.28234990860945774), (('the', 'park'), 0.00022152034414257336), (('park', '.'), 0.11170212765957446)]


In [11]:
print(
    [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s2,2) ]
)

[(('he', 'saw'), 0.009687892752408882), (('saw', 'there'), 0.0006517597167688712), (('there', 'was'), 0.20986070381231672), (('was', 'a'), 0.07911360163015792), (('a', 'football'), 0.0002371200689803837), (('football', 'in'), 0.006179374849610485), (('in', 'the'), 0.28234990860945774), (('the', 'park'), 0.00022152034414257336), (('park', '.'), 0.11170212765957446)]


In [12]:
from operator import mul
from functools import reduce

In [13]:
p_s1 = reduce(mul,
    [v for (k,v) in  [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s1,2) ]]
    ,1)

p_s2 = reduce(mul,
    [v for (k,v) in  [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s2,2) ]]
    ,1)


In [14]:
print(p_s1)
print(p_s2)
if p_s1 < p_s2:
    print('Aha! The right sentence is \'' +
          s2 + '\''
         )
else:
     print('Aha! The right sentence is \'' +
          s1 + '\'' 
         )   

3.211772068050607e-20
1.0731852279363508e-18
Aha! The right sentence is 'He saw there was a football in the park.'


## Second Order Way

In [15]:
trigrams = nltk.ngrams(unigrams,3) # generator returned 
trigram_freq_dist = nltk.FreqDist(trigrams) 
trigram_counts = trigram_freq_dist.most_common()

In [None]:
plotly.offline.iplot({
    "data": [Scatter(x=[k[0]+' '+k[1] + ' ' +k[2] for (k,v) in trigram_counts[:30] ], y=[v for (k,v) in trigram_counts[:30] ])],
    "layout": Layout(title="频数最高的 30 个 trigrams")
})

![](plot1.png)

哈哈哈，从图上看来，没有去掉标点，真的是很多乱七八糟的东西啊。  

In [17]:
# 我依然使用了和上面一样的 beta 

def get_trigram_probability(trigram):
    # trigram: (u,v,w)
    # set A : {w : c(u,v,w) > 0} 
    # set B : {w : c(u,v,w) = 0} 
    u = trigram[0]
    v = trigram[1]
    w = trigram[2]
    counts_u_v_w = trigram_freq_dist[trigram]
    if counts_u_v_w > 0:      
        return (counts_u_v_w - beta) / bigram_freq_dist[(u,v)]     
    else: 
        # (u,v,*) : 其中 * 这个单词有多少种可能， 或者说 w 有多少种可能 
        num_of_possible_word_after_uv = 0
        for tri in trigram_freq_dist.keys():
            if tri[0]==u and tri[1]==v :
                num_of_possible_word_after_uv += 1
        if bigram_freq_dist[(u,v)] == 0:
            alpha = 1 # back off to bigram 
        else:
            alpha = num_of_possible_word_after_uv * beta / bigram_freq_dist[(u,v)]  
        
        # FIXME： 这里太慢了， 有什么更快的方法吗？
        sum_of_Q_d_over_A = 0
        for tri in trigram_freq_dist.keys():
            if tri[0]==u and tri[1]==v :
                sum_of_Q_d_over_A += get_bigram_probability((v,tri[2]))
        denominator = 1 - sum_of_Q_d_over_A
        
        
        return alpha*get_bigram_probability((w,v))/denominator
            

In [18]:
print(
    [(tri, get_trigram_probability(tri)) for tri in nltk.ngrams(tokens_in_s1,3) ]
)

[(('he', 'saw', 'their'), 5.458284880267648e-05), (('saw', 'their', 'was'), 0.0002816077858380414), (('their', 'was', 'a'), 2.155636990730761e-05), (('was', 'a', 'football'), 0.0019305019305019305), (('a', 'football', 'in'), 1.278189402105604e-05), (('football', 'in', 'the'), 2.1437452658958712e-05), (('in', 'the', 'park'), 0.0005809128630705395), (('the', 'park', '.'), 0.09375)]


In [19]:
print(
    [(tri, get_trigram_probability(tri)) for tri in nltk.ngrams(tokens_in_s2,3) ]
)

[(('he', 'saw', 'there'), 9.679839314667383e-05), (('saw', 'there', 'was'), 0.0036169128884360672), (('there', 'was', 'a'), 0.2469458987783595), (('was', 'a', 'football'), 0.0019305019305019305), (('a', 'football', 'in'), 1.278189402105604e-05), (('football', 'in', 'the'), 2.1437452658958712e-05), (('in', 'the', 'park'), 0.0005809128630705395), (('the', 'park', '.'), 0.09375)]


In [20]:
p_s1 = reduce(mul,
    [v for (k,v) in  [(trigram, get_trigram_probability(trigram)) for trigram in nltk.ngrams(tokens_in_s1,3) ]]
    ,1)

p_s2 = reduce(mul,
    [v for (k,v) in  [(trigram, get_trigram_probability(trigram)) for trigram in nltk.ngrams(tokens_in_s2,3) ]]
    ,1)


In [21]:
print(p_s1)
print(p_s2)
if p_s1 < p_s2:
    print('Aha! The right sentence is \'' +
          s2 + '\''
         )
else:
     print('Aha! The right sentence is \'' +
          s1 + '\'' 
         )  

9.545471484009647e-30
2.4907429831958207e-24
Aha! The right sentence is 'He saw there was a football in the park.'


## Discussion

- 如果把  `s1` 换成 `He saw the was a football in the park.` ， `s1` 会被认为是正确的句子， `Markov Model` 也是有问题的。 有的时候， `trigram` 或者 `bigram` 中的最后一个词与前面的词可能恰好关系是不大的，于是需要考虑 `grammars` 吗？  
- 如果有出现 `unigram` 的 `counts` 也为 0 的情况该如何处理呢？  
- 是否把 `trigram` 看成是 `bigram + unigram` ，在程序设计上会有一些很棒的思路呢？  