## 问题  

听起来很傻的问题，这里有两句话： 

- 'He saw the was a football in the park.'   
- 'He saw there was a football in the park.'   

哪句是对的呢？  

我需要做的是，用 `First Order Markov Language Model` 和 `Second Order Markov Language Model` 分别求解出哪句话是对的。  

In [1]:
import nltk
import plotly
from plotly.graph_objs import Scatter, Layout

In [2]:
unigrams = [word.lower() for word in nltk.corpus.brown.words()]

In [3]:
unigrams_freq_dist = nltk.FreqDist(unigrams)
unigrams_counts = unigrams_freq_dist.most_common()

In [4]:
plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot({
    "data": [Scatter(x=[k for (k,v) in unigrams_counts[:30] ], y=[v for (k,v) in unigrams_counts[:30] ])],
    "layout": Layout(title="频数最高的 30 个 unigrams")
})

In [5]:
s1 = 'He saw the was a football in the park.'  
s2 = 'He saw there was a football in the park.'  

tokens_in_s1 = [word.lower() for word in nltk.tokenize.wordpunct_tokenize(s1) ]
tokens_in_s2 = [word.lower() for word in  nltk.tokenize.wordpunct_tokenize(s2) ]

print(tokens_in_s1)

['he', 'saw', 'the', 'was', 'a', 'football', 'in', 'the', 'park', '.']


## First Order Way

In [6]:
bigrams = nltk.ngrams(unigrams,2) # generator returned 

bigram_freq_dist = nltk.FreqDist(bigrams)
bigrams_counts = bigram_freq_dist.most_common()

In [7]:
plotly.offline.iplot({
    "data": [Scatter(x=[k[0]+' '+k[1] for (k,v) in bigrams_counts[:30] ], y=[v for (k,v) in bigrams_counts[:30] ])],
    "layout": Layout(title="频数最高的 30 个 bigrams")
})

- 这里我不能将每一个 bigram 的条件概率提前计算好，因为太多了 

这里对条件概率的估计使用 Maximum Likelihood:  

> `q(w1|w0) = c(w0,w1)/c(w0)`  

其中 : 

- `c(w0,w1)`: `bigram w0 w1` 的频数  
- `c(w1)` : `unigram w1` 的频数 

In [8]:
print(
    [(bigram, bigram_freq_dist.freq(bigram)/ unigrams_freq_dist.freq(bigram[0])) for bigram in nltk.ngrams(tokens_in_s2,2) ]
)

[(('he', 'saw'), 0.00974026812842305), (('saw', 'there'), 0.0), (('there', 'was'), 0.2100441691564777), (('was', 'a'), 0.07916461224050571), (('a', 'football'), 0.0002586766616558999), (('football', 'in'), 0.0), (('in', 'the'), 0.2823735852574503), (('the', 'park'), 0.0002286663586193746), (('park', '.'), 0.1170213773726854)]


遇到问题了， 不少 `bigrams` 的 `c(w0,w1)/c(w0)` 是 0 . 

接下来考虑一下 `discounting method`，这里让 `beta = 0.5`:  

-  `q(w1|w0) = （c(w0,w1) - beta）/c(w0)` ， 如果 `c(w0,w1) > 0`.  
-  `q(w1|w0) = alpha(w0,w1) * c(w1)/c()`, 如果 `c(w0,w1) ==0`,  
- `c()`: 表示的是所有 `unigrams` 的频数和 
- `alpha(w0,w1) 是 1-sum([q(w1|w0) for all w1 and c(w0,w1) > 0])`

参考 [week1-question2.pdf 中的 Question 4](week1-question2.pdf)

In [9]:
beta = 0.5
def get_bigram_probability(bigram):
    counts = bigram_freq_dist[bigram]
    if counts == 0:
        # 首先要计算一下 alpha 
        # alpha = num_of_types_of_w1 * beta / c(w0)  
        num_of_types_of_w1 = 0 
        for bi in bigram_freq_dist.keys(): 
            if bi[1] == bigram[1]:
                num_of_types_of_w1 += 1
        alpha = num_of_types_of_w1 * beta / unigrams_freq_dist[bigram[0]]
        
        return alpha*unigrams_freq_dist.freq(bigram[1])
    else:
        return (counts - beta)/unigrams_freq_dist[bigram[0]]
        
        
                

好啦，现在来看看情况如何呢？ 

In [10]:
print(
    [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s1,2) ]
)

[(('he', 'saw'), 0.009687892752408882), (('saw', 'the'), 0.18039772727272727), (('the', 'was'), 0.00015172523802936162), (('was', 'a'), 0.07911360163015792), (('a', 'football'), 0.0002371200689803837), (('football', 'in'), 1.5371267795889432), (('in', 'the'), 0.28234990860945774), (('the', 'park'), 0.00022152034414257336), (('park', '.'), 0.11170212765957446)]


In [11]:
print(
    [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s2,2) ]
)

[(('he', 'saw'), 0.009687892752408882), (('saw', 'there'), 0.0015217121716305311), (('there', 'was'), 0.20986070381231672), (('was', 'a'), 0.07911360163015792), (('a', 'football'), 0.0002371200689803837), (('football', 'in'), 1.5371267795889432), (('in', 'the'), 0.28234990860945774), (('the', 'park'), 0.00022152034414257336), (('park', '.'), 0.11170212765957446)]


In [12]:
from operator import mul
from functools import reduce

In [13]:
p_s1 = reduce(mul,
    [v for (k,v) in  [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s1,2) ]]
    ,1)

p_s2 = reduce(mul,
    [v for (k,v) in  [(bigram, get_bigram_probability(bigram)) for bigram in nltk.ngrams(tokens_in_s2,2) ]]
    ,1)


In [14]:
print(p_s1)
print(p_s2)
if p_s1 < p_s2:
    print('Aha! The right sentence is \'' +
          s2 + '\''
         )
else:
     print('Aha! The right sentence is \'' +
          s1 + '\'' 
         )   

5.3420762636186025e-17
6.232823572553894e-16
Aha! The right sentence is 'He saw there was a football in the park.'


## Second Order Way

## Discussion

- 如果有出现 `unigram` 的 `counts` 也为 0 的情况该如何处理呢？  
- 是否把 `trigram` 看成是 `bigram + unigram` ，在程序设计上会有一些很棒的思路呢？  