# 贝叶斯迷你NLP项目

## 1. 项目说明
想象你的老板从一个信号很差的地方给你留了条消息。有几个词根本听不清。你想基于他以前给你留的一些消息的录音，填补这些剩下的词。为此，给定一些额外的消息信息，我们将使用贝叶斯公式来计算给定单词正好能填入留白处的概率。
回想一下贝叶斯公式：
$$ P(A|B) = \frac {P(B|A) * P(A)} {P(B)}$$
或者在我们这个案例中：
$$ P(某个特定词|前后文的词) =\frac{ P(前后文的词|某个特定词) * P(某个特定词)} {P(前后文的词)}$$



## 2. 练习1
![title](./info_01.png)

In [1]:
sentences = 'So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?'
N = len(sentences.split())
print('sentences number:', N)

sentences number: 22


由于"you"出现的次数为1，"if"出现的次数也为1，所以
- 单词 you 跟在单词 if 后面的概率是多少
    因为if后面只出现了有1次，所以概率为1。

- 在句子中，随机选择一个词，选到 you 的概率是多少
    - $\frac{1}{22}$

- 在句子中，随机选择一个词，选到 if 的概率是多少
    - $\frac{1}{22}$

## 3. 练习2: Maximum Likelihood
最大可能性假设

在这个练习中我们要根据前面一个单词，找出哪个单词最有可能跟在它后面

实现 NextWordProbability 使得你可以传入一段话，一个词，返回一个字典。这个字典的键（keys）是出现在这个词后面的词，每个键（key）的值（value）是跟在后面这个词出现的次数。

你可以用 .split() 方法来把 sample_memo 这段话中的词用空格分割开来。

In [2]:
sample_memo = '''
Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?
Oh, and remember: next Friday... is Hawaiian shirt day. So, you know, if you want to, go ahead and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna need you to go ahead and come in on Sunday, too...
Hello Peter, whats happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.
'''

In [3]:

#   Maximum Likelihood Hypothesis
#
#   In this quiz we will find the maximum likelihood word based on the preceding word
#
#   Fill in the NextWordProbability procedure so that it takes in sample text and a word,
#   and returns a dictionary with keys the set of words that come after, whose values are
#   the number of times the key comes after that word.
#   
#   Just use .split() to split the sample_memo text into words separated by spaces.

def NextWordProbability(sampletext,word):
    word_list = sampletext.split()
    dictionary = dict()
    for i in range(len(word_list) -1 ):
        if word_list[i] == word:
            if word_list[i+1] in dictionary:
                dictionary[word_list[i+1]] +=1
            elif word_list[i+1] not in dictionary:
                dictionary[word_list[i+1]] = 1
    return dictionary

In [4]:
NextWordProbability(sample_memo, 'you')

{'downstairs': 1, 'could': 2, 'know,': 1, 'want': 1, 'to': 3}

## 4.练习3：最佳分类器
![title](./info_02.png)
![title](./info_03.png)



In [30]:
#   Bayes Optimal Classifier 贝叶斯最佳分类器
#
#   In this quiz we will compute the optimal label for a second missing word in a row
#   based on the possible words that could be in the first blank
#
#   Finish the procedurce, LaterWords(), below
#
#   You may want to import your code from the previous programming exercise!

sample_memo = '''
Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?
Oh, and remember: next Friday... is Hawaiian shirt day. So, you know, if you want to, go ahead and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna need you to go ahead and come in on Sunday, too...
Hello Peter, whats happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.
'''

corrupted_memo = '''
Yeah, I'm gonna --- you to go ahead --- --- complain about this. Oh, and if you could --- --- and sit at the kids' table, that'd be --- 
'''

data_list = sample_memo.strip().split()

words_to_guess = ['ahead', 'could']

def LaterWords(sample, word, distance):
    '''
    @param sample: a sample of text to draw from
    @param word: a word occuring before a corrupted sequence
    @param distance: how many words later to estimate (i.e. 1 for the next word, 2 for the word after that)
    @returns: a single word which is the most likely possibility
    '''
    # TODO: Given a word, collect the relative probabilities of possible following words
    # from @sample. You may want to import your code from the maximum likelihood exercise.
    dictNextWord = NextWordProbability(sample, word)
    
    # TODO: Repeat the above process--for each distance beyond 1, evaluate the words that
    # might come after each word, and combine them weighting by relative probability
    # into an estimate of what might appear next.
    dictNextWord2 = {}
    for key in dictNextWord.keys():
        freq = NextWordProbability(sample, key)
        dictNextWord2[key] = sorted(freq, key=freq.get, reverse=True)
        
    if distance == 1:
        return sorted(dictNextWord, key=dictNextWord.get, reverse=True)
    elif distance == 2:
        firstWord = sorted(dictNextWord, key=dictNextWord.get, reverse=True)[0]
        return dictNextWord2[firstWord][0]

    
print(LaterWords(sample_memo,"ahead",2))

come


## 5. 练习4：词语调解
What set of words in a memo do you think you could help predict what a missing word might be?  
What are some advantages and disadvantages of using more of fewer possible influences in prediction?

## 6. 练习5：联合分布分析
If you wanted to measure the joint probability distribution of a missing word given its position relative to every other word in the document, how many probabilities would you need to measure? Say the document is N words long.

## 7. 练习6：区间知识测试
Given the corpus of text we have from our boss, we might like to identify some things he often says, and use that knowledge to make better predictions. 

What are some statements you see arising multiple times?

## 8. 练习7：区间知识填入
We've identified the following patterns in our boss' speech:
> 'Gonna need [you] to go ahead and '   
> 'So if you could ... that would be [great, terrific], [ok, okay, mmmk]',    
> 'Oh, and I almost forgot'

Trying to search all regular expressions of length up to 9 with multiple optional parts is computationally infeasible. But if we have these hypothesis to begin with, we can make extremely accurate guesses. For example, fill in the blanks in the following sentence:  

- Yeah, I'm gonna {} you to go {} {} not complain about this. Oh, and if you could {} ahead and sit at the kids' table, that'd be {} .


- gonna {need}
- go {ahead} {and}
- could {go} ahead
- be {great, terrific, ok, okay, mmmk}