# Naive Bayes assignment

## The data


Let's start with some data that will give us examples
of the kind of classes we're looking for.

In [2]:
import nltk
nltk.download('senseval')

[nltk_data] Downloading package senseval to /Users/gawron/nltk_data...
[nltk_data]   Package senseval is already up-to-date!


True

In [3]:
from nltk.corpus import senseval

hard_data = [(i,i.senses[0]) for i in senseval.instances('hard.pos')]

len(hard_data)

4333

We have downloaded some data with over 4,000 examples!  Looking at the first

In [4]:
 hard_data[0]

(SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")], senses=('HARD1',)),
 'HARD1')

That's a data structure (a Python object) with an attribute called `context` that contains a awkwardky respresented sentence, because each word is paired with a part of speech tag. Here are the words.

In [12]:
print(' '.join([word for (word,tag) in hard_data[0][0].context]))

`` he may lose all popular support , but someone has to kill him to defeat him and that 's hard to do . ''


The way we created  `hard_data` was to write a Python **list comprehension** (a kind of 
abbreviated `for`-loop) that returned a list of pairs. As a result,
the first member of the list, `hard_data[0]`, is a **pair**. Let's  
look at the two objects in the pair::

In [14]:
ex0 = hard_data[0]
cls,si = ex0[1],ex0[0]
print(cls,si)
 

HARD1 SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")], senses=('HARD1',))


`HARD1` is the class which we are trying to  learn to predict in this classification
task.  The other member of the pair contains the data
we're going try to classify.

The name I chose, `si`, stands for for "senseval instance".  This a Python
**object** (a special user-defined custom data type), that stores information
appropriate to the data in this corpus.  An `si`-instance has
two attributes of interest:

In [7]:
si.context

[('``', '``'),
 ('he', 'PRP'),
 ('may', 'MD'),
 ('lose', 'VB'),
 ('all', 'DT'),
 ('popular', 'JJ'),
 ('support', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('someone', 'NN'),
 ('has', 'VBZ'),
 ('to', 'TO'),
 ('kill', 'VB'),
 ('him', 'PRP'),
 ('to', 'TO'),
 ('defeat', 'VB'),
 ('him', 'PRP'),
 ('and', 'CC'),
 ('that', 'DT'),
 ("'s", 'VBZ'),
 ('hard', 'JJ'),
 ('to', 'TO'),
 ('do', 'VB'),
 ('.', '.'),
 ("''", "''")]

In [8]:
si.word

'hard-a'

The attribute context has a sentence which contains the word
`si.word` (`hard-a`, or the adjective "hard") somewhere in it
(the actual position of `si.word` is stored in `si.position`). Notice
the "words" in the sentence are actually pairs of a word and a part of
speech tag.  We will learn more about such tags later in the
course. For now `VB`, `VBZ`, `VBD`, and `VBG` are all kinds of verbs,
and `NN`, `NNS`, `NNP`, and `NNPSS` are all kinds of nouns.  For the
purposes of this exercise, the difference between a word and a
word/part of speech pair does not matter.  We will be counting
word/part of speech pairs instead of words.

Our task is to learn how to predict the sense
of any token of the adjective *hard* based on the words
in the sentence it occurs in.  To make
a prediction, we will use all the words in the training set,
and we will use a Naive-Bayes classifier to choose a sense.
Let's store the list of senses we need to choose between
under the name `senses`.

In [54]:
senses = sorted(list(set(sense for (inst, sense) in hard_data)))
senses

['HARD1', 'HARD2', 'HARD3']

So there are only 3 senses.  Let's get an idea of what their definitions
are:

In [92]:
def find_examples(data, sense, num_examples):
    res = []
    for (inst,s) in data:
        if s == sense:
         res.append(inst)
        if len(res) == num_examples:
         return res

def print_examples (ex_list):
    for ex in ex_list:
        for word in ex.context:
          print('{0}_{1}'.format(word[0], word[1]), sep = ' ', end=' ')
        print('\n')

Grab some examples of each sense:

In [93]:
hard1_ex = find_examples(hard_data, 'HARD1', 3)
hard2_ex = find_examples(hard_data, 'HARD2', 3)
hard3_ex = find_examples(hard_data, 'HARD3', 3)

Print some examples of sense `HARD1`:

In [94]:
print_examples(hard1_ex)

``_`` he_PRP may_MD lose_VB all_DT popular_JJ support_NN ,_, but_CC someone_NN has_VBZ to_TO kill_VB him_PRP to_TO defeat_VB him_PRP and_CC that_DT 's_VBZ hard_JJ to_TO do_VB ._. ''_'' 

clever_NNP white_NNP house_NNP ``_`` spin_VB doctors_NNS ''_'' are_VBP having_VBG a_DT hard_JJ time_NN helping_VBG president_NNP bush_NNP explain_VB away_RB the_DT economic_JJ bashing_NN that_IN low-and_JJ middle-income_JJ workers_NNS are_VBP taking_VBG these_DT days_NNS ._. 

i_PRP find_VBP it_PRP hard_JJ to_TO believe_VB that_IN the_DT sacramento_NNP river_NNP will_MD ever_RB be_VB quite_RB the_DT same_JJ ,_, although_IN i_PRP certainly_RB wish_VBP that_IN i_PRP 'm_VBP wrong_JJ ._. 



So this sense could be paraphrased as "difficult".  And it looks like
the word `time` might be one good cue for this sense.

And now `hard2`:

In [95]:
print_examples (hard2_ex)

keep_VB this_DT one_CD in_IN your_PRP$ drawer_NN for_IN the_DT next_JJ time_NN the_DT boss_NN gives_VBZ you_PRP a_DT hard_JJ time_NN ._. 

she_PRP recommends_VBZ continuing_VBG education_NN courses_NNS ,_, developing_VBG effective_JJ people_NNS skills_NNS and_CC hard_JJ work_NN ._. 

the_DT phrase_NN ``_`` consent_NN of_IN the_DT governed_VBN ''_'' needs_VBZ a_DT hard_JJ look_NN ._. 



Paraphrase this one as "stressful" or "intense".  And finally:

In [96]:
print_examples (hard3_ex)

my_PRP$ companion_NN enjoyed_VBD a_DT healthy_JJ slice_NN of_IN the_DT chocolate_NN mousse_NN cake_NN ,_, made_VBN with_IN a_DT hard_JJ chocolate_NN crust_NN ,_, topping_VBG a_DT sponge_NN cake_NN with_IN either_DT strawberry_NN or_CC raspberry_JJ on_IN the_DT bottom_NN ._. 

``_`` i_PRP feel_VBP that_IN the_DT hard_JJ court_NN is_VBZ my_PRP$ best_JJS surface_NN overall_JJ ,_, "_" courier_NNP said_VBD ._. 

water_NNP becomes_VBZ stiff_JJ and_CC hard_JJ as_IN clear_JJ stone_NN ._. 



 So this sense could be paraphrased "the opposite of **soft**".

## Training

We need to separate training and test data. 

First let's shuffle our data instances.  They're a little
too orderly and we need to make a fair test set:

In [97]:
from sklearn.utils import shuffle   
new_hard_data = shuffle(hard_data, random_state = 42)

This probably won't work across machines, but I'm
hoping your results more closely match mine if you
use the same value for the random_state argument (42) as I do.


We're going to do a Naive Bayes model that
tells us what sense of **hard** is being used
in  sentences in our corpus.  We need to separate
a bundle to train on and another smaller bundle to 
test on::

In [98]:
train_ind = int(9 * round(len(hard_data)/10.))
train_data = new_hard_data[:train_ind]
test_data = new_hard_data[train_ind:]

Here's the total amount of training data:

In [99]:
 len(train_data) 

3897

Nearly 4000 examples.

Now let's use a Python dictionary to
sort our training data into the three senses. These are
three **classes** that we are learning to recognize:

In [57]:
train_dict = dict((sense,[]) for sense in senses)
for (s_inst, sense) in  train_data:
    train_dict[sense].append(s_inst)

In [61]:
ctr = 0
for sense in senses:
    inc = len(train_dict[sense])
    ctr += inc
    print(sense, inc)
print('Total', ctr)

HARD1 3100
HARD2 459
HARD3 338
Total 3897


Since the target word *hard* occurs only once in each sentence,
we computed the number of instances of each sense by
counting the number of sentences in each data class:
By the same reasoning, the total number of tokens of the word
*hard* is the total number of sentences.

Here's how to start counting words in a single context example:

In [67]:
import nltk
si0 = train_dict[senses[0]][0]
fd = nltk.FreqDist()
fd.update(si0.context)

Now let's look at the results:

In [68]:
fd

FreqDist({('to', 'TO'): 2, ('it', 'PRP'): 1, ('was', 'VBD'): 1, ('so', 'RB'): 1, ('hard', 'JJ'): 1, ('shoot', 'VB'): 1, (',', ','): 1, ('we', 'PRP'): 1, ('lost', 'VBD'): 1, ('our', 'PRP$'): 1, ...})

All the words in the context sentence now have count 1.
Okay, now another sentence:

In [70]:
si1 = train_dict[senses[0]][1]
fd.update(si1.context)
fd

FreqDist({('to', 'TO'): 3, ('it', 'PRP'): 2, ('hard', 'JJ'): 2, (',', ','): 2, ('n', 'NN'): 2, ("'t", 'NN'): 2, ('of', 'IN'): 2, ('.', '.'): 2, ('was', 'VBD'): 1, ('so', 'RB'): 1, ...})

Crucially counts are getting **incremented** by the `update`
method.  Now many words have counts of 2 and 3.
Based on that, here's a loop which computes, for each term
(word) $t_{i}$, the number of  times $t_{i}$ occurs in
a sentence in the  data for a particular sense

In [85]:
fd_dict = dict()
for sense in senses:
    fd_dict[sense] = nltk.FreqDist()
    for si in train_dict[sense]:
        fd_dict[sense].update(si.context)

`fd_dict['HARD1']` is now a dictionary which,
for each word key `w`, returns the number of times 
that word `w` occurred in a sentence with sense `HARD1`.
To look up $\text{count}(t_{k}, s)$, where 
$t_{k} = (\text{time}, \text{NN})$ and
$s = \text{HARD1}$, execute:

In [86]:
fd_dict['HARD1'][('time','NN')]

293

The total number of tokens of the word "time" occuring (as a noun) in sentences with
sense `HARD1` is 293.  

The total number of word tokens occurring
in sentences with sense `HARD1` is:

In [88]:
fd_dict['HARD1'].N()

79228

When you have done that, you basically have a trained
Naive Bayes model.  For each class $s$, you know
how to compute $P(s)$.

For example, to compute $p(\text{HARD1})$, the probability of the HARD1 sense,
you need to know the total number 
the number of tokens of "hard" that
occured with sense `HARD1` in the training data and you need
know the number of tokens of the word "hard" in all senses. Therefore,

$p(\text{HARD1}) = \frac{\text{count}(\text{HARD1},\, \text{"hard"})}{\sum_{s} \text{count}(s, \,\text{"hard"})}$

The other Naive Bayes parameters is the conditional probabilities
of the words given the senses:  that is, for each sense $s$,
for each term $t_k$, we need $P(t_{k} \mid s)$.
For this, the additional count needed for an MLE estimate
is the count of each word with each sense, $\text{count}(t_{k}, s)$.
Then for each **word** $t_{k}$ and each sense $s$:

${P}_{\text{mle}}(t_{k} \mid s) = \frac{\text{count}(t_{k}, s)}{\sum_{j} \text{count}(t_{j}, s)}$

We just saw how to look up $\text{count}(t_{k}, s)$.  Now the denominator is a sum. It sums up $\text{count}(t_{k}, s)$ for every word in the
vocabulary; and that sum is just the total number of word tokens occurring
in sentences with sense $s$.  We computed that for `HARD1` above.

## Smoothing

The first step in smoothing is as simple as can be.  We basically
have three vocabularies compiled, one for each of three senses. 
We need a total vocabulary
that is the union of all three::

In [89]:
total_V = set(list(fd_dict['HARD1'].keys()) + list(fd_dict['HARD2'].keys()) + 
               list(fd_dict['HARD3'].keys()))

Now we do add1 smoothing:

In [90]:
sm_fd_dict = dict() 
for sense in senses:
    sm_fd = nltk.FreqDist()
    fd = fd_dict[sense]
    sm_fd_dict[sense] = sm_fd
    for word in total_V:
          sm_fd[word] = fd[word] + 1

Whenever `word` `t` did not occur with sense `HARD1`, `sm_fd_dict['HARD1'][t]` will
return the count 0, and the smoothed count will be 1; as a result no
word in `total_V` will have a count of 0 in `sm_fd_dict['HARD1']`.  It is now
easy to find the new **smoothed count** of the total number of words
"occurring" in sentences with sense `HARD1`.  That is

In [91]:
sm_fd_dict['HARD1'].N()

92735

which has gone up because it now assigns
count1 to all the words that never occurred 
with sense `HARD1` in the training. 

As a result the toatl word counts and the probability of
each sense will change slightly. For example:

$p(\text{HARD1}) = \frac{\text{sm_count}(\text{HARD1},\, \text{"hard"})}{\sum_{s} \text{sm_count}(s, \,\text{"hard"})}$

## Questions

Turn in a copy of this notebook with answers to the following questions.  You 
should include as an appendix, and cells containing the code you executed
to get your answers.  The material above is intended to help
you produce that code.  Numbers given with
no evidence of how you got them code will not be accepted.
That evidence cannot just be a count to be accepted on faith.
For every count you use in a probability calcylation, smoothed
or unsmoothed, you must show how you got that count
from the data above.  You should use the code examples provided
above as yo ur guide for how to look up and interpret the data.

Note that question 2 involves some multiplication of
really, really tiny probabilities to get
the joint probability of the sense and the whole sentence.  
If you know a little Python you
can write a very simple loop.  It will begin::

```
for word in t0:
    prob *= < some stuff >
```
That is, multiply the current value of `prob` by < some stuff >,
and update the value of `prob` to be equal to the result.  Each word
you look at introduces another factor; and since they will
all be less than 1, `prob` keeps getting smaller and smaller.
If you don't feel comfortable writing the Python loop, you can write
out the sequence of multiplications by hand.  I have chosen
very short sentences to allow you to do just that.

  1. Compute $\hat{P}(\text{HARD1})$.   Compute $\hat{P}(\text{HARD2})$. Compute $\hat{P}(\text{HARD3})$. Note:  
     $P(\text{HARD1})$ is the **prior probability**  of sense `HARD1`
     (See slide 3 of the 
     [Naive Bayes lectures slides](http://gawron.sdsu.edu/compling/course_core/lectures/naive_bayes.pdf>).  $\hat{P}(\text{'HARD1'})$ is estimated
     prior probability. (See slide 7).

     One way to test that you are doing this right is to check: $\hat{P}(\text{HARD1}) + \hat{P}(\text{HARD2}) +\hat{P}(\text{HARD3})$

     You should get 1.0.

  2. For this question, use the definition of  the 
     probability of the sense and the document 
     on slide 6, illustrated in the example
     on slide 19: 
     
     $$\hat{P}(s) \prod_{1 \leq k \leq n_d} \hat{P}(t_{k} \mid s)$$

     Assume the document is	

In [None]:
 t0 = [('it', 'PRP'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('watch', 'VB'), 
       ('.', '.')]

  2. Here the document size $n_d$ will be 5 (because we're omitting the target word
     `('hard', 'JJ')`). Compute the following joint probability:
     
      $$\hat{P}(\text{HARD1}) \prod_{1 \leq k \leq 5} \hat{P}(t_{k} \mid \text{HARD1})$$
      
 
  3.  Same question as (2.) for HARD2 and HARD3. The results should provide
      motivation for answering the next question, which is about smoothing.  
      Explain what happened and determine what words in the context motivated 
      smoothing.  But despite the smoothing issues, based on the numbers
      you get, what is the maximum a posteriori class?  That is,
      what **sense** does  Naive Bayes choose?

  4.  Now create
      a smoothed model using add 1 smoothing, as defined in slide 12,
      and illustrated in slide 18.  Recompute the joint
      probability for the data you were given in question 2, 
      and report the classification decision made
      by the smoothed model.

  5. Estimate $\hat{P}(\text{HARD1} \mid t_{1,n_{d}})$ for the
     sentence in question (2) using the unsmoothed model.  You
     will need to normalize the probability.  See slide 20.