# Assignment 2 - Elementary Probability and Information Theory 
# Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* This notebook loads some data into a `pandas` dataframe, then does a small amount of preprocessing. Make sure your data can load by stepping through all of the cells up until question 1. 
* Most of the questions require you to write some code. In many cases, you will write some kind of probability function like we did in class using the data. 
* Some of the questions only require you to write answers, so be sure to change the cell type to markdown or raw text
* Don't worry about normalizing the text this time (e.g., lowercase, etc.). Just focus on probabilies. 
* Most questions can be answered in a single cell, but you can make as many additional cells as you need. 
* When complete, please export as HTML. Follow the instructions on the corresponding assignment Trello card for submitting your assignment. 

<body>
    <section style="border:1px solid RoyalBlue;">
        <section style="background-color:White; font-family:Georgia;text-align:center">
            <h1 style="color:RoyalBlue">Natural Language Processing</h1>
            <h2 style="color:RoyalBlue">Dr.	Casey Kennington</h1>
            <h2 style="font-family:Courier; text-align:center;">CS-536</h2>
            <br>
            <h2 style="font-family:Garamond;">Gerardo Caracas Uribe</h2>
            <h2 style="font-family:Garamond;">Student ID: 114104708</h2>
            <h2 style="font-family:Courier;">Assignment 2</h2>
            <hr/>
        </section>
    </section>
</body>

In [33]:
from client.api.notebook import Notebook
ok = Notebook('a2.ok')
import os
if not os.path.exists(os.path.join(os.environ.get("HOME"), ".config/ok/auth_refresh")):
    ok.auth(force=True)
else:
    ok.auth(inline=True)

Assignment: A2 Python and Jupyter
OK, version v1.13.11


Open the following URL:

https://okpy.org/client/login/

After logging in, copy the code from the web page and paste it into the box.
Then press the "Enter" key on your keyboard.

Paste your code here: CkOyMhwdS6jdHekyQ0e8C5aBIcomWQ
Successfully logged in as gerardocaracasur@u.boisestate.edu


In [1]:
import pandas as pd 

data = pd.read_csv('pnp-train.txt',delimiter='\t',encoding='latin-1', # utf8 encoding didn't work for this
                  names=['type','name']) # supply the column names for the dataframe

# this next line creates a new column with the lower-cased first word
data['first_word'] = data['name'].map(lambda x: x.lower().split()[0])

In [2]:
data[:10]

Unnamed: 0,type,name,first_word
0,drug,Dilotab,dilotab
1,movie,Beastie Boys: Live in Glasgow,beastie
2,person,Michelle Ford-Eriksson,michelle
3,place,Ramsbury,ramsbury
4,place,Market Bosworth,market
5,drug,Cyanide Antidote Package,cyanide
6,person,Bill Johnson,bill
7,place,Ettalong,ettalong
8,movie,The Suicide Club,the
9,place,Pézenas,pézenas


In [3]:
data.describe()

Unnamed: 0,type,name,first_word
count,21001,21001,21001
unique,5,20992,13703
top,movie,Cuba,the
freq,6262,2,635


## 1. Write a probability function/distribution $P(T)$ over the types. 

Hints:

* The Counter library might be useful: `from collections import Counter`
* Write a function `def P(T='')` that returns the probability of the specific value for T
* You can access the types from the dataframe by calling `data['type']`

In [4]:
from collections import Counter
def P(T=''):
    global data
    counts=Counter(data.type)
    return counts[T]/sum(counts.values())

## 2. What is `P(T='movie')` ?

In [5]:
P(T='movie')

0.29817627732012764

## 3. Show that your probability distribution sums to one.

In [6]:
import numpy as np
np.sum(list(P(T=x) for x in set(data.type)))

1.0

## 4. Write a joint distribution using the type and the first word of the name

Hints:

* The function is $P2(T,W_1)$
* You will need to count up types AND the first words, for example: ('person','bill)
* Using the [itertools.product](https://docs.python.org/2/library/itertools.html#itertools.product) function was useful for me here

In [7]:
def P2(T='', W1=''):
    global data
    return float(len(data[(data.type == T) & 
                   (data.first_word == W1)]))/float(len(data))

## 5. What is P2(T='person', W1='bill')? What about P2(T='movie',W1='the')?

In [8]:
P2(T='person', W1='bill')

0.00047616780153326033

In [9]:
P2(T='movie', W1='the')

0.02747488214846912

## 6. Show that your probability distribution P(T,W1) sums to one.

In [11]:
from itertools import *
ds = list(product(set(data.type),set(data.first_word)))
np.sum(list(map(lambda x: P2(T=(x[:][0]), W1=(x[:][1])), ds)))

1.0

## 7. Make a new function Q(T) from marginalizing over P(T,W1) and make sure that Q(T) sums to one.

Hints:

* Your Q function will call P(T,W1)
* Your check for the sum to one should be the same answer as Question 3, only it calls Q instead of P.

In [12]:
def Q(T=''):
    return sum([P2(T=T, W1=x) for x in set(data.first_word)])

In [13]:
Q(T='movie')

0.2981762773201276

In [14]:
np.sum(list(map(lambda x: Q(T=x), set(data.type))))

1.0000000000000284

## 8. What is the KL Divergence of your Q function and your P function for Question 1?

* Even if you know the answer, you still need to write code that computes it.

In [322]:
import math
np.sum(list(map(lambda x: P(T=x)*math.log(P(T=x)/Q(T=x)), set(data.type))))

-2.5651628708294765e-14

## 9. Convert from P(T,W1) to P(W1|T) 

Hints:

* Just write a comment cell, no code this time. 
* Note that $P(T,W1) = P(W1,T)$

## By using Baye's theorem ##
#### First, let's see our target, this is what we want ####


#### We will use Baye's theorem: ####
### $$P(A|B) = \frac{P(A,B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{\sum_i P(B|A_i)P(A_i)}  $$

#### If we consider the following property ####
### $$P(T,W_1) = P(W_1,T)$$

#### And Considering we already have ####
### $$P(T)$$ ###

#### We can therefore use the first part of Baye's theorem to have the final result ####
### $$P(W_1|T)= \frac{P(W_1,T)}{P(T)} $$ ###

## 10. Write a function `Pwt` (that calls the functions you already have) to compute $P(W_1|T)$.

* This will be something like the multiplication rule, but you may need to change something

In [15]:
def Pwt(W1='',T=''):
    return P2(T=T,W1=W1)/P(T=T)

## 11. What is P(W1='the'|T='movie')?

In [16]:
Pwt(W1='the',T='movie')

0.09214308527626956

## 12. Use Baye's rule to convert from P(W1|T) to P(T|W1). Write a function Ptw to reflect this. 

Hints:

* Call your other functions.
* You may need to write a function for P(W1) and you may need a new counter for `data['first_word']`

In [19]:
word = Counter(data['first_word'])
def P_W(W1=''):
    global word
    global data  
    return float(word[W1]) / float(len(data))
def Ptw(T='',W1=''):
    return P2(T=T,W1=W1)/P_W(W1=W1)

## 13 
### What is P(T='movie'|W1='the')? 
### What about P(T='person'|W1='the')?
### What about P(T='drug'|W1='the')?
### What about P(T='place'|W1='the')
### What about P(T='company'|W1='the')

In [25]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [26]:
Ptw(T='person',W1='the')

0.0

In [27]:
Ptw(T='drug',W1='the')

0.0

In [28]:
Ptw(T='place',W1='the')

0.0015748031496062994

In [29]:
Ptw(T='company',W1='the')

0.08976377952755905

## 14 Given this, if the word 'the' is found in a name, what is the most likely type?

movie

## 15. Is Ptw(T='movie'|W1='the') the same as Pwt(W1='the'|T='movie') the same? Why or why not?

In [30]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [31]:
Pwt(W1='the', T='movie')

0.09214308527626956

Let's take the following expression:
### $$P(B|A) $$ ###
The conditional probability of an event B is the probability that the event will occurr given the knowledge that an event A has already occurred. 
Therefore it is very important the order of the input paramenter (That's why we can't just swap them) because we know that the second event, A, has already happened.
$$P(A|B)$$ and  $$P(B|A) $$ solve different questions

## 16. Do you think modeling Ptw(T|W1) would be better with a continuous function like a Gaussian? Why or why not?

- Answer in a markdown cell


I don't think so, we are dealing with categorical variables (we are counting things), whereas a continuous function would take a continous numerical data, which in our case, we are not dealing with.

In [34]:
ok.submit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'A2-probability-information-theory.ipynb'.
Submit... 100% complete
Submission successful for user: gerardocaracasur@u.boisestate.edu
URL: https://okpy.org/boisestate/cs4-533/sp19/a2/submissions/VP13o5

