# Maximum Entropy in Natural Language Processing

When faced with natural language processing tasks such as translation or prediction, the goal is often to find a model that is the most uniform while still fulfilling preset constraints based on prior knowledge/data. By considering models that are uniformly distributed outside of constraints, the hope is to most accurately describe stochastic processes present in data sets by picking the most "general" model. 

For example, if we know the word **water** is always followed the word **bottle**, we can set that tuple as a constraint and have all other words in our data set be equally likely to be followed by any other word.

This maximum entropy function with respect to a conditional probability $p(y|x)$ is given by

$$ H(p) = -\sum_{x,y} \tilde{p}(x)p(y|x)\log p(y|x) $$

However, maximizing $H(p)$ also has the constraints 

$$ p(f_i) = \tilde{p}(f_i) $$

where $f$ is an indicator function with respect to a tuple $(x,y)$, $p(f_i)$ is the expected value of $f$ with respect to the empirical distribution of a tuple $(x,y)$, and $\tilde{p}(f_i)$ is the expected value of $f$ with respect to $p(y|x)$ . We will discuss these and similar functions and their definitions in further detail when implementing this method.

To avoid dealing with these constraints when finding the optimum, we instead minimize the dual of $H(p)$. This dual function is given by

$$ \psi(\lambda) = - \sum_{x} \tilde{p}(x) \log{(\sum_{y} \exp{(\sum_{i} \lambda_i f_i (x,y))})} + \sum_{i} \lambda_i \tilde{p}(f_i)$$

Because our primal function $H(p)$ is symmetric, our dual is therefore unbounded, that is $\lambda \in \rm I\!R$. This dual function should be relatively easier to minimize compared to our original primal problem. 

## Function Definitions

To start, we import the NLTK package, which contains useful functions as well as formatted text files.

In [60]:
import nltk
from __future__ import division

nltk.download()

In [1]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


Next we define intermediate functions that will combine to form the dual function.

$\tilde{p}(x)$ is the empirical distribution of a word $x$, and so we count the number of occurences of $x$ and divide it by the length of the dataset.

In [70]:
## p~(x): count occurences of word x
def ptildex(text, x):
    return(txt.count(x) / len(txt))

$f_i(x,y)$ is an indicator (and in our case, a feature) function corresponding to a tuple of words (x,y). We will have it return $1$ if the input **tup** matches a specific tuple **z** we chose as a feature. 

In [71]:
def f(tup, z):
    if tup == z: return(1)
    else: return(0)     

$\tilde{p}(f)$, as mentioned earlier, is the expected value of a feature function $f(x,y)$ with respect to $\tilde{p}(x,y)$, the empirical distribution of $(x,y)$. Specifically, it is defined as 

$$ \tilde{p}(f) = \sum_{x,y}\tilde{p}(x,y) f(x,y) $$

In [72]:
## p~(x,y): count occurences of a pair (x, y)
def ptildexy(text, tup):
    bg = list(bigrams(text))
    return(bg.count(tup) / len(bg))

def ptildef(bigramset, f):
    val = 0
    for pair in bigramset:
        val += ptildexy(txt, pair)*float(f(pair, ('hey', 'no')))
        # the feature tuple for f is filled in for now

Finally, an important NLTK function we will be using is the bigrams function, which accepts a text file input and returns a list of all sequential bigrams. 

For example, bigrams([yes, no, maybe]) will return (yes, no), (no, maybe).

In [69]:
## bigrams function example
txt = ['hey', 'no', 'ucla', 'math', 'derivative', 'bus', 'computer', 'hello']
bg = list(bigrams(txt))
print(bg)

## p~(x): count occurences of word x
def ptildex(text, x):
    return(txt.count(x) / len(txt))
print(ptildex(txt, 'ucla'))

## p~(x,y): count occurences of a pair (x, y)
def ptildexy(text, tup):
    bg = list(bigrams(text))
    return(bg.count(tup) / len(bg))
print(ptildexy(txt, ('no', 'ucla')))

#print(bg.count(('hey', 'no')))

## f(x,y) feature function, z is an element of a set of features (x,y)
def f(tup, z):
    if tup == z: return(1)
    else: return(0)     
print(f(bg[1], ('bus', 'computer')))

def ptildef(bigramset, f):
    val = 0
    for pair in bigramset:
        val += ptildexy(txt, pair)*float(f(pair, ('hey', 'no')))
    print(val)
ptildef(bg, f)    


lambdaf = 0
for l, f in zip(lam, feat):
    lambdaf += l*f((x,y))
    
## freq dist example
#txt2 = nltk.tokenize.word_tokenize(txt)
#dist = FreqDist(txt2)
#print(dist)

[('hey', 'no'), ('no', 'ucla'), ('ucla', 'math'), ('math', 'derivative'), ('derivative', 'bus'), ('bus', 'computer'), ('computer', 'hello')]
0.125
0.142857142857
0
0.142857142857
