# Maximum Entropy in Natural Language Processing

When faced with natural language processing tasks such as translation or prediction, the goal is often to find a model that is the most uniform while still fulfilling preset constraints based on prior knowledge/data. By considering models that are uniformly distributed outside of constraints, the hope is to most accurately describe stochastic processes present in data sets by picking the most "general" model. 

For example, if we know the word **water** is always followed the word **bottle**, we can set that tuple as a constraint and have all other words in our data set be equally likely to be followed by any other word.

This maximum entropy function with respect to a conditional probability $p(y|x)$ is given by

$$ H(p) = -\sum_{x,y} \tilde{p}(x)p(y|x)\log p(y|x) $$

However, maximizing $H(p)$ also has the constraints 

$$ p(f_i) = \tilde{p}(f_i) $$

where $f$ is an indicator function with respect to a tuple $(x,y)$, $p(f_i)$ is the expected value of $f$ with respect to the empirical distribution of a tuple $(x,y)$, and $\tilde{p}(f_i)$ is the expected value of $f$ with respect to $p(y|x)$ . We will discuss these and similar functions and their definitions in further detail when implementing this method.

To avoid dealing with these constraints when finding the optimum, we instead minimize the dual of $H(p)$. This dual function is given by

$$ \psi(\lambda) = - \sum_{x} \tilde{p}(x) \log{(\sum_{y} \exp{(\sum_{i} \lambda_i f_i (x,y))})} + \sum_{i} \lambda_i \tilde{p}(f_i)$$

Because our primal function $H(p)$ is symmetric, our dual is therefore unbounded, that is $\lambda \in \rm I\!R$. This dual function should be relatively easier to minimize compared to our original primal problem. 

## Function Definitions

To start, we import the NLTK package, which contains useful functions as well as formatted text files.

In [67]:
import nltk
import numpy as np
from scipy.optimize import minimize 
from __future__ import division

# only download once
#nltk.download()

In [33]:
from nltk.book import *

Next we define intermediate functions that will combine to form the dual function.

$\tilde{p}(x)$ is the empirical distribution of a word $x$, and so we count the number of occurences of $x$ and divide it by the length of the dataset.

In [49]:
## p~(x): count occurences of word x
def ptildex(text, x):
    return(txt.count(x) / len(txt))

$f_i(x,y)$ is an indicator (and in our case, a feature) function corresponding to a tuple of words (x,y). We will have it return $1$ if the input **tup** matches a specific tuple **z** we chose as a feature. 

In [50]:
def f(tup, z):
    if tup == z: return(1)
    else: return(0)     

$\tilde{p}(f)$, as mentioned earlier, is the expected value of a feature function $f(x,y)$ with respect to $\tilde{p}(x,y)$, the empirical distribution of $(x,y)$. Specifically, it is defined as 

$$ \tilde{p}(f) = \sum_{x,y}\tilde{p}(x,y) f(x,y) $$

In [61]:
## p~(x,y): count occurences of a pair (x, y)
def ptildexy(text, tup):
    bg = list(nltk.bigrams(text))
    return(bg.count(tup) / len(bg))

def ptildef(text, bigramset, feat):
    val = float(0)
    for pair in bigramset:
        val += ptildexy(text, pair)*float(f(pair, feat))
    return(val)

It would also be convenient to define 
$$\log{(\sum_{y} \exp{(\sum_{i} \lambda_i f_i (x,y))})}$$

In [52]:
def logsum(x, lambdas, features):
    outersum = float(0)
    innersum = float(0)
    for y in sety:
        for lam, feat in zip(lambdas, features):
            innersum += lam * f((x,y), feat)
        outersum += np.exp(innersum)
    return(np.log(outersum))

Finally, an important NLTK function we will be using is the bigrams function, which accepts a text file input and returns a list of all sequential word pairs. For example, bigrams(yes, no, maybe) will return (yes, no), (no, maybe).

In [57]:
txt = ['hey', 'no', 'ucla', 'math', 'derivative', 'bus', 'computer', 'hello']
bg = list(nltk.bigrams(txt))
print(txt)
print(bg)

['hey', 'no', 'ucla', 'math', 'derivative', 'bus', 'computer', 'hello']
[('hey', 'no'), ('no', 'ucla'), ('ucla', 'math'), ('math', 'derivative'), ('derivative', 'bus'), ('bus', 'computer'), ('computer', 'hello')]


Now we can piece together the dual function. It takes a vector of initial lambda values and sums over all $x$ and $y$, which are the sets of unique words in our text file. In our case x and y are equal, as every word can either be predicted or predicted from. 

In [69]:
text = ['test']
setx = set(text)
sety = set(text)
bg = list(nltk.bigrams(text))

def dual(lambdas):
    firstsum = float(0)
    for x in setx:
        firstsum -= ptildex(text, x) * logsum(x, lambdas, features)
    secondsum = float(0)
    for lam, feat in zip(lambdas, features):
        secondsum += lam * ptildef(text, bigramset, feat)
    return (firstsum + secondsum)

To use a Quasi-Newton method we need to also provide a gradient, which we can achieve using simple calculus
$$ \dfrac{\partial \psi(\lambda)}{\partial{\lambda_i}} = -\sum_{x}\tilde{p}(x) \frac{\sum_{y}(\exp{(\sum_{i} \lambda_i f_i(x,y))}f_i (x,y))}{\sum_{y} \exp{(\sum_{i} \lambda_i f_i (x,y))}} + \tilde{p}(f_i)$$

since 

$$ \dfrac{\partial}{\partial \lambda_i} \sum_{i} \lambda_i f_i(x,y) = f_i(x,y) $$

and where $ \dfrac{\partial \psi(\lambda)}{\partial{\lambda_i}}$ is the ith component of our gradient vector.

In [84]:
def expterm(x, y, lambdas, features):
    innersum = float(0)
    for lam, feat in zip(lambdas, features):
        innersum += lam * f((x,y), feat)
    return(np.exp(innersum))

def grad(lambdas):
    firstsum = 0
    numerator = 0
    denominator = 0
    gradient = np.empty((1,1))
    for i in np.arange(0, len(lambdas)):
        for x in setx:
            for y in sety:
                numerator += expterm(x, y, lambdas, features)
            denominator = numerator
            numerator = numerator*f((x, y), features[i])
            firstsum -= (numerator/denominator) * ptildex(text, x)
        firstsum += ptildef(text, bigramset, features[i])
        gradient = np.append(gradient, firstsum)
    gradient = gradient[1:]
    return(gradient)

### Test functions for output

In [85]:
text = ['In', 'optimization', 'quasi-Newton', 'methods', 'are', 'algorithms', 'for', 'finding', 'local', 
        'maxima', 'and', 'minima', 'of', 'functions', 'In', 'quasi-Newton', 'methods', 'the', 'Hessian',
        'matrix', 'does', 'not', 'need', 'to', 'be', 'computed']
bigramset = list(nltk.bigrams(text))
setx = set(text)
sety = set(text)
features = [('quasi-Newton', 'methods'), ('local', 'maxima')]
lambdas = [4, 5]

print(dual(lambdas))
print(grad(lambdas))

0.8400000000000001
[0.16 0.2 ]


In [83]:
minimize(dual, lambdas, method='BFGS', jac=grad)

  rhok = 1.0 / (numpy.dot(yk, sk))
  sk[numpy.newaxis, :])
  return f(xk + alpha * pk, *args)


      fun: -6.185681452892673e+152
 hess_inv: array([[inf, inf],
       [inf, inf]])
      jac: array([0.16, 0.2 ])
  message: 'Desired error not necessarily achieved due to precision loss.'
     nfev: 3863
      nit: 51
     njev: 3862
   status: 2
  success: False
        x: array([-2.94556260e+153, -3.68195325e+153])

## Gradient Method

Another popular and computationally cheaper optimization method is a gradient method, which iteratively steps in a direction that minimizes the function(usually determined by the gradient). First we set $\lambda = 0$. Then for each $\lambda_i$ we increment it by $\Delta \lambda_i$ so that $\lambda_i = \lambda_i + \Delta \lambda_i$ where $\Delta \lambda_i$ is the solution to 

$$ \sum_{x,y} \tilde{p}(x) p(y|x) f_i(x,y) \exp{(\Delta \lambda_i \sum_{i} f_i(x,y))} = \tilde{p}(f_i)$$

We then check for convergence of $\lambda_i$ and repeat the increment of $\lambda_i$ if not.