# Character-level entropy: reproducing results from Shannon (1951)
In this notebook, we will explore the computation of various flavors of entropy using natural language as a test-bed.  We will perform these explorations at the level of characters (although an interesting analysis can also be done at the word level).

## Part 1: The baseline
As a matter of comparison, and in order to get a sense of the magnitude of the entropy numbers that we expect to be dealing with, let's compute the entropy of the alphabet under the assumption that each character is independent and uniformly distributed.  As our alphabet, let's take the 26 letters (A-Z), the digits (0-9), and the space character.  Ignore case and punctuation.  Derive the simplest expression that you can for the entropy of this distribution for a single character.  

## Part 2: Unigram entropy
Next, we will consider the character-level entropy of natural english.  In words, our task is as follows: given some character drawn from a corpus of English text, how many yes or no questions do I need to ask on average to figure out the character, assuming questions are asked optimally?  To compute the entropy, we need a text on which to base our estimates of frequency.  Python's [Natural Language Toolkit (NLTK)](https://www.nltk.org) is a nice tool for this purpose.  We'll look at a few different text corpi (the others being the Book of Genesis and a spanish-language corpus_, starting with the 'Brown dataset', which is around 1M words long, and was compiled at Brown University from a large number of newspaper articles.  First, install nltk via pip (or whatever), then we can acquire the corpus as  

In [None]:
import nltk
nltk.download('brown')
nltk.download('genesis')
nltk.download('cess_esp')

from nltk.corpus import brown,genesis,cess_esp
print(brown.sents())

For the sake of simplicity, we want to use the same alphabet as in the previous section - here is a function which strips out all punctuation and special characters, casts to lowercase, and then puts all of the words together as one big string:

In [None]:
import string
import re
import itertools

def process_nltk_corpus(corpus):
    text = list(itertools.chain.from_iterable(corpus.sents()))
    char_string = re.sub('[^A-Za-z0-9 ]+', '', re.sub(r"\s+", " ", ' '.join(text).translate(str.maketrans('', '', string.punctuation))).strip().lower())
    return char_string

text_brown = process_nltk_corpus(brown)
print(text_brown[:500])

**Compute the empirical distribution over characters** (e.g. $p_a = \frac{n_a}{N}$, with $n_a$ the number of $a$'s that appear, and $N$ the total number of characters), then **compute the entropy of the distribution**.   

Claude Shannon also computed this number (for a different corpus) and published it in a manuscript called [Prediction and Entropy of Printed English](https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf) in 1951.  His estimate can be found in the table on p. 54, under the column header $F_1$ (the outcome of the previous section would be under column header $F_a$.  Compare your result to his and comment on potential reasons for any deviations.  

## Part 3: Joint entropy over bigrams
Obviously the distribution over the usage of characters in English is less random than uniform - but ultimately language is determined by the relationships between letters rather than the letters in isolation.  As such, let's explore a bigram model, which is to say, we'll be dealing with the distribution over two-letter pairs:
$$P(X_1,X_2).$$
First, if we were to assume that $X_1$ and $X_2$ were independent of one another (of course they aren't in reality), what would be the joint entropy of $X_1$ and $X_2$? *Note that you shouldn't have to do much to get this - the answer is an immediate consequence of your response to Part 2.* Again, this serves as an upper bound on the true joint entropy.

Next, let's compute the actual joint entropy given the text corpus.  To compute this, you will need to first determine the empirical joint distribution over all possible bigrams (of which there are $37^2$).  How you organize these is up to you.  With this distribution in hand, compute the joint entropy $H(X_1,X_2)$.  How does this compare to the baseline assuming independence?

## Part 4: Conditional entropy over bigrams
The next natural question to ask is: how predictable is the the next word ($X_2$) given knowledge the previous one ($X_1$)?  This is precisely the answer given by the conditional entropy
$$ H(X_2|X_1) = \sum_{x_1\in\mathcal{X}} P(X=x_1) H(X_2 | X_1=x_1).$$
**Develop a method to compute the conditional entropy**.  Compare your result to Shannon, whose calculation is shown in his table under the column heading $F_2$.  

*Note that you can also check to ensure that your code is functioning properly by using the identity
$$
H(X_1,X_2) = H(X_2|X_1) + H(X_1).
$$
You computed both of the quantities on the right hand side previously, so it should be trivial to ensure that this equality holds.*

## Part 5: Mutual Information
Exactly how many bits of information does knowing the first letter provide me about the second letter?  Compute the mutual information in two ways: first, using the definition based on the Kullback-Leibler divergence:
$$I(X_1;X_2) = \sum_{x_1\in\mathcal{X}} \sum_{x_2\in\mathcal{X}} P(x_1,x_2) \lg \frac{P(x_1,x_2)}{P(x_1)P(x_2)} $$ 

and second using the identity 
$$ I(X_1;X_2) = H(X_2) - H(X_2 | X_1)$$ 

## Part 6: Kullback-Leibler Divergence
Recall that there is a close relationship between the entropy of a random variable and the most efficient way in which that random variable can be encoded as a binary sequence.  The Kullback-Leibler divergence
$$
D(P(X) || Q(X)) = \sum_{x\in\mathcal{X}} P(x) \lg \frac{P(x)}{Q(x)}
$$
measures the inefficiency (measured in extra bits) of encoding a distribution $P(X)$ with a distribution designed for $Q(X)$.  We have already seen the KL-divergence applied to answering the question "how much efficiency do we lose by assuming independence", but we can use this more generally.  In particular, please answer the question: "how many extra bits do I lose by encoding Spanish characters using a code optimized for English?"  Stated alternatively, **what is the KL-divergence between $P(X)$ - defined as the unigram distribution computed from the English corpus that we've already been working with - and $Q(X)$ - defined as the unigram distribution computed from a Spanish corpus** (please find a Spanish corpus in the following code snippet).  

Comment on your result, in particular whether it says anything on the universality of language.  Do you think your result would change if you considered joint distributions rather than univariate ones?

# Huffman Coding
### (no need to work on this part quite yet - we will get there soon)
We are already familiar with Huffman codes: they are the binary sequences of answers that optimally encode a random variable (optimal with respect to minimizing expected number of questions), and as such are deeply tied to entropy.  **Create a method that builds the Huffman coding tree given a sequence of characters.**  *You will want to build some simple test cases to ensure correct functionality*.  Once you are sure that your method is working, construct the Huffman coding tree for the Brown corpus described above.  

In [None]:
def build_huffman_tree(text):
    """
    Build a Huffman tree from the given text
    
    Args:
        text: Input string
        
    Returns:
        root: Root node of Huffman tree
    """

With this tree in hand, **encode the Brown corpus.**    

In [None]:
def huffman_encoding(text):
    """
    Perform Huffman coding on the input text
    
    Args:
        text: Input string containing lowercase letters, digits, or spaces
        
    Returns:
        encoded_text: Binary string of the encoded text
        huffman_tree: Root node of the Huffman tree (for decoding)
        codes: Dictionary mapping characters to their Huffman codes
    """
    
    return encoded_text, huffman_tree, codes

**Report the compression factor** (the ratio of bits required to represent the unencoded and encoded versions of the corpus).  

**Report the average number of bits used to encode each symbol in the corpus**.  Compare this to the entropy that you calculated previously.  How does your Huffman coding scheme compare to the entropy (which provides the theoretical lower limit on this quantity)? 