# Entropy and cross entropy of "High Flight"

High Flight is a famous poem. Read in the contents of 'HighFlight.txt' as a list of characters. Calculate the following.

1) What is the entropy of the characters in this poem (including the special characters for simplicity)? 

2) Suppose these characters appeared uniformly, you would expect the entropy to be higher. What would the entropy for the set of characters contained in this poem be if they were uniformly distributed.

3) Let p be the frequency distribution of characters in the poem and q be the uniform frequency distribution of those characters. Calculate H(p,q) and H(q,p) (the respective cross entropies) along with KL(p||q) and KL(q||p) (the KL divergences). Verify that H(p,q) = H(p) + KL(p||q)

In [18]:

using Pkg
using StatsBase


In [21]:
# Here is some seed code to open the file, get the text, and collect the set of unique symbols

f = open("HighFlight.txt", "r")
contents = read(f, String)
close(f)

stList = collect(contents)
alphabet = Set(stList)  # set of symbols in the string
println("Alphabet of symbols in the string:")
println(alphabet)



Alphabet of symbols in the string:
Set(['E', 'i', 's', ',', 'O', '\'', 'W', 'T', 'y', 'k', 'S', 'w', 'r', ';', 'c', 'U', 'e', 'j', 'n', 'f', 'M', '-', 'b', '!', 'A', ' ', 'g', 'h', 'P', 'I', 'H', 'a', 'p', 'd', 'G', 't', '.', 'v', 'm', 'o', '\r', 'u', 'Y', 'l'])
44

In [22]:
poem_length = length(contents)
num_chars = length(alphabet)

44

In [23]:
println("1. Entropy for characters in poem:")
p = countmap(stList)
p = collect(values(p))
p = p ./ poem_length
entropy = - sum(p .* log.(p))
println(entropy)

1. Entropy for characters in poem:


3.1374749890781697


In [24]:

println("2. Entropy if each character was equally likely:")
entropy_uni = - log(1/num_chars)
println(entropy_uni)
println()

2. Entropy if each character was equally likely:
3.784189633918261



In [29]:
# Cross entropies:
println("Cross entropies:")
q = ones(num_chars) ./ num_chars
H_pq = - sum(p .* log.(q))
H_qp = - sum(q .* log.(p))

println("H(p, q): ", H_pq)
println("H(q, p): ", H_qp)
println()

# KLs:
println("KL divergences:")
KL_pq = H_pq - entropy
KL_qp = H_qp - entropy

println("KL(p || q): ", KL_pq)
println("KL(q || p): ", KL_qp)
println()

# The rest
println("From direct calculation we find:")
println("H(p, q) = ", H_pq)
println("Composing entropy and KL yields:")
#println("H(p) + KL(p||q) = ", ...)
#println("H(p, q) = ", H_pq)
#println("Very small difference: ", ...)
#println("H(p, q) = ", H_pq)

# Verify that H(p,q) = H(p) + KL(p||q)
println("H(p) + KL(p||q) = ", entropy + KL_pq)

Cross entropies:
H(p, q): 3.784189633918261
H(q, p): 4.662679575931986

KL divergences:
KL(p || q): 0.6467146448400913
KL(q || p): 1.5252045868538167

From direct calculation we find:
H(p, q) = 3.784189633918261
Composing entropy and KL yields:
H(p) + KL(p||q) = 3.784189633918261
