In [1]:
! wget http://www.gutenberg.org/files/2600/2600-0.txt

--2019-06-02 15:28:15--  http://www.gutenberg.org/files/2600/2600-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3359545 (3.2M) [text/plain]
Saving to: ‘2600-0.txt’


2019-06-02 15:28:17 (2.04 MB/s) - ‘2600-0.txt’ saved [3359545/3359545]



In [21]:
import re
with open("2600-0.txt") as train_file:
    text = train_file.read()
    text = re.sub('\n|\t','', text)

## Parameter estimates

$p(s_i)$: Treat this is an estimate of a categorical distribution such that a letter $s$ (one hot encoded vector) drawn from it \begin{equation}\bar{s} \sim \textbf{Cat}(\bar{p}) \end{equation} with $\bar{p}$ a vector of category probabilities. The maximum likelihood estimate of $p$ given observed draws is $p_i = \frac{c_i}{\sum c_i} $ i.e. the count of category $i$ divided by the total count. In small data settings is is conventional to use a dirichelet prior to ensure, for example, that none of the probabilities are zero. Here, we have a very large training corpus and ignore this.

For the case of calculating the conditional distributions $p(s_n | s_{n-1})$, we use the same model, but parameterise a different distribution for each $s_{n-1}$. Due due the limitation of data here, we use a dirichelet prior on p with $\alpha_i =1 $. \begin{equation}p_{ij} = \frac{c_{ij} + 1}{\sum_j c_{ij} + N}\end{equation} with $N$ the number of unique characters and $c_{ij}$ the counts of letter $s_i$ following letter $s_j$.

In [41]:
from collections import Counter 
import numpy as np
import pandas as pd
counts = Counter(text)
count_total = sum(counts.values())
p_i = np.array([i / count_total for s, i in counts.items()])
letter_types = [s for s in counts.keys()]

def generate_conditional_counts(text):
    ## encode a 2 by 2 array such that p(s_i | s_{i-1}) = arr[i,j]
    conditional_counts = np.ones([len(letter_types), len(letter_types)])
    for i in range(1, len(text)):
        conditional_counts[letter_types.index(text[i]), letter_types.index(text[-1])]+=1    
    return conditional_counts / np.sum(conditional_counts, axis=0)

p_ij = generate_conditional_counts(text)

## 1. Table with letter frequencies

In [32]:
pd.DataFrame.from_dict({"Letter": letter_types, "Frequency": p_i})

Unnamed: 0,Frequency,Letter
0,9.489101e-07,î
1,2.837241e-03,”
2,1.261766e-02,","
3,2.252080e-04,L
4,2.028042e-02,u
5,2.847679e-03,“
6,1.107062e-05,Q
7,5.197686e-02,i
8,4.001237e-04,Y
9,3.677975e-02,d


## 2. Are the latent variables independent?

No, imagine at 2 letter alphabet $[0,1]$. Then $\sigma(0) = 0 \rightarrow \sigma(1) = 1$



## 3. Joint probabilities

\begin{align}p(s_1, ... s_n , e_1 ... e_n | \sigma) &= p(e_1 ... e_n | s_1, ... s_n, \sigma) p(s_1, \ ... s_n) \\&=\prod_i\delta_{\sigma(s_i),e_i}p(s_1, \ ... s_n)\end{align}

$\delta_{ij}$ is the kroenecker delta

alternatively let $\alpha = \sigma^{-1}$ then 
\begin{align}p(s_1, ... s_n , e_1 ... e_n | \alpha) &= p(\alpha(e_1), \ ... \alpha(e_n))\end{align}
which is fine as $\sigma$ is bijective



## 4. Proposal and acceptance probabilities

$\alpha$ is effectively a permutation matrix that maps the encrypted symbols back to decrypted. The propoal distribution amounts to applying a random permutation matrix to $\alpha$ that swaps any 2 rows. Ths is a reversible operation, so the acceptance probability is \begin{equation} min\left(1, \frac{p(\alpha_m(e_1), \ ... \alpha_m(e_n)) }{ p(\alpha_{m-1}(e_1), \ ... \alpha_{m-1}(e_n))}\right) \end{equation}


## 5. Implementing the mh algorithm

In [51]:
def decode(input_text, p_i, p_ij):
    alpha = np.eye(len(p_ij))
    print (alpha)

decode()