# Murder Mystery 1 Tamam Shud

In December 1948 the dead body of a man was found near Adelaide, Australia. To this day his identity is unknown and his death is still a mystery. One of the only clues recovered to what happened is a small piece of paper with the words "Tamam Shud" written on it. The piece has turned out to be torn from the last page of the book "Rubaiyat" by Omar Khayyam. The police managed to find the copy of the book from which the piece was torn. This book had some letters written inside the cover:

**WRGOABABD**

**MLIAOI**

**WTBIMPANETP**

**MLIABOAIAQC**

**ITTMTSAMSTGAB**

The second line seems to have been crossed out. The similarity to the penultimate line could suggest that it was a mistake.

What does this mean? Is it some kind of code?

![Tamam Shud](https://storage.googleapis.com/big-data-course-datasets/Actual-tamam-shud.jpg)

![Code](https://storage.googleapis.com/big-data-course-datasets/SomertonManCode.jpg)

In [1]:
code=["WRGOABABD",
#"MLIAOI",
"WTBIMPANETP",
"MLIABOAIAQC",
"ITTMTSAMSTGAB"]

One theory that we can start by looking into is that the letters is a short hand code for a sentence where only the starting letters of each word is written down. We can test this theory by investigating if distribution of letters is compatible with the general distribution of words in the English language.

We first need to find the distribution of starting letters in English, and for that we will use data from the [Gutenberg Project](https://www.gutenberg.org/), which makes avaiable a large number of books in digital form.

In [2]:
from scipy import stats
import numpy as np
import pandas as pd

books=spark.sparkContext.textFile("gs://big-data-course-datasets/gutenberg/").cache()

We can use Spark to find the counts of each starting letter.

In [3]:
import string

counts=books.flatMap(lambda x: x.replace(",", "").split()) \
    .filter(lambda x: len(x)>0) \
    .map(lambda x: x[0].upper()) \
    .filter(lambda x: x in string.ascii_uppercase) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda x,y: x+y) \
    .collect()

In [4]:
counts

[('C', 73345),
 ('N', 48820),
 ('H', 151054),
 ('Y', 32791),
 ('X', 100),
 ('O', 130771),
 ('W', 137955),
 ('S', 162160),
 ('D', 55619),
 ('K', 14483),
 ('J', 13206),
 ('G', 40746),
 ('A', 242153),
 ('V', 12893),
 ('F', 78982),
 ('M', 91287),
 ('R', 34603),
 ('P', 53167),
 ('Q', 3571),
 ('B', 97838),
 ('T', 347977),
 ('L', 56497),
 ('I', 128541),
 ('E', 41718),
 ('U', 31154),
 ('Z', 1058)]

Similarly, we can count the letters in the code and compare this to the expected counts if the code is indeed starting letters from English sentences.

In [5]:
book_frequencies=np.zeros(len(string.ascii_uppercase))
for key in dict(counts).keys():
    book_frequencies[string.ascii_uppercase.index(key)]+=dict(counts)[key]
    
book_frequencies=book_frequencies/sum(book_frequencies)

In [6]:
code_frequencies=np.zeros(len(string.ascii_uppercase))
for letter in "".join(code):
    code_frequencies[string.ascii_uppercase.index(letter)]+=1


In [7]:
data=pd.DataFrame(data={"letter": list(string.ascii_uppercase), 
                   "book_freq": book_frequencies, 
                   "code_freq": code_frequencies})

data["expected_freq_book"]=data["code_freq"].sum()*data["book_freq"]

In [8]:
import seaborn as sns

ax = sns.barplot(x="letter", 
            y="val",
            hue="frequencies",
            data=data[["letter", "code_freq", "expected_freq_book"]].melt(id_vars=["letter"], var_name="frequencies", value_name="val"))

So, when we plot the expected frequencies with the actual frequencies we can see that the distribution is not too far off, but we can actually also test this with a chi-squared test:

In [9]:
stats.chisquare(data["code_freq"], data["expected_freq_book"])

Power_divergenceResult(statistic=30.940002102352494, pvalue=0.1910131742960432)

The p-value of the test is 19% which means that if we have guessed the correct distribution of letters then there is one chance in five that the actual letters would fit the distribution worse.

We can compare this to the uniform distribution of letters:

In [10]:
stats.chisquare(data["code_freq"], np.ones(len(data))/len(data))

Power_divergenceResult(statistic=4697.0, pvalue=0.0)

So the frequencies of the code fits the starting letters of words in English. But, perhaps, it is not the starting letters, but some mutations of words. How well does the code fit the general usage of letters in the English language?

To answer this, redo the calculation of the letter frequencies, but this time take all letters into account, not just the start of each word.