# Cryptoanalysis Report
---

## Overview

This report breaks down the tools and methods used to decode a shift and substitution cipher. By comparing letter frequencies, pattern recognition, and various phases of trial and (mostly) error, the plaintexts were revealed. As well as writing code in Python to manipulate and reorganize the data into useful metrics.

## Shift Cipher

Compared to the substitution cipher, breaking the shift cipher was fairly intuitive. If we examine the formula for decryption:

$d_K(y) = (y - K) \mod 26$

**where**

$K$ **is the keyspace,** $0 <= k <= 25$

$y$ **is the encrypted message**

$d_K$ **is the encryption function**

We can deduce that there is one integer between 1 and 26 that is shifting each character.

since

$d_K(e_K(x)) = x \quad \forall x \in P$

We see there is only 1 integer used to shift all of the characters in the plaintext. I wrote a `for` loop with 26 iterations that applied the shift to each character of the ciphertext in place, and then printed out all 26 possibilities. I then examined each plaintext to see if one of the solutions was coherent or intelligible. To which there was, **"It is better to create than to learn. Creating is the essence of life"**.


## Substitution Cipher

The substitution cipher was quite more challenging, and required several days of continous brainstorming before the code was *cracked*. The first step was converting some of the data provided in the English statistics PDF from Canvas to a format that I could load into my program. Specifically the bigram matrix, and the probabilities of the characters. I followed this paradigm with the characters in the ciphertext, and parsed the number of occurrences in the cipher to elude any useful insights and deductions.

**Frequency Analysis**

Initially, I would compare frequencies and number of occurrences of individual letters between the English statistics and the ciphertext's statistics. At this point it was very much educated guessing. I'd chose individual characters from both English and the cipher made probabilistic sense. Unfortunately, this wasn't returning anything tangibly useful. Since I was splicing in letters without then being able to test and verify if it was the correct character or not, I realized I needed more information, or **a lot** of luck.

**Bigrams**

Fortunately, I've taken a machine learning course previously, and I am concurrently taking a course in natural language processing (NLP). Recently, we started working with ngrams and co-occurrence matrices to predict text. I realized some common threads between the two. NLP is formed by predictions that are founded on likelihood of occurrence, given $x$ has already occurred; Bayes' theorem.

The next step from here was to compile bigrams of the ciphertext. I already had the bigram data for the characters used in the English language from the PDF. This is where the first break came. I discovered in the cipher's bigram matrix that the character "**W**" had *zero* co-occurrence with any letter, save one: "**S**".

With W pairing with S four times yet never with another character, I knew it had to he a very uncommon letter that only appeared often in tandem with another specific letter. This suggested the classic pairing of **QU**. But after further review, it seemed increasingly incorrect. Soon after, I realized that in almost all instances of **WS**, it was preceeded by another **S**. After this discovery, I was confident **SWS** mapped to **EVE**. In turn, this demonstrated the value of trigrams.

**Trigrams**

I had to refactor and expand the initial bigram functionality to accommodate trigrams, as well as a print function to show the trigrams. The row headings were the cipher's distinct characters, and the column headings were every arrangement of bigrams in the cipher. This generated a numpy array with the dimensions 22 x 150. Leaving it too large for any meaningful interpretations. A "top $k$" function spawned from this challenge, where the matrices would be flattened, sorted, and zipped into a python dictionary of length $k$ where the full trigrams mapped to their corresponding raw counts or frequencies. 

To narrow my purview, I picked three trigrams from the top ten: **KVS**, **VSG**, and **YVS**. I picked them based on them having some common characters so I could cross reference substitutions, use one trigram as a tool to solve another. I was determined to solve for **the** in the ciphertext, since **the** is the most common trigram in the English language.

I've maintained the assumption that **S** -> **e** and **W** -> **v**. The three trigrams **KVS**, **VSG**, and **YVS** all have an **S**. After applying the known substitutions in place:

**KVS** -> KV**e**

**VSG** -> V**e**G

**YVS** -> YV**e**

Honing in on **VS** in the ciphertext, it's probabilistically logical that **V** -> **h**, yielding the bigram **he** in the ciphertext.  With in place substitions our trigams are now:

**KVS** -> K**he** 

**VSG** -> **he**G 

**YVS** -> Y**he** 

Upon further reviewing of **Khe** and **Yhe**, it's feasible that one of them maps to **she**, the other to **the**. The uncertainty was resolved when I noticed a peculiar sequence in the cipher: **YYSYY**. With current assumptions in place:

**YYSYY** -> YY**e**YY

if **Y** -> **t** then **YYSYY** -> **ttett**

if **Y** -> **s** then **YYSYY** -> **ssess**

I stewed on **ttett** to try and imagine plausible scenarios of its usage. With my limited capacity, I couldn't surmise a colloquial context the arrangement **ttett**. But **ssess** *did* resemble accepted sequences. Also, the particular arrangement **ssess** is rare, only appearing in words like **obsess**, **possess**, and **assess**. This unveiling solved the **she** and **the** guesswork as well; **Y** maps to **s**, therefore **YVS** maps to **she**. With current assumptions in place:

**KVS** -> K**he**

**VSG** -> **he**G

**YVS** -> **she**

Since **YVS** -> **she**, it is now logically clear that **Y** likely maps to **s**.  With current assumptions in place:

**KVS** -> **the**

**VSG** -> **he**G

**YVS** -> **she**

**YYSYY** -> **ssess**

At this point, major strides have been achieved. The logical next steps begin to appear as more letters are uncovered. For example, with **ssess** we can logically deduce that the letter before and after the **ss** sequences is going to be a vowel. Broadening the ciphertext containing **YYSYY**, we see **BYYSYYBGB**. With assumptions in place:

**BYYSYYBGB** -> B**ssess**BGB

I tested **B** mapping to **o**.

**BYYSYYBGB** -> **ossesso**G**o**

Now the word "possessor" seems to appear. After cross validating **G** mapping to **r** with the trigram **heG** (becoming **her**), I'm able to solve for another high frequency character in the ciphertext. My plaintext is now:

-O-O---SH-ETO---E-O-R-E------E---O---E-SE-THE-TO--HERTHE-OR--H-

-H-E-O--E-TO-O-------E----S--HERS-R-R-SESHETO---ETH-T-T--S---OS

S---E-ORSHE-E--EVE-HERSE--THEO----OSSESSORO-TH-T-OR--H--HSHE-E-

T--HER-E-OR-----H--HSHEH---EVER-R-TTE--O----O---H-VETO--HERTHET

R-TH--TO-----R--E-TSTR----ETOTE--HERTH-T--E--EH--REVE--E--TTO-E

Finally, I aim to solve for the last crucial vowel **a**. I scan the plaintext for sequences of **TH-T**, and test subbing in **a**.  Then **p** to complete **possessor**.  So on and so forth.

## Conclusion

By analyzing character frequencies, letter patterns, and iteratively refining character replacements, I was able to systematically break both ciphers. The methods employed allowed me to identify high probability character mappings while being able to validate them through common ngram sequences. The reliance on statistically likely substitutions minimized errors, and also premiered the effectiveness of analyzing character frequencies and pattern recognition in cryptographic analysis.