## Stage 3

This cipher is the Monoalphabetic Substitution with Homophones, which means that the same plain letter can be substituted by several cipher letter.

In [1]:
import operator
from collections import Counter

In [2]:
f = open("./cipher.txt")

# It is a one line :)
lines = f.readlines()

This is the cipher text:

In [3]:
cipher_text = ''
for line in lines:
    if len(line) > 1:
        cipher_text += line.strip()
print(cipher_text)

IXDVMUFXLFEEFXSOQXYQVXSQTUIXWF*FMXYQVFJ*FXEFQUQXJFPTUFXMX*ISSFLQTUQXMXRPQEUMXUMTUIXYFSSFI*MXKFJF*FMXLQXTIEUVFXEQTEFXSOQXLQ*XVFWMTQTUQXTITXKIJ*FMUQXTQJMVX*QEYQVFQTHMXLFVQUVIXM*XEI*XLQ*XWITLIXEQTHGXJQTUQXSITEFLQVGUQX*GXKIEUVGXEQWQTHGXDGUFXTITXDIEUQXGXKFKQVXSIWQXAVPUFXWGXYQVXEQJPFVXKFVUPUQXQXSGTIESQTHGX*FXWFQFXSIWYGJTFXDQSFIXEFXGJPUFXSITXRPQEUGXIVGHFITXYFSSFI*CXC*XSCWWFTIXSOQXCXYQTCXYIESFCX*FXCKVQFXVFUQTPUFXQXKI*UCXTIEUVCXYIYYCXTQ*XWCUUFTIXLQFXVQWFXDCSQWWIXC*FXC*XDI**QXKI*IXEQWYVQXCSRPFEUCTLIXLC*X*CUIXWCTSFTIXUPUUQX*QXEUQ**QXJFCXLQX*C*UVIXYI*IXKQLQCX*CXTIUUQXQX*XTIEUVIXUCTUIXACEEIXSOQXTITXEPVJQCXDPIVXLQ*XWCVFTXEPI*IXSFTRPQXKI*UQXVCSSQEIXQXUCTUIXSCEEIX*IX*PWQXQVZXLFXEIUUIXLZX*ZX*PTZXYIFXSOQXTUVZUFXQVZKZWXTQX*Z*UIXYZEEIRPZTLIXTZYYZVKQXPTZXWITUZJTZXAVPTZXYQVX*ZXLFEUZTHZXQXYZVKQWFXZ*UZXUZTUIXRPZTUIXKQLPUZXTITXZKQZXZ*SPTZXTIFXSFXZ**QJVNWWIXQXUIEUIXUIVTIXFTXYFNTUIXSOQXLQX*NXTIKNXUQVVNXPTXUPVAIXTNSRPQXQXYQVSIEEQXLQ*X*QJTIXF*XYVFWIXSNTUIXUVQXKI*UQXF*XDQXJFVBVXSITXUPUUQX*BSRPQXBX*BXRPBVUBX*QKBVX*B

Let's count the letters.

In [4]:
letter_count = {}
for letter in cipher_text:
    if letter not in letter_count:
        letter_count[letter] = 0

    letter_count[letter] += 1
    
sorted_letter_count = sorted(letter_count.items(), key=operator.itemgetter(1))
sorted_letter_count.reverse()
sorted_letter_count

[('X', 207),
 ('Q', 103),
 ('I', 89),
 ('F', 79),
 ('T', 72),
 ('U', 70),
 ('*', 62),
 ('V', 56),
 ('E', 42),
 ('S', 42),
 ('P', 35),
 ('Y', 30),
 ('C', 30),
 ('Z', 30),
 ('W', 26),
 ('L', 22),
 ('K', 19),
 ('B', 19),
 ('J', 16),
 ('G', 14),
 ('M', 13),
 ('R', 11),
 ('D', 10),
 ('O', 8),
 ('N', 7),
 ('H', 6),
 ('A', 4)]

We have 27 letters, so in principle at least one of them is a homophone.

In [5]:
cipher_alphabet = 'XQIFTU*VESPCZYWLBKJGMRDONHA'
cipher_alphabet

'XQIFTU*VESPCZYWLBKJGMRDONHA'

Count the occorrences of double letters:

In [6]:
doubleletter_count = {}
for i, letter in enumerate(cipher_text[1:]):
    previous_letter = cipher_text[i]
    if previous_letter == letter:
        if letter not in doubleletter_count:
            doubleletter_count[letter] = 0

        doubleletter_count[letter] += 1
    
sorted_doubleletter_count = sorted(doubleletter_count.items(), key=operator.itemgetter(1))
sorted_doubleletter_count.reverse()
sorted_doubleletter_count

[('E', 5),
 ('U', 5),
 ('S', 4),
 ('W', 3),
 ('Y', 3),
 ('*', 3),
 ('X', 1),
 ('V', 1)]

In the english alphabet, only a few letters can be doubled, and some are very rare. The most common are `LL`, `SS`, `EE` , `OO`, `TT`, `FF` and `PP` in order.
    
Source: https://blogs.sas.com/content/iml/2014/10/03/double-letter-bigrams.html

In [7]:
def decipher(text, key):
    mapping = {cipher_letter: real_letter for (cipher_letter, real_letter) in zip(cipher_alphabet, key)}
    real_text = ''
    for cipher_letter in text:
        real_letter = mapping[cipher_letter] if cipher_letter in mapping else cipher_letter
        real_text += real_letter
    return real_text

After several failed attempts, I assumed to most common letter is actually a space character. A good evidence is that its frequency is quite high, although we can't directly assume that, since the homophones can distort the frequencies.

In [8]:
# English frequency         ' eotahinsrdlcumfpgwbyvkxjqz'
# Cipher frequency:         'XQIFTU*VESPCZYWLBKJGMRDONHA'
english_letter_frequency1 = ' eo__s__lt__f__________h___'
english_letter_frequency2 = ' eoFTs*VltPCfYWLBKJGMRDhNHA'
print(decipher(cipher_text, english_letter_frequency1))
print()
print(decipher(cipher_text, english_letter_frequency2))

o ___s_ __ll_ the _e_ te_so _____ _e_____ l_ese ____s_ _ _ott__e_se _ __els_ s__so __tt_o__ _______ _e _ols__ le_l_ the _e_ _____e_se _o_ _o____se _e___ _el_e__e___ ___es_o __ lo_ _e_ _o__o le___ _e_se to_l__e__se __ _ols__ le_e___ __s_ _o_ _olse _ ___e_ to_e ___s_ __ _e_ le____ ___s_se e t__olte___ __ __e_ to______ _et_o l_ ___s_ to_ __els_ o____o_ __tt_o__ __ t_____o the _ _e__ _olt__ __ ___e_ __se__s_ e _o_s_ _ols__ _o___ _e_ __ss__o _e_ _e__ __te__o ___ __ _o__e _o_o le___e _t___ls___o ___ __so ___t__o s_sse _e lse__e ___ _e ___s_o _o_o _e_e_ __ _osse e _ _ols_o s__so __llo the _o_ l___e_ __o_ _e_ _____ l_o_o t____e _o_se __ttelo e s__so t_llo _o ___e e_f __ losso _f _f ___f _o_ the _s_fs_ e_f_f_ _e _f_so _fllo__f__o _f__f__e __f _o_sf__f ____f _e_ _f __lsf__f e _f__e__ f_sf sf_so __f_so _e__sf _o_ f_ef f_t__f _o_ t_ f__e_____o e solso so__o __ ____so the _e __ _o__ se___ __ s___o __t__e e _e_tolle _e_ _e__o __ ____o t__so s_e _o_se __ _e _____ to_ s_sse __t__e _ __ ____s_ _e___ __

Nevertheless, after trying to substitute some letters, nothing makes sense. Plus, there are several 7 one-letters words. English only has the words `a` and `I` containing one letter. This cannot be English.

In [9]:
whatever_language = ' QIFTU*VESPCZYWLBKJGMRDONHA'
splitted = decipher(cipher_text, whatever_language)
Counter([w for w in splitted.split(' ') if len(w) == 1])

Counter({'*': 2, 'B': 1, 'C': 1, 'G': 1, 'I': 1, 'M': 2, 'Q': 9})

7 one-letter words.
Assuming all most be vogals, although `u` looks hard to be a word.

* Portugues: a, e, o
* Spanish: a, e, o, y,
* Italian: a, e, o, i

More one letter words: https://jalu.ch/languages/one_letter_words.php

Guesses
* Q, also the most common letter. Guess: e
* I, second most common letter -> Guess: a
* `*`, also appears as a double letter -> Guess: o
* M -> ?
* B -> ?
* C -> ?
* G -> ?

In [10]:
# Cipher frequency:  'XQIFTU*VESPCZYWLBKJGMRDONHA'
whatever_language1 = ' ea___o____________________'
whatever_language2 = ' eaFTUoVESPCZYWLBKJGMRDONHA'
print(decipher(cipher_text, whatever_language1))
print()
print(decipher(cipher_text, whatever_language2))

a _____ _____ __e _e_ _e__a __o__ _e___o_ __e_e ______ _ oa____e__e _ __e___ ____a _____ao_ ____o__ _e _a____ _e___ __e _eo _____e__e _a_ _a_o___e _e___ oe__e__e___ ___e__a _o _ao _eo _a__a _e___ _e__e _a____e___e o_ _a____ _e_e___ ____ _a_ _a__e _ ___e_ _a_e _____ __ _e_ _e____ ______e e ___a__e___ o_ __e_ _a______ _e__a __ _____ _a_ __e___ a____a_ _____ao_ _o ______a __e _ _e__ _a____ o_ ___e_ ___e____ e _ao__ _a____ _a___ _eo ______a _e_ _e__ ___e__a _o_ _o _aooe _aoa _e___e __________a __o o__a ______a ____e oe __eooe ___ _e o_o__a _aoa _e_e_ o_ _a__e e o _a___a ____a ____a __e _a_ ____e_ __a_ _eo _____ __aoa _____e _ao_e ____e_a e ____a ____a oa o__e e__ __ _a__a __ o_ o___ _a_ __e ______ e_____ _e o_o_a ____a_____a _______e ___ _a______ _____ _e_ o_ ________ e ____e__ _o__ ____a _____a _e____ _a_ __e_ _o____ _a_ __ _ooe_____a e _a__a _a__a __ _____a __e _e o_ _a__ _e___ __ ____a _____e e _e__a__e _eo oe__a _o ____a ____a __e _ao_e _o _e _____ _a_ ____e o____e _ o_ ______ oe___ o_

Sequences like `eooe` and `aooe` look bad. Assuming `A -> e` is right, drop `I -> a` and `* -> o`. Test `I -> o`.

In [11]:
# Cipher frequency:  'XQIFTU*VESPCZYWLBKJGMRDONHA'
whatever_language1 = ' eo________________________'
whatever_language2 = ' eoFTU*VESPCZYWLBKJGMRDONHA'
print(decipher(cipher_text, whatever_language1))
print()
print(decipher(cipher_text, whatever_language2))

o _____ _____ __e _e_ _e__o _____ _e_____ __e_e ______ _ _o____e__e _ __e___ ____o _____o__ _______ _e _o____ _e___ __e _e_ _____e__e _o_ _o_____e _e___ _e__e__e___ ___e__o __ _o_ _e_ _o__o _e___ _e__e _o____e___e __ _o____ _e_e___ ____ _o_ _o__e _ ___e_ _o_e _____ __ _e_ _e____ ______e e ___o__e___ __ __e_ _o______ _e__o __ _____ _o_ __e___ o____o_ _____o__ __ ______o __e _ _e__ _o____ __ ___e_ ___e____ e _o___ _o____ _o___ _e_ ______o _e_ _e__ ___e__o ___ __ _o__e _o_o _e___e __________o ___ ___o ______o ____e _e __e__e ___ _e _____o _o_o _e_e_ __ _o__e e _ _o___o ____o ____o __e _o_ ____e_ __o_ _e_ _____ __o_o _____e _o__e ____e_o e ____o ____o _o ___e e__ __ _o__o __ __ ____ _o_ __e ______ e_____ _e ____o ____o_____o _______e ___ _o______ _____ _e_ __ ________ e ____e__ ____ ____o _____o _e____ _o_ __e_ ______ _o_ __ ___e_____o e _o__o _o__o __ _____o __e _e __ _o__ _e___ __ ____o _____e e _e__o__e _e_ _e__o __ ____o ____o __e _o__e __ _e _____ _o_ ____e _____e _ __ ______ _e___ __

Looks better. Many three letter words `SOe`. Perhaps `que`?

In [12]:
# Cipher frequency:  'XQIFTU*VESPCZYWLBKJGMRDONHA'
whatever_language1 = ' eo______q_____________u___'
whatever_language2 = ' eoFTU*VEqPCZYWLBKJGMRDuNHA'
print(decipher(cipher_text, whatever_language1))
print()
print(decipher(cipher_text, whatever_language2))

o _____ _____ que _e_ qe__o _____ _e_____ __e_e ______ _ _oqq__e__e _ __e___ ____o __qq_o__ _______ _e _o____ _e___ que _e_ _____e__e _o_ _o_____e _e___ _e__e__e___ ___e__o __ _o_ _e_ _o__o _e___ _e__e qo____e___e __ _o____ _e_e___ ____ _o_ _o__e _ ___e_ qo_e _____ __ _e_ _e____ ______e e q__o_qe___ __ __e_ qo______ _eq_o __ _____ qo_ __e___ o____o_ __qq_o__ __ q_____o que _ _e__ _o_q__ __ ___e_ ___e____ e _o___ _o____ _o___ _e_ ______o _e_ _e__ __qe__o ___ __ _o__e _o_o _e___e _q________o ___ ___o ___q__o ____e _e __e__e ___ _e _____o _o_o _e_e_ __ _o__e e _ _o___o ____o ____o que _o_ ____e_ __o_ _e_ _____ __o_o q____e _o__e __qqe_o e ____o q___o _o ___e e__ __ _o__o __ __ ____ _o_ que ______ e_____ _e ____o ____o_____o _______e ___ _o______ _____ _e_ __ ________ e ____e__ ____ ____o _____o _e____ _o_ __e_ __q___ _o_ q_ ___e_____o e _o__o _o__o __ _____o que _e __ _o__ _e___ __ ____o __q__e e _e_qo__e _e_ _e__o __ ____o q___o __e _o__e __ _e _____ qo_ ____e __q__e _ __ ______ _e___ __

`oqq` is bad. Italian `SOe -> che`? It has plently of `cc`.

In [13]:
# Cipher frequency:  'XQIFTU*VESPCZYWLBKJGMRDONHA'
whatever_language1 = ' eo______c_____________h___'
whatever_language2 = ' eoFTU*VEcPCZYWLBKJGMRDhNHA'
print(decipher(cipher_text, whatever_language1))
print()
print(decipher(cipher_text, whatever_language2))

o _____ _____ che _e_ ce__o _____ _e_____ __e_e ______ _ _occ__e__e _ __e___ ____o __cc_o__ _______ _e _o____ _e___ che _e_ _____e__e _o_ _o_____e _e___ _e__e__e___ ___e__o __ _o_ _e_ _o__o _e___ _e__e co____e___e __ _o____ _e_e___ ____ _o_ _o__e _ ___e_ co_e _____ __ _e_ _e____ ______e e c__o_ce___ __ __e_ co______ _ec_o __ _____ co_ __e___ o____o_ __cc_o__ __ c_____o che _ _e__ _o_c__ __ ___e_ ___e____ e _o___ _o____ _o___ _e_ ______o _e_ _e__ __ce__o ___ __ _o__e _o_o _e___e _c________o ___ ___o ___c__o ____e _e __e__e ___ _e _____o _o_o _e_e_ __ _o__e e _ _o___o ____o ____o che _o_ ____e_ __o_ _e_ _____ __o_o c____e _o__e __cce_o e ____o c___o _o ___e e__ __ _o__o __ __ ____ _o_ che ______ e_____ _e ____o ____o_____o _______e ___ _o______ _____ _e_ __ ________ e ____e__ ____ ____o _____o _e____ _o_ __e_ __c___ _o_ c_ ___e_____o e _o__o _o__o __ _____o che _e __ _o__ _e___ __ ____o __c__e e _e_co__e _e_ _e__o __ ____o c___o __e _o__e __ _e _____ co_ ____e __c__e _ __ ______ _e___ __

* Let's fill on the other one-letter words with some guesses. What are `M`, `B`, `C`, `G`, `*`?
* `C` is the most common among them, guess `a`.
* `*` appears both in `_*` and `*_`, and by itself. And as double letter `**`. `* -> l`?
* `Lel -> del`?
* `YeV -> per`?

In [14]:
# Cipher frequency:  'XQIFTU*VESPCZYWLBKJGMRDONHA'
whatever_language1 = ' eo___lr_c_aZp_d_o_____h___'
whatever_language2 = ' eoFTUlrEcPCZpWdBKJGMRDONHA'
print(decipher(cipher_text, whatever_language1))
print()
print(decipher(cipher_text, whatever_language2))

o _r___ d____ che per ce__o __l__ per__l_ __e_e ______ _ locc_de__e _ __e___ ____o p_cc_ol_ o___l__ de _o__r_ _e___ che del r____e__e _o_ oo_l___e _e__r le_per_e___ d_re_ro _l _ol del _o_do _e___ _e__e co___der__e l_ oo__r_ _e_e___ ____ _o_ _o__e _ o_oer co_e _r___ __ per _e___r o_r___e e c__o_ce___ l_ __e_ co_p____ _ec_o __ _____ co_ __e___ or___o_ p_cc_ola al ca____o che a pe_a po_c_a l_ aore_ r__e____ e ool_a _o__ra poppa _el _a____o de_ re__ _ace__o al_ al _olle oolo _e_pre ac_____a_do dal la_o _a_c__o ____e le __elle __a de lal_ro polo oedea la _o__e e l _o__ro _a__o _a__o che _o_ __r_ea __or del _ar__ __olo c____e ool_e racce_o e _a__o ca__o lo l__e erZ d_ _o__o dZ lZ l__Z po_ che __rZ__ erZoZ_ _e lZl_o pZ__o__Z_do _ZppZroe __Z _o__Z__Z _r__Z per lZ d___Z__Z e pZroe__ Zl_Z _Z__o __Z__o oed__Z _o_ ZoeZ Zlc__Z _o_ c_ Zlle_r___o e _o__o _or_o __ p____o che de l_ _oo_ _err_ __ __r_o __c__e e perco__e del le__o _l pr__o c___o _re ool_e _l _e __r_r co_ ____e l_c__e _ l_ ___r__ leo_r l_

Next guesses in order:
* `percoEEe -> percosse`
* `deF -> dei`
* `richiPso -> richiuso`
* `peTa -> pena`
* `M -> a`
* `nosUri -> nostri`
* `considerGte -> considerate`
* `caWWino -> cammino`
* `compaJni -> compagni`
* `KiKer -> viver`
* `erZ -> era`
* `Ruesta -> cuesta`
* `allegrNmmo -> allegrammo`
* `canoscenHa -> canoscenza`

In [15]:
# Cipher frequency:  'XQIFTU*VESPCZYWLBKJGMRDONHA'
whatever_language1 = ' eointlrscuaapmdavgaacfhazb'
whatever_language2 = ' eointlrscuaapmdavgaacfhazb'
print(decipher(cipher_text, whatever_language1))
print()
print(decipher(cipher_text, whatever_language2))

o frati dissi che per cento milia perigli siete giunti a loccidente a cuesta tanto picciola vigilia de nostri sensi che del rimanente non vogliate negar lesperienza diretro al sol del mondo senza gente considerate la vostra semenza fati non foste a viver come bruti ma per seguir virtute e canoscenza li miei compagni fecio si aguti con cuesta orazion picciola al cammino che a pena poscia li avrei ritenuti e volta nostra poppa nel mattino dei remi facemmo ali al folle volo sempre accuistando dal lato mancino tutte le stelle gia de laltro polo vedea la notte e l nostro tanto basso che non surgea fuor del marin suolo cincue volte racceso e tanto casso lo lume era di sotto da la luna poi che ntrati eravam ne lalto passocuando napparve una montagna bruna per la distanza e parvemi alta tanto cuanto veduta non avea alcuna noi ci allegrammo e tosto torno in pianto che de la nova terra un turbo naccue e percosse del legno il primo canto tre volte il fe girar con tutte laccue a la cuarta levar la