A Python 3 script for decrypting text encrypted with monoalphabetic/simple substitution ciphers, where word boundaries remain in place, using a technique that I will call combined pattern deduction.
Install required packages using pip
:
pip install -r requirements.txt
Each encrypted word from an encrypted message will have zero or more matching plaintext words that share the same word pattern. This script attempts to decipher encrypted messages by reducing the number of plaintext matches for each encrypted word to one (ideally).
Example:
Encrypted Word | Pattern | No. Plaintext Matches |
---|---|---|
WSNNHGDK | 8-12334567 | 675 |
SJJQGSJB | 8-12234125 | 4 |
FTKJPAHG | 8-12345678 | 5762 |
This number of tuples in the 3-fold cartesian product of the plaintext matches for these 3 words is ~15,000,000.
However, by sequentially combining encrypted words and comparing the longer pattern with the pattern of each combined tuple in the cartesian product of the plaintext matches, the number of matches for each word can be reduced to one by examining ~50,000 tuples.
The decryption process utilises word patterns and combines words to create larger, more useful, patterns. Consider the following two encrypted words and their corresponding patterns:
Encrypted Word | Pattern |
---|---|
S JJQGS JB |
8-1 22341 25 |
WS NNHGDK |
8-12 334567 |
In the first word, S
is identified with a 1
. In the second word, S
is identified with a 2
. If we combine both words and create a new pattern, such as:
Encrypted Word | Pattern |
---|---|
S JJQGS JBWS NNHGDK |
16-1 22341 2561 77849A |
Then in the new pattern we can now see that S
is identified with a 1
for all three occurrences.
It is now possible to say that: any set in the cartesian product of the plaintext matches for each of the individual word patterns that doesn't have the same combined pattern, can be deducted from the original sets of plaintext matches.
We can keep building and comparing larger patterns as we iterate through the encrypted words in sequence, and through this process of deduction, we can determine a cyphertext alphabet to decrypt the message.
Note: You can run the script using the verbose (-v) flag and see this deduction in action
- PQACEIAMNSXU has 23 word pattern matches
- RWCTJSXSZA has 261 word pattern matches
By creating a combined pattern from PQACEIAMNSXU and RWCTJSXSZA and comparing it with a combined pattern for each tuple in the cartesian product of each word's word pattern matches, then there is only 1 possible match for each word:
- PQACEIAMNSXU has 1 word pattern matches
- RWCTJSXSZA has 1 word pattern matches
"OVERWHELMING SCRUTINIZE"
Total tuples compared: (23 x 261) = 6,003
- OCUBICBP has 53 word pattern matches
- KCXXPQUN has 675 word pattern matches
- PUBOMNV has 8585 word pattern matches
First iteration
- OCUBICBP has 53 word pattern match
- KCXXPQUN has 675 word pattern match
By creating a combined pattern from OCUBICBP and KCXXPQUN and comparing it with a combined pattern for each tuple in the cartesian product of each word's word pattern matches, then there are 6 possible matches for OCUBICBP and 7 possible matches for KCXXPQUN.
Second iteration after deductions
- OCUBICBP has 6 word pattern matches
- KCXXPQUN has 7 word pattern matches
- PUBOMNV has 8585 word pattern matches
By creating a combined pattern from OCUBICBP, KCXXPQUN and PUBOMNV, and comparing it with a combined pattern for each tuple in the cartesian product of each word's word pattern matches, then there is only 1 possible match for each word:
- OCUBICBP has 1 word pattern match
- KCXXPQUN has 1 word pattern match
- PUBOMNV has 1 word pattern match
"ENGLISH LANGUAGE PATTERNS"
Total tuples compared: (53 x 675) + (6 x 7 x 8585) = 396,345
The first step of the decryption process is to split the encrypted message into normalised alphabetic words. This can involve converting unicode characters to ascii, removing punctuation, transforming to uppercase, removing duplicates etc. We then need to consider the examination order of the words - taking into account that the larger the number of items in the cartesian product, then the more iterations/comparisons will be required.
There are 3 ordering options:
- LONGEST_TO_SHORTEST: Longest length word to the shortest. In general, combing the largest words first will result in a large number of deductions early in the process. This is the default ordering option.
- FEWEST_TO_MOST_MATCHES: Word with the fewest plaintext matches to the most. In general, LONGEST_TO_SHORTEST decrypts with the least iterations and is therefore quicker. However,
message_example_2.txt
is an example where FEWEST_TO_MOST_MATCHES is better. - MATCHES_DIVIDED_BY_LENGTH: Number of plaintext matches divided by the length of the word.
- Depends on a word list (dictionary).
- When a word isn't present in the dictionary, it could cause issues if the ordering places the word in the first two words to be examined.
- Can struggle with some shorter sentences where each word has a lot of pattern matches.
- Doesn't attempt to decrypt numbers.
-
Can often handle words not being present in the dictionary. For example:
Cyphertext:
Zozm Nzgsrhlm Gfirmt LYV UIH; 23 Qfmv 1912 – 7 Qfmv 1954) dzh zm Vmtorhs nzgsvnzgrxrzm, xlnkfgvi hxrvmgrhg, oltrxrzm, xibkgzmzobhg, ksrolhlksvi, zmw gsvlivgrxzo yrloltrhg.
Decrypts to:
Alan Mathison Turing OBE [F,K,Q]RS; 23 June 1912 – 7 June 1954) [k,v,w,f]as an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist.
This isn't perfect as
U
mapped toF, K, or Q
andd
mapped tok, v, w, or f
. This is becausefrs
,krs
,qrs
,kas
,vas
,was
, andfas
are all words present in thedictionary.txt
file, and there isn't enough information to determine a one-to-one mapping forU
andd
.However, the word
Mathison
is not present in the dictionary but we can still decrypt is as we have deciphered the individual characters from other words that are in the dictionary. -
Can be fast and very accurate.
usage: decrypt.py [-h] [-m MESSAGE | -f MESSAGE_FILE] [-v | -vv] [-s]
[-w [WORDS_FILES [WORDS_FILES ...]]]
[-o {LONGEST_TO_SHORTEST,FEWEST_TO_MOST_MATCHES,MATCHES_DIVIDED_BY_LENGTH}]
optional arguments:
-h, --help show this help message and exit
-m MESSAGE, --message MESSAGE
An encrypted message
-f MESSAGE_FILE, --message-file MESSAGE_FILE
An encrypted message text file
-v, --verbose-info Makes the decryptor output more verbose
-vv, --verbose-debug Makes the decryptor output even more verbose
-s, --suppress-encrypted-text-output
Suppresses the encrypted text from the output
-w [WORDS_FILES [WORDS_FILES ...]], --words-files [WORDS_FILES [WORDS_FILES ...]]
Dictionary (words) files. Example: -w
words/dictionary.txt words/names.txt
-o {LONGEST_TO_SHORTEST,FEWEST_TO_MOST_MATCHES,MATCHES_DIVIDED_BY_LENGTH}, --order {LONGEST_TO_SHORTEST,FEWEST_TO_MOST_MATCHES,MATCHES_DIVIDED_BY_LENGTH}
The word examine order
Example usage:
python3 decrypt.py -f message_examples/message_example_10.txt
Attempts to decrypt the contents of the message_examples/message_example_10.txt
file using default word ordering LONGEST_TO_SHORTEST.
python3 decrypt.py -f message_examples/message_example_10.txt -o FEWEST_TO_MOST_MATCHES
Attempts to decrypt the contents of the message_examples/message_example_10.txt` file using FEWEST_TO_MOST_MATCHES word ordering.
There are a number of encrypted message examples provided in this repository. The screenshots below show the output for each one.