# Frequency Analysis Tool

## Import and Reorganize the Text

Define two functions:
The function `extract` extracts and returns lowercase and uppercase English letters in a given text.

The function `print_text` prints a given text with a fixed number of columns (default 5 columns).

In [3]:
import re                      # Use regular expressions
def extract(s):
  pattern = r'[a-zA-Z]'
  extracted = ''.join(re.findall(pattern, s))
  return extracted

def print_text(s, columns=5):
  extracted = extract(s)
  index = 0
  while index < len(extracted):
    row = ""
    space_count = 0
    for i in range(index, index+5*columns):
      if i < len(extracted):
        row += extracted[i]
        space_count += 1
        if space_count == 5:        # Insert whitespaces
          row += ' '
          space_count = 0
    print(row)
    index += 5 * columns

For example, the following text is an encrypted text, from *Rubinstein_Salzedo 3.4 Problem 5*.

In [None]:
encrypted = "NBPFR KISOQ NFRDB FKJFD XNOIN OJXIX NZXSI\
      DJXIJ NYENO ISDSA SOFBY REJRK IKSKI PFRAR\
      DJZIJ RUSEE JXIZI KADFB JXIJK SODYI OGIOJ\
      SEJIK ADSOG UESOJ JXIAI VKPWX IKIPF RARDJ\
      ENIRU FOJXI GSNDN IDSOG GNDYF RKDIN OOFVI\
      EUXKS DIDFB PFRKY FAUEN YSJIG DJSJI FBANO\
      GJXIA ISONO ZGFID OJASJ JIKNB NJDFO EPNGE\
      IYXSJ JIKFB SJKSO DYIOG IOJSE LNOGS OGIVK\
      PFOIW NEEDS PSDPF RWSEL PFRKA PDJNY WSPNB\
      JXNDP FROZA SOIQU KIDDI DXNAD IEBNO JIKAD\
      JFFGI IUBFK AIWXP WXSJS VIKPD NOZRE SKEPG\
      IIUPF ROZAS OJXND GIIUP FROZA SOARD JCICI\
      IEFMR IOJNO UKSND IFBJX IVIKP GREEF EGGSP\
      DWXNY XXSVI EFOZD NOYIU SDDIG SWSPS OGYFO\
      VNOYI IANBP FRYSO JXSJJ XIKIN ZOFBZ FFGMR\
      IIOSO OIWSD YREJR KIDUS EANID JGSPF BYFRK\
      DIPFR WNEEU FFXUF FXWXS JIVIK DBKID XSOGO\
      IWSOG GIYES KINJD YKRGI SOGAI SOBFK SKJDJ\
      FUUIG DXFKJ NOJXI YREJN VSJIG YFRKJ FBJXI\
      IAUKI DDHFD IUXNO ISOGI VKPFO IWNEE DSPSD\
      PFRWS ELPFR KAPDJ NYWSP NBJXS JDOFJ ZFFGI\
      OFRZX BFKXN AWXNY XNDZF FGIOF RZXBF KAIWX\
      PWXSJ SVIKP YREJN VSJIG LNOGF BPFRJ XJXND\
      LNOGF BPFRJ XARDJ CIJXI OSDIO JNAIO JSEUS\
      DDNFO FBSVI ZIJSC EIBSD XNFOA RDJIQ YNJIP\
      FRKES OZRNG DUEII OSOSJ JSYXA IOJSE SUESJ\
      FBFKS CSDXB REPFR OZUFJ SJFFK SOFJJ FFBKI\
      OYXBK IOYXC ISOJX FRZXJ XIUXN ENDJN OIDAS\
      PHFDJ EIPFR WNEEK SOLSD SOSUF DJEIN OJXIX\
      NZXSI DJXIJ NYCSO GNBPF RWSEL GFWOU NYYSG\
      NEEPW NJXSU FUUPF KSENE PNOPF RKAIG NIVSE\
      XSOGS OGIVK PFOIW NEEDS PSDPF RWSEL PFRKB\
      EFWKP WSPNB XIDYF OJIOJ WNJXS VIZIJ SCEIE\
      FVIWX NYXWF REGYI KJSNO EPOFJ DRNJA IWXPW\
      XSJSA FDJUS KJNYR ESKEP URKIP FROZA SOJXN\
      DURKI PFROZ ASOAR DJCI"

print("Without Spaces:")
print(extract(encrypted))
print("\n5 letters per group and 30 groups per line:")
print_text(encrypted, 30)

The text `extract(encrypted)` is also useful since there are no whitespaces or punctuations. So you can copy and paste it to any texteditor to visiualize the occurences of any substrings of characters.

I recommend using [regex101](https://regex101.com/) where you can use reguler expressions.

## Frequency Analysis

The following function `freq_analyse(s, length=1)` returns a dictionary of frequencies (actually the number of occurences) of `length`-grams in the text `s`. By default, `length` is set to 1.

Set the second parameter to $n$ to get the frequencies of all $n$-grams in the text.

The frequences returned are in descending order.

In [9]:
def freq_analyse(s, length = 1):
    pattern = r'[a-zA-Z]'
    extracted = ''.join(re.findall(pattern, s))
    freq_dict = {}

    for index in range(len(extracted)-length+1):
        if extracted[index:index+length] not in freq_dict:
            freq_dict[extracted[index:index+length]] = 0
        freq_dict[extracted[index:index+length]] += 1

    return (dict(sorted(freq_dict.items(), key = lambda item: item[1], reverse = True)))

## Test with the Text `encrypted`

In [None]:
print(freq_analyse(encrypted))
print(freq_analyse(encrypted, 2))
print(freq_analyse(encrypted, 3))
print(freq_analyse(encrypted, 4))

You can also get the frequency of a specific letter combination. For example, to find the frequency of `PFRWSEL`, we can use the following command. Note that $7$ is the length of `PFRWSEL`.

In [None]:
print(freq_analyse(encrypted, 7)["PFRWSEL"])

## Substitute and Rerun Frequency Analysis

Say, we make a guess that `I` represents `e` since it has the highest frequency. We can do the following.

In [12]:
text_guess = encrypted    #Note that strings are immutable in Python, so we will not change `encrypted`
text_guess = text_guess.replace('I', 'e')   #This command will substitute all "I" with "e"

Then we can run `freq_analyse` again for the new text `text_guess` and proceed.

In [None]:
print(freq_analyse(text_guess))
print(freq_analyse(text_guess, 2))
print(freq_analyse(text_guess, 3))
print(freq_analyse(text_guess, 4))