In [None]:
## Assignment #2: Spamlet-Entropy-Monkeys
## PHYS481 FALL2020
## Yauheni Kalionau (30062335)

## Introduction:
Dealing with probabilty can be quite complicated for humans since the concept itself is not intuitively understandable (just a natural feature of being a human). That's why using computers is a great idea when it comes to statistically analyzing given sets of data. Quite often, certain data follows a statistical distribution which can be expressed as a mathematical expression (Bernoulli, Hypogeometrical, Gaussian, etc.) and thus many statistically-important variables can be derived from them (eg. standart deviation). This lab explores an interesting example of probabilistical distribution - "The infinite monkey theorem", which basically proclaims that every possible outcome of the experiment is possible, given that number of trials is infinite and the experiment can go on for an infinite amount of time. Tthe purpose will be to compare different probability distributions when applied to spamlet ("Simplified Hamplet"), and to calculate entropy associated with those distributions.

## Question #1:


In [1]:
import urllib.request
import numpy as np
import mpmath 
import math as math

#Accesing file from the 'url':
url = r'http://www.gutenberg.org/files/1524/1524-0.txt'
bytedata = urllib.request.urlopen( url ).read()
data = bytedata.decode()
charlist = [c.lower() for c in data if c.isalpha() or c==' ']
nchars = len(charlist)

#Calculating probability of each letter/space occuring at any given position:
prob = 1.0/27

#A list with the factor after "sum" sign in Shannon's equation:
list_to_sum = []
for elem in range(nchars):
    list_to_sum.append(-np.log(prob)*prob)

#Shannon's Entropy = sum of all elements in the list above:
entropy = np.sum(list_to_sum)

print("Number of characters in spamlet (including spaces, but ommiting punctuation signs):", nchars)
print("Entropy of a spamlet in bits per character:", entropy)


#Might be loading longer than expected. Be patient.


Number of characters in spamlet (including spaces, but ommiting punctuation signs): 181732
Entropy of a spamlet in bits per character: 22183.66760491477


## Question #2: 
Total number of possible sequences with 167774 characters and the probability of typing any single one of them if you were to hit the keyboard 167774 times randomly.

In [25]:
# number of characters:
n = 167774

# probability of a single character occuring at position "n":
prob_single = mpmath.mpf(1.0)/mpmath.mpf(27.0)

# total probability is just the probability of a single character raised to the power of number of characters since prob_single is independent 
# for each character:
tot_prob = prob_single**n

# The biggest possible number of combinations that can be made of "n" characters:
totcomb = mpmath.mpf(27.0)**n

print("Probability of typing a specific sequence of",n,"characters if you were to hit keyboard",n,"times randomly:", tot_prob)
print("There are", totcomb, "different sequences with", n,"characters.")


Probability of typing a specific sequence of 167774 characters if you were to hit keyboard 167774 times randomly: 2.37592096671345e-240146
There are 4.20889420986911e+240145 different sequences with 167774 characters.


## Question #3: 
Probability of typing a specific sequence with 167774 characters if you were to hit the keyboard 167774 times randomly, but for the actual probability distrubution of spamlet.


In [62]:
# Catalog with all possible characters from spamlet and number of times they occur:
catalog = {}
for symbol in charlist:
    if symbol in catalog:
        catalog[symbol] += 1  
    else:
        catalog[symbol] = 1 

# Deleting weird non-letter symbols from catalog:
catalog.pop('æ')
catalog.pop('à')           

#conditional probabilty: (prob to type any letter or space out of 27)x(prob that out of 167774 characters this letter will occur N times)
prob = {}           
for elem in catalog:
    prob[elem] = mpmath.mpf((1.0/27.0)*(catalog[elem]/n))

#Creating a list "prob_list" which is basically needed to perform multiplication of elements in the dictionary "prob"
prob_list = []                                      
for value in prob:
    prob_list.append(prob[value])

#probability that any outcome will happen n-times.
tot_prob = mpmath.mpf(np.sum(prob_list))**n    
                    
print("The probability of typing 167774 symbols in a specific sequence for an actual spamplet distribution is:",tot_prob)
        

The probability of typing 167774 symbols in a specific sequence for an actual spamplet distribution is: 2.90906943084119e-234324


## Question #4: 
Probability of typing each 2-key sequence from spamlet.


In [78]:
#probability of typing two specific letters in a row. Spaces included.
prob = mpmath.mpf(1.0/27.0)**2  

#probability of of typing "n/2" random sequences each of length 2 in a row. Spaces included.
tot_prob = prob**(n/2)

print("The joint probability of each 2-key sequence from spamlet:", tot_prob)


The joint probability of each 2-key sequence from spamlet: 2.37592096672102e-240146


Now, the probability of typing "n/2" random sequences each of length 2 in a row for an actual distribution of spamlet will be:

In [79]:
# A dictionary where each key identifies a unique letter from spamlet, and 
# each value probability of that letter for an actual spamplet distribution:
prob = {}                                                             
for elem in catalog:                                                  
    prob[elem] = (mpmath.mpf((1.0/27.0)*(catalog[elem]/n)))**2               

# since elements in the dictionary cannot be multiplied in a trivial way
# I had to transform the dictionary into a list:
prob_list = []                                                        
for value in prob:                                                    
    prob_list.append(prob[value])

# Net probability will simply be the product of all probabilities I have calculated 
# before since the outcome of each trial is independent from the next one:
tot_prob = mpmath.mpf(np.prod(prob_list))**(n/2)

print("The joint probability of each 2-key sequence from spamlet for an actual spamplet distribution is:",tot_prob)

The joint probability of each 2-key sequence from spamlet for an actual spamplet distribution is: 5.560661280888e-14121187


## Question #5:
Entropy of 2-key sequences in spamlet

In [77]:
#A list with the factor after "sum" sign in Shannon's equation
list_to_sum = []
for prob in prob_list:
    list_to_sum.append(-mpmath.log(prob)*prob)

# Now, entropy is simply the sum of elements in the "list_to_sum":
entropy = np.sum(list_to_sum)

print("The entropy of each 2-key sequences in spamlet is:",entropy)

The entropy of each 2-key sequences in spamlet is: 0.00133710641595483


## Discussion and conclusion:
Probabilities are not intuitively understandable for humans and sometimes too rigorous thinking about and attempts to visualize them can lead to even greater confusion instead of clarification. I should admit that I had diffuclties with this assignment, not with the coding part, but with the question of understanding the way probabilities of independent events interact with each other. It was quite interesting to explore limitations of float type of data in Python, what is quite relevant since probabilites of highly unlikely events tend to decrease extremely fast. I am gratefull to authors of "mpfmath" package for bringing those tools to Python community as well. Finally, I think it's a great thought experiment to consider than any possible outcome is likely to occur given that number of trials and the timespan are infinite ("Infinite Monkey Theorem"). Also, I think it's crucial to develop a comprehensive understanding of entropy for any scientist since it looks like one of the most fundamental laws according to which nature exists. I will appreciate if you could give me feedback on points where I made mistakes.