# Statistics Notebook

This Jupyter Notebook is designed for some basic exercises in statistics. The mentioned field has huge applications in plenty of industry and research areas - data science, artifical intelligence, biological simulations of cells and statistical mechanics are just a few to name. 

Let us begin with the classical example of throwing a dice with 6 faces, supposing that each face is equally likely as an outcome. Theory already tells us we should expect an equal probability of getting any number between 1 and 6: $ p_1 = p_2 = ... = p_6 = \frac{1}{6} $. However, let us simulate this on a computer and see what we can get. Firstly, import the following libraries

In [None]:
import numpy as np

You have already seen this library plenty of times. Arrays and matrices are declared within this library. They have plenty of properties and behaviours defined, so it is very convenient to work with them. Moreover, we also define numpy arrays as they can be plotted. For more documentation, just type (help(np)) or go to this link: https://numpy.org/doc/stable/reference/index.html. In summary, numpy arrays and matrices can be multiplied in various ways( recall how matrices are multiplied), or used for more statistical and linear algebra operations. For plotting, we also need the following library:

In [None]:
import matplotlib.pyplot as plt

Now we are set for the problem. Imagine we have a dice and we want to throw it more times to visualize the outcomes. We can do this 10,100,10000 times, after all this is why computers have been originally designed: to work with huge amounts of data. We have already used this function before

In [None]:
help(np.random.randint)

From the documentation, it seems we can create pretty quickly an array of elements as huge as we want. Of course, we can also do it manually with a loop (coding exercise left for you). This is already a significant advantage for working with np arrays

In [None]:
stats_test = np.random.randint(6, size = 10) 
stats_test

Beware that declaring the number 6 will generate equally random numbers from 0 to 5, not from 1 to 6. To fix this, just add 1

In [None]:
stats_test = stats_test + 1
stats_test

Now plot it as a histogram: show on vertical axis the number of counts for face 0 and on horizontal axis each face. Use the following matplotlib function. Try the last three cells more times to see the results.

In [None]:
plt.hist(stats_test, bins=6)                 # Command for plotting a histogram
plt.rcParams['figure.figsize'] = (10, 10)    # Command for setting the size of the figure
plt.xlabel("Faces of the dice")              # Commands for setting the x and y axis labels
plt.ylabel("No of counts for each face")
plt.show()

Well, this is definitely not what we looked for: we expected equal distributions of the faces. However, we get significant fluctuations for this system - at times one of the faces is not obtained at all, at other times two faces hve very low number of occurences (low <b> weight </b> in mathematical language). The statistical procedure is correct, and the theory is also correct, so what is wrong? Let us try with a bigger number of dice throwing events ( <b> trials </b>)

In [None]:
stats_test = np.random.randint(6, size = 1000) + 1 # Set directly all the number faces from 1 to 6

plt.hist(stats_test, bins=6)                 # Command for plotting a histogram
plt.rcParams['figure.figsize'] = (10, 10)    # Command for setting the size of the figure
plt.xlabel("Faces of the dice")              # Commands for setting the x and y axis labels
plt.ylabel("No of counts for each face")
plt.show()

This indeed looks better, all of the faces are more evenly distributed, even though there are still significantly smaller differences in face weights. Let us try now with a bigger number of trials

In [None]:
stats_test = np.random.randint(6, size = 10000) + 1 # Set directly all the number faces from 1 to 6

plt.hist(stats_test, bins=6)                 # Command for plotting a histogram
plt.rcParams['figure.figsize'] = (10, 10)    # Command for setting the size of the figure
plt.xlabel("Faces of the dice")              # Commands for setting the x and y axis labels
plt.ylabel("No of counts for each face")
plt.show()

This looks even better. You can already see where this is going: All faces are now <b> almost equally </b>. Indeed, one face still prevails over other faces, but only by a <b> small number </b>, which is in fact <b> astonishingly small </b> with respect to the number of trials. It is so small, that all the statistical fluctuations from our expected distribution - an even distribution - can be in fact neglected. We have already worked out a key concept in statistics: For testing an event which involves randomness (<b> stochastic process </b>), you need a very high number of trials.

Let us try now a slightly different example, but with different impact. Suppose we roll two dice, and we would like to visualize the outcome - what is the probability of getting overall 4,5,6,..? The maths behind it, you already know: just see what combinations are there for 1,2,3..,12 (Example: $ 3=1+2=2+1 $, $ 4=1+3=2+2=3+1 $, and so on...) The probability distribution in this case takes the form of a bell shape, centered on $x=7$. Let us simulate this on computer. Now we already know this: for the simulation to work, let us put directly a big number, not 10 or 20

In [None]:
test_two_dice = np.random.randint(6, size = 10000) + np.random.randint(6, size=10000) + 2 # We add 2, as 2 is the least number 
                                                                                          # possible: 1+1=2

plt.hist(test_two_dice, bins=11)             # Command for plotting a histogram: there are 11 numbers from 2 to 12, so 11 bins are due.
                                             # Try with a different number of bins to see what you get
plt.rcParams['figure.figsize'] = (10, 10)    # Command for setting the size of the figure
plt.xlabel("Faces of the dice")              # Commands for setting the x and y axis labels
plt.ylabel("No of counts for each face")
plt.show()

Let us calculate the probability of getting 7 only from the simulation above. We can do this either with a loop, traditionally, or we can use the benefits of working with numpy arrays:

In [None]:
print(len(test_two_dice[ test_two_dice == 7 ]) / 10000)

Does the answer match the expected probability?

Let us imagine now that we roll three dice: what is the final result? 

In [None]:
test_three_dice = np.random.randint(6, size = 10000) + np.random.randint(6, size=10000) + np.random.randint(6, size = 10000) + 3 
                                             # We add 3, as 3 is the least number possible: 1+1+1 = 3
plt.hist(test_three_dice, bins=11)           # Command for plotting a histogram: there are 11 numbers from 3 to 18, so 16 bins are due.
                                             # Try with a different number of bins to see what you get
plt.rcParams['figure.figsize'] = (10, 10)    # Command for setting the size of the figure
plt.xlabel("Faces of the dice")              # Commands for setting the x and y axis labels
plt.ylabel("No of counts for each face")
plt.show()

Just for fun, let us plot now what happens when we roll 10 dices: clearly, we can write 10 times that sum, or we can use a loop to save us some trouble:

In [None]:
test_ten_dices = np.zeros(10000)

# Add the 10 random events; add 1 to each event to get dice numbering from 1 to 6
for i in range(10):
    test_ten_dices += np.random.randint(6, size=10000)
    
test_ten_dices += 10
# Plot the results: from 10 to 60 there are 51 numbers, so put 51 bins
plt.hist(test_ten_dices, bins=40)
plt.rcParams['figure.figsize'] = (10, 10)    # Command for setting the size of the figure
plt.xlabel("Faces of the dice")              # Commands for setting the x and y axis labels
plt.ylabel("No of counts for each face")
plt.show()

<b> Question: </b> We have seen the shape of the curve - the bell curve - but where is the number 60? 

In [None]:
print(len(test_ten_dices[test_ten_dices == 60]))

How come that out of 10000 trials we could not get all faces of 6? Time for some maths: the probability of getting 6 for a singe dice is $ p=\frac{1}{6} $. The dices are thrown all together, but the 10 events are independent from each other - get probability for all 6s : $ p = \frac{1}{6^{10}} $, and the expected number of trials for all 6s in this case: $ N = 6^{10} $ 

In [None]:
print(6**10)

10000 trials is not even close to the required number of trials for getting all 6s, which is roughly 60 million trials. Let us try the simulation with the required number to see what we get:

### Gaussian (normal) distribution

Apart from some more programming exercise, the purpose of the above mentioned exercises was to show you how plenty of stochastic processes can be expressed as gaussians - the bell-shaped curve you have seen it all these times. The proper mathematical proof will be set aside, but the concept will be discussed here. If plenty of random events with equally ditributed probability density functions are taken together, the result is a gaussian distribution. The process of taking all the random events together is called <b> convolution. </b> The gaussian distribution is heavily met in nature - height and weight distributions, incomes and sales, and so on...(More examples here...)

### Some combinatorics exercises

Take, as before, the two dice being thrown, with equal chances of getting any face. Let us do some basic probability exercises:

<b> Q1: </b> What are the chances of getting a number lower than 3 for the first die <b> and </b> bigger than 4 for the second die?

The dice are not communicating with each other, or exchanging any information - the events of die throwing are independent: 
$ p_0 = p_1(<3) \cdot p_2(>4) = \frac{2}{6} \cdot \frac{2}{6} = \frac{4}{36} $. Let us try this simulation below, to see if the theoretical value is right:

In [None]:
n_checked = 0 # variable for trial which satisfies our condition

for i in range(10000): # work with big numbers to make sure we get the right distribution
    #breakpoint()
    a = np.random.randint(6, size=2) 
    if( (a[0] < 3) and (a[1] > 4) ): # These are the two dice we are throwing
        n_checked += 1
print(n_checked / 10000)

Why do we have different outcomes? Solution: add np.ones(2)

<b> Q2: </b> What are the chances of getting a number lower than 3 for the first die <b> or </b> greater than 4 for the second die? Let us do this question in Python first, and then work on it more:

In [None]:
n_checked = 0 # variable for trial which satisfies our condition

for i in range(10000): # work with big numbers to make sure we get the right distribution
    #breakpoint()
    a = np.random.randint(6, size=2) 
    if( (a[0] < 3) or (a[1] > 4) ): # These are the two dice we are throwing
        n_checked += 1
print(n_checked / 10000)

Notice how adding the probabilities $ p(A \land B) + p(A \lor B) $ gives the sum $ p(A) + p(B) $, where $ p(A) = p(B) = \frac{2}{6} $

<b> Q3: </b> What are the chances of throwing two dice and getting a number greater or equal to 6 given that the first die gives a number lower than 3?

Theory (Bayes' Theorem actually) states that, in this case: $ p(A,B) = p(A) \cdot p(B|A) $:

For die A showing 1 we get for die B either 5 or 6

For die A showing 2 we get for die B either 4,5 or 6.

The theoretical answer is: $ p(A,B) = p(A=1) \cdot p(B=5,6) + p(A=2) \cdot p(B=4,5,6) = \frac{1}{6} \cdot \frac{2}{6} + \frac{1}{6} \cdot \frac{3}{6} $. Let us show this on Python

In [None]:
n_checked = 0 # variable for trial which satisfies our condition

for i in range(10000): # work with big numbers to make sure we get the right distribution
    #breakpoint()
    a = np.random.randint(6, size=2) + np.ones(2)
    if( (a[0] + a[1]) >= 6 and a[0] < 3 ): # These are the two dice we are throwing
        n_checked += 1
print(n_checked / 10000)

### Application: Estimation of $ \pi $

Many scientists have struggled to find more and more accurate formulas for $ \pi $, and as many digits as possible. We will do an attempt to estimate this number by numerical simulation through a simple procedure: Let us take a square of length 2 and a circle just inside the square. Hence the circle will have radius 1. Shoot as many points as possible (eventually they will cover all the square) and mark them out if they fit inside the circle. At the end, do the ratio between the number of points inside the circle and the overall number of points.  

In [None]:
import time

N = 10000 # Number of points to consider
plt.figure()
no_inside = 0

t1 = time.time()
for i in range(N):
    a = np.random.random(2) * 2 -np.ones(2)         # Create a random number with x and y coordinates between 1 and -1 
    if(a[0]**2 + a[1]**2 <= 1):                     # Criteria for a point lying inside the circle
        plt.plot(a[0], a[1], 'ro')          # Mark it out as red
        no_inside += 1
    else:
        plt.plot(a[0], a[1], 'bo')          # Mark it as blue

plt.show()
t2 = time.time()
print("The estimated ratio is: " + str(no_inside/N))
print("The process took " + str(t2 - t1) + " seconds")

The ratio between the surface of the circle and the surface of the square is $ \frac{S_{circle}}{S_{square}} = \frac{ \pi \cdot r^2}{4} = \frac{\pi}{4} $ as r = 1. From this calculation, you get an estimate for number $ \pi $, even though it may be a bit inaccurate. Question: Try out the algorithm with smaller or bigger numbers N

In [None]:
print("The estimated number pi is: " + str(no_inside * 4/N))

### Application: Birthday Paradox

Did you know that out of 23 people randomly chosen, there are $ 50\% $ chances that two of them share the same birthday? Take one person and assign a random day. 

Take a second person and assign it a random day. Chances of them having different birthdays are: $ 1 - \frac{1}{365} $

Take a third person: Chances of different birthdays between him and the other two, <b> given that the previous two already have different birthdays </b> are: $ (1 - \frac{1}{365}) \cdot (1 - \frac{2}{365}) $ (This is nothing else but Bayes Theorem applied)

Take a fourth person: Chances of different birthdays between him and the other three, <b> given that the previous three already have different birthdays </b> are: $ (1 - \frac{1}{365}) \cdot (1 - \frac{2}{365}) \cdot (1 - \frac{3}{365}) $

How long to reach $ 50\% $ chances?

In [169]:
i = 1
p = 1
while(p > 0.5):
    p = p * (1 - i/365)
    i += 1
print("The required number of people to reach half-chances is: " + str(i))

The required number of people to reach half-chances is: 23


Do this all simulation in Python: take 23 variable and assign them random values out of the first 365 positive integers. If two variables have the same value, we detected a case. Repeat this for a significant number of times to see what we get: <b> Hint: </b> we use the numpy function np.unique (we can implement it manually another time):

In [178]:
N = 10000 # Number of iterations

paradox_var = 0
for i in range(N):
    a = np.random.randint(365, size = 23)
    if(len(a) != len(np.unique(a))):
        paradox_var += 1
print(paradox_var/N)

0.5078
