# MACHINE LEARNING AND QUANTUM COMPUTERS
# ASSIGNMENT 1 (26/11/25)

## PROBLEM 3

<div class="alert alert-block alert-success">
<b>P3</b>. For which distributions does the 68–95–99.7 rule hold?
</div>

### Preliminaries

Let's start by importing all the libraries that we will need:

In [1]:
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import random

Also, let's check that all of those packages were correctly installed:

In [2]:
print(f"Numpy's version: {np.__version__}")
print(f"Matplot's version: {mpl.__version__}")
print(f"Scipy's version: {sp.__version__}")
print(f"Pandas's version: {pd.__version__}")

Numpy's version: 2.3.4
Matplot's version: 3.10.7
Scipy's version: 1.16.3
Pandas's version: 2.3.3


As we will use the *68-95-99.7* rule, we'll start by explaining its fundamentals. Then, we'll move on and compute it for the random data we generate.

### 68-95-99.7 rule

$$\text{Pr}(\mu-1\sigma\leq X\leq\mu+1\sigma)\approx 68.27\% $$
$$\text{Pr}(\mu-2\sigma\leq X\leq\mu+2\sigma)\approx 95.45\% $$
$$\text{Pr}(\mu-1\sigma\leq X\leq\mu+1\sigma)\approx 99.73\% $$

This rule, also known as the empirical rule (and sometimes abbreviated $3\sigma$) is a shorthand used to remember the percentatge of values that lie within an interval estimate in a normal distribution: aprox. 68%, 95% and 99.7% of the values lie within one, two and three standard deviations of the mean, respectively.

It's, therefore, a rule that should only work with normal distributions and (because of the central limit theorem) general datasets (i.e., following any probability distribution) with a huge number of elements.

### Generating random data

Let's generate random data from the different distributions discussed previously. We'll start by defining the function to compute our datasets and its mean and standard deviation:

In [17]:
def data_set(N,mu,sigma,a,b,alpha,beta):
    x_normal = np.random.normal(mu,sigma,N)
    x_uniform = np.random.uniform(a,b,N)
    x_beta = np.random.beta(alpha,beta,N)
    x = np.linspace(-10,10,N)

    mN = np.mean(x_normal)
    sN = np.std(x_normal)

    mU = np.mean(x_uniform)
    sU = np.std(x_uniform)

    mB = np.mean(x_beta)
    sB = np.std(x_beta)

    return(x_normal,x_uniform,x_beta,x,mN,sN,mU,sU,mB,sB)

We'll work with three datasets:
- One of 5000 elements
- One of 500 elements
- One of 50 elements

In [50]:
x_normal1, x_uniform1, x_beta1, x1, mN1, sN1, mU1, sU1, mB1, sB1 = data_set(N = 5000, mu = 1, sigma = 1.1, a = 0, b = 2, alpha = 4, beta = 4)
x_normal2, x_uniform2, x_beta2, x2, mN2, sN2, mU2, sU2, mB2, sB2 = data_set(N = 500, mu = 1, sigma = 1.1, a = 0, b = 2, alpha = 4, beta = 4)
x_normal3, x_uniform3, x_beta3, x3, mN3, sN3, mU3, sU3, mB3, sB3 = data_set(N = 50, mu = 1, sigma = 1.1, a = 0, b = 2, alpha = 4, beta = 4)

Let's see if our distributions follow the rule or not. To do it, we should check if each value $X_i$ of our dataset $X$ follows

$$|X-\mu|\leq N\cdot\sigma$$

for $N\in\{1, 2, 3\}$. Once we computed each $|X-\mu|$ for the different probability distributions (and sizes), we can check how many of them fall inside the $N\cdot\sigma$ threesold (in %). If the number we get is similar to the 68-95-99.7 rule, the data passes the *test*. Let's see:

In [52]:
# Gauss
meanN = 1
sigmaN = 1.1

onestdN1 = np.sum(np.abs(x_normal1 - meanN) <= sigmaN) / 5000 * 100
twostdN1 = np.sum(np.abs(x_normal1 - meanN) <= 2 * sigmaN) / 5000 * 100
threestdN1 = np.sum(np.abs(x_normal1 - meanN) <= 3 * sigmaN) / 5000 * 100
onestdN2 = np.sum(np.abs(x_normal2 - meanN) <= sigmaN) / 500 * 100
twostdN2 = np.sum(np.abs(x_normal2 - meanN) <= 2 * sigmaN) / 500 * 100
threestdN2 = np.sum(np.abs(x_normal2 - meanN) <= 3 * sigmaN) / 500 * 100
onestdN3 = np.sum(np.abs(x_normal3 - meanN) <= sigmaN) / 50 * 100
twostdN3 = np.sum(np.abs(x_normal3 - meanN) <= 2 * sigmaN) / 50 * 100
threestdN3 = np.sum(np.abs(x_normal3 - meanN) <= 3 * sigmaN) / 50 * 100

print("[ GAUSSIAN 68-95-99.7 RULE ]")
print(f"Percentage within 1 standard deviation (G, N=5000): {onestdN1:.2f}%")
print(f"Percentage within 2 standard deviations (G, N=5000): {twostdN1:.2f}%")
print(f"Percentage within 3 standard deviations (G, N=5000): {threestdN1:.2f}%")
print()
print(f"Percentage within 1 standard deviation (G, N=500): {onestdN2:.2f}%")
print(f"Percentage within 2 standard deviations (G, N=500): {twostdN2:.2f}%")
print(f"Percentage within 3 standard deviations (G, N=500): {threestdN2:.2f}%")
print()
print(f"Percentage within 1 standard deviation (G, N=50): {onestdN3:.2f}%")
print(f"Percentage within 2 standard deviations (G, N=50): {twostdN3:.2f}%")
print(f"Percentage within 3 standard deviations (G, N=50): {threestdN3:.2f}%")
print()
print()

# Uniform
a = 0
b = 2
meanU = (a+b)/2
sigmaU = (b-a)/np.sqrt(12)

onestdU1 = np.sum(np.abs(x_uniform1 - meanU) <= sigmaU) / 5000 * 100
twostdU1 = np.sum(np.abs(x_uniform1 - meanU) <= 2 * sigmaU) / 5000 * 100
threestdU1 = np.sum(np.abs(x_uniform1 - meanU) <= 3 * sigmaU) / 5000 * 100
onestdU2 = np.sum(np.abs(x_uniform2 - meanU) <= sigmaU) / 500 * 100
twostdU2 = np.sum(np.abs(x_uniform2 - meanU) <= 2 * sigmaU) / 500 * 100
threestdU2 = np.sum(np.abs(x_uniform2 - meanU) <= 3 * sigmaU) / 500 * 100
onestdU3 = np.sum(np.abs(x_uniform3 - meanU) <= sigmaU) / 50 * 100
twostdU3 = np.sum(np.abs(x_uniform3 - meanU) <= 2 * sigmaU) / 50 * 100
threestdU3 = np.sum(np.abs(x_uniform3 - meanU) <= 3 * sigmaU) / 50 * 100

print("[ UNIFORM 68-95-99.7 RULE ]")
print(f"Percentage within 1 standard deviation (U, N=5000): {onestdU1:.2f}%")
print(f"Percentage within 2 standard deviations (U, N=5000): {twostdU1:.2f}%")
print(f"Percentage within 3 standard deviations (U, N=5000): {threestdU1:.2f}%")
print()
print(f"Percentage within 1 standard deviation (U, N=500): {onestdU2:.2f}%")
print(f"Percentage within 2 standard deviations (U, N=500): {twostdU2:.2f}%")
print(f"Percentage within 3 standard deviations (U, N=500): {threestdU2:.2f}%")
print()
print(f"Percentage within 1 standard deviation (U, N=50): {onestdU3:.2f}%")
print(f"Percentage within 2 standard deviations (U, N=50): {twostdU3:.2f}%")
print(f"Percentage within 3 standard deviations (U, N=50): {threestdU3:.2f}%")
print()
print()

# Beta
alpha = 4
beta = 4
meanB = alpha/(alpha+beta)
sigmaB = np.sqrt((alpha * beta)/((alpha+beta)**2 * (alpha + beta + 1)))

onestdB1 = np.sum(np.abs(x_beta1 - meanB) <= sigmaB) / 5000 * 100
twostdB1 = np.sum(np.abs(x_beta1 - meanB) <= 2 * sigmaB) / 5000 * 100
threestdB1 = np.sum(np.abs(x_beta1 - meanB) <= 3 * sigmaB) / 5000 * 100
onestdB2 = np.sum(np.abs(x_beta2 - meanB) <= sigmaB) / 500 * 100
twostdB2 = np.sum(np.abs(x_beta2 - meanB) <= 2 * sigmaB) / 500 * 100
threestdB2 = np.sum(np.abs(x_beta2 - meanB) <= 3 * sigmaB) / 500 * 100
onestdB3 = np.sum(np.abs(x_beta3 - meanB) <= sigmaB) / 50 * 100
twostdB3 = np.sum(np.abs(x_beta3 - meanB) <= 2 * sigmaB) / 50 * 100
threestdB3 = np.sum(np.abs(x_beta3 - meanB) <= 3 * sigmaB) / 50 * 100

print("[ BETA 68-95-99.7 RULE ]")
print(f"Percentage within 1 standard deviation (B, N=5000): {onestdB1:.2f}%")
print(f"Percentage within 2 standard deviations (B, N=5000): {twostdB1:.2f}%")
print(f"Percentage within 3 standard deviations (B, N=5000): {threestdB1:.2f}%")
print()
print(f"Percentage within 1 standard deviation (B, N=500): {onestdB2:.2f}%")
print(f"Percentage within 2 standard deviations (B, N=500): {twostdB2:.2f}%")
print(f"Percentage within 3 standard deviations (B, N=500): {threestdB2:.2f}%")
print()
print(f"Percentage within 1 standard deviation (B, N=50): {onestdB3:.2f}%")
print(f"Percentage within 2 standard deviations (B, N=50): {twostdB3:.2f}%")
print(f"Percentage within 3 standard deviations (B, N=50): {threestdB3:.2f}%")

[ GAUSSIAN 68-95-99.7 RULE ]
Percentage within 1 standard deviation (G, N=5000): 68.18%
Percentage within 2 standard deviations (G, N=5000): 95.12%
Percentage within 3 standard deviations (G, N=5000): 99.58%

Percentage within 1 standard deviation (G, N=500): 68.40%
Percentage within 2 standard deviations (G, N=500): 94.40%
Percentage within 3 standard deviations (G, N=500): 99.60%

Percentage within 1 standard deviation (G, N=50): 72.00%
Percentage within 2 standard deviations (G, N=50): 98.00%
Percentage within 3 standard deviations (G, N=50): 100.00%


[ UNIFORM 68-95-99.7 RULE ]
Percentage within 1 standard deviation (U, N=5000): 57.98%
Percentage within 2 standard deviations (U, N=5000): 100.00%
Percentage within 3 standard deviations (U, N=5000): 100.00%

Percentage within 1 standard deviation (U, N=500): 55.20%
Percentage within 2 standard deviations (U, N=500): 100.00%
Percentage within 3 standard deviations (U, N=500): 100.00%

Percentage within 1 standard deviation (U, N=50):

As we can see, the dataset that better reproduces the 68-95-99.7 rule is the Gaussian with N=5000, as we discussed above (it's nearly identical). The other two Gaussians are also pretty decent but worse than the case N=5000, as we predicted. The case N=50 starts to deviate from this rule, which can be easily explained by the low number of elements in the sample.

Regarding the other two distributions, in both cases (Uniform and Beta) the cases N=5000 are somewhat decent, like the central limit theorem predicts. The Beta distribution, in general, follows the rule better than the Uniform distribution, which yields the worst results. In fact, one can go as far as to say that the Uniform dataset doesn't follow the rule for any case, while the Beta's sample does for N=5000 and N=500.