<h2> Simulating means and variances </h2>

<h3> Introduction </h3>

We're almost finished establishing some of our big techniques in simulation. We've learned how to simulate easy-to-describe events (like coin flips and dice rolls), and in the last notebook we learned how to generate exponentially distributed random numbers through a change of variables.

In this notebook, we'll take one last look at simulation for a while -- we'll keep generating random data and studying it. Our next step will be to study actual real-world data sets and analyze them statistically. But we need one last detail: how to describe the *summary statistics* of a data set so that we can give a high-level overview of the data. We'll also get more experience working with normally distributed data. Happily, Python has a way to do that: `random.gauss()`. It's called this because normal distributions are also referred to as *Gaussian,* after Carl Friedrich Gauss.

<h3> Generating data </h3>

To begin, let's generate some normally distributed data and compare it to our Table B.1 by estimating $P(z \ge 0.29)$. The table value estimates this as $0.3859$; we'll generate $10,000$ random numbers and comapre.

In [86]:
# Import the necessary package, pg 432 B1 table
from random import gauss
#where N(0,1) where 0 is the mean and 1 is teh std for distributed random variable z
mu = 0 # mean
sigma = 1 # std
success = 0
for _ in range(10000):
    if gauss(mu, sigma) >= 0.29: #gauss(0,1)
        success += 1

print(f'Estimated probability: {success / 10000}')

Estimated probability: 0.3855


After one run, I got an estimated probability of $0.383$, which is pretty close to the actual value!

Let's continue: we can compute means and variances by directly computing averages. For the mean, we'll just take the average (`sum`) of the data set and divide by the number of trials. For the variance, we'll sum the squared deviations $(r - \mu)^2$ before dividing.

In [85]:
# Keep going: generate 10,000 more random numbers and compute the mean and variance of the data set.

randoms = [gauss(0,1) for _ in range(10000)]
computed_mean = sum(randoms) / 10000

# Keep track of the deviations from the mean: subtract and square
squared_deviations = [(r - computed_mean)**2 for r in randoms]
computed_variance = sum(squared_deviations) / 10000

print(f'Computed mean: {computed_mean}')
print(f'Computed variance: {computed_variance}')

Computed mean: 0.00021450803601562774
Computed variance: 1.005717531706127


After running this with 10,000 numbers, I came up with a computed mean of $0.00045$ and a variance of $1.0068$. These are both really close to the actual parameters of the distribution -- which are $0$ and $1$, respectively!

In fact, this is a really important technique we'll use later: take some data from the real world, make a guess at the distribution it follows, estimate the parameters of that distribution, and then see how well our hypothesis fits the data. This is one of the core ideas of applied statistics. 

<h3> Questions: </h3>

(1) A standard rule of thumb is the [68-95-99.7](https://en.wikipedia.org/wiki/68–95–99.7_rule) rule, which refers to the probabilities of $P(-1 \le Z \le 1)$, $P(-2 \le Z \le 2)$, and $P(-3 \le Z \le 3)$ for a random variable $Z \sim N(0, 1)$. 

Generate at least $100,000$ random numbers from a Gaussian distribution and **estimate these three probabilities**; how well does your computed data match the rule?

(2) **Estimate a threshold** $\alpha$ for which $P(Z \le \alpha) = 0.80$, again using at least $100,000$ trials. Compare your result to Table B.1 (or any other $z$-table!).  

(3) Many important distributions can be found by modifying the standard normal one. Two of these are the  [chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution) (which is the sum of squares of normally distributed random variables) and the [folded normal distribution](https://en.wikipedia.org/wiki/Folded_normal_distribution), which is the absolute value $|Z|$ of a normal random variable. **Estimates the means and variances** of these two distributions (the linked articles have the computed values if you'd like to compare!).

Note: the absolute value in Python is given by `abs()`.

<h4> Submission instructions: </h4>

When you've finished your notebook, **save and export it as a pdf** (this is an option under the `file` menu). Upload it to Gradescope under the assignment "Weekly Jupyter 4." Do not submit any screenshots of your code.

In [88]:
# for Questions 1 
# Import the necessary package, pg 432 B1 table
from random import gauss
success1 = 0
success2 = 0
success3 = 0
for _ in range(100000):
    if -1 <= gauss(0, 1) <= 1:
        success1 += 1
    if -2 <= gauss(0, 1) <= 2:
        success2 += 1
    if -3 <= gauss(0, 1) <= 3:
        success3 += 1

print(f'Estimated probability: {success1 / 100000}')
print(f'Estimated probability: {success2 / 100000}')
print(f'Estimated probability: {success3 / 100000}')
# the computed data is close and matchs the rule
#  68-95-99.7is teh actual data, my data below shows 68, 95, 99 so it is a match or close to the actual values.

Estimated probability: 0.68569
Estimated probability: 0.9539
Estimated probability: 0.99746


In [15]:
#Quesiton #2
#Estimate a threshold 𝛼 for which 𝑃(𝑍≤𝛼)=0.80,
#again using at least  100,000 trials. 
#Compare your result to Table B.1 (or any other  𝑧-table!).
#Finding alpha with given probaliblity of 0.8 and when (Z <= alpha)

from random import gauss
s0 = 0
s1 = 0
s2 = 0
s3 = 0
s4 = 0
s5 = 0
s6 = 0
for _ in range(100000):
    if gauss(0, 1) <= 0.83:
        s0 += 1
    if gauss(0, 1) <= 0.84:
        s1 += 1
    if gauss(0, 1) <= 0.85:
        s2 += 1
    if gauss(0, 1) <= 0.86:
        s3 += 1
    if gauss(0, 1) <= 0.87:
        s4 += 1
    if gauss(0, 1) <= 0.88:
        s5 += 1
    if gauss(0, 1) <= 0.89:
        s6 += 1
print(f'if P (Z <= alpha) = 0.80, alpha = 0.83 to 0.89')
print(f'Estimated probability alpha=0.83: {s0 / 100000}')
print(f'Estimated probability alpha=0.84: {s1 / 100000}')
print(f'Estimated probability alpha=0.85: {s2 / 100000}')
print(f'Estimated probability alpha=0.86: {s3 / 100000}')
print(f'Estimated probability alpha=0.87: {s4/ 100000}')
print(f'Estimated probability alpha=0.88: {s5/ 100000}')
print(f'Estimated probability alpha=0.89: {s6/ 100000}')
# the table shows that it is close to each other.
# using online table it shows that when alpha = 0.84, then p1 = 0.7995 ,when alpha = 0.85, then p2 = 0.8023 ,
# (cont. online table) when alpha = 0.86, then p3 = 0.8051, when alpha = 0.87, then p = 0.8078
# if P(Z<= alpha) 𝑃(𝑍≤𝛼) = 0.80,alpha = 0.84 to 0.87
#this shows that they are close to the acutal values when compared

if P (Z <= alpha) = 0.80, alpha = 0.83 to 0.89
Estimated probability alpha=0.83: 0.79778
Estimated probability alpha=0.84: 0.80066
Estimated probability alpha=0.85: 0.80237
Estimated probability alpha=0.86: 0.80495
Estimated probability alpha=0.87: 0.80888
Estimated probability alpha=0.88: 0.80853
Estimated probability alpha=0.89: 0.81069


In [9]:
#Question number 3 correct answer
from random import gauss

samples = [gauss(0,1) for _ in range(100000)]
#print(samples[:10])

squares = [c**2 for c in samples]
folded_samples = [ abs(c) for c in samples]

mean_1 = sum(squares)/100000
mean_2 = sum(folded_samples)/100000
print(f'Computed mean of chi2: {mean_1}')
print(f'Computed mean of fold: {mean_2}')

squared_deviations1 = [(r - mean_1)**2 for r in squares]
variance1 = sum(squared_deviations1) / 100000
print(f'Computed variance of chi2: {variance1}')

squared_deviations2 = [(r - mean_2)**2 for r in folded_samples]
variance2 = sum(squared_deviations2) / 100000
print(f'Computed variance of fold: {variance2}')

#this shows that they are close to the actual values when compared
# Online it states the varience to be 2 and 0.36
# Online it states that the mean is 1 and 0.7
# these are close to the acual values and when compared it shows that I estimated it about right.

Computed mean of chi2: 1.0028454083973817
Computed mean of fold: 0.7993730083760416
Computed variance of chi2: 2.024037619556017
Computed variance of fold: 0.36384820187722516


In [137]:
#Question #3 for fun in case the above is wrong......
# Keep going: generate 10,000 more random numbers and compute the mean and variance of the data set.
from random import gauss
from scipy import stats
from scipy.stats import chi2
from scipy.stats import foldnorm
import numpy as np

df=1 #[0,1] normal distribution???
rc = chi2.rvs(df, size=1000) # generates ramdom numbers
randoms = [rc for _ in range(10000)]
computed_mean = sum(randoms) / 10000

# Keep track of the deviations from the mean: subtract and square
squared_deviations = [(r - computed_mean)**2 for r in randoms]
computed_variance = sum(squared_deviations) / 10000
print("")
print(f'Computed mean of chi-squared: {computed_mean}')
print("")
print("")
print("")
print(f'Computed variance of chi-squared: {computed_variance}')
print("")
print("")
print("")

# second portion
c=1 #[0,1] normal distribution???
rf = foldnorm.rvs(c, size=1000) # generates ramdom numbers of fold norm
randoms2 = [rf for _ in range(10000)]
computed_mean2 = sum(randoms2) / 10000


# Keep track of the deviations from the mean: subtract and square
squared_deviations2 = [(r2 - computed_mean2)**2 for r2 in randoms2]
computed_variance2 = sum(squared_deviations2) / 10000
print("")
print(f'Computed mean of folded normal: {computed_mean2}')
print("")
print("")
print("")
print(f'Computed variance of folded normal: {computed_variance2}')
print("")


Computed mean of chi-squared: [2.27171677e-03 2.51115109e-01 1.57876824e+00 2.58334003e+00
 3.37952519e-01 1.70654058e-03 5.75020632e-01 2.65110541e-01
 1.60024590e+00 1.23638746e-02 1.93755516e+00 2.79371636e-03
 1.63068069e-01 4.47347523e-01 3.18646133e-02 9.24052423e-01
 2.47717918e-01 6.47632558e-03 1.47304279e+00 3.53271168e-02
 7.65107889e-01 2.33388207e-03 1.19231159e-01 1.16102689e-01
 1.33527156e-01 2.30525250e+00 3.11029945e-02 7.16218331e-01
 1.53065060e+00 2.26919793e+00 4.43157348e-01 9.16700930e-01
 3.74143467e-01 2.05522750e-07 1.18024896e-02 9.37186462e-01
 5.36008581e-01 7.47801898e-01 2.41293493e+00 4.90008126e-01
 4.43865029e-02 5.08945146e-03 1.07633383e+00 6.67894871e-01
 1.49413592e-01 7.04215003e-02 1.93203257e-02 5.51834978e-02
 2.15149895e-02 8.83747286e+00 7.11336005e-04 3.70917614e-02
 4.23547380e-01 2.93328224e-01 4.01933982e+00 1.00033816e-01
 8.37460829e-01 2.68902176e-01 4.57129105e+00 1.11425967e+00
 7.56690180e-02 5.65873481e-04 3.83334416e-02 7.384702

In [None]:
#They are clsoe to the actual values.....