# Question 2

In this question, you'll go over some of the core terms and concepts in statistics.

### Part A

Write a function, `variance`, which computes the variance of a list of numbers.

The function takes one argument: a list or 1D NumPy array of numbers. It returns one floating-point number: the variance of all the numbers.

Recall the formula for variance:

$$
variance = \frac{1}{N - 1} \sum_{i = 1}^{N} (x_i - \mu_x)^2
$$

where $N$ is the number of numbers in your list, $x_i$ is the number at index $i$ in the list, and $\mu_x$ is the average value of all the $x$ values.

You can use `numpy.array` and your `numpy.mean` functions, but no other NumPy functions or built-in Python functions other than `range()`.

In [5]:
import numpy as np
def variance(numbers):
    count = 0
    total = 0
    
    for value in numbers:
        total += value
        count += 1
        
    if count > 1:
        average = total / count
        
        var = 0
        
        for value in numbers:
            var += (value - average) ** 2
            
        var = var / (count - 1)
        
    else:
        var = None
        
    return var

In [6]:
import numpy as np

np.random.seed(5987968)
x = np.random.random(8491)
v = x.var(ddof = 1)
np.testing.assert_allclose(v, variance(x))

In [7]:
np.random.seed(4159)
y = np.random.random(25)
w = y.var(ddof = 1)
np.testing.assert_allclose(w, variance(y))

### Part B

The lecture on statistics mentions latent variables, specifically how you *cannot* know what the underlying process is that's generating your data; all you have is the data, on which you have to impose certain assumptions in order to derive hypotheses about what generated the data in the first place.

To illustrate this, the code provided below generates sample data from distributions with mean and variance that are *typically* not known to you. Put another way, **pretend you cannot see the mean (`loc`) and variance (`scale`) in the code that generates these samples; all you usually can see are the data samples themselves.**

You'll use the `numpy.mean` and `variance` function you wrote in Part A to compute the statistics on the sample data itself and observe how these statistics change.

In the space provided, compute and print the mean and variance of each of the three samples:
 - `sample1`
 - `sample2`
 - `sample3`

You can just `print()` them out in the space provided. **Don't modify anything above where it says "DON'T MODIFY"**.

In [12]:
import numpy as np
np.random.seed(5735636)

sample1 = np.random.normal(loc = 10, scale = 5, size = 10)
sample2 = np.random.normal(loc = 10, scale = 5, size = 1000)
sample3 = np.random.normal(loc = 10, scale = 5, size = 1000000)

#########################
# DON'T MODIFY ANYTHING #
#   ABOVE THIS BLOCK    #
#########################

### BEGIN SOLUTION
print("Variance of Sample 1 =", + variance(sample1))
print("Mean of Sample 1 =", + np.mean(sample1))
print("Variance of Sample 2 =", + variance(sample2))
print("Mean of Sample 2 =", + np.mean(sample2))
print("Variance of Sample 3 =", + variance(sample3))
print("Mean of Sample 3 =", + np.mean(sample3))
### END SOLUTION

Variance of Sample 1 = 40.227985133290126
Mean of Sample 1 = 9.225549723298691
Variance of Sample 2 = 25.268584205085652
Mean of Sample 2 = 9.799833697772822
Variance of Sample 3 = 25.009857369536363
Mean of Sample 3 = 10.000880335689784
