<a href="https://colab.research.google.com/github/bentech28/AI-E-101-May-17--2024---Bounyamine-Baparape/blob/main/1_Assignment1SLWeek1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Element of Statistical Learning Week1

* Reading Mathematical Statistics With Resampling in R Chapter 1 to 5.
* Summary each chapter in your own words. Use ChatGPT to polish your work. Save your ChatGPT session.
* Read Chapter 1 of Introduction to Statistical Learning
* As an appendix of this assignment, write out the first 12 fundamental proofs of the Foundation of Statistical Software



## **Problem 1 (30 points)**
In the week 1 lecture, we developed a simple simulation to examine the mean of a sample as a random variable. Specifically, we repeatedly drew samples of size \( $N = 20$ \) from the same underlying normal distribution. To observe how the sample mean fluctuated from one experiment to the next, we created a histogram of the obtained mean values.

In this problem, we will:
1. **Characterize the Distribution**: Analyze the distribution of the sample means by calculating its standard deviation.
2. **Examine Spread with Increasing Sample Size**: Observe how the spread of the distribution decreases with increasing sample size. This aligns with the intuitive notion that larger samples tend to have means closer to the true mean of the underlying population from which the samples are drawn.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Different sample sizes we are going to try
sample_sizes = [3, 10, 50, 100, 500, 1000]

# We will use the list below to keep the SD of the distribution of the means at each given sample size
# (note that it's ok to initialize with an empty list - it will dynamically grow as we append results)
mean_sds = []

for N in sample_sizes:  # Try different sample sizes

    # Insert your code here (refer to the slides if needed).
    # 1) At each given N (i.e., in each iteration of the outer loop), you have to draw a large number
    #    (e.g., 1000) of samples of size N. Calculate the mean of each of those samples and save them all into
    #    a list called m.

    # 2) Now, with the list m in hand, we want to characterize how much the sample mean fluctuates
    #    from one experiment (experiment = taking a sample of N measurements) to the next. Instead of just
    #    drawing a histogram, this time we will calculate the standard deviation of the distribution
    #    represented by the list m. Use np.std().

    # 3) Save the result (SD of the distributions of the means for the current N) into the list mean_sds.
    #    You can use the append() method to add the new SEM value to the list.


# At this point, you should have the list mean_sds filled. It should have length 6 and keep the values of
# the standard deviation of the mean (known as the standard error of the mean, SEM) at different sample sizes.
# (mean_sds[0] is the SEM at N=3, mean_sds[1] is the SEM at N=10, and so on)

# Let us now PLOT the SEM (i.e., the "typical" error we expect the sample mean to exhibit in any
# given experiment) as a function of the sample size, N.

plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, mean_sds, 'o-', label='SEM')
plt.plot(sample_sizes, 1/np.sqrt(sample_sizes), 'b--', label='Theoretical SEM (1/√N)')
plt.xlabel('Sample Size')
plt.ylabel('SEM')
plt.title('SEM vs Sample Size')
plt.legend()
plt.grid(True)
plt.show()


IndentationError: expected an indented block after 'for' statement on line 11 (<ipython-input-1-506d7254a2db>, line 34)

### Explanation:

- **Sample Sizes**: We define a list of different sample sizes to test.

- **Mean SDs**: We initialize an empty list `mean_sds` to store the standard deviation of sample means for each sample size.

- **Loop**: For each sample size \( N \), we perform the necessary operations to calculate and store the SEM.

- **Plotting**: Finally, we plot the SEM as a function of sample size and compare it with the theoretical SEM curve \( \frac{1}{\sqrt{N}} \).


## Sampling and Observing Distribution Changes

All you need to do is to provide missing pieces of code. You should remember that the sampling functions provided in Python do just what we need. For instance, `np.random.uniform(size=3)` will draw 3 values independently from a uniform distribution. But that's exactly what measuring 3 i.i.d uniformly distributed random variables is! So, in order to sample our N variables $$X_1$...$X_N$$ in each experiment, we just need to call the sampling function with N as an argument (and whatever other arguments that specific DISTR function might require). Do NOT use `np.random.uniform()` though, it is too dull. Use something very different from uniform distribution. Exponential distribution (`np.random.exponential`) or normal (`np.random.normal`) are good candidates (see help pages for the distribution function you choose to see what parameters it might require).

The code above uses N=1. In this case $S=X_1$ and obviously S is the same "process" as $X_1$ itself. So the histogram will in fact show you the distribution you have chosen for X. Change N and rerun the code a few times. See how the distribution of S (the histogram we draw) changes for N=2, N=5, ... Can you see how the distribution quickly becomes normal even though the distribution we are drawing with (the one you have seen at N=1) can be very different from normal?

### Submission for this problem:

1. **Python code** with missing pieces filled in, generating histogram plots at few different N of your choosing:
   - For N=1 (i.e., the distribution you choose to sample from).
   - For N large enough so that the distribution of S in the histogram looks very "normal".
   - For some intermediate N, such that distribution of S already visibly departed from N=1 but is clearly non-normal just yet.

2. **Histogram plots** showing the distributions for the above cases.

You can submit separate files or a combined document that includes both the code and the figures.


## **Problem 2 (30 points)**

There is a beautiful fact in statistics called the Central Limit Theorem (CLT). It states that the distribution of a sum of \( N \) independent, identically distributed (i.i.d.) random variables \( $X_i$ \) has a normal distribution in the limit of large \( N \), regardless of the distribution of the variables \( $X_i$ \) (under some very mild conditions, strictly speaking).

**Here is what it means in plain English:**

Suppose we have a distribution (and thus a random variable, since a random variable is a distribution, drawing a value from the distribution is what "measuring" a random variable amounts to!). Let's draw a value from that distribution, \( $x_1$ \). Then, let us draw another value \( $x_2$ \) from the same distribution, independently, i.e., without any regard to the value(s) we have drawn previously. Continue until we have drawn \( N \) values \( $x_1, \ldots, x_N$ \).

Let us now calculate the sum \( $s = x_1 + \ldots + x_N$ \) and call this an "experiment".

Clearly, \( s \) is a realization of some random variable.
If we repeat the experiment (i.e., draw \( N \) random values from the distribution again), we will get a completely new realization \( $x_1, \ldots, x_N $\), and the sum will thus take a new value too!

Using our notations, we can also describe the situation outlined above as
\[ $S = X_1 + X_2 + \ldots + X_N \quad (X_i \text{ i.i.d.}) $\]

The fact stated by this equation, that the random variable \( S \) is the "sum of random variables," is just what we discussed above: The "process" \( S \) is defined as measuring \( N \) processes which are "independent and identically distributed" (i.e., drawn from the same distribution) and summing up the results.

We cannot predict what the sum is going to be until we do the actual measuring of \$( X_1, \ldots, X_N \)$, so \( S \) is a random variable indeed!
It has some distribution associated with it (some values of this sum are more likely than others), and what CLT tells us is that at large \( N \), this distribution is bound to be normal.

Here is initial code you will have to complete:


# Homework: Summing i.i.d. Random Variables

## Task Description

In this assignment, you will implement a simulation to understand the concept of summing i.i.d. random variables across multiple experiments. Follow the instructions below to complete the exercise.

### Instructions

1. **Define Parameters:**
   - `N`: the number of i.i.d. variables \( X \) we are going to sum.
   - `n_exp`: how many times we are going to repeat the "experiment" (see the text above for what we call an experiment).

2. **Simulation Steps:**

   - Initialize a vector to store the sum of i.i.d. variables in each experiment.
     ```python
     N = 1  # the number of i.i.d. variables X we are going to sum

     # how many times we are going to repeat the "experiment" (see the text above for what we call an experiment):
    repeats = 1000 # n_exp

     # Initialize an empty list to store sum values in each experiment
     s_values = [] #we will use this vector to store the value of the sum in each experiment
     ```
   
   - Repeat the experiment!
     ```python
     # explained below. Here we must draw the values x1, ..., xN of the random variables we are going to sum up:
     for _ in range(n_exp):
         # Step 1: Draw the values x1, ..., xN of the random variables we are going to sum up:
         # x = DISTR(N, ...) = np.random.normal(size=3)
         # (students fill in this line with the appropriate distribution function and parameters)
         x = np.random.DISTR(N, ...)  # This line requires student's input
         
         # Step 2: Calculate the sum of x1, ..., xN
         # (fill in this line to compute the sum)
         s = sum(x)  # This line requires student's input
         
         # Step 3: Save the sum into the vector s_values:
         s_values.append(s)
     ```
     
   - We repeated the experiment `n_exp` times, so we have `n_exp` values sampled from the process \( S \). This should be plenty for looking at their distribution:
     ```python
     # Draw histogram of n_exp
     # (fill in this line to draw the histogram of s_values)
     plt.hist(s_values, bins=30, density=True, alpha=0.75)
     plt.xlabel('Sum of i.i.d. Variables')
     plt.ylabel('Density')
     plt.title('Distribution of Sum of i.i.d. Variables')
     plt.grid(True)
     plt.show()
     ```
     
### Requirements

- Students are required to provide answers at 3 specific steps in the code:
  1. `x = DISTR(N, ...)`
  2. `...???...`
  3. `...DRAW histogram of n_exp...`
  
- Replace `DISTR(N, ...)` with the appropriate distribution function and parameters.
- Fill in the code to compute the sum of `x1, ..., xN`.
- Fill in the code to draw the histogram of `n_exp` values.

### Submission

Submit your completed Python script with the filled-in code for evaluation.



All you need to do is to provide missing pieces of code highlighted in red. You should remember that the sampling functions provided in Python do just what we need. For instance, `np.random.normal(size=3)` will draw 3 values, independently, from the same normal distribution (with default mean=0 and sd=1 in this particular example). But that's exactly what sampling 3 i.i.d normally distributed random variables is!

Therefore, to sample our N variables X1...XN in each experiment in Python, simply call the sampling function with N as an argument, along with any other required parameters specific to the chosen distribution function. Avoid using np.random.normal(), as it's too commonplace. Instead, consider using something distinctly different, such as the uniform distribution (np.random.uniform()) or exponential distribution (np.random.exponential()). Check the help pages for the chosen distribution function to understand what parameters it may need.

The code above uses N=1. In this case $S=X_1$ and obviously S is the same "process" as $X_1$ itself. So the histogram will in fact show you the distribution you have chosen for X. Change N a rerun the code a few times. See how the distribution of S (the histogram we draw) changes for $N=2, N=5, \ldots$  

Can you see how the distribution quickly becomes normal even though the distribution we are drawing with (the one you have seen at N=1) can be very different from normal?
Submission for this problem: the code with missing pieces filled in, plus histogram plots generated at few different N of your choosing, for instance for N=1 (i.e. the distribution you choose to sample from), for N large enough so that the distribution of S in the histogram looks very "normal" , and some intermediate N, such that distribution of S already visibly departed from N=1 but is clearly non-normal just yet.

You can submit separate files or a word/pdf document that combines both the code and the figures.

Lastly, **for the full credit** you should answer the following question (5 points): suppose you have an arbitrary distribution and take a sample of N measurements from it. You calculate the mean of your sample. As we discussed, the sample mean is a random variable of course. How is the sample mean distributed when N becomes large? HINT: look at the definition of the sample mean!
