## Question 1
Question 1 is asking us to explore different numerical methods for calculating standard deviation and the associated errors that arise from different methods. We will be computing the standard deviation using two different methods. We will also use data from a professor at UCLA and also normal distribuitions generated using the Numpy package. 

## Part a: Pseudocode to evaluate two different methods for estimating standard deviation
1. Import relevant modules 
2. calculate the standard deviation using the numpy package designated the "true" value
3. Calculate the standard deviation using the equations 
\begin{equation}
\bar{x} = \frac{1}{n}\sum_{i = 1}^{n} x_i \qquad   (1)
\end{equation}

\begin{equation}
  \sigma = \sqrt{\frac{1}{n-1}\sum_{i = 1}^{n} ((x_i - \bar{x})^2)}  \qquad   (2)
\end{equation}
4. this is done by first calculating the length of the data array to find n 
5. Then the sum of the data is initialized and computed in a for loop
6. The average is calculated using the formula
7. The sum of the deviation is initialized and computed in a separate for loop using the average
8. Then these values are used to  compute the formula to find sigma
9. Calculate the standard deviation using the other equation
\begin{equation}
  \sigma = \sqrt{\frac{1}{n-1}\sum_{i = 1}^{n} x_i^2 - n\bar{x}^2}  \qquad   (3)
\end{equation}
10. this is done by first calculating the length of the data array to find n 
11. Then the sum of the data and sum of the data squared is initialized and computed in a single for loop
12. The average is computed following the formula
12. Then all of these values are used to compute this other formula to also find sigma
16. Compare the values from Part 3 and 4 to the value from 2, using the equation $\frac{x-y}{y}$ for some true value $y$

## Part b: Implementation of Pseudocode from Part a and evaluation of relative error to numpy.std method

In [2]:
import numpy as np
import matplotlib as plt
import time as time
#Imported packages
#Using Hungarian Notation

In [3]:
a_light_data = np.loadtxt('cdata.txt')      # gathered data
f_std = np.std(a_light_data) #our 'truth' value for standard deviation of the dataset

In [4]:
def first_sigma(a_data):
    i_n_length = len(a_data)              # n in the formulas
    f_sum = 0
    f_std_sum = 0
    for i in range(i_n_length):                 # computing the sum as a forloop, basically np.sum
        f_sum += a_data[i] 
    f_avg =  f_sum/i_n_length                   # basically np.mean. Following formula 1 for calculating mean
   
    for i in range(i_n_length):                 # sum required in formula 2 for standard deviation
        f_std_sum += (a_data[i] - f_avg)**2
    f_first_sigma = np.sqrt(1/(i_n_length-1)*f_std_sum)   # calculating standard deviation
    #print(a_data[i], f_avg) # WHY ARE WE PRINTING THIS? - B
    return f_first_sigma


def second_sigma(a_data):
    i_n_length = len(a_data)             # n in the formulas
    f_sum = 0
    f_sum_squared = 0
    for i in range(i_n_length): 
        f_sum += a_data[i]                # computing the sum as a forloop, basically np.sum
        f_sum_squared += a_data[i]**2 
    f_avg = f_sum/i_n_length                # basically np.mean. Following formula 1 for calculating mean
    try: #dont want to try to square root a negative number
        f_sec_sigma = np.sqrt((1/(i_n_length-1))*(f_sum_squared-i_n_length*f_avg**2))
    except IOerror:
        print('Warning: encountered square-root of a negative number.')
    return f_sec_sigma

def RelativeErr(accepted, measured):
    return (measured-accepted)/accepted

In [5]:
f_first_sigma = first_sigma(a_light_data)
f_first_error = RelativeErr(f_std, f_first_sigma)
print('first error is ', f_first_error)

first error is  0.005037815259211947


In [6]:
f_sec_sigma = second_sigma(a_light_data)
f_sec_error = RelativeErr(f_std, f_sec_sigma)
print('second error is ', f_sec_error)

second error is  0.005037819018453159


In [7]:
f_diff = f_first_error - f_sec_error #difference in errors to compare size
print('The difference in the relative errors (first method error minus second method error) is: {0:.2E}.'.format(f_diff))

The difference in the relative errors (first method error minus second method error) is: -3.76E-09.


The above print statement suggests that the second error is slightly larger by a difference on order of $10^{-9}$. Typically in practice, this difference is effectively negligble in any statistical analysis considering standard deviation. 

## Part c: Evaluate these methods on sample data sampled randomly from a Gaussian distibution

In [8]:
# Part C
f_test_std = 1. # the standard deviation given in the lab manual 
# defining the two normal distribuitions as given in the manual 
a_seq_1 = np.random.normal(0.,f_test_std,2000)
a_seq_2 = np.random.normal(1e7,f_test_std,2000)

In [9]:
f_seq_firstsig_1 = first_sigma(a_seq_1)   # first method of standard deviation for first normal distribuition
f_seq_secsig_1 = second_sigma(a_seq_1)    # second method of standard deviation for first normal distribuition
f_first_error_1 = RelativeErr(f_test_std, f_seq_firstsig_1)  # error associated with first method 
f_sec_error_1 = RelativeErr(f_test_std, f_seq_secsig_1)      # error associated with second method 
print('first error is ',f_first_error_1)
print('second error is ',f_sec_error_1)
f_diff_normal_1 = f_first_error_1 - f_sec_error_1
print('The difference in the relative errors (first method error minus second method error) is: {0:.2E}.'.format(f_diff_normal_1))

first error is  0.008977750126145434
second error is  0.008977750126144768
The difference in the relative errors (first method error minus second method error) is: 6.66E-16.


In this case, the first method produces an error higher than the second method by a difference on the order of $10^{-16}$. Again, this difference in negligible in any practical setting.

In [10]:
f_seq_firstsig_2 = first_sigma(a_seq_2) # first method of standard deviation for second normal distribuition
f_seq_secsig_2 = second_sigma(a_seq_2)  # second method of standard deviation for second normal distribuition
f_first_error_2 = RelativeErr(f_test_std, f_seq_firstsig_2)  # error associated with first method 
f_sec_error_2 = RelativeErr(f_test_std, f_seq_secsig_2)      # error associated with second method 
print('first error is ',f_first_error_2)
print('second error is ',f_sec_error_2)
f_diff_normal_2 = f_first_error_2 - f_sec_error_2
print('The difference in the relative errors (first method error minus second method error) is: {0:.2E}.'.format(f_diff_normal_2))
f_ratio_2 = abs(f_sec_error_2*100/f_first_error_2)
print('The second method produces a relative error {0:.3f} % the value of the first.'.format(f_ratio_2))

first error is  -0.034097823709330566
second error is  -0.07890054168615812
The difference in the relative errors (first method error minus second method error) is: 4.48E-02.
The second method produces a relative error 231.395 % the value of the first.


Clearly, the second method produces a much greater error (a factor on the order of $10^4$ larger) than the first in the case that the mean of the random Gaussian sampling is set to be large. This notion would hold for 'real' data of a continuous random variable whose value was supposed to be constant. The reason we observe this stark difference in relative error for the two methods in this case when we didn't for the other cases with comparatively small means is because numerical errors are accumulating due to truncation of values when Python stores them between each subsequent step in the for-loop during the computation of $\sum_{i = 1}^{n} x_i^2 \text{ and } n\bar{x}^2$. 
## Question 1c
I postulate that shifting the 'dataset' down by a value comparable to the size of its mean would solve this large numerical error in method 2. We cannot simply subract the mean itself because this would require an additional loop and would defeat the purpose of this single-pass method. The value could be chosen to be the mid-point between the largest and smallest value in the dataset. In general this will be on the same order as the mean is, and is better than just shifting down by the largest value because this may be very large in comparison to the smallest value (depending on the size of the standard deviation in fact) which would result in numerical errors due to smallness of the values and the precision of floating point storage in memory. In otherwords, subtract the mid-point (sometimes reffered to as mode) of the dataset from each element of the data array, thus centering the Gaussian closer to zero. This in no way affects the standard deviation of the set, which is evident by considering that the standard deviation of a dataset sampled from a Gaussian distribution is effectively the 'width' of the Gaussian curve about it's mean. Shifting the mean of course preserves this width. 

In [14]:
def second_sigma_v2(a_data): # a new and improved version of method 2 of standard deviation calculation with data shifting to correct numerical errors
    i_n_length = len(a_data)             # n in the formulas
    f_sum = 0
    f_sum_squared = 0
    a_data = a_data - (min(a_data)+max(a_data))/2 #THIS IS THE FIX: shift the data
    for i in range(i_n_length): 
        f_sum += a_data[i]                # computing the sum as a forloop, basically np.sum
        f_sum_squared += a_data[i]**2 
    f_avg = f_sum/i_n_length                # basically np.mean. Following formula 1 for calculating mean
    try: #dont want to try to square root a negative number
        f_sec_sigma = np.sqrt((1/(i_n_length-1))*(f_sum_squared-i_n_length*f_avg**2))
    except IOerror:
        print('Warning: encountered square-root of a negative number.')
    #print(f_sum_squared, i_n_length*f_avg**2) #WHY ARE WE PRINTING THIS? -B
    return f_sec_sigma

f_seq_secsig_2_new = second_sigma_v2(a_seq_2)
f_sec_error_2_new = RelativeErr(f_test_std, f_seq_secsig_2_new)
print(f_seq_secsig_2_new,' corrected sigma')
print(f_seq_secsig_2, 'original sigma with larger error')
print('The relative error for the improved second method with data shifting is {0}.'.format(f_sec_error_2_new))
print('The relative error for the first method is {0}.'.format(f_first_error_2))
f_diff_new = f_first_error_2 - f_sec_error_2_new
print('The difference in the relative errors (first method error minus second method error) is: {0:.2E}.'.format(f_diff_new))


0.9659021762906691  corrected sigma
0.9210994583138419 original sigma with larger error
The relative error for the improved second method with data shifting is -0.0340978237093309.
The relative error for the first method is -0.034097823709330566.
The difference in the relative errors (first method error minus second method error) is: 3.33E-16.


This difference in relative error between the first and second method is comparably small to the first two situations where the difference in relative error was deemed neglible. This means our proposed 'fix' for the numerical error indeed worked!