# Machine Learning and Statistics - Tasks
Assignment Tasks for Machine Learning and Statistics, GMIT 2020

Lecturer: dr Ian McLoughlin


>Author: **Andrzej Kocielski**  
>Github: [andkoc001](https://github.com/andkoc001/)  
>Email: G00376291@gmit.ie, and.koc001@gmail.com

Created: 16-11-2020

This Notebook should be read in conjunction with the corresponding `README.md` file at the project [repository](https://github.com/andkoc001/Machine-Learning-and-Statistics.git) at GitHub.

___
## Task 3 - Standard Deviation


### Objectives
__Simulate Excel function `STDEC.S` and `STDEV.P` using Numpy and explain advantages of the former.__


### Standard deviation

_Standard deviation_ (SD) is a statistical concept, with a wide range of application, to measure how the data is spread out around the mean. [Dictionary.com](https://www.dictionary.com/browse/standard-deviation) defines it as "a measure of dispersion in a frequency distribution, equal to the square root of the mean of the squares of the deviations from the arithmetic mean of the distribution."

The standard deviation is defined as a square root of the average of the squared differences from the Mean [Mathisfun.com](https://www.mathsisfun.com/data/standard-deviation.html).

![Standard Deviation](https://upload.wikimedia.org/wikipedia/commons/f/f9/Comparison_standard_deviations.svg) Image source: [Wikipedia](https://simple.wikipedia.org/wiki/File:Comparison_standard_deviations.svg)

  

### Population and sample SD

There are two main methods of calculating the standard deviation. One that refers to the entire population and the other that consider the data set as a sample of the population. For simplicity, only discrete values are consider in this notebook.

The **standard deviation of population** ($\sigma$), is a measure that could be accurately calculated if the values of the variable were known for all population units; corresponds to the deviation of a random variable whose distribution is identical to the distribution in the population. This kind of standard deviation is often referred as to unbiased or uncorrected.

The the formula for population standard deviation ([Mathisfun.com](https://www.mathsisfun.com/data/standard-deviation-formulas.html)):
$$
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^n (x_i - \mu)^2}
$$
where:  
$N$ is the size of the population,
$x_i$ represents the observed value of the i-th member,
$\mu$ denotes population mean.

The **standard deviation of sample** ($s$), is a measure that estimates the standard deviation in a population based on the knowledge of only some of its objects, i.e. the random sample. For practical reasons this method is often the only viable option. This kind of standard deviation is often referred as to biased or corrected.

The the formula for sample standard deviation ([Mathisfun.com](https://www.mathsisfun.com/data/standard-deviation-formulas.html)):
$$
s = \sqrt{\frac{1}{N-1} \sum_{i=1}^n (x_i - \bar{x})^2}
$$
where $\bar{x}$ denotes sample mean.

The Microsoft Excel's functions `STDEC.S` and `STDEV.P` are used to calculate standard deviation of **sample** and **population** respectively.

### Standard deviation in NumPy 

NumPy library for Python allows for calculating the standard deviation The function `numpy.std()` is used for this purpose. The syntax employed for the calculation takes the following form:  
`std = sqrt(mean(abs(x - x.mean())**2))`  
where x is value of an observation.

NumPy allows for calculating the standard deviation both of population and of sample. The correction is controlled by the function parameter `ddof`, which by default equals zero (standard deviation of population).

"The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se" ([NumPy](https://numpy.org/doc/stable/reference/generated/numpy.std.html)).

In [1]:
# import NumPy
import numpy as np

In [2]:
# define the dataset in a form of a two-dimensional array
a = np.arange(10).reshape((2, 5)) # starting from 1
a

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [3]:
# number of elements in array
n = np.size(a)

In [4]:
# calculate the SD of population 'manually', using formula:
# np.sqrt(np.sum((x - np.mean(x))**2)/len(x))

# auxiliary variable
sum = 0

# iterate over elements of the array
for x in np.nditer(a):
    sum = sum + (x - np.mean(a))**2
    
sd_p = np.sqrt(sum/n)

# print out the result
print(f"Standard deviation of population (calculated manually): {sd_p:.4f}")    

Standard deviation of population (calculated manually): 2.8723


In [5]:
# Standard deviation of population, calculated using NumPy's std() function

# calculate the SD using NumPy's std() function
sd_p_np = np.std(a)
print(f"Standard deviation of population (calculated with NumPy): {sd_p_np:.4f}")

Standard deviation of population (calculated with NumPy): 2.8723


In [6]:
# calculate the SD of sample 'manually', using formula: 
# np.sqrt(np.sum((x - np.mean(x))**2)/len(x-1))

# auxiliary variable
sum = 0

# iterate over elements of the array
for x in np.nditer(a):
    sum = sum + (x - np.mean(a))**2
    
sd_s = np.sqrt(sum/(n-1))

# print out the result
print(f"Standard deviation of population (calculated manually): {sd_s:.4f}")    

Standard deviation of population (calculated manually): 3.0277


In [7]:
# Standard deviation of sample, calculated using NumPy's std() function

# calculate the SD
sd_s_np = np.std(a, ddof=1)
print(f"Standard deviation of sample (calculated with NumPy): {sd_s:.4f}")

Standard deviation of sample (calculated with NumPy): 3.0277


In [8]:
# evaluate the error
err = abs(sd_p - sd_s)/sd_p # in percent
print(f"The error of the standard deviation of a sample is: {err:.2%}") 

The error of the standard deviation of a sample is: 5.41%


## Conclusion 

In the example above, the two methods of calculating the standard deviation produced a difference (error) of approximately 5.41%.

Although standard deviation of the entire population yields an accurate results (every observation is considered), for practical reasons is often not viable (for example, it is hard to imagine taking the height of every person).

Standard deviation of sample yields a biased result. However, for a representative sample (large enough random sample), it is a often a good estimate. The larger the sample, the more accurate estimate.

___
## References and bibliography 

### General 

- Ian McLoughlin, Assignment Brief, 2020. [pdf] GMIT. Available at: <https://learnonline.gmit.ie/mod/url/view.php?id=102004> [Accessed October 2020].

### Task 3 related

- Wikipedia Contributors - Standard Deviation [online] Available at: <https://en.wikipedia.org/wiki/Standard_deviation> [Accessed November 2020]
- Tech Book Report - Standard Deviation In 30 Seconds [online] Available at: <http://www.techbookreport.com/tutorials/stddev-30-secs.html> [Accessed November 2020]
- Math is fun - Standard Deviation and Variance [online] Available at: <https://www.mathsisfun.com/data/standard-deviation.html> [Accessed November 2020]
- Microsoft support - STDEV.P function [online] Available at: <https://support.microsoft.com/en-us/office/stdev-p-function-6e917c05-31a0-496f-ade7-4f4e7462f285> [Accessed November 2020]
- Microsoft support - STDEV.S function [online] Available at: <https://support.microsoft.com/en-us/office/stdev-s-function-7d69cf97-0c1f-4acf-be27-f3e83904cc23> [Accessed November 2020]
- Exceltip - How To Use Excel STDEV.P Function [online] Available at: <https://www.exceltip.com/statistical-formulas/how-to-use-excel-stdev-p-function.html> [Accessed November 2020]
- Good Data - Standard Deviation Functions [online] Available at: <https://help.gooddata.com/doc/en/reporting-and-dashboards/maql-analytical-query-language/maql-expression-reference/aggregation-functions/statistical-functions/standard-deviation-functions> [Accessed November 2020]
- NumPy documentation - Standard Deviation (numpy.std) [online] Available at: <https://numpy.org/doc/stable/reference/generated/numpy.std.html> [Accessed November 2020]
- Stack Overflow contributors - STDEV.S and STDEV.P using numpy [online] Available at: <https://stackoverflow.com/questions/64884294/stdev-s-and-stdev-p-using-numpy> [Accessed November 2020]

___
Andrzej Kocielski