# Tasks 2020

***
These are the solutions to the 2020 Task Assessments for the Machine Learning & Statistics module.

The author of these tasks is Dervla Candon (G00283361@gmit.ie).
***


## Task 1
***
_Write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places. Your code should not depend on any module from the standard library or otherwise. You should research the task first and include references and a description of your algorithm._
***


Newton's method is one of the most common for estimating the square root of a number[1]. This method is also referred to as the Babylonian method[2].

The iterative equation for calculating x, square root of a, is as follows:

\\[x_{n+1}=\frac{1}{2}(x_{n} + \frac{a}{x_{n}})\\]

In [1]:
###as per the system calculator, the correct value is 1.414213562373
##to use as reference to confirm output is correct

def sqrt2():
    # setting an initial value as half of 2, i.e. 1
    x0 = 1
    # defining the second value for the iteration
    # value is negligible, provided it allows the while loop to begin
    xn = 2
    # define a tolerance limit based on the square of the estimate
    while (xn**2)-2 > 0.00000001:
        # tmp variable used to temporarily hold the old value of xn, to reassign to x0 at the end of the loop
        tmp = xn
        # use Newton's method to calculate a closer approximation
        xn = 0.5*(x0 + 2/x0)
        x0 = tmp
    string_xn = "{:0.100f}".format(xn)
    float_xn = float(string_xn)
    return string_xn

In [2]:
sqrt2()

'1.4142135623746898698271934335934929549694061279296875000000000000000000000000000000000000000000000000'

***

## Task 2
***
_The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example [4], stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell._
***

In [3]:
import scipy.stats as sp
import numpy as np
import pandas as pd

# defining an array of the data categories
# results will be the same whether the arrays are grouped by occupation or neighbourhood
x = np.array([[90,30,30],[60,50,40],[104,51,45],[95,20,35]])
# only 1 input is required in the chi squared test
# output is 4 variables
chi2,p,dof,expected = sp.chi2_contingency(x)

In [4]:
chi2

24.571202858582602

In [5]:
p

0.0004098425861096692

In [6]:
dof

6

For the test data, the null hypothesis being tested is whether occupation and home neighbourhood are independent of one another.

Assuming a confidence interval of 95%, a p-value of less than or equal to 0.05 implies that the probability of the observed values occurring in a population wherein the null hypothesis is true is not significant. 

Here, as the p-value is ~0.00041, the null hypothesis is rejected, resulting in the conclusion that a randomly selected person's neighbourhood has an impact on the likelihood that they will fall within a certain occupation category.

***

## Task 3

*The standard deviation of an array of numbers x is calculated using numpy as np.sqrt(np.sum((x - np.mean(x))* *2)/len(x)) .
However, Microsoft Excel has two different versions of the standard deviation calculation, STDDEV.P and STDDEV.S . The STDDEV.P function performs the above calculation but in the STDDEV.S calculation the division is by len(x)-1 rather than len(x). Research these Excel functions, writing a note in a Markdown cell about the difference between them. Then use numpy to perform a simulation demonstrating that the STDDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample. Note that part of this task is to figure out the terminology in the previous sentence.*

***

For the excel functions STDEV.P and STDEV.S, P stands for population [6] while S stands for sample [7]. STDEV.P should be used if the input data represents the entire population; if the values represent a sample from the population, STDEV.S is the appropriate choice. 

When numerical arrays are calculated, these represent a sample selected from the probability distribution used to generate the array; thus I would expect the STDEV.S formula to better represent the standard deviation of the array.

In [19]:
a = np.random.normal(loc=0,scale=10,size=1000)
b = np.random.choice(a,size=100)
STDEV_S = np.sqrt(np.sum((b - np.mean(b))**2)/len(b))
STDEV_P = np.sqrt(np.sum((b - np.mean(b))**2)/(len(b)-1))
print(STDEV_S,STDEV_P)

10.564891819773086 10.618115792994663


In [9]:
scale1=1
trials=1000
results_S = []
results_P = []

for i in range(trials):
    x1 = np.random.normal(loc=0,scale=1,size=100)
    STDEV_S = np.sqrt(np.sum((x1 - np.mean(x1))**2)/len(x1))
    results_S.append(STDEV_S)
    STDEV_P = np.sqrt(np.sum((x1 - np.mean(x1))**2)/(len(x1)-1))
    results_P.append(STDEV_P)

print(np.mean(results_P),np.mean(results_S))
    
#percentage_error = [abs((scale1 - STDEV_P)/scale1)*100,abs((scale1-STDEV_S)/scale1)*100]
#print(STDEV_S,STDEV_P,percentage_error)

0.9949202479811068 0.989933147664204


In [8]:
scale2=5
x2 = np.random.normal(loc=20,scale=5,size=100)

STDEV_P2 = np.sqrt(np.sum((x2 - np.mean(x2))**2)/(len(x2)-1))
STDEV_S2 = np.sqrt(np.sum((x2 - np.mean(x2))**2)/len(x2))
percentage_error2 = [abs((scale2-STDEV_P2)/scale2)*100,abs((scale2-STDEV_S2)/scale2)*100]
print(STDEV_P2,STDEV_S2,percentage_error2)

5.332549128893826 5.305819391003206 [6.650982577876512, 6.116387820064126]


### References

[1] Newton's Method; Wikipedia; https://en.wikipedia.org/wiki/Newton%27s_method#Square_root

[2]Methods of Computing Square Root; Wikipedia; https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method

[3] Chi-Squared Test; Wikipedia; https://en.wikipedia.org/w/index.php?title=Chi-squared_test&oldid=983024096

[4] Chi2 Contingency; Scipy; https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

[5] p Values; statsdirect; https://www.statsdirect.com/help/basics/p_values.htm

[6] How To Use Excel STDEV.P Function; ExcelTip; https://www.exceltip.com/statistical-formulas/how-to-use-excel-stdev-p-function.html

[7] STDEV.S Function; Microsoft; https://support.microsoft.com/en-us/office/stdev-s-function-7d69cf97-0c1f-4acf-be27-f3e83904cc23