# Tasks
### This workbook contains my solutions to the tasks for the Machine Learning and Statistics Module





#### Task 1 - October 5th, 2020 
Write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places. Your code should not depend on any module from the standard library or otherwise. You should research the task first and include references and a description of your algorithm.
#### Research
Newton's method [1] can be used to calculate the square root of a number by first taking a guess and then iteratively moving closer toward the actual square root. According to Newton's method, one can approach further toward the square root of a number, x, from an initial guess of the square root, z, as follows:

$$ z_{next} = z - \frac{z^2-x}{2z} $$

Newton's method is of course straightforward to program in Python - one uses Python's operators (-, * and /) to perform each instance of the calculation, and then one instructs the program to keep performing the calculation, using the output of the previous instance as the input to the next, until the square of the output is however close to the number that one is trying to get the square root of (one could say to within 0.001, for example). One would then specify to how many decimal places one would like one's calculated root to have, in this instance, 100.

However, there are two impediments to calculating the square root of 2 to a precision to 100 decimal places without using any Python modules, as anyone who tries will quickly find out:
1. Decimal point numbers do not have exact representations in binary floating point, which is essentially what Python's float datatype is, and which is the only datatype available for displaying decimal numbers in Python without the use of modules such as Decimal [2]. This means that we will not likely be able to achieve the accuracy required to display the square root of two to 100 decimal places with precision, i.e. exactly, as in order to do so we would have to perform many very precise calculations involving floating point numbers, which are themselves by definition not exact, i.e. imprecise.
2. Python is simply not able to store 100 decimal places in a floating point number

It is important to note, however, that the problem is not with the Python operators (-, * and /) or loops, but only the float datatype itself. This means that if we could somehow use the other numerical datatype that does not need to be imported in Python, i.e. the integer, to perform the calculation, we should not have a problem - except, that is, for the fact that the integer datatype by does not cater to decimal places.

The trick, then, is to forget about decimal places during the calculation itself, and then insert the decimal place afterwards using the string datatype. What if, instead of calculating the square root of two, we calculated the first hundred digits of the square root of: $$2\times10^{100000000}$$

Looking at the two numbers as strings, the only difference would be that one has a '.' and the other does not; and because integers have exact representations in Python's integer datatype, we should not come across any imprecision as we would have using the float datatype. Credit to here [4] for this solution.

What will thus been shown as part of my solution is as follows:
1. Python's Math module's sqrt function is not accurate to the 100th decimal place (we will use NASA's calculation of the root of 2 to ~ million decimal places as our comparison [3]), nor is it able to even provide 100 decimal places. From this we can infer that the inaccuracy resides in thh datatype used by this function, namely, the float.
2. Python's Decimal module's sqrt function is accurate to 100 decimal places, again indicating that Python's float datatype is what is at issue above.
3. Python's float datatype cannot calculate the square root of two such that the difference between the actual root squared and the calculated root squared is less than .0001.
4. The square root of two can be calculated by using the integer datatype to calculate the square root of *2\*10^100000000* and then converting it to a string, slicing to retain only the first 100 characters, and then dutifully inserting the decimal point.


[1] https://mathworld.wolfram.com/NewtonsMethod.html

[2] https://docs.python.org/2/library/decimal.html

[3] https://stackoverflow.com/questions/22162522/how-to-display-a-decimal-number-to-100-decimal-places

[4] https://stackoverflow.com/questions/64295245/how-to-get-the-square-root-of-a-number-to-100-decimal-places-without-using-any-l

In [1]:
# hint at the inaccuracy of the sqrt function in the
# Math module, which uses floats
from math import sqrt
sqrt(2)**2

2.0000000000000004

In [2]:
# print the Math module's square root of 2 to 100
# decimal places
answer = "%.100f" % sqrt(2)
print(answer)
# confirm that there are 100 decimal places
print(len(str((answer))) - 2)

1.4142135623730951454746218587388284504413604736328125000000000000000000000000000000000000000000000000
100


In [3]:
# we will take NASA's calculation of the root of 2
# as accurate to 100 decimal places
nasa = "1.41421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157273501384623091229702492483605585073721264412149709993583141322266592750559275579995050115278206057147010955997160597027453459686201472851..."
nasa = nasa[:101]
# check at which decimal place the Math module
# becomes inaccurate, and thus floats likely
# become inaccurate
for i in range(102):
    if str(answer)[i] != nasa[i]:
        accuracy = i - 2
        break
# check how many places (inaccurate or otherwise)
# the math module can calculate the root of 2 to
for i in range(102):
    if str(answer)[i] != "0":
        x = 0
    if str(answer)[i] == "0":
               x+=1
    if x == 4:
        place = i-2-x
        break
print("Python's Math module's sqrt function is accurate to the " + str(accuracy) + "th place, although it provides an answer to the " + str(place) + "th place.")

Python's Math module's sqrt function is accurate to the 15th place, although it provides an answer to the 51th place.


In [4]:
# show that the Decimal module can calculate the
# square root of 2 to 100 decimal places, by 
# comparing with NASA's figure, above
from decimal import Decimal, getcontext
# set the decimal point precision to 100
getcontext().prec = 100
# get the decimal square root of 3
Decimal(2).sqrt()
print(nasa)
print(Decimal(2).sqrt())
# as the NASA has not been rounded, 
# we will check the first 99 decimal values
# for simplicity instead
print(str(Decimal(2).sqrt())[:100] == nasa[:100])

1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641572
1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641573
True


In [5]:
# Find the difference between 2 and the square root of 2
# to a hundred decimal places squared
# note that this number is far smaller than that for
# the Math module's function (.0000000000000004)
Decimal(2 - 1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641573 ** 2)

Decimal('-4.44089209850062616169452667236328125E-16')

In [6]:
# design an algorithm implementing Newton's Method
# Newton's method: better guess = 0.5 * (guess + number / guess)
# store the desired precision in a variable
# we decide on the precision outside of the function itself
# as the requirements of the question do not indicate
# that precision should be inputted as a parameter to the function
def squareRootOfTwo():
    # store 2 in a variable so that we are not
    number = 2
    # hardcoding values
    # have the function take a guess
    guess = 1.5
    # while the guess is not of a sufficient accuracy,
    # apply Newton's algorithm to improve it.
    # To check the accuracy of the guess, square it and
    # check if the absolute difference between the result 
    # and the input is less than the desired precision.
    while abs(guess ** 2 - number) > precision:
        # if greater than the precision, apply Newton's alogrithm
        guess = guess - ((guess ** 2 - number) / 2 * guess)
        # once the guess is sufficiently accurate the while
        # loop ends. Now round the result to your liking
    print("%.100f" % guess)
    return str(answer)

In [7]:
# Now we will test Newton's method for calculating
# the square of two, and time the calculation
import time
precision = 0.01
start = time.time()
answer = squareRootOfTwo()
end = time.time()
print(str(round(end - start, 1)) + " seconds taken")
# check how many places (inaccurate or otherwise)
# this run of the algorithm calculated the root of 2 to
for i in range(len(answer)):
    if str(answer)[i] != "0":
        x = 0
    if str(answer)[i] == "0":
               x+=1
    if x == 4:
        place = i-2-x
        break
print("Running our function so that the square of the calculated square root of two is between 1.9 and 2.1 provides an approximation of the square root of two to the " + str(place) + "th place.")

1.4177445877362315762582056777318939566612243652343750000000000000000000000000000000000000000000000000
0.0 seconds taken
Running our function so that the square of the calculated square root of two is between 1.9 and 2.1 provides an approximation of the square root of two to the 51th place.


In [8]:
# Same again but to a precision of 0.001
# note the increase from 89.6 seconds to seconds
import time
precision = 0.001
start = time.time()
answer = squareRootOfTwo()
end = time.time()
print(str(round(end - start, 1)) + " seconds taken")
# check how many places (inaccurate or otherwise)
# this run of the algorithm calculated the root of 2 to
for i in range(len(answer)):
    if str(answer)[i] != "0":
        x = 0
    if str(answer)[i] == "0":
               x+=1
    if x == 4:
        place = i-2-x
        break
print("Running our function so that the square of the calculated square root of two is between 1.99 and 2.01 provides an approximation of the square root of two to the " + str(place) + "th place.")

1.4145670714723390659628421417437493801116943359375000000000000000000000000000000000000000000000000000
1.0 seconds taken
Running our function so that the square of the calculated square root of two is between 1.99 and 2.01 provides an approximation of the square root of two to the 51th place.


In [9]:
# Same again but to a precision of 0.0001
# note the increase from 89.6 seconds to seconds
import time
precision = 0.0001
start = time.time()
answer = squareRootOfTwo()
end = time.time()
print(str(round(end - start, 1)) + " seconds taken")
# check how many places (inaccurate or otherwise)
# this run of the algorithm calculated the root of 2 to
for i in range(len(answer)):
    if str(answer)[i] != "0":
        x = 0
    if str(answer)[i] == "0":
               x+=1
    if x == 4:
        place = i-2-x
        break
print("Running our function so that the square of the calculated square root of two is between 1.999 and 2.001 provides an approximation of the square root of two to the " + str(place) + "th place.")

1.4142489172699921340381479240022599697113037109375000000000000000000000000000000000000000000000000000
90.0 seconds taken
Running our function so that the square of the calculated square root of two is between 1.999 and 2.001 provides an approximation of the square root of two to the 51th place.


We can note two things from the running of our function with different precision demands:
1. The number of decimal places remains the same (51) despite varying precision demands, indicating that there is a limit to the number of decimal places that a float is capable of storing.
2. The computational expense of calculating the square root of two increases very quickly with respect to an increased demand for precision (from 0.0 seconds at 0.1 precision to 1.1 seconds at 0.001 precision to approximately 100 seconds at 0.0001 precision). If one tries to attain an even greater precision, one will see that the function simply times out, i.e. the precision cannot be attained by Python's float datatype.

*The conclusion from this is thus that it is not possible to use the float datatype to calculate the square root of 2 to 100 decimal places.

In [10]:
# rewrite our function so that it calculates the square root
# of 2*10**100000000, and then manipulate this using
# string operations into the square root of two
def squareRootOfTwo():
    number = 2*10**200
    # have the function take a guess
    guess = number // 2
    # while the guess is not of a sufficient accuracy,
    # apply Newton's algorithm to improve it.
    # This time, because we are dealing with very large
    # numbers, it is not practical to check that the answer
    # is precise. Instead, we will assume that after 1000
    # iterations it is as precise as we need.
    for i in range(1000):
        # if greater than the precision, apply Newton's alogrithm
        guess = guess - ((guess ** 2 - number) // (2 * guess))
        # once the guess is sufficiently accurate the while
        # loop ends. Now round the result to your liking
    return guess

In [11]:
import time
precision = 10**100
start = time.time()
answer = squareRootOfTwo()
end = time.time()
# we only want the first 100 characters
answer = list((str(answer)[:100]))
finalanswer = []
# the first character will be the same as answer
finalanswer.append(answer[0])
# the second will be a decimal point
finalanswer.append(".")
# now add the rest of the characters
for i in range(1, len(answer)):
    finalanswer.append(answer[i])
# print our final answer
print("".join(finalanswer))
# print NASA's calculation
print(nasa[:101])
# check if they are the same
print("".join(finalanswer) == nasa[:101])


1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641572
1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641572
True


#### Task 2 November 2nd, 2020: 
The Chi-squared test for independence is a statistical
hypothesis test like a t-test. It is used to analyse whether two categorical variables
are independent. The Wikipedia article gives the table below as an example [4],
stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats
to verify this value and calculate the associated p value. You should include a short
note with references justifying your analysis in a markdown cell.

<table>
    <tr>
        <td><td>
        <td>A<td>
        <td>B<td>
        <td>C<td>
        <td>D<td>
        <td>Total<td>
    <tr>
    <tr>
        <td>White Collar<td>
        <td>90<td>
        <td>60<td>
        <td>104<td>
        <td>95<td>
        <td>349<td>
    <tr>
    <tr>
        <td>Blue Collar<td>
        <td>30<td>
        <td>50<td>
        <td>51<td>
        <td>20<td>
        <td>151<td>
    <tr>
    <tr>
        <td>No Collar<td>
        <td>30<td>
        <td>40<td>
        <td>45<td>
        <td>35<td>
        <td>150<td>
    <tr>
    <tr>
        <td>Total<td>
        <td>150<td>
        <td>150<td>
        <td>200<td>
        <td>150<td>
        <td>650<td>
    <tr>
<table>
    

<br>
I will first simply use scipy.stats to find the chi-squared value and the corresponding p-value for this sample. One can achieve that without actually understanding how the chi-squared value is actually calculated.the chi2_contingency function can achieve this for us - it's only mandatory parameter is the array of frequencies for each category.

In [10]:
import numpy as np
import scipy.stats as ss

data = np.array([[90, 60, 104, 95],[30, 50, 51, 20],[30, 40, 45, 35]])

chi2, p, dof, expected = ss.chi2_contingency(data)

print(f"chi^2 = {chi2}")
print(f"p-value = {p}")
if p < 0.05:
    print("As the p-value is less than 0.05, the null hypothesis, that the categories are independent, can be assumed to be false.")

chi^2 = 24.5712028585826
p-value = 0.0004098425861096696
As the p-value is less than 0.05, the null hypothesis, that the categories are independent, can be assumed to be false.


As regards understanding what the chi-squared test achieves, its meaning is actually quite intuitive. This is in contrast to the chi-squared distribution, whose probability density function is certainly intimidating to look at, and mathematically depends on the gamma function - and many say that to understand the gamma function one must first understand the exponential and the poisson functions... Wikipedia's definition of the test is lucid:

> The chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance [...] It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability [1].

For example, when we flip a coin, the result is either heads or tails, not both - this covers mutual exclusivity. The result also *must* be either heads or tails, meaning that the probability of heads coming up and that of tails coming up equals 1. Of course, we know that a standard coin has a .5 chance of turning up heads, and .5 chance of turning up tails. Thus, if we flip a coin 100 times, and we get 65 heads and 35 tails, we could use the chi-squared test to evaluate whether the disparity between this and the expected 50/50 outcome should be put down to chance or not. Basically, the chi-squared test may inform us that the disparity would highly unlikely to be down to chance, and it may be that the coin is not evenly weighted, for example.

The chi-squares test operates then by calculating the chi-squared value of a sample, or number of samples. To get this value, we must first calculate a value that represents the variance between the expected frequency of a category in question with the actual frequency observed in the sample. The formula for this is:
$$\frac{(expected value - actual value)^2}{expected value}$$

This formula is very similar to that for standard deviation, as we would expect, because we are here calculating a kind of variance between the expected and actual frequencies in the sample. To obtain the chi-squared value, we then just sum the 'variance' of each category in each sample:

$${\displaystyle \chi ^{2}=\sum _{i=1}^{n}{\frac {(O_{i}-E_{i})^{2}}{E_{i}}}}$$


There are two further things that are important to grasp when understanding how we evaluate the chi-squared value to actually carry out the chi-squared test.
1. The higher the chi-squared value, the higher the 'variance' between the expected and actual frequencies for the categories, and thus the higher likelihood that the null hypothesis is false.
1. Because the chi-squared value is the sum of the 'variances' for each category and sample, more categories and samples will lead to a higher chi-squared value.

When one considers both of the above, it is clear that one must evaluate the chi-squared value differently for different numbers of categories and samples. Thus, while the chi-squared value represents our test-statistic, to obtain the more meaningful 'p-value', we must factor in what are know as the 'degrees of freedom,' which can be calculated from the number of categories and samples. If one looks at the table above, one can see that apart form the first and last rows and columns, each row represents a category, and each column a sample. Basically, the degrees of freedom is the number of pieces of data that we need in order to calculate our test-statistic, here the chi-squared value. In this case, we need $(rows - 1)(columns - 1)$ pieces of data, because (assuming we know the total number of datapoints for each category and the total number of datapoints in each sample, which is a safe assumption) we can calculate the frequency of any one category in any one sample by either subtracting the frequencies of the other categories from the total in that sample, or subtracting the frequencies of the category in the other samples from the total datapoints for that category. Thus, for our coin example, the degrees of freedom would be $(2-1)(2-1) = 2$. 

Once we are equipped with the chi-squared value and the degrees of freedom, we are then able to plot the probability density function for a chi-squared-distribution with those degrees of freedom. Then, we calculate the area under the function and to the left of the calculated chi-squared value (represented by the x-axis), and if that area is less than 0.95, we can reject the null hypothesis (or accept if greater). This is essentially how the p-value of the test is calculated: $1 - area = Pvalue$.

Once the purpose and meaning of the chi-squared test is understood, the actual chi-squared distribution becomes much more manageable. In particular, the manner in which the pdf changes with changing degrees of freedom becomes actually intuitively understandable, as the chi-squared value has been shown above to increase with increasing degrees of freedom.


References
[1] https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

[2] https://python-bloggers.com/2020/09/how-to-run-chi-square-test-in-python/

#### Task 3 November 16th, 2020
The standard deviation of an array of numbers x is
calculated using numpy as np.sqrt(np.sum((x - np.mean(x))**2)/len(x)) .
However, Microsoft Excel has two different versions of the standard deviation calculation, STDDEV.P and STDDEV.S . The STDDEV.P function performs the above
calculation but in the STDDEV.S calculation the division is by len(x)-1 rather
than len(x) . Research these Excel functions, writing a note in a Markdown cell
about the difference between them. Then use numpy to perform a simulation
demonstrating that the STDDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample. Note that part of
this task is to figure out the terminology in the previous sentence.

#### Research
Like many statistical phenomena, there is both a mathematical and intuitive explanation of why when calculating the standard deviation of a sample we divide the sum of the squared distances of each point from the sample mean (i.e the variance) from the size of the sample minus one rather than simply the size of the sample. I will here eschew the mathematical explanation in favour of the intuitive. I have taken this intuitive explanation mostly from this video [1].

Firstly, one must understand the difference between a population and its sample:

* A population is a complete set of data.
* A sample is a sample from a population, and thus is an incomplete set of data.

Secondly, one must understand how variance is calculated:
* Variance is a measure of the spread of a dataset, and is measured as a sum of each the datapoint's distances from the mean squared.
* The distances are squared for two reasons: (1) to avoid a situation where the negative and positive distances from the mean would simply cancel each other and the variance would be zero - by squaring the distances from the mean the result will always be positive and there will be no cancellingo out; (2) to align with what are called the statistical moments of a distribution, which is a topic for another day.
* If you still think that squaring the distances seems somewhat artificial, then you are correct to the extent that variance in itself is not a very meaningful value. It is only when one reduces it by dividing by a certain factor and then getting the square root of the result (thereby to a certain extent undoing the previous squaring of the distances) that one arrives at an intrinsically meaningful value - the standard deviation, which represents how much individual values tend to deviate from the mean.

With an understanding of the difference between a population and a sample, and how variance is calculated, understanding 
Because a sample is an incomplete version of a population, statistical phenomena that are derived from the entirety of a dataset's data points, such as the mean, are likely to be different in the population and the sample because there are data points in the population that are missing in the sample. As suggested above, when calculating the variance of a sample/population, we first need to calculate the sample/population mean. This is simply because the mean is involved in the calculation of variance, as variance is a measure of the data points' distance from the mean. Now this the key thing to understand: if one calculates the mean of a sample and then calculates the variance, *and if one then artifically changes the mean in anyway without changing the datapoints and then recalculates the variance using that artificial mean, this new variance will always be greater than the originally calculated variance.* This because the mean is calculated in such a way that it minimizes the total distance of data points from itself and is as much as possible equidistant from all the datapoints (it is a kind of middle value).

As we said above, the population mean is almost always different from the sample mean. This means that the sample variance from the population mean will always be greater than the sample variance from the sample mean. To counteract this bias in the sample variance, we  decrease the value of the denominator when calculating the standard deviation so that its value larger than it would have been had we not decreased it. Thus, instead of dividing by n, we divide be n - 1. But why n - 1?

The reason for the choice of -1 is somewhat less intuitive, although related to the foregoing. It involves what are called degrees of freedom. The degrees of freedom represents the number of data points that are 'free to move', or are independent.

Take a univariate dataset, i.e. where there is one variable, of which we know the mean. In a sample from the population, each new datapoint, i.e. each value that the variable takes on, is independent of the population value. This is because the population cannot be in any way used to calculate the values in the sample (it might suggest ranges for values, but not actually values themselves.). This means that the degrees of freedom of a population is always the size of the population, n. If we take the *sample* mean, however, the situation is different. Here, in a sense there is no longer a one-way relationship where the mean is derived from the values but not the other way around. This is because if we know all but one of the values in the sample as well as the sample mean, then we also can *derive* the remaining value. This means that the remaining value is actually not independent - it is not 'free.' As such, the degrees of freedom of the sample is the size of the sample minus one, n - 1. When we calculate standard deviation, we divide not by the size of the dataset but by the degrees of freedom. In the case of populations, this just happens to equal the size of the dataset. It is also worth mentioning that the same coincidence happens when first calculating the sample mean. Because at that point we do not know the mean, then every value is free and cannot be derived from any other value available to us - hence the degrees of freedom in that case is equal also the sample size.

To demonstrate that dividing by the degrees of freedom provides a more accurate standard deviation value than dividing by the size of the sample, I will do the following:

1. Simulate a normally distribute dataset using numpy.random's normal() function, which takes in the standard deviation as a parameter.
1. Manually calculate the standard deviation of the population according to both the population and sample formulas. The population formula should be closer to the inputted standard deviation.

[1] https://www.youtube.com/watch?v=wpY9o_OyxoQ

In [60]:
import numpy as np
import math
rng = np.random.default_rng()

mu, sigma, n = 100, 70, 100000 # mean, standard deviation and size
norm = rng.normal(mu, sigma, n)
print(f'Actual standard deviation is {sigma}')

popstd = np.sqrt((sum(abs(norm - mu)**2))/n)
print(f'Standard deviation calculated using the population formula is: {popstd}')

sampstd = np.sqrt((sum(abs(norm - mu)**2))/n-1)
print(f'Standard deviation calculated using the sample formula is: {sampstd}')

if abs(sigma - popstd) < abs(sigma - sampstd):
    print('When calculating the standard deviation of a population, the population formula produced the most accurate result\n')
else:
    print('When calculating the standard deviation of a population, the sample formula produced the most accurate result\n')
    
n = int(n/math.sqrt(n))
print(f'Taking a sample of size {n} from the population, the results are as follows:\n')
norm = np.random.choice(norm, size=n)
popstd = np.sqrt(np.sum((norm - np.mean(norm))**2)/len(norm))
print(f'Standard deviation calculated using the population formula is: {popstd}')

sampstd = np.sqrt(np.sum((norm - np.mean(norm))**2)/len(norm)-1)
print(f'Standard deviation calculated using the sample formula is: {sampstd}')

if abs(sigma - popstd) < abs(sigma - sampstd):
    print('When calculating the standard deviation of a sample, the population formula produced the result closest to the standard deviation of the population.')
else:
    print('When calculating the standard deviation of a sample, the sample formula produced the result closest to the standard deviation of the population')

Actual standard deviation is 70
Standard deviation calculated using the population formula is: 69.75789029982327
Standard deviation calculated using the sample formula is: 69.7507222835877
When calculating the standard deviation of a population, the population formula produced the most accurate result

Taking a sample of size 316 from the population, the results are as follows:

Standard deviation calculated using the population formula is: 62.17862208428696
Standard deviation calculated using the sample formula is: 62.17058021524793
When calculating the standard deviation of a sample, the population formula produced the result closest to the standard deviation of the population.
