# Machine Learning and Statistics - Tasks 2020
***
### Task 1 
#### Write a Python function called `sqrt2` that calculates and prints to the screen the square root of 2 to 100 decimal places

* This program uses Newton's method (also known as the Newton-Raphson method) to calculate the square root of 2.
$$ z = z - \frac{z^2 - x}{2z}$$

* The task requires that the function prints the result to 100 decimal places without using any modules from the standard library. 
* Python, however, stores floating point numbers correctly only to around 16 or 17 digits [1]. This is due to space limitations when storing these numbers in binary on the machine. 
* In order to print the square root of 2 (an irrational number) to 100 decimal places, a different approach is needed. 
* The correct output could be displayed as a string, which would bypass the need to use a floating point number. Additionally it may be noted that integers in Python have arbitrary precision [2]. A combination of these two approaches may indeed be the route to completing this task to satisfaction.
* 


[1] http://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/float.html <br>
[2] https://mortada.net/can-integer-operations-overflow-in-python.html

In [1]:
def sqrt2():
    
    """
    This function calculates the square root of 2 using Newton's method
    """
    
    # Let the initial guess r be equal to 2 
    r = 2
    # The tolerance variable is set to a sufficiently low number for increased accuracy of returned value 
    tolerance = 10 ** (-10)
   
    # Loop until we reach desired accuracy
    while abs(2 - r * r) > tolerance:
        # Newton's method
        r = (r + 2 / r) / 2
        # Prints new guess on each iteration
        print(r)
        
    return r


In [2]:
# call function
sqrt2()

1.5
1.4166666666666665
1.4142156862745097
1.4142135623746899


1.4142135623746899

In [3]:
def sqrt(x):
    """
    A function to calculate the square root of a number x.
    """
   # Initial guess for the square root z.
    z = x / 2
    # z = x
    # Loop until we're happy with the accuracy.
    while abs(x - (z * z)) > 0.000001:
        # Calculate a better guess for the square root.
        z = z - ((z*z - x) / (2 * z))
        print(z)
    # Return the (approximate) square root of x.
    return z

In [4]:
sqrt(9)

3.25
3.0096153846153846
3.000015360039322
3.0000000000393214


3.0000000000393214

In [5]:
sqrt(2)

1.5
1.4166666666666667
1.4142156862745099
1.4142135623746899


1.4142135623746899

In [6]:
# the square root of 2*10**20 copy and pasted
s = 1414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641572735013846230912297024924836055850737212644121497099935831

In [7]:
s**2

1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999998831229803130709902367424280661790339879176106247403381180563057585771322144352423335120973526436038005690651132039220275069396901513948987145305184317660561

In [8]:
b = s**2

In [9]:
b

1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999998831229803130709902367424280661790339879176106247403381180563057585771322144352423335120973526436038005690651132039220275069396901513948987145305184317660561

In [10]:
len(str(b))

313

In [11]:
import math

In [12]:
math.sqrt(2)

1.4142135623730951

In [13]:
b = 2*10**200

In [14]:
b

200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

In [15]:
len(str(b))

201

In [16]:
math.sqrt(b)

1.414213562373095e+100

In [17]:
c = 1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999998831229803130709902367424280661790339879176106247403381180563057585771322144352423335120973526436038005690651132039220275069396901513948987145305184317660561

In [18]:
%%capture 
math.sqrt(c);

OverflowError: int too large to convert to float

In [19]:
s

1414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641572735013846230912297024924836055850737212644121497099935831

In [20]:
len(str(s))

157

In [21]:
1/10

0.1

In [22]:
0.1 + 0.2

0.30000000000000004

In [23]:
3602879701896397 / 2 ** 55

0.1

In [24]:
0.3+0.3+0.3

0.8999999999999999

In [25]:
0.3* 3

0.8999999999999999

In [26]:
1/3

0.3333333333333333

In [27]:
(1/3)* 3

1.0

### Task 2 - Chi-squared test
#### This task uses `scipy.stats` to verify a known Chi-squared value of 24.6 based on the table below. It also calculates the p-value. The table is taken from the Wikipedia article on the Chi-squared test [1].

| 	        | A | B | C | D | total|
:-----------|:---:|:---:|:---:|:---:|:------:|
White collar| 90| 60|104| 95|349   |
Blue collar | 30| 50| 51| 20|151   |
No collar   | 30| 40| 45| 35|150   |
**Total**       |**150**|**150**|**200**|**150**|**650**   |


#### The Chi-squared formula below is integrated into the `chi2_contingency()` function used for the task:


$$ \chi^2 = \Sigma \frac{(O_i - E_i)^2}{E_i} $$



* Each column A, B, C and D represents a neighbourhood in a city with a population of 1,000,000. A random sample of 650 is taken and their occupation recorded as either White Collar, Blue Collar or No Collar.
* The null hypothesis is that each person's occupation classification is independent of the neighbourhood they live in
* Using the `chi2_contingency()` function, the Chi-squared value of 24.5712 is calculated. This verifies the the value provided by Wikipedia.
* The p-value is approximately 0.0004. Since this is less than the conventionally accepted significance level of 0.05 [2] we reject the null hypothesis: "For a Chi-square test, a p-value that is less than or equal to your significance level indicates there is sufficient evidence to conclude that the observed distribution is not the same as the expected distribution." [3]
* We can therefore accept the alternative hypothesis - that there *is* a relationship between occupation and neighbourhood in the city




#### **References**
[1] Wikipedia; Chi-squared test; https://en.wikipedia.org/wiki/Chi-squared_test <br>
[2] Eck, David and Ryan, Jim; The Chi Square Statistic; https://math.hws.edu/javamath/ryan/ChiSquare.html <br>
[3] Frost, Jim; Chi-Square Test of Independence and an Example; https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/ 

In [28]:
# Import chi2_contingency function from the scipy.stats module
from scipy.stats import chi2_contingency
# Import numpy to generate array representing table values
import numpy as np

In [29]:
# Assign 3x4 array of observed values to variable obs
obs = np.array([[90, 60, 104, 95], [30, 50, 51, 20], [30, 40, 45, 35]])

In [30]:
# Use chi2_contingency function to generate Chi-squared value, p-value, 
# degrees of freedom and array of expected frequencies
chi2, p, dof, ex = chi2_contingency(obs, correction=False)

In [31]:
# p-value
p

0.0004098425861096696

In [32]:
# Chi-squared value
chi2

24.5712028585826

In [33]:
# Degrees of freedom
dof

6

In [34]:
# Array of expected value
ex

array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
       [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
       [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]])

<br>

##### end

### Task 3 - Standard Deviation
***

#### Research the Microsoft Excel functions `STDEV.S` and `STDEV.P`, writing a note about the difference between them. Then use numpy to perform a simulation demonstrating that  the`STDEV.S` calculation is a better estimate than `STDEV.P` for the standard deviation of a population when performed on a sample.

<br>

Both `STDEV.S` and `STDEV.P`, calculate the standard deviation of an array of numbers. The difference between them lies in the nature of what these numbers (or data) represent. If the data represents a *population*, `STDEV.P` is used, while `STDEV.S` is used on a *sample* of the population. 

**x bar and mu!!**

In [35]:
import numpy as np


In [36]:
rng = np.random.default_rng()

In [37]:
mu = 20
sigma = 10
population = np.random.default_rng().normal(mu, sigma, 100000)

In [38]:
x = np.random.choice(population, 5000)

In [39]:
len(x)

5000

In [40]:
std_p = np.sqrt(np.sum((x - np.mean(x))**2)/len(x))

In [41]:
std_s = np.sqrt(np.sum((x - np.mean(x))**2)/len(x) - 1)

In [42]:
def stdev(population):
    std_list = []
    std2_list = []
    for i in range(10000):
        x = np.random.choice(population, 5000)
        std = np.sqrt(np.sum((x - np.mean(x))**2)/len(x))
        std2 = np.sqrt(np.sum((x - np.mean(x))**2)/len(x) -1)
        std_list.append(std)
        std2_list.append(std2)
    return std_list, std2_list

In [43]:
std_list, std2_list = stdev(population)

In [44]:
result1 = sum(std_list)/len(std_list)

In [45]:
result2 = sum(std2_list)/len(std2_list)

In [46]:
result1

10.033283411886718

In [47]:
result2

9.983319930312945

In [48]:
pop = abs(sigma - result1)

In [49]:
sample = abs(sigma - result2)

In [50]:
pop

0.03328341188671757

In [51]:
sample

0.016680069687055266