## Machine Learning and Statistics Module 52954
## Tasks
### Lecturer: Ian McLaughlin
### Student : Fiona O'Riordan
***

## Task 1
***

## Objective: 

October 5th, 2020: Write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places. Your code should not depend on any module from the standard library1 or otherwise. You should research the task first and include references and a description of your algorithm.

## Background

### Methods for calculating square roots are approximations.

Methods for calculating square roots seek to identify the non negative square root of a number (commonly denoted as √S, 2√S, or S1/2 ) of a real number [1]. Real numbers are any positive or negative number including all integers (whole positive or negative numbers or 0), rational(can be expressed as a fraction) and irrational numbers (can be expressed as an infinite decimal representation e.g. 3.1415926535....).  Real numbers which include decimal points are called floating point numbers, since the decimal "floats" between the digits [2].  Irrational numbers with infinite decimal representation are generally estimated by computers [2]. For most numbers their square root is an irrational number [1] Moreover, even in the case of computing the square root of a perfect square integer where a square root with an exact finite represenation exists, only a series of increasingly accurate approximations are returned. [1].


### Newtons Method.

Typically analytical methods to calcuate square roots tend to be iterative and have two steps. Firstly an initial guess $s$ of the square root is provided. This number can be any number as long as it is less that the number $x$ where √X is the number sought. Secondly, each iteration produces a better guess or is closer in refinement to the √X until a required accuracy is met or when a maximum number of iteratiions (predefined) have been reached for slowly converging algorithms. The closer this initial guess is to the √S then the less iterations will performed[1],[4].  

Newton's method is one such approach and will be used for this task as it the most widely used approach and the most suitable to computational [1],[4]. Newtowns method can be implemented to calculate the square root $s$ of a number $x$, where $x \gt 0$ and $s = \sqrt x $. Starting with an initial guess for the square root, $s_0$, the algorithm calculates a better guess using the formula
$$ s_{n+1} = s_n - \left ( \frac{s_n^2 - x}{2 s_n} \right ). $$
$s_{n}$ is the previous estimate for the square root and $s_{n+1}$ is the revised/updated appromiation.

Therefore, We can calculate the square root of a number using Newton's method [3, 4]. To find the square root $s$ of a number $x$, we can iterate using the following equation.
$$ s_{n+1} = s - \frac{s^2 - x}{2s} $$

 

## Implement the function.


In [1]:
"""
A function to calculate the square root of a number.
"""
def sqrt(x):
   
    # define precision as number of decimal places to be used.
    precision = 10**-10;
    # Set the initial guess for the square root of s.
    s = x // 2
    # iteration number
    i = 1
    # Loop while the absolute difference between x and (s^2) is greater than the precision /accuracy required.
    while abs(x - (s **2)) > precision:
        # Calculate the next better guess for the square root.
        # first lets calculate the difference between current guess and next guess so that we can view.
        diff = (s*s - x) / (2 * s)
        s -= diff
        print("Loop:",i, "approximation",s, "difference from pervious value", diff)
        i += 1
    # Return the square root of x (approximation).
    return s


### Test the function


In [2]:
myans = sqrt(2)
print("Square root of 2: ", myans)
num1 = sum(c.isdigit() for c in (str(myans)))
print("Number of decimal returned using Newton's square number is", sum(c.isdigit() for c in (str(myans))))

Loop: 1 approximation 1.5 difference from pervious value -0.5
Loop: 2 approximation 1.4166666666666667 difference from pervious value 0.08333333333333333
Loop: 3 approximation 1.4142156862745099 difference from pervious value 0.002450980392156932
Loop: 4 approximation 1.4142135623746899 difference from pervious value 2.123899819940621e-06
Square root of 2:  1.4142135623746899
Number of decimal returned using Newton's square number is 17


### Evalute the result

The functiona sqrt(2) only returns a value with only 17 significant digits displayed. As the Python documentation [6] explains and as we explored the earlier section above 'Methods for calculating square roots are approximations' computer languages use an approximation of certain numbers. In the case of Python that approximatiion is displayed to 17 significant decimal places [6].

According to Python documentation [6], floating point numbers are represented in base 2 (binary) fractions. Therefore for example 0.125 is represented as follows in decimal (base 10): 
0.125 = 1/10 + 2/100 + 5/1000
Similary the binary fraction 0.001 is represented as : 
0.001 = 0/2 + 0/4 + 1/8.

But mostly decimal fractions cannot be exactly expressed as a binary fraction. Therefore decimal floating-point numbers are **approximated** by the binary floating-point numbers actually stored by the computer[6]. 

Python documentation gives the example of 1/10 and explains how the decimal value 0.1 can never be represented exactly as a base 2 fraction regardless of the base 2 digits you use. The value 1/10 in base 2 is an infinitely repeating fraction: 
0.0001100110011001100110011001100110011001100110011...

While 1/10 will by default be displayed as 0.1, the actual stored value of 1/10 is the nearest representable binary fraction of that fraction.  If we format 0.1 to 100 places we see that number of signficant places calculated to in the approximation is 55.

In [3]:
print(0.1) 
# print 0.1 and format to 100 decimal places
len(format((0.1), ".100f"))
mystring = "1000000000000000055511151231257827021181583404541015625"
len(mystring)

0.1


55

Python documentation [6] explains that floats are typically approximated using a binary fraction with the numerator using the first 53 bits starting with the most significant bit and with the denominator as a power of two. The binary fraction is 3602879701896397 / 2 ** 55

In [4]:
3602879701896397 / 2 ** 55

0.1

So with this insight in mind, lets try to see if we can view the square root to 2 to 100 places. 

In [5]:
# Print the result to 100 signification places.
print(format(myans, '.100f'))
 

1.4142135623746898698271934335934929549694061279296875000000000000000000000000000000000000000000000000


In [6]:
import re
num2 = len(re.sub("[^1-9]", "", (format(myans, '.100f'))))
#adding back in the one zero I shouldnt have removed but did for convenience.
print("Number of significant digits in my answer:", num2+1)

Number of significant digits in my answer: 53


We see that 53 significant digits now as expected but this does not solve our question we were asked.


## Modify the approach.


However, if we use the python library 'decimal' and use its function 'sqrt' to calculate the square root of 2 with precision set to 100 we can see that python can be used to calculate the square root of 2 to 100 places. Therefore, it is possible to write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places. 


In [7]:
# https://docs.python.org/2/library/decimal.html [7]
from decimal import *
# set precision equal to 101 as precision refers to significant number of digits and not decimal places.
getcontext().prec = 101 
sqrt2byDecimal = Decimal(2).sqrt()
print(sqrt2byDecimal)
a = sqrt2byDecimal
print(type(a))

# https://stackoverflow.com/questions/24878174/how-to-count-digits-letters-spaces-for-a-string-in-python [8]
import re
numofdigits = len(re.sub("[^0-9]", "", str(sqrt2byDecimal)))
print("Number of signficant digits is ", numofdigits)


1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727
<class 'decimal.Decimal'>
Number of signficant digits is  101


So the square root of 2 with 100 decimal places can be achieved using python. Stackoverflow [10] suggests just "multiply your number by 10**200 and use Newton's method to get the integer square root. Then insert a decimal point in the right place". In otherwords, this would return a number which is 100 decimal places from the square root of 2 and could be printed using the string format. 

In [17]:

"""
A function to calculate the square root of 2.
"""
def sqrt2():
    x = 2*10**200
    # define precision as number of decimal places to be used.
    # precision = 10**-10;  No precision now as difference will be zero.
    # Set the initial guess for the square root of s.
    s = x // 2
    # iteration number
    i = 1
    # Loop while the absolute difference between x and (s^2) is greater than the precision /accuracy required.
    while (x - (s **2)) <0:
        # Calculate the next better guess for the square root.
        # first lets calculate the difference between current guess and next guess so that we can view.

#         s = s- (s*s - x) / (2 * s)
        #simplify in order to avoid overflow error.       
        s = (s+x//s)//2
    # Return the square root of 2 * 10**200
    s = (str(s))
    return(s[0] + "." + s[1:])
    
sqrt2()


'1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727'

This Python function called sqrt2 calculates and prints to the screen the square root of 2 to 100 decimal places.  Finally I am going to benchmark the result.





## Benchmark the Result


Compare to Nassa [9]: 

In [23]:
# https://apod.nasa.gov/htmltest/gifcity/sqrt2.1mil [10]
sqrt2byNasa = "1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727"
# check I have 100 decimal places places/ 101 significant places.
import re
# verify 100 decimal places
print("number of significant digits",len(str(sqrt2byNasa)[1:]))



number of significant digits 101


In [33]:
taskanswer = sqrt2()
if (str(sqrt2byNasa) == taskanswer):
    print("Square root of 2 to 100 decimal places is equal to Nasa's number with 100 decimal places displayed.")

Square root of 2 to 100 decimal places is equal to Nasa's number with 100 decimal places displayed.


## Conclusion

In [None]:
Conclusion here... to do . 

### References: 
[1] Methods of computing square roots; Wikipedia;https://en.wikipedia.org/wiki/Methods_of_computing_square_roots  

[2] Real Number Definition; Techterms https://techterms.com/definition/realnumber  

[3] A Tour of Go; Exercise: Loops and Functions; https://tour.golang.org/flowcontrol/8  

[4] Newton's method; https://en.wikipedia.org/wiki/Newton%27s_method  

[5] Geeks For Geeks; Find root of a number using Newton’s method; https://www.geeksforgeeks.org/find-root-of-a-number-using-newtons-method/  
[6] Python Software Foundation, "Floating Point Arithmetic: Issues and Limitations", https://docs.python.org/3/tutorial/floatingpoint.html
[7]decimal — Decimal fixed point and floating point arithmetic;Documentation » The Python Standard Library » Numeric and Mathematical Modules; Python.org »https://docs.python.org/3/library/decimal.html   
[8] How to count digits, letters, spaces for a string in Python? Óscar López Reply; Stackoverflow; https://stackoverflow.com/a/24878232  
[9] https://apod.nasa.gov/htmltest/gifcity/sqrt2.1mil  
[10] How to get the square root of a number to 100 decimal places without using any libaries or modules [; Stack Overflow;https://stackoverflow.com/questions/64295245/how-to-get-the-square-root-of-a-number-to-100-decimal-places-without-using-any-l



---

# Task 2

## Objective

November 2nd, 2020: The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example [4], stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.

|              | A   | B   | C   | D   | Total |
|--------------|-----|-----|-----|-----|-------|
| White Collar | 90  | 60  | 104 | 95  | 349   |
| Blue Collar  | 30  | 50  | 51  | 20  | 151   |
| No Collar    | 30  | 40  | 45  | 35  | 150   |
| Total        | 150 | 150 | 200 | 150 | 650   |

## Background

###  The Chi Squared Test


The Chiq Squared test of independence tests for dependence between categorical variables.  The test is an omnibus test and therefore it tests "whether the explained variance in a set of data is significantly greater than the unexplained variance, overall".  In other words, the chi squared test test  whether distributions of categorical variables differ from each another.

The Chiq Squared test uses a cross tabulated table (as exemplifed in [4].) and analyses the independence between variables rows and columns.  


In theory, if the observed and expected were equal then chi-square would be zero but this is unlikeyly to happen in he real world. Determining whether a Chi Squared test statisic is sufficently large enough to state that a significant statistical difference is indicated is not that straightforward[3].


### Wikipedia example to be calculated 

Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification [1]. 

To estimate what proportion of the whole 1,000,000 live in neighborhood A, the sample number of 150 is used and similarly 349/650 is used to estimate the proportion of 1,000,0000 that are white collar workers [1]. 

As Wikipedia [1] explains, assuming independence under the hypothesis the  number of white-collar workers in neighborhood A to be: 
\begin{equation*} 
150 \times\frac{349}{650} \approx 80.54
\end{equation*}

Then in that "cell" of the table, we have

\begin{equation*}
\frac{(observed - expected)^2}{expected} =  
\frac{(90 - 80.54)^2}{80.54}
\approx 1.11
\end{equation*}

The sum of these quantities over all of the cells is the test statistic:
≈ 24.6 [1]
Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom:  
 =(rows -1) * (colums -1)  
 = (3-1) * (4-1)  
 = 2*3  
 = 6
 


 


#### Null Hypothesis (H0):

The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification.

#### Alertnative Hypothesis H1):

The alternative hypothese is therefore a that person's neighborhood of residence is dependent on the person's occupational classification.

## VERIFY USING SCIPY.STATS

### Apply scipy.stats import chi2_contingency to the data

In [1]:
# create a crosstab with each row of data.
crosstab = [[90, 60, 104, 95], [30,  50,  51, 20], [30,40,45,35]]


In [6]:
# import the package scipy.stats chi2_contingency
from scipy.stats import chi2_contingency


In [39]:
# perform the tests with outputs chisquare, p and dof
teststat, p, dof, expected = chi2_contingency(crosstab)

### Compare results


#### Compare 'cell': white-collar workers in neighbourhood A test statistic.

In [32]:

print("As Wikipedia shows and scipy.stats concurs, assuming independence under the hypothesis the number of expected white-collar workers in neighborhood in the sample should be  :", expected[0][0],  
     "but the observed number is in the sample table is ", crosstab[0][0])
cell00stat = ((crosstab[0][0]- expected[0][0])**2)/expected[0][0]
print('As wikipedia shows and as we can deduce from scipy stats expected calculation the test statistic for this cell would be:%.2f'% (cell00stat))

As Wikipedia shows and scipy.stats concurs, assuming independence under the hypothesis the number of expected white-collar workers in neighborhood in the sample should be  : 80.53846153846153 but the observed number is in the sample table is  90
As wikipedia shows and as we can deduce from scipy stats expected calculation the test statistic for this cell would be:1.11


#### Compare Chiq Square (sum of each test stat for each cell) result from Wikipedia with scipy.stats

In [38]:
# chiq square test statistic.
print(f'Chi Squared Test Statistic from scipy.stats is %.1f (rounded to one decimal place)'%(teststat))
print('which is the same result from Wikipedia')
# print the p value.
print('The p value = %.10f'% (p))
# print the Degrees of Freedom
print('The degrees of freedom value = %d'% (dof))

Chi Squared Test Statistic from scipy.stats is 24.6 (rounded to one decimal place)
which is the same result from Wikipedia
The p value = 0.0004098426
The degrees of freedom value = 6


In [40]:
# assume a alpha/critical value
alpha = 0.05
print('With p value of p=%.10f and alpha set to alpha=%.10f'% (p, alpha))
if p <= alpha:
    print("Reject the null hypothesis (H0)")
else:
    print('Results Do not reject (H0)')

With p value of p=0.0004098426 and alpha set to alpha=0.0500000000
Reject the null hypothesis (H0)


### Evalute the result

The p value is 0.0004098425861096696 and is lower than alpha = 0.05 (an alpha value I have assumed to use here). There is a low probability that a persons neighbourhood is independent of the person's occupation. The Chi Squared test statistic is 24.6 approximately, an improbably large statistic according to the chi-squared distribution.   Therefore reject null hypothesis that each person's neighborhood of residence is independent of the person's occupational classification. 


DOF neeed to include.. 

### References

[1] Wikipedia contributors, “Chi-squared test — Wikipedia, the free encyclopedia,” 2020, [Online; accessed 1-November-2020]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Chi-squaredtest&oldid=983024096  
[2] scipy.stats.chi2_contingency; Scipy Docs https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html  
[3] Chi-Square Statistic: How to Calculate It / Distribution;Statistics How to; https://www.statisticshowto.com/probability-and-statistics/chi-square/
