
================================================================================================================
## Module: Machine Learning
### Lectirer: Dr. Ian McLoughlin
### Student: Vitalis Smirnovs
### Student ID: G00317774
================================================================================================================


# Task1
## Computing Square Roots
#### October 5th, 2020: Write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places.

===============================================================================================================

Methods of computing square roots are numerical analysis algorithms for finding the principal, or non-negative, square root (usually denoted √S, 2√S, or S1/2) of a real number. Arithmetically, it means given S, a procedure for finding a number which when multiplied by itself, yields S; algebraically, it means a procedure for finding the non-negative root of the equation x2 - S = 0; geometrically, it means given the area of a square, a procedure for constructing a side of the square.

Every real number has two square roots.[Note 1] The principal square root of most numbers is an irrational number with an infinite decimal expansion. As a result, the decimal expansion of any such square root can only be computed to some finite-precision approximation. However, even if we are taking the square root of a perfect square integer, so that the result does have an exact finite representation, the procedure used to compute it may only return a series of increasingly accurate approximations.<a href = "https://en.wikipedia.org/wiki/Square_root">[1]</a>

The continued fraction representation of a real number can be used instead of its decimal or binary expansion and this representation has the property that the square root of any rational number (which is not already a perfect square) has a periodic, repeating expansion, similar to how rational numbers have repeating expansions in the decimal notation system.

The most common analytical methods are iterative and consist of two steps: finding a suitable starting value, followed by iterative refinement until some termination criteria is met. The starting value can be any number, but fewer iterations will be required the closer it is to the final result. The most familiar such method, most suited for programmatic calculation, is Newton's method, which is based on a property of the derivative in the calculus. A few methods like paper-and-pencil synthetic division and series expansion, do not require a starting value. In some applications, an integer square root is required, which is the square root rounded or truncated to the nearest integer (a modified procedure may be employed in this case).<a href = "https://stackoverflow.com/questions/3047012/how-to-perform-square-root-without-using-math-module">[2]</a>

Procedures for finding square roots (particularly the square root of 2) have been known since at least the period of ancient Babylon in the 17th century BCE. Heron's method from first century Egypt was the first ascertainable algorithm for computing square root. Modern analytic methods began to be developed after introduction of the Arabic numeral system to western Europe in the early Renaissance. Today, nearly all computing devices have a fast and accurate square root function, either as a programming language construct, a compiler intrinsic or library function, or as a hardware operator, based on one of the described procedures. <a href = "https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra">[3]</a>


    
    
   
           
        
        
        

In [1]:
# adapted from https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra?fbclid=IwAR1kPB2mUWK738A-6V39M5EQ9PIasECH8Rlv3Csa8_0UL-7i9B5HvipP3-Q
def sqroot(number):
    # inflate a number
    s=number*10**200 
    # choose a first guess
    x=s//2
    # while the the difference between the guess squared and the number is not zero:
    while (s-x**2)<0:
        # apply Newton's 
        x=(x-x//s)//2
    # print the result unformated
    print("final x ", x)
    # print the final result formatted
    print(f'{x // 10**100}.{x % 10**100:0100d}')
#call a sqroot function for number 2    
sqroot(2)


final x  11429873912822749822157835483053409594519099948227986612151258432276326359067381956754480218601720296
1.1429873912822749822157835483053409594519099948227986612151258432276326359067381956754480218601720296


### References
[1] https://en.wikipedia.org/wiki/Square_root </br>
[2] https://stackoverflow.com/questions/3047012/how-to-perform-square-root-without-using-math-module \n
[3] https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra \n

## Task2

### The Chi-squared test for independence


In this task I will use chi-squared test for categorical data where we suppose a city of 1,000.000 residents with the 4 groups of neighbourhoods A, B, C and D. I would analyse a random sample of 650 residents and their occupation as "white collar", "blue collar", or "no collar". I will use test for independence as a statistical hypothesis test and will use to analyse whether two categorical variables are independent. The scipy.stats will be used to calculate the associated value to verify statistical output.

The data are tabulated as classification data were neighborhood residence is independent of the person's occupational classification:

|              	|  A  	|  B  	|  C  	|  D  	| total 	|
|:------------:	|:---:	|:---:	|:---:	|:---:	|:-----:	|
| White collar 	|  90 	|  60 	| 104 	|  95 	|   349 	|
| Blue collar  	|  30 	|  50 	|  51 	|  20 	|   151 	|
| No collar    	|  30 	|  40 	|  45 	|  35 	|   150 	|
| Total        	| 150 	| 150 	| 200 	| 150 	|  650  	|


If we will take the sample living in neighborhood A, 150, to estimate what proportion of the whole 1,000,000 live in neighborhood A. Similarly we take 349/650 to estimate what proportion of the 1,000,000 are white-collar workers. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood A to be



$$150 * \frac{349}{650} \approx 80.54$$


Then in that "cell" of the table, we have 

$$\frac{(observed - expected)^{2}}{ expected}=\frac{(90 - 80.54)^{2}}{80.54}\approx 1.11$$


The sum of these quantities over all of the cells is the test statistic; in this case, ${\displaystyle \approx 24.6}$. Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are


$(number of rows -1)(number of columns -1)=(3-1)(4 -1 )= 6$


If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.

A related issue is a test of homogeneity. In this calculation instead of giving every resident of each of the four neighborhoods an equal chance of inclusion in the sample, we decide in advance how many residents of each neighborhood to include. Then each resident has the same chance of being chosen as do all residents of the same neighborhood, but residents of different neighborhoods would have different probabilities of being chosen if the four sample sizes are not proportional to the populations of the four neighborhoods. In such a case, we would be testing "homogeneity" rather than "independence". The question is whether the proportions of blue-collar, white-collar, and no-collar workers in the four neighborhoods are the same. However, the test is done in the same way.<a href = "https://en.wikipedia.org/wiki/Chi-squared_test">[1]</a>






### Pearson's chi-squared test with Python: 


In [2]:
# python imported libraries
import numpy as np 
import pandas as pd
import scipy
from scipy.stats import chi2

In [3]:
# contingency table with floats to avoid datatype issues with pd.DataFrame.at 
ar=np.array([[90,60,104,95],[30,50,51,20],[30,40,45,35]])    
df=pd.DataFrame(ar, columns=["A", "B", "C", "D"])
df.index=["White collar", "Blue collar", "No collar"] 
df

Unnamed: 0,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35


In [4]:
df2=df.copy() # contingency table with the marginal totals and the grand total. 
df2.loc['Column_Total']= df2.sum(numeric_only=True, axis=0)
df2.loc[:,'Row_Total'] = df2.sum(numeric_only=True, axis=1)
df2

Unnamed: 0,A,B,C,D,Row_Total
White collar,90,60,104,95,349
Blue collar,30,50,51,20,151
No collar,30,40,45,35,150
Column_Total,150,150,200,150,650


In [5]:

n=df2.at["Column_Total", "Row_Total"]  # grand total 
n2=df2.iloc[2,2]
# new empty data frame to record expected values
exp=pd.DataFrame(columns=["A", "B", "C", "D"],index=["White collar", "Blue collar", "No collar"] ,dtype=float )

# loop over cells to calculate expected values
for x in exp.index[0:]:
    for y in exp.columns[0:]:
        # calculate expected value:
        var= float(((df2.at[x, "Row_Total"]) * (df2.at["Column_Total", y])   )   /n ) 
        exp.at[x,y]=float(var)

exp        


Unnamed: 0,A,B,C,D
White collar,80.538462,80.538462,107.384615,80.538462
Blue collar,34.846154,34.846154,46.461538,34.846154
No collar,34.615385,34.615385,46.153846,34.615385


In [6]:
print(df)
print(exp)
sum=0
for x in exp.index[0:]:
    for y in exp.columns[0:]:
        # calculation of expected values:
        #print(df.at[x,y] )
        #print(exp.at[x,y])
        sum = sum + ((df.at[x,y] - exp.at[x,y])**2/(exp.at[x,y]) ) 
        
print('sum :',sum)


               A   B    C   D
White collar  90  60  104  95
Blue collar   30  50   51  20
No collar     30  40   45  35
                      A          B           C          D
White collar  80.538462  80.538462  107.384615  80.538462
Blue collar   34.846154  34.846154   46.461538  34.846154
No collar     34.615385  34.615385   46.153846  34.615385
sum : 24.5712028585826


In [7]:
DOF = (len(df.columns)-1)*(len(df.index)-1) # degrees of freedom 
DOF

6

In [8]:
pval=1-chi2.cdf(sum, DOF) # subtraction the cumulative distribution function from 1
pval

0.0004098425861096544

In [9]:
from scipy.stats import chi2_contingency # Scipy's built-in function

tstat_scipy,pval_scipy,ddof_scipy,exp_scipy=chi2_contingency(df, correction=False) # "correction=False" means no Yates' correction is used 
print("Chi-squared test statistic without Yates correction (Scipy): " + str(tstat_scipy))
print("P-value without Yates correction (Scipy): " + str(pval_scipy))

Chi-squared test statistic without Yates correction (Scipy): 24.5712028585826
P-value without Yates correction (Scipy): 0.0004098425861096696


## Chi-squared test with Yates correction:

##### All the aforementioned steps are basically the same but we use the following (adjusted) formula to determine our test statistic:


$$\chi^2 {\text{yates}} = \sum_{\text{i}} \frac{(|O_\text{i} - E_\text{i}|-0.5)^2}{E_\text{i}}$$

In [10]:
df

Unnamed: 0,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35


In [11]:
exp


Unnamed: 0,A,B,C,D
White collar,80.538462,80.538462,107.384615,80.538462
Blue collar,34.846154,34.846154,46.461538,34.846154
No collar,34.615385,34.615385,46.153846,34.615385


In [12]:
dof = (len(df.columns)-1)*(len(df.index)-1)
dof

6

In [13]:
# Applied Yates' correction by subtracting 0.5 from the absolute difference between observed and expected counts: 
tstat_yates= np.sum((((np.abs(df-exp)-0.5)**2)  / (exp)).values)
print("Chi-squared test statistic with Yates correction: " + str(tstat_yates))

pval=1-   chi2.cdf(tstat_yates, dof)
print("P-value with Yates correction: " + str(pval))

Chi-squared test statistic with Yates correction: 22.630576336573963
P-value with Yates correction: 0.0009301351063994989


In [14]:

from scipy.stats import chi2_contingency
tstat_scipy,pval_scipy,ddof_scipy,exp_scipy=chi2_contingency(df, correction=True)# "correction=True" to apply Yates' correction
print("Chi-squared test statistic with Yates correction (Scipy): " + str(tstat_scipy))
print("P-value with Yates correction (Scipy): " + str(pval_scipy))

Chi-squared test statistic with Yates correction (Scipy): 24.5712028585826
P-value with Yates correction (Scipy): 0.0004098425861096696


# References:
<a href = "https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/">[1]</a>
<a href = "https://pythonexamples.org/pandas-create-initialize-dataframe/">[2]</a>
<a href = "https://www.medcalc.org/manual/chi-square-table.php">[3]</a>
<a href = "https://github.com/BundleOfKent/Pearson-s-chi-squared-test-from-scratch/blob/master/Chi-squaredFromScratch_Medium.ipynb">[4]</a>
<a href = "https://medium.com/analytics-vidhya/pearsons-chi-squared-test-from-scratch-with-python-ba9e14d336c">[5]</a>
<a href = "https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html">[6]</a>

