# Tasks 2020

December 18th 2020

Eoin Lees - G00387888

--------------

In [1]:
# Import Functions

import numpy as np
import requests
from bs4 import BeautifulSoup
import pandas as pd
import sklearn.neighbors as nei
import sklearn.model_selection as mod


# Task 1: Calculate square root of 2

October 5th 2020
____________

Write a Python function called sqrt2 that calculates and
prints to the screen the square root of 2 to 100 decimal places.
Your code should
not depend on any module from the standard library1 or otherwise.

----


## Introduction

Calculating the square root of any number has been studied in mathematics for years. In simple terms it is a method of calculating the length of the hypothenuse on a right angled triangle with equal side length. In this case the triangle is shown below.

<img src="https://upload.wikimedia.org/wikipedia/commons/5/5c/Isosceles_right_triangle_with_legs_length_1.svg" alt="sqrt2" style="width: 200px;"/>

It can be accomplished in many ways such as:


* Babylonian Method
* Bakhshali Method
* Digit-by-digit calculation
* Taylor series
* Newton Rhapson Method

In this task we we look at the Newton Rhapson Method. 



#### Newton Rhapson Method. 

"The Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function." [1,2]

To find the square root of $z$ of a number $x$, we can iterate using the following equation. 

$$ z_{next} = z - \frac{z^2 - x}{2z} $$


https://en.wikipedia.org/wiki/Newton's_method
The below code shows how this is done:

In [2]:
# A function to calculate the square root of number x

def sqrt1(p):
    '''
    A function to calculate the square root of number x
    '''
    # Initial guess for square root z
    x = float(p)
    z = x / 2
    # Loop until accuracy is ok
    while abs(x - (z*z)) > 0.000000000000001:
         z -= (z*z - x) / (2*z)
    # Return the approximate square root of x. 
    return z



##### Tests of the function

The function was tested with some known values. 

In [3]:
# Test the function on 100.
sqrt1(100)

10.0

In [4]:
# Test the function on 2.
ans = sqrt1(2)
ans = str(ans)
ans

'1.4142135623730951'

In [5]:
# Check answer
print("The number of decimal places is:", (len(ans)-2) ) #Number of characters total - 2 = number of decimal places

The number of decimal places is: 16


Using this method it seems like the limit is reached after 16 decimal places. Another method muct be investigated. 

## Alternate Method

A way to get around this limit is to multiply the number we need the square root of by the number of decimal places required squared. 

The brief states that all that is required is that the algorithm `prints to the screen the square root of 2 to 100 decimal places`

Taking advantage of this statement we can convert the answer to the above algorithm into a string. Then using string manipulation we can insert a decimal place in the correct location and print the square root of 2 to 100 decimal places.

https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra

https://leetcode.com/problems/sqrtx/discuss/169594/Python-Solution-based-on-shifting-nth-root-algorithm

https://stackoverflow.com/questions/29724907/limit-of-digit-by-digit-calculation-of-square-roots

https://stackoverflow.com/questions/64295245/how-to-get-the-square-root-of-a-number-to-100-decimal-places-without-using-any-l


In [6]:
# multiply input by number of decimal places needed squared. 

def sqrt(p):

   #take input number and multiply it by 100^100 
    x = p * 100 ** 100
    
    # Initial guess for square root z
    z = x // 2
    
    # Loop until accuracy is ok
    while x - (z*z) < 0:
        
         z = (z + x//z) // 2 #https://stackoverflow.com/questions/183853/what-is-the-difference-between-and-when-used-for-division
    z = str(z) # Convert into string
    
    # Return the approximate square root of x and  
    return(z[0] + "." + z[1:])



In [7]:
# Test Function
sqrt(2)

'1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727'

In [8]:
# Verify answer
print("The number of decimal places is:", (len(sqrt(2))-2) )

The number of decimal places is: 100


### Importing modules to compare results

For comparison two functions have been imported to test the functions and find out the results we wish to achieve. As stated in the project brief the functions must be completed without importing any of the standard functions so these are purely for comparitive sake. 

In [9]:
# Test with imported  math function
import math
math.sqrt(2)

1.4142135623730951

In [10]:
# Verify answer
ans = str(math.sqrt(2)) 
print("The number of decimal places is:", (len(ans)-2))

The number of decimal places is: 16


In [11]:
# Check if the answers returned are the same. 
str(math.sqrt(2)) == sqrt(2)

False

The math function returns a similar number of decimal places as our first test. It must be based on a similar process as it returns the exact same result.

In [12]:
# Get result using decimal module
from decimal import *
getcontext().prec = 101 # for 100 decimal places
ans = Decimal(2).sqrt()  # https://docs.python.org/3/library/decimal.html
ans

Decimal('1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727')

In [13]:
# Count number of decimal places here
ans = str(ans) 
print("The number of decimal places is:", (len(ans)-2))

The number of decimal places is: 100


In [14]:
# Test if results from function match results from decimal module
print(sqrt(2))
print(str(ans))
sqrt(2) == str(ans)

1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727
1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727


True

The decimal function returns the same answer as our `sqrt()` function. It verifys our results.

-----------------


## Other Research

While researching this function I spent time looking into other methods names

https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Arithmetic_estimates

https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation

Methods of computing square roots.

https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method

Digit by digit
https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation

However after discovering the above method it fulfilled the brief successfully. There was no need to investigate any more complicated methods any further. 

# Task 1: References


This is a markdown cell [1]

[1] Mastering Markdown; GitHub; https://guides.github.com/features/mastering-markdown/

[2] Python Tutorial; Python Software Foundation; https://docs.python.org/3/tutorial/controlflow.html#for-statements

[3] Methods of computing square roots; Wikipedia; https://en.wikipedia.org/wiki/Methods_of_computing_square_roots

[4] https://medium.com/@surajregmi/how-to-calculate-the-square-root-of-a-number-newton-raphson-method-f8007714f64

[5] A tour of go; Exercise: Loops and Functions; https://tour.golang.com/flowcontrol/8

[6] Newton's method; Wikipedia; https://en.wikipedia.org/wiki/Newton%27s_method

[7] https://www.mathjax.org/

###### Assignment 2

-------
    

# Task 2: Chi-squared test fo independence


November 2nd 2020

-----------------

The Chi-squared test for independence is a statistical
hypothesis test like a t-test. It is used to analyse whether two categorical variables
are independent. The Wikipedia article gives the table below as an example,
stating the Chi-squared value based on it is approximately 24.6. Use `scipy.stats`
to verify this value and calculate the associated p value.

--------------------


## Intoroduction

#### Chi-squared test


https://en.wikipedia.org/wiki/Chi-squared_test

references 

https://www.youtube.com/watch?v=ICXR9nDbudk&ab_channel=JieJenn 

https://stackoverflow.com/questions/50355577/scraping-wikipedia-tables-with-python-selectively

https://medium.com/analytics-vidhya/web-scraping-a-wikipedia-table-into-a-dataframe-c52617e1f451

### Import table from wikipedia using beautiful soup

In [15]:

# [8] [9]
URL = "https://en.wikipedia.org/w/index.php?title=Chi-squared_test&oldid=983024096"
table_class = "wikitable"

response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')

chisqr = soup.find("table", class_=table_class)
df = pd.read_html(str(chisqr), skiprows=(5,6)) # Skip total at bottom

# tidy up data [10]
data = pd.DataFrame(df[0])
data_df = data.rename(columns={"Unnamed: 0": "Occupation", "total": "RowTotal"})
data_df
#pivot_df = df.pivot(index="", columns="")
#pivot_df




Unnamed: 0,Occupation,A,B,C,D,RowTotal
0,White collar,90,60,104,95,349
1,Blue collar,30,50,51,20,151
2,No collar,30,40,45,35,150
3,Total,150,150,200,150,650


To future proof this notebook the data was added to an array manually. If the wikipedia article changes it still retains its functionality. It also enables us to input the data directly in the format required for `scipy.stats`.  

### Manually input data

In [44]:
# Manually set data
data = np.array([[90,60,104,95],[30,50,51,20],[30,40,45,35]])

In [45]:
from scipy.stats import chi2_contingency

### chi2_contingency function

The function `chi2_contingency()` imported from `scipy.stats` is a function that computes a Chi-square test of independence of variables in a contingency table.

It is usefull in this situation as the data required to be analised is in the perfect format for this function. It takes the data inputted as an array without the titles or indexes. 

It returns the following:

* chi2 (float): The test statistic.

* p (float): The p-value of the test

* dof (int): Degrees of freedom

* expectedndarray (same shape as observed): The expected frequencies, based on the marginal sums of the table.


"This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. The expected frequencies are computed based on the marginal sums under the assumption of independence. The number of degrees of freedom is (expressed using numpy functions and attributes)"

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html



### Run chi squared test

In [46]:
# Run chi squared test
chi2_contingency(data)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

In [50]:
# Tidy up results
chi2, p, dof, ex = chi2_contingency(data, correction=False)
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

stat = chi2

dataChi2 = pd.DataFrame(ex, index = ['White collar', 'Blue collar', 'No collar']) #https://stackoverflow.com/questions/19851005/rename-pandas-dataframe-index#:~:text=Pandas%20has%20some%20quirkiness%20when,change%20the%20index%20level%20names.&text=This%20DataFrame%20has%20one%20level,column%20index%20have%20no%20name.
 # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html#pandas.Series.sum
data_df = dataChi2.rename(columns={0: "A", 1: "B", 2: "C", 3: "D"})
data_df = data_df.append(data_df.sum().rename('Total'))

print("--------------------------------------------------")
print("\n")
print("The test statistic is: ", chi2)
print("The p-value is: ", p)
print("The degrees of freedom are: ", dof)
print("\n")
print("--------------------------------------------------")
data_df




--------------------------------------------------


The test statistic is:  24.5712028585826
The p-value is:  0.0004098425861096696
The degrees of freedom are:  6


--------------------------------------------------


Unnamed: 0,A,B,C,D
White collar,80.538462,80.538462,107.384615,80.538462
Blue collar,34.846154,34.846154,46.461538,34.846154
No collar,34.615385,34.615385,46.153846,34.615385
Total,150.0,150.0,200.0,150.0


In [57]:
from scipy.stats import chi2
# interpret test-statistic (https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)
prob = 0.95
critical = chi2.ppf(prob, dof)
#print(critical) #To test result
if abs(stat) >= critical:
	print('Dependent (Reject Null Hypothesis)')
else:
	print('Independent (fail to reject Null Hypothesis)')

Dependent (Reject Null Hypothesis)


In [58]:
# interpret p-value
alpha = 1.0 - prob
if p <= alpha:
	print('Dependent (Reject Null Hypothesis)')
else:
	print('Independent (fail to reject Null Hypothesis)')

Dependent (Reject Null Hypothesis)


### Results. 

The table above gives the results of the chi-squared test for independence along with the p value that was expected. The values shown in the table are the expected values of each 

What the results tell us:
* The test statistic result of 24.5712028585826 matches the stated statistic 24.6 from wikipedia.
* The degrees of freedom result of 6 matches the wikipedia article also. 
* Both the chi-squared and p value results reject the null hypothesis

What format is the table required to be in to use the function:
* The table must be in a simple numpy array 
* Each occupation is in a list seperatly
* labels and occupation names are added after the calculation


### Conclusions

In this specific set of data the null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification.

We can interpret the test statistic in the context of the chi-squared distribution with the requisite number of degress of freedom as follows:

* If Statistic >= Critical Value: significant result, reject null hypothesis , dependent.
* If Statistic < Critical Value: not significant result, fail to reject null hypothesis , independent.https://machinelearningmastery.com/chi-squared-test-for-machine-learning/

The simple calculation above rejects the null hypothesis. 

In simple terms it means that neighbourhood is in fact dependent on your your occupation.  

https://stackoverflow.com/questions/64669448/understanding-scipy-stats-chisquare

https://machinelearningmastery.com/chi-squared-test-for-machine-learning/



---------------------------

# Task 2: References


[1] Mastering Markdown; GitHub; https://guides.github.com/features/mastering-markdown/


[8] https://www.youtube.com/watch?v=ICXR9nDbudk&ab_channel=JieJenn]

[9] https://stackoverflow.com/questions/50355577/scraping-wikipedia-tables-with-python-selectively

[10] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html




-----------------------------------

# Task 3: Standard deviation

November 16th 2020

---------------

In [None]:
import numpy as np

import statsmodels.stats.weightstats as stat
import scipy.stats as ss
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("fivethirtyeight")



x = 10

x = 10
np.sqrt(np.sum((x - np.mean(x))**2)/len(x))

## Excel Functions

#### STDEV.P

https://support.microsoft.com/en-us/office/stdev-p-function-6e917c05-31a0-496f-ade7-4f4e7462f285

divided by `len(x)`

#### STDEV.S

https://support.microsoft.com/en-us/office/stdev-s-function-7d69cf97-0c1f-4acf-be27-f3e83904cc23


divided by `len(x)-1`

Explination Needed for both and reasons for one over the other

### Numpy Function

https://numpy.org/doc/1.19/reference/generated/numpy.std.html?highlight=std#numpy.std

--------------

In [None]:
a = np.array([[1, 2], [3, 4]])
np.std(a)


In [None]:
np.std(a, axis=0)


In [None]:
np.std(a, axis=1)


Write up results


-------------------------------

# Task 3: References

[1] Mastering Markdown; GitHub; https://guides.github.com/features/mastering-markdown/


----------------------

# Task 4: scikit-learn: Iris Data Set

November 30th 2020

Use scikit-learn to apply k-means clustering to
Fisher’s famous Iris data set.
Explain in a Markdown cell how your code works and how accurate it might be, and then explain how your model could be used to make predictions of species of iris.

---------------

### Introduction

Kmeans algorithm

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields. https://scikit-learn.org/stable/modules/clustering.html#k-means

The k-means algorithm divides a set of samples into seperate clusters ,each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids”

The K-means algorithm aims to choose centroids that minimise the inertia

Inertia:
    * a measure of how internally coherent clusters are
    * responds poorly to elongated or irregular shapes
    
    
### Iris flower data set

A commonly used data set in machine learning. Was created in 1936 by british statistician Ronald Fisher. 

It comprises of three species of Iris (setosa, virginica and versicolor) with 50 samples of each. four seperate features of the iris was measured: sepal length, sepal width, petal length and petal width.  

It is usefull as a tool in machine learning as it has 2 distinct groups of data. One is easily identified while the others can be seperated using a linear discriminant model developed by Fisher. 

Further information can be fount here:
https://en.wikipedia.org/wiki/Iris_flower_data_set
    

In [None]:
# Load the iris data set from a URL.
df = pd.read_csv("https://github.com/ianmcloughlin/datasets/raw/master/iris.csv")

In [None]:
# Display the data
df

In [None]:
# Load the seaborn package.
import seaborn as sns

# Plot the Iris data set with a pair plot.
sns.pairplot(df, hue="class")

Using the pairplot command in the seaborn module gives us a good initial view of the data. Given that the species are already labelled we can see where the issues are in seperating them into clusters as the versicolor and virginica overlapin all cases.

## Inputs and outputs


https://medium.com/@belen.sanchez27/predicting-iris-flower-species-with-k-means-clustering-in-python-f6e46806aaee

### Read article

### K-means Test

In [None]:
from sklearn.cluster import KMeans
import numpy as np


In [None]:
# Convert dataframe to array
x = df.iloc[:, [0, 1, 2, 3]].values
x

In [None]:
# Isolate the class and save in variable to compare results with
species = df["class"]
species

In [None]:
# find the optimum number of clusters
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # within cluster sum of squares
plt.show()

In [None]:
# Pre specify 3 clusters 
kmeans = KMeans(n_clusters=3, random_state=0)

The elbow method allows us to pick the optimum number of clusters from the given data. In this case we can see that the graph elbows at around 3, which is already known. It verifys the effictiveness of the graph. Any further number of clusters do not give a meaningful increase in the effictiveness of the kmeans test. 

https://www.kaggle.com/tonzowonzo/simple-k-means-clustering-on-the-iris-dataset

In [None]:
# Fit object using iris data to train model
KMmodel = kmeans.fit(x)

In [None]:
# Check the classifications based on the model
KMmodel.labels_

The above array prints the classifications that the model has decided upon for the data it has recieved. This model is reasonibly in order. Taking a quick look at this array it seems like in a number of places the data has been misclassified. 

In [None]:
# Get the centres of the clusters
KMmodel.cluster_centers_

In [None]:
# Use pandas to crosstabulate predictions with true values
ct = pd.crosstab(species, KMmodel.labels_)
column_titles = [1,2,0]

ct.reindex(columns=column_titles)

From this comparison we can be certain that the Kmeans test predicted the setosa accutatly. Referring back to the labels we can see all of the 1 values are in order and at the beginning of the array. 



https://www.youtube.com/watch?v=pT17z_PziZs&ab_channel=DragonflyStatistics

https://kobriendublin.wordpress.com/

In [None]:
# Select existing data
df.loc[75] 

In [None]:
df.loc[2]

In [None]:
# Predict existing using k means
kmeans.predict([[6.6, 3, 4.4, 1.4]])

In [None]:
# convert prediction to species
p = kmeans.predict([[4.7, 3.2, 1.3, 0.2], [6.6, 3, 4.4, 1.4]])
for i in p:
    if i == 1:
        print("setosa")
    elif i == 2:
        print("versicolor")
    else:
        print("virginica")

Kmeans can accuratly predict the setosa and versicolor species as shown above. 

In [None]:
# Select existing data
df.loc[149]

In [None]:
# convert prediction to species
p = kmeans.predict([[5.9, 3, 5.1, 1.8]])
for i in p:
    if i == 1:
        print("setosa")
    elif i == 2:
        print("versicolor")
    else:
        print("virginica")

Entering the data given for entry 149 on the Iris data set we see how kmeans misclassifies it. It predicts it as a versicolor however we know it is infact virginica. 

It is a good example of how it is important to have a good understanding of your data set before undertaking an exercise like this one. The cluster defined by kmeans overlaps along the virginica and versicolor data so you must do further investigation when recieving results from a prediction tool. 

-------------------

# Task 4: References

[1] Mastering Markdown; GitHub; https://guides.github.com/features/mastering-markdown/


-------------------