# Machine Learning and Statistics - Tasks
Assignment Tasks for Machine Learning and Statistics, GMIT 2020

Lecturer: dr Ian McLoughlin


>Author: **Andrzej Kocielski**  
>Github: [andkoc001](https://github.com/andkoc001/)  
>Email: G00376291@gmit.ie, and.koc001@gmail.com

Created: 07-11-2020

This Notebook should be read in conjunction with the corresponding `README.md` file at the project [repository](https://github.com/andkoc001/Machine-Learning-and-Statistics.git) at GitHub.

___
## Task 2

### Objectives
__Verify the value of the $ {\chi}^2 $ (chi-squared) test for a sample dataset and calculate the associated $ p $ value__.



### The Chi-squared test

The chi-squared test is a statistical tool suitable for categorical data values (for instance, colour or academic degree or dog breeds). It can be used for three applications, although in essence they test similar things: 1) the goodness of fit test (how one category data fits the distribution), 2) test for homogeneity (liekehood of different samples coming from the same population) and 3) test of independence. [YouTube](https://www.youtube.com/watch?v=7_cs1YlZoug). 

For this task, Chi-squared **Test of independence** applies. Test of independence attempts to answer whether being a member of one category is independent of another is statistically significant. In other words, "the chi-square independence test is a procedure for testing if two categorical variables are related in some population." [SPSS Tutorials](https://www.spss-tutorials.com/chi-square-independence-test/)

The calculation compares the measured (observed) values against the expected values (if the null hypothesis is true).

The result of the chi-squared test is a numerical value that can be interpreted in such a way that it allows for seeing whether one variable is independent of another.

A generic form of the test is like this (from [Wikipedia](https://en.wikipedia.org/wiki/Chi-squared_tes)):

$$ {\chi}^2 = \sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\ $$

where:  
$ n $ - number of categories,  
$ O $ - measurement from observation,  
$ E $ - expected value if the null hypothesis is true.

It is also worth noting that, in order to render the independence test viable, the expected value $ E_k > 5 $. Although that value is assumed arbitrarily, it is commonly used for practical application (-----quote source here).

Also 'degrees of freedom' is required for assessing the independence test. The general formula for the degrees of freedom for tabularised data is as follows:

$$ d = (r-1)(c-1) $$
    
where:  
$ d $ - degrees of freedom,  
$ r $ - number of rows,  
$ c $ - number of columns.

The degree of freedom will affect the chi-squared distribution. Sample plots for various degrees of freedom are shown below.


![Chi-squared distribution](https://saylordotorg.github.io/text_introductory-statistics/section_15/5a0c7bbacb4242555e8a85c9767c03ee.jpg) Image source: [Saylor Academy](https://saylordotorg.github.io/text_introductory-statistics/s15-01-chi-square-tests-for-independe.html)

"The value of the chi-square random variable $ {\chi}^2 $ with degree of freedom $ d = k $ that cuts off a right tail of area c is denoted $ {\chi}^2_c $ and is called a critical value." [Saylor Academy](https://saylordotorg.github.io/text_introductory-statistics/s15-01-chi-square-tests-for-independe.html)

![Chi-squared critical value](https://saylordotorg.github.io/text_introductory-statistics/section_15/34d06306c2e726f6d5cd7479d9736e5e.jpg) Image source: [Saylor Academy](https://saylordotorg.github.io/text_introductory-statistics/s15-01-chi-square-tests-for-independe.html)



### The problem

This task is to evaluate the given example data along the Chi-squared test result also already given. "The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia [article](https://en.wikipedia.org/wiki/Chi-squared_test) gives the table below as an example, stating the Chi-squared value based on it is approximately 24.6. Use `scipy.stats` to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell."

The data from the above Wikipedia page, describes the test scenario as follows. "Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as:

|              |  A  |  B  |  C  |  D  | Total |
|--------------|-----|-----|-----|-----|-------|
| White collar |  90 |  60 | 104 |  95 |  349  |
| Blue collar  |  30 |  50 |  51 |  20 |  151  |
| No collar    |  30 |  40 |  45 |  35 |  150  |
|              |     |     |     |     |       |
| Total        | 150 | 150 | 200 | 150 |  650  |

([Wikipedia](https://en.wikipedia.org/wiki/Chi-squared_test))

The chi-squared test of independence verifies whether or not two categorical variables are independent of each other (statistically meaningful). The test assumes the 'null hypothesis' and the opposing 'alternative hypothesis'.

For the given sample data, the hypotheses are as follows (from the Wikipedia article):

**Null hypothesis** $ H_0 $ - "each person's neighborhood of residence is independent of the person's occupational classification",

**Alternative hypothesis** $ H_a $ - there is such a dependency.

The result of the test is already given in Wikipedia article: $ {\chi}^2 $ = 24.6, and so is the degrees of freedom: $ d $ = 6.

........... the result interpretation .............


### Calculation

The chi-squared test of independence for the provided data is calculated using the statistical functions (`scipy.stats`) from [Scipy](https://docs.scipy.org/doc/scipy/reference/stats.html) library for Python.



''' To do plan

+ chi-squared plots for different degrees of freedom
+ mention that Ek should be greater than 5
+ define the null hypothesis and the alternative hypothesis, eg.
    Null Hypthesis (from the Wikipedia article): "each person's neighborhood of residence is independent of the person's occupational classification".
    Alt Hypothesis: there is such a dependency (statistically meaningfull)
- elaborate the steps of calculating the chi-squared for the sample data 
- apply the scipy.stats
- interpret the result from the calcs; compare to the results given in the Wikipedia
+ write a brief conclusion / findings

'''

### Conclusion 

From the data provided by the survey, there is not an evidence strong enough to conclude (within reason) that {the distribution of the 'collar colours' among the neighborhoods} is due to the relationship between the person's neighborhood of residence and the person's occupational classification (null hypothesis is true). In other words, the proportion of a certain 'collar colour' is not equal to the proportion of the neighborhood. There is no relationship between the two categorical variables - they are independent. The collated data may be just because of the sampling variability. 

___
## References and bibliography 

### General 

- Ian McLoughlin, Assignment Brief, 2020. [pdf] GMIT. Available at: <https://learnonline.gmit.ie/mod/url/view.php?id=102004> [Accessed October 2020].
- Ian McLoughlin, Lecturer's notes on square root of 2, 2020. [pdf] GMIT. Available at: <https://learnonline.gmit.ie/mod/url/view.php?id=92022> [Accessed October 2020].

### Task 2 related

- Ian McaLoughlin - Introduction to the tasks, 2020 [online]. Available at <https://github.com/ianmcloughlin/playing-with-jupyter/blob/main/playing-with-jupyter.ipynb> [Accessed October 2020]
- Chi-squared test - Wikipedia contributors [online]. Available at: <https://en.wikipedia.org/wiki/Chi-squared_test> [Accessed October 2020].
- Chi-squared test - Wolfram MathWorld contributors. [online] Available at: <https://mathworld.wolfram.com/Chi-SquaredTest.html> [Accessed October 2020].
- Chi-squared test of independence - SPSS Tutorials [online]. Available at: <https://www.spss-tutorials.com/chi-square-independence-test/> [Accessed November 2020]
- Chi-squared test of independence - Stat Trek [online]. Available at: <https://stattrek.com/chi-square-test/independence.aspx> [Accessed November 2020]
- How the Chi-Squared Test of Independence Works - Statistics by Jim [online]. Available at: <https://statisticsbyjim.com/hypothesis-testing/chi-squared-independence/> [Accessed November 2020]
- Chi-kwadrat - Statystyka pomoc [online]. Available at: <http://statystyka-pomoc.com/Chi-kwadrat.html> [Accessed November 2020]  
- A Gentle Introduction to the Chi-Squared Test for Machine Learning [online]. Available at: <https://machinelearningmastery.com/chi-squared-test-for-machine-learning/> [Accessed November 2020]
-Saylor Academy - Introductory statistics, Chi-Square Tests for Independence [online]. Available at: <https://saylordotorg.github.io/text_introductory-statistics/s15-01-chi-square-tests-for-independe.html> [Accessed November 2020]
- CrashCourse - Chi-Square Tests: Crash Course Statistics (YouTube), [online] <https://www.youtube.com/watch?v=7_cs1YlZoug> [Accessed November 2020]
- Lisa Dierker - Chi-square test - Python (YouTube), [online] Available at: <https://www.youtube.com/watch?v=Pbo7VbHK9cY> [Accessed November 2020]
- Statistical functions (scipy.stats) - Scipy documentation [online]. Available at: <https://docs.scipy.org/doc/scipy/reference/stats.html> [Accessed November 2020]



___
Andrzej Kocielski