<a id='sections'></a>
**Sections**

- [**1. Methodology**](#methodology)
    - [**Odds-ratio**](#Odds-ratio)
    - [**Chi-square test**](#Chi-square-test)
    - [**Cramer's V coefficient**](#Cramer-s-V-coefficient)
    - [**Goodman-Kruskal's statistics**](#Goodman-Kruskal-s-statistics)
    - [**One-way ANOVA**](#One-way-ANOVA)
    - [**Pearson correlation**](#Pearson-correlation)
    - [**Spearman rank-order correlation**](#Spearman-rank-order-correlation)
    - [**Kendall's tau**](#Kendall-s-tau)
- [**2. Other Resources**](#resources)
- [**3. References**](#references)

<a id='methodology'></a>
# Methodology
[[back to top](#sections)]

<a id='Odds-ratio'></a>
## Odds-ratio
[[back to top](#sections)]

[Odds-ratio](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/) $OR$
- Purpose: Measure the association between <font color='blue'>exposure and outcome</font>. Good for case control studies. 
- Mathematics: 
 
Odds = ${\frac{p}{1-p}}$
   - p is the probability of an event occurring.

$OR$ (Odds-ratio) = Odds that the event will occur with the exposure / Odds that the event will occur without the exposure = ${\frac{a}{b}}$/${\frac{c}{d}}$

|   | Outcome Status +   | Outcome Status -  |
|------|------|------|
|**Exposure Status +**|a (Number of exposed cases)| b (Number of exposed non-cases)|
|**Exposure Status -**|c (Number of unexposed cases)| d (Number of unexposed non-cases) |

- Interpretation:
    - OR = 1 : Exposure does not affect the outcome.
    - OR > 1 : Exposure associated with higher odds of outcome.
    - OR < 1 : Exposure associated with lower odds of outcome. 
- Code example: [Scipy fisher_exact](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html)   

<a id='Chi-square-test'></a>
## Chi-square test
[[back to top](#sections)]

[Chi-square test of independence](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/#:~:text=The%20assumptions%20of%20the%20Chi,the%20variables%20are%20mutually%20exclusive) $\chi^2$
- Purpose: Test if there is a statistically significant <font color='blue'>difference between groups</font> and find out which group is different from others.
- Non-parametric
- Null Hypothesis: $H_0$: Variable_1 is independent of Variable_2. 
- Mathematics: $\chi^2 = \Sigma{\frac{(observed-expected)^2}{expected}} $
    - Degree of freedom: (r - 1) * (c - 1), r = number of rows, c = number of columns
    - P-value: P($\chi^2$>$ x^2 $), the probability of observing a value at least as extreme as the test statistic for a chi-square distribution with (r-1)(c-1) degrees of freedom.Compare the P-value with the predefined significance level $alpha$ (for example, $alpha$=0.05). If P-value < $alpha$, then $H_0$should be rejected. 
- Requirements/Assumptions:
    - `Both independent variable and dependent variable are categorical variables`.
    - Independent variable must have more than 2 independent groups.
    - Observations are independent (thus it can not be used for pre-test and post-test observations).
    - Expected frequencies for each cell should be at least one. Expected frequency for most cells should be at least 5. 
- Note: It only assess the association. It doesn't provide any causation inference.
- Code example: [A Gentle Introduction to the Chi-Squared Test for Machine Learning](https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)

<a id='Cramer-s-V-coefficient'></a>
## Cramer's V coefficient
[[back to top](#sections)]

[Cramer's V coefficient](http://www.acastat.com/statbook/chisqassoc.htm) $\varphi_c$
- Purpose: Determine the <font color='blue'>strength of association</font> **after Chi-square test** has determined significance. Useful for comparing multiple Chi-square test statistics.
- Non-parametric
- Mathematics: $V=\sqrt{\frac{\chi^2}{n(q-1)}}$ 
    - q is the smaller number of columns or rows .
    - n is the total number of observations.
- Interpretation: A general rule of interpreting the association strength:
    - $\varphi_c$ < 0.1 : weak
    - 0.1 < $\varphi_c$ < 0.3 : moderate
    - $\varphi_c$ > 0.3 : strong
- Requirements/Assumptions:
    - Can be used for `Nominal variables or higher (Ordinal variable and Discrete variable)`.
- Note: 
    - It's a symmetrical measure. Which variable is treated as dependent variable and which variable is treated as independent variable does not matter. Also, the order (arrangement) of rows and columns does not matter. 
    - [Phi-coefficient](http://www.people.vcu.edu/~pdattalo/702SuppRead/MeasAssoc/NominalAssoc.html#:~:text=Note%20measures%20of%20association%2C%20unlike,the%20coefficient%20to%20reach%201.0.) is a special case of Cramer's V for 2x2 table.
- Code example: [Cramer's coefficient](https://stackoverflow.com/questions/20892799/using-pandas-calculate-cram%C3%A9rs-coefficient-matrix)

<a id='Goodman-Kruskal-s-statistics'></a>
## Goodman-Kruskal's statistics
[[back to top](#sections)]

Goodman-Kruskal's statistics

- Non-parametric

(1) [Goodman-Kruskal's tau](https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/)
- Purpose: Measure the direction and strength of association between `two Nominal variables`. It measures the percentage improvement in predictability of the dependent variable (column or row) given the value of independent variable (row or column).
- Mathematics: [Goodman and Kruskal’s tau](http://uregina.ca/~gingrich/gkt.pdf)

(2) [Goodman-Kruskal's lambda](https://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_lambda)
- Purpose: The same as Goodman-Kruskal's tau.

    
(3) [Goodman-Kruskal's gamma](https://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_gamma) $\gamma$
- Purpose: Test if there is a <font color='blue'>monotonic</font> relationship between `two Ordinal variables`. 
- Mathematics:  $\gamma = \frac{N_c-N_d}{N_c+N_d}$

    - $N_c$ is the total number of concordant pairs.
    - $N_d$ is the total number of discordant pairs.
- Interpretation:
    - $\gamma$ = +1 : Perfect positive association.  
    - $\gamma$ = -1 : Perfect negative association.
    - $\gamma$ = 0  : There is no association between the variables.
- Note:
    - It's a symmetrical measure. Which variable is treated as dependent variable and which one is treated as independent variable does not matter.
    - The data outliers do not affect the results very much (very useful method if the data has outliers).
    - Yule's Q is a special case for 2x2 table.

<a id='One-way-ANOVA'></a>
## One-way ANOVA
[[back to top](#sections)]

[One-way ANOVA](https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-statistics.php)
- Purpose: Test if there is a significant <font color='blue'>difference between the means </font> of multiple groups.
- Null Hypothesis: $H_0$: $u_1$ = $u_2$ = $\dots$ = $u_k$
    - $u$ is the group mean, $k$ is the number of groups
- Requirements/Assumptions:
    - `The dependent variable is Continuous variable. The independent variable must have more than 2 independent groups.` 
    - The dependent variable should be approximately normally distributed for each independent variable group. 
    - The variance of each group should be the same (homogeneity of variance).
    - Observations are independent, and do not have significant outliers.
- [Mathematics:](https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test)
    - F-statistics = Variation between sample means / Variation within the samples
- [Interpretation](https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php)
    - Under $H_0$, F-statistics should be close to 1.
    - A small F means that the variation of group means is relatively small to the variation within groups. The groups cannot be separated. 
    - A high F means that the variation of group means is relatively larger than the variation within groups. The $H_0$ should be rejected.
- Drawbacks:
    - Cannot tell which groups are statistically significant different from the others. Post-hoc-test is needed to help find out which group is different. 
        - [Tukey’s Honest Significant Difference (HSD)](#https://stats.libretexts.org/Bookshelves/Applied_Statistics/Book%3A_An_Introduction_to_Psychological_Statistics_(Foster_et_al.)/11%3A_Analysis_of_Variance/11.08%3A_Post_Hoc_Tests#:~:text=A%20post%20hoc%20test%20is,will%20give%20us%20similar%20answers.) is one of the commonly used post-hoc-test method.
- Code example: [Scipy f_oneway](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)

<a id='Pearson-correlation'></a>
## Pearson correlation
[[back to top](#sections)]

[Pearson correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) $r$
- Purpose: Test if there is a <font color='blue'>linear</font> relationship between two variables, and the degree of linear relationship.
- Requirements/Assumptions:
    - `Both variables should be normally distributed`. 
    - There should be no significant outliers. Outliers may lead to a misleading result. 
    - Homoscedascity (equal variance)
- Mathematics:
$r$ = $\frac{n\Sigma x_i y_i -\Sigma x_i \Sigma y_i }{\sqrt{n\Sigma x_i^2-(\Sigma x_i)^2}-\sqrt{n\Sigma y_i^2-(\Sigma y_i)^2}}$
    - $x_i$ is value of x for observation $i$ .
    - $y_i$ is value of y for observation $i$ .
    - $n$ is the number of observations.
    - Degree of freedom: $n-2$

- If one of the variable is binary, use Point-biserial correlation instead.
- Code example: [Scipy pearsonr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)

<a id='Spearman-rank-order-correlation'></a>
## Spearman rank-order correlation
[[back to top](#sections)]

[Spearman rank-order correlation](https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/) $\rho_s$

- Non-parametric
- Purpose: Test if there is a <font color='blue'>monotonic</font> relationship between two variables, and the degree of monotonic relationship.
- Requirements/Assumptions:
    - `At least one variable is at Ordinal scale`.
    - Can be used for either ordinal variables or for continuous data that has failed the assumptions necessary for conducting the Pearson's product-moment correlation test.
- Null Hypothesis:  𝐻0 : There is no monotonic association between two variables.
- [Mathematics](https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php):
    - For data without tied ranks: $\rho_s$ = 1- $\frac{6\Sigma d_i^2}{n(n^2-1)}$
        - $d_i$ is the difference between the ranks of corresponding variables.
        - $n$ is total number of observations.
    - For data with tied ranks: $\rho_s = \frac{\Sigma{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\Sigma(x_i - \bar{x})^2\Sigma(y_i - \bar{y})^2}}$ 
        - $i$ is paired score.

- Interpretation:
    - $\rho_s$ = +1 : Perfect positive association of ranks.
    - $\rho_s$ = 0 : No association between ranks.
    - $\rho_s$ = -1 : Perfect negative association of ranks.
- Note:
    - Make a scatter plot first to find if there is a monotonic relationship between the two variables. Then use $\rho$ to determine  the degree to which a monotonic relationship is.
- Code example: [Scipy spearmanr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr)

<a id='Kendall-s-tau'></a>
## Kendall's tau
[[back to top](#sections)]

[Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)  $\tau$
- Non parametric
- Purpose: Measure the correspondence between the ranking of x and y.
- Mathematics: $\tau$ = $\frac{n_c - n_d}{\frac{1}{2}n(n-1)}$

    - $n_c$ is the total number of concordant pairs.
    - $n_d$ is the total number of discordant pairs.
    - $n$ is total number of observations. $\frac{1}{2}n(n-1)$ represents the total number of possible pairing of x and y.
- Interpretation:
    - $\tau$ = +1 : The observations have identical rank
    - $\tau$ = 0  : The two variables are independent.
    - $\tau$ = -1 : One variable rank is the perfect reverse of the other.
- code example: [Scipy kendalltau](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html)

<a id='resources'></a>
## Other resources
[[back to top](#sections)]


- [Tied Values](https://v8doc.sas.com/sashtml/stat/chap47/sect13.htm#:~:text=Tied%20values%20occur%20when%20two,%2C%20however%2C%20ties%20often%20occur.)
- [What Statistical Test Do I Need?](http://www.mash.dept.shef.ac.uk/Resources/MASH-WhatStatisticalTestHandout.pdf)

<a id='references'></a>
## References
[[back to top](#sections)]

Please see the inserted external links in each section