<a href="https://colab.research.google.com/github/adeeconometrics/literate-programming/blob/main/Statistical_Analyses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Statistical Analyses

This notebook shall discuss some of the most prominent statistical methods, their assumptions and applications in a research setting. The code implementation in [SciPy(scipy.stats)](https://docs.scipy.org/doc/scipy/reference/stats.html) is also documented in this notebook. It shall serve as a guide for understanding Statistical Methods and apply them right-away. 

### About SciPy
SciPy  is a Python-based ecosystem of open-source software for mathematics, science, and engineering. It provides an array of computatioanl tools written in [C](https://www.tutorialspoint.com/cprogramming/index.htm) and [Fortran](https://www.tutorialspoint.com/fortran/fortran_overview.htm) for high-performance codebase and interfaced with a python package. To install SciPy if you already have Python installed in your system, type: ``
python -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose``. in your command line interface (CLI). For other cases, [refer to the documation](https://www.scipy.org/install.html).

You can explore a myriad of scientific libraries of advanced computation [here](https://docs.scipy.org). 

----
##Content

- Probability Distributions
    - Normal Distribution
    - Binomial distribution
    - T-distribution
    - Chi square - distribution
    - F Distribution

- Parametric Statistics
    - T-test
        - One Sample T-Test
        - Two Sample T-Test
        - Paired T-test
    - Z-test
        - One Sample Z-test
        - Two Sample Z-test
    - Chi square- test
    - Levene's Test
    - ANOVA
        - One-way
        - Two-way
    - Shapiro-Wilk test

- Non-parametric Statistics


## Probability Distrubutions

A probability distribution is a mathematical function that outputs the possibilities of occurrence of different possible outcomes for an experiment; It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).

The sample space, often denoted by $\Omega$ , is the set of all possible outcomes of a random phenomenon being observed; it may be any set: a set of real numbers, a set of vectors, a set of arbitrary non-numerical values, etc. 

To define probability distributions for the specific case of random variables (so the sample space can be seen as a numeric set), it is common to distinguish between **discrete** and **continuous** random variables.

### General Definition 
Since probability distributions can be has alternative definitions, one of the most general descriptions, which applies for continuous and discrete variables, is by means of a probability function ${\displaystyle P\colon {\mathcal {A}}\rightarrow \mathbb {R} } $ whose input space ${\mathcal {A}}$ is related to the sample space, and gives a probability as its output.

<!-- do I elaboration-->
It is important to note that the probability function only characterize a probability distribution if if satisfied the [Kolmogorov axioms](), as follows:
1. ${\displaystyle P(X\in E)\geq 0\;\forall E\in {\mathcal {A}}}$, so the probability is non-negative;

2. ${\displaystyle \sup _{E\in {\mathcal {A}}}P(X\in E)=1}$, so no probability exceeds 1; and

3. ${\displaystyle P(X\in \bigsqcup _{i}E_{i})=\sum _{i}P(X\in E_{i})}$ for any disjoint family of sets  $\{E_{i}\}$.

Probability distributions are generally divided into two classes: (1) Discrete Probability Distribution and (2) Continuous Probability Distribution. 

A probability distribution whose sample space is one-dimensional (e.g. $\mathbb{R}$, $\mathbb{N}$) are called *univariate*, while a distribution whose sample space is a vector space of dimensions more than 1 is called *multivariate.* A univariate distribution gives the probabilities of a single random variable taking on various alternative values; a multivariate distribution (a joint probability distribution) gives the probabilities of a random vector – a list of two or more random variables – taking on various combinations of values. 

Commonly encountered Univariate Distributions:
- [Binomial Distribution](https://en.wikipedia.org/wiki/Binomial_distribution)
- [Hypergeometric Distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution)
- [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution)

Commonly encountered Multivariate Distribution
- [Multivariate Normal Distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)

### Key Concepts

Functions for discrete variables
- **Probability Function** - describes the probability ${\displaystyle P(X\in E)} $that the event $E$, from the sample space, occurs.

- **Probability Mass Function** - function that gives the probability that a discrete random variable is equal to some value.

- **Frequency Distribution** - a table that displays the frequency of various outcomes **in a sample**.

- **Relative Frequency Distribution** - : a frequency distribution where each value has been divided (normalized) by a number of outcomes in a sample i.e. sample size.

- **Discrete Probability Distribution Function** -  general term to indicate the way the total probability of 1 is distributed over **all** various possible outcomes (i.e. over entire population) for discrete random variable.

- **Cumulative Distribution Function** -  function evaluating the probability that $X$ will take a value less than or equal to $x$ for a discrete random variable.
- **Categorical Distribution** -  for discrete random variables with a finite set of values.

Functions for continuous variables

- **Probability Density Function** - function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

- **Continuous Probability Distribution Function** - most often reserved for continuous random variables.

- **Cumulative Probability Distribution Function** - function evaluating the probability that $X$ will take a value less than or equal to $x$ for continuous variable.

- **Quantile function** - the inverse of the cumulative distribution function. Gives $x$ such that, with probability $q$, $X$ will not exceed $x$.

----
Basic Terms 
- Mode
    - discrete random variable - the value with highest probability
    - continuous random variable - a location at which the probability density function has a local peak.

- Support - set of values that can be assumed with non-zero probability by the random variable

- Tail - the regions close to the bounds of the random variable, if the pmf or pdf are relatively low therein. 

- Head -  the region where the pmf or pdf is relatively high.

- Expected value (or mean) - the weighted average of the possible values, using their probabilities as their weights; or the continuous analog thereof.

- Median - the value such that the set of values less than the median, and the set greater than the median, each have probabilities no greater than one-half.

- Variance - the second moment of the pmf or pdf about the mean; an important measure of the dispersion of the distribution.

- Standard Deviation - the square root of the variance

- Symmetry -  a property of some distributions in which the portion of the distribution to the left of a specific value(usually the median) is a mirror image of the portion to its right.

- Skewness - a measure of the extent to which a pmf or pdf "leans" to one side of its mean. The third standardized moment of the distribution.

- Kurtosis - a measure of the "fatness" of the tails of a pmf or pdf. The fourth standardized moment of the distribution.