<a href="https://colab.research.google.com/github/bundickm/CheatSheets/blob/master/Statistics_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Resources
[Numpy Documentation](https://docs.scipy.org/doc/numpy-1.16.1/reference/)

[Scipy.stats Documentation](https://docs.scipy.org/doc/scipy-1.2.1/reference/tutorial/stats.html)


#Definitions
**Set**: A collection of distinct entities regarded as a unit, being either individually specified or (more usually) satisfying specified conditions. See the Python class [`set()` ](https://docs.python.org/3/library/stdtypes.html#set) 

**Subset**: If a set `A` in which all members are also members of another set `B`, then set `A` is considered to be a subset of `B`

**Empty Set**: A set without any members. The empty set is a subset of all other sets

**Universal Set**: The set that all other sets are a subset of

**Combination**: A selection of a given number of elements from a larger number without regard to their arrangement $nCr = \frac{n!}{r!(n-r)!}$

**Permutation**: A selection of a given number of elements from a set with concern for ordering $nPr = \frac{n!}{(n-r)!}$

**Degrees of Freedom**: the number of independent values or quantities which can be assigned to a statistical distribution. 

**Sample**: A group of individual observations drawn from a population, usually at random, and with the assumption that the sample mean and population mean will be equal. The larger the sample size, the closer the sample mean and population mean will actually be.

**Null Hypothesis**: A general statement or default position that there is no relationship between two measured phenomena, or no association among groups. "The boring" choice, nothing special is happening.

**Statistical Significance**: a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (usually indicated by a p-value < .05)

**Student's T-test**: A set of statistical hypothesis tests applied in cases where the test statistic is distributed normally (a very typical situation), but the standard deviation of the population is unknown (typical for many real-world situations). It is a relaxed version of Z-tests which assumes more strict normality and a known standard deviation. The larger the t-test statistic the more 'unusual' the result, while the p-value determines whether that result is considered statistically significant (disproving the null hypothesis). 

Two common t-test are:
- A [one-sample test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html#scipy.stats.ttest_1samp) of whether the mean of a population has a value specified in a null hypothesis.
- A [two-sample test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind) of the null hypothesis such that the means of two populations are equal.

In [0]:
#one sample t-test
from scipy import stats

stats.ttest_1samp(df_subset['feature'],df['feature'].mean()

In [0]:
#two sample t-test

stats.ttest_ind(df1['feature'], df2['feature'])

**Chi Square Test**: Used to test the independence of rows or columns (null hypothesis is independent), usually with categorical variables. 
- [Chi Square Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html)
- [Chi Square Contingency Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)  Related: [Contingency Table](https://en.wikipedia.org/wiki/Contingency_table)

In [0]:
from scipy.stats import chisquare
# The null hypothesis is that the rows/cols are independent -> low chi square
# The alternative is that there is a dependence -> high chi square

chisquare(observations, axis=None)

In [0]:
from scipy.stats import chi2_contingency

#Chi^2 from contingency table
chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency_table)

**Confidence Interval**: Given a sample, the confidence interval is a calculated range that an unknown population parameter (such as mean) is likely to be within. A 95% confidence interval means there is a 95% chance that the unknown parameter is within the range calculated.

In [0]:
# Confidence intervals
import numpy as np
from scipy import stats

def confidence_interval(data, confidence=0.95):
  data = np.array(data)
  mean = np.mean(data)
  n = len(data)
  
  stderr = stats.sem(data)
  interval = stderr * stats.t.ppf((1 + confidence) / 2., n - 1)
  
  return (interval)

#Bayesian Inference

**Conditional Probability** - A measure of the probability of an event (some particular situation occurring) given that another event has occurred. $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

**Bayes Theorem** - Describes the probability of an event, based on prior knowledge of conditions that might be related to the event. $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$  The probability of $A$ conditioned on $B$ is the probability of $B$ conditioned on $A$, times the probability of $A$ and divided by the probability of $B$. These unconditioned probabilities are referred to as "prior beliefs", and the conditioned probabilities as "updated."

In [0]:
from scipy.stats import bayes_mvs

bayes_mvs(df['feature'], alpha=.95)