# DSI Notes

## Measures of Dispersion
---
<span style="color:green">Range</span> = difference between the maximum and minimum values  
<span style="color:green">Variance</span> = average squared distance of data from the mean
## Covariance & Correlation
---
<span style="color:green">Covariance</span> - how two variables vary with respect to one another  
<span style="color:green">Correlation</span> - normalizes and removes units from covariance
- Correlation coefficients are **always** between -1 and 1
- Values close to -1 or 1 ==> strong linear relationship
- Values close to 0 ==> weak or non-linear relationship
- Values < 0 ==> negative relationship
- Values > 0 ==> positive relationship    

## Central Limit Theorm & Confidence Intervals  
---
<span style="color:green">Standard Error</span> - Uncertainty of sample mean relative to true population mean  
$$SE = \frac{s}{\sqrt{n}}$$
**CLT Properties**
- If x is a normal random variable ==> the mean of a sample of x is normally distributed
- If x is not a normal random variable ==> the mean of a sample of x is approximately normally distributed as long as sample size >= 30  

**Confidence Interval**  
$$ CI = \bar{x} \pm t^* \frac{s}{\sqrt{n}} $$
This is a an interval (or range) of values that we believe contains the "true" mean $\mu$.  If we resampled the data many, many times, then this confidence interval would contain the true mean 95% of the time.
- Range of possible values for the true mean based on a sample mean
- Uncertainty related to a sample

<span style="color:green"> Z-score </span> is chosen to establish level of confidence  
    - 90% confidence => z = 1.645
    - 95% confidence => z = 1.96
    - 99% confidence => z = 2.575   
    

## Distributions
---
**Normal (Gaussian)** 
- Averages of samples of observations of random variables independently drawn from independent distributions become normally distributed when the number of observations is sufficiently large
- Things that are measured in nature

**Poisson**  
- the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event
- If you look for a set duration how many events do you expect to see

**Exponential**  
- The probability distribution that describes the time between events in a Poisson point process, i.e. a process in which events occur continuously and independently at a constant average rate.
- Given some avg rate of success how long do you have to wait before the next success

**Binomial**  
- For the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of independent occurrences
- Repetitions of bernouli trials

**Uniform/Flat**
- Values are equally likely to be observed
- Every outcome is equally as probable

**Bernouli**  
- The probability distribution of any single experiment that asks a yes–no question; the question results in a boolean-valued outcome
- Single binary event where outcome can be determined by 1 outcome



## Frequentist Hypothesis Testing
---
**<span style="color:green">Null Hypothesis</span>** - has no effect  
$H_0$: The mean difference between treatment and control groups is zero
$$ H_0: P_A = P_B$$
**<span style="color:green">Alternative Hypothesis</span>** - has an effect  
$H_1$: The mean difference between treatment and control, is different than zero
$$H_1: P_A \neq P_B$$
**<span style="color:green">t-test</span>** - search for significance in a measure of randomness  
- The numerator: the difference between the group means. Recall that our assumed mean difference is 0 (our null hypothesis)
- The denominator: the square root of the pooled sample variance divided by the sample size. This is the standard error of the mean

**<span style="color:green">t-statistics</span>** - measure of the degree by which our groups differ   
We have $\bar{x} \approx \mu$.  Let's calculate a new statistic that measures how far or sample mean, $\bar{x}$ is from the true mean, $\mu$. Normalized by our estimate for the standard deviation of the population:
$$t = \frac{\bar{x} - \mu}{s/\sqrt{n}}$$
**<span style="color:green">p-value</span>** - metric that indicates the probability that out measured difference was due to random change in sampling of subjects
- The probability that, given the null hypothesis (H0) is true, we could have ended up with a statistic at least as extreme as the one measured from our random sample of data from the true population

In practice we use a t-distribution rather than normal distribution  
- Degrees of Freedom
    - One sample: dof = $n-1$
    - Two samples: dof = $ n_1 + n_2 -2$  
- More conservative than normal distribution

### Signal and Noise
Another way to think about the t-statistic is in terms of the <span style="color:green">signal to noise ratio</span> in our data.

<span style="color:green"> Signal </span> - measured difference. This is our measured mean difference between groups minus the hypothesized mean $H_0$.

<span style="color:green"> Noise </span> - variation in our data, how much our measurements vary across the groups. The t-distribution also imposes an additional penalty for smaller sample sizes by "fattening the tails" of the distribution when the number of observations is small.

# Bayseian Statistics
<span style="color:green"> Probability </span> - How likely is an outcome
- 1 - Will dinitely occur
- 0 - Will not occur
- $1-p$ - Chance of not p happening

<span style="color:green"> Non-negativity </span> 
$$P(A) >= 0$$
<span style='color:green'>Unit Measure</span> - Over all possible events for a given case, total probability is 1
<span style='color:green'>Additivity</span> - Probability of any event occuring = sum of those probabilities

## Probabilities of Two Independent Events
<span style="color:green"> Independence </span> - two events do not influence each other  
For independent events _A_ and _B_:  
How likely they happen at the same time
$$p(A \cap B) = p(A) \cdot p(B)$$
How likely either event happens
$$p(A \cup B) = p(A) + p(B) - p(A \cap B)$$
## Probabilities of Two Dependent Events
If $p(A|B)=0$, events are mutually exclusive  
If $p(A|B) = p(A)$, events are independent. Chance of $A$ does not change with $B$
$$p(B|A) = \frac{p(B \cap A)}{p(A)}$$
Our equation for joint probability is actually:
$$p(A \cap B) = p(A) \cdot P(B|A)$$

### Bayes Theorem
#### $$p(A|B) = \frac{p(A) \cdot p(B|A)}{p(B}$$

## Combinatorics
|  | Combination | Permutation |
| :--- | :---:| :---: |
| **Replacement** | $$ \frac{(n + k -1)!}{k! (n-1)!}$$ | $$n^k$$ |
| **Without Replacement** | $$\frac{n!}{(n-k)! \cdot k!}$$ | $$\frac{n!}{(n - k)!}$$ |
<span style="color:green">Permutations </span> - every distinct ordering of results is a different instance of occurance **order matters**  
<span style='color:green'>Combinations</span> - order doesn't matter

## Probability Distributions
**Probability**: asks how likely some event (or set of events) is to occur  
**Combinatorics**: enumerates how many different ways a set of things can be arranged   
A <span style="color:green">probability distribution </span> is the set of different cases or groups of events with how likely each situation is  
<span style="color:green">Binomial Distribution </span> - calculate the probability for:
- A given number of trials
- A given number of successes
- The **probability** of success

and, by iterating through all possible number of successes in that trial, we can see the full set of possibilities!