$\color{red}{\LARGE{\text{General Definitions}}}$
- **Random Sample:** A random sample is collected when each member of the sample is chosen from the population strictly by chance
- **Representative Sample:** A representative sample is a subset of the population that accurately reflects the members of the entire population.



$\color{red}{\LARGE{\text{Descriptive Statistics}}}$

$\color{blue}{\Large{\text{Types of Data}}}$
### Categorical Data
- Examples: Car barands, gender, true or false

### Numerical Data
- Discrete
    - Examples: Number of children in a family, shoe size, number of objects
- Continuous
    - Examples: Height, weight, area, time

$\color{blue}{\Large{\text{Levels of Measurement}}}$

### Qualitative Data
- Nominal (cannot be ordered)
    - Examples: gender, car brands, seasons of the year
- Ordinal (can be ordered)
    - Examples: letter grades
    
### Quantitative Data
- Ratio (has a true 0)
    - Examples: Number of objects, distance, time
- Interval (does not have a true 0)
    - Examples: Temperature (if today is 5C = 41F and yesterday was 10C = 50F, it being twice as cold today is only true in Celsius not Farenheit, this stems from 0C and 0F not being **true zeros**, they are scale created by humans)
    - Note: 0 Kelvin is considered a true zero, so Kelvin is a ratio variable
    
$\color{blue}{\Large{\text{Types of Graphs}}}$

### Categorical Data
- Frequency distribution tables
- Pie Charts
- Bar Charts
- Pareto Diagrams 
    - This is a special type of bar chart where categories are shown in descending order of frequency, it also has a curve on the graph showing cumulative frequency
    - It shows how subtotals change with each additional category and provides a better understanding of data

### Numerical Data
- Frequency distribution table
    - Data is grouped into intervals then analysed, statisticians prefer working with 5-20 intervals so the summary is useful
    - Can also have relative frequency column
- Histogram
    - Similar to bar chart, but no spaces between bars to show continuation between intervals
    - Can have unequal interval width

$\color{red}{\LARGE{\text{Univariate Measures}}}$

$\color{blue}{\Large{\text{Measures of Central Tendency}}}$

## Mean
- Denoted $\mu$ for population mean and $\bar{x}$ for sample mean
- **Disadvantage:** Easily affected by outliers
- The mean is not enough to make definite conclusions

## Median
- Middle number in an ordered dataset (at position $\frac{n+1}{2}$ where n is the number of elements in our dataset
- Not affected by outliers

## Mode
- Value that occurs most often
- No mode if there is no most common value
- Can have multiple modes

Need all 3 measures (or as many as possible) to make best conclusions

$\color{blue}{\Large{\text{Measures of Asymmetry}}}$

## Sample Skewness Formula
Formula:
![image.png](attachment:image.png)

Skewness indicates whether the data is concentrated on one side 

### Right Skewness means the outliers are to the right 
- Right Skewness is also known as Positive Skewness
- Mode $<$ Median $<$ Mean
### Left Skewness means the outliers are to the left
- Left Skewness is also known as Negative Skewness
- Mean $<$ Median $<$ Mode

If a distribution is not skewed left or right we say it is a symmetrical distirbution

$\color{blue}{\Large{\text{Measures of Variability}}}$

## Variance
- Variance measures the dispersion of a set of data points around their mean
- Sample variance:
$$ s^2 = \frac{\sum^n_{i=1}(x_i - \bar{x})^2}{n-1}$$
- Population variance:
$$ \sigma^2 = \frac{\sum^N_{i=1}(x_i - \mu)^2}{N}$$

The formula for the sample variance is different as we have all the information when examining the population, the change in denominator accommodates the extra uncertainty when working with a sample.

- Disadvantages: The number is usually large, which makes it hard to compare, and the units being squared is usually physically unrealistic.

## Standard Deviation
- To combat the above disadvantages we calculate the standard deviation instead, which is the square root of the variance.
- Standard deviation is the most common measure of variability for a **SINGLE DATASET**
- We use the following for comparing **TWO OR MORE** datasets

## Coefficient of Variation
- $\frac{standard \hspace{2pt} deviation}{mean}$
- It is also called the **relative standard deviation** as it is the standard deviation relative to the mean
- Population: $c_{v} = \frac{\sigma}{\mu}$, Sample: $\hat{c}_{v} = \frac{s}{\bar{x}}$
- This is a unitless number!

$\color{red}{\LARGE{\text{Bivariate Measures}}}$

## Covariance
- Can be positive, negative, or equal to 0
- $Cov(x,y) = Cov(y,x)$
- Sample:
$$s_{xy} = \frac{\sum^n_{i=1}(x_i-\bar{x})(y_i - \bar{y})}{n-1}$$
- Population:
$$\sigma_{xy} = \frac{\sum^N_{i=1}(x_i - \mu_x)(y_i - \mu_y)}{N}$$


- If the covariance is positive, the variables move together
- If the covariance is 0, the variables are independent
- If the covariance is negative, the two variables move in opposite directions


- Disadvantage: Can be difficult to interpret

## Correlation Coefficient (r)
- Correlation adjusts covariance, so that the relationship between the two variables becomes easy and intuitive to interpret
- Sample:
$$\frac{s_{xy}}{s_x s_y}$$
- Population:
$$\frac{\sigma_{xy}}{\sigma_x \sigma_y}$$
- -1 $\leq$ covariance $\leq$ 1
- **r = 1**, is a perfect positive correlation, it means the entire variability of one variable is explained by the other, and as one increases so does the other
- **r = -1**, is a perfect negative correlation, again the entire variability of one variable is explained by the other, but this time as one increases the other decreases
- Correlation is symmetrical with respect to both variables (although the physical implication may not make sense both ways)


- A common practice is to disregard a value between -0.2 and 0.2

### Causality
**CORRELATION $\neq$ CAUSATION**

Example: There are more murders in summer, and ice cream sales increase in summer, this does not mean an increase in ice cream sales results in more murder!!

$\color{red}{\LARGE{\text{Inferential Stats Fundamentals}}}$

A distribution is determined by its probabilities, a graph is just a visual representation of this.

Adding/Subtracting a value from all data points will change the mean but not the standard deviation.
Dividing a value by all data points will change the standard deviation.

## Normal Distribution
- $N \text{ ~ } (\mu, \sigma^2)$ 
- $z = \frac{x - \mu}{\sigma}$, $z \text{ ~ } N(0, 1)$ 

## Central Limit Theorem
- Since the sample mean differs with each sample, we take the mean of a number of samples and create a distribution with those, we call this a **sampling distribution of the mean**
- The mean of the sampling distribution of the mean is close to the **population mean** 

#### The sampling distribution is a normal distribution, even if the distribution the samples come from is not

If the original distribution has mean $\mu$ and variance $\sigma^2$, then the sampling distribution is $N(\mu, \frac{\sigma^2}{n})$, where n is the sample size, the larger the sample size the more accurate the results will be.

## Standard Error
- This is the standard deviation of the distribution formed by the sample means (recall the variance was $\frac{\sigma^2}{n}$)
$$E = \frac{\sigma}{\sqrt{n}}$$
- This represents the variablility of the different sample means
- The standard error decreases as the sample size increases.

**Why is it important?**
-It is used in most statistical tests and it shows how well you approximated the true mean.

## Estimators and Estimates
- A population estimator is an approximation depending solely on sample information
- An estimate is a specific value
- There are two types: Point estimates and Confidence Intervals
- The two are related with the point estimate being at the centre of the confidence interval, however confidence intervals are much more useful
- Example: $\bar{x}$ is an estimator for $\mu$

Estimators can have bias and efficiency
- We want the most efficient unbiased estimator
- An **unbiased estimator** has an expected value equal to the population parameter
- The most efficient estimators have the least variability of outcomes

### Confidence Interval
- Created around a point estimate
- More accurate representation of reality
- $ 0 \leq \alpha \leq 1$
- Confidence level is $1-\alpha$
- Confidence interval: 
$$\left[\bar{x} - (\frac{\sigma}{\sqrt{n}}*z_{\alpha /2}), \hspace{6pt} \bar{x} + (\frac{\sigma}{\sqrt{n}}*z_{\alpha /2})\right]$$
- $P(z_{\alpha /2}) = 1- \alpha /2$

# Student's T Distribution
This is used when the population variance is unknown

$$t_{n-1, \alpha /2} = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}$$
Confidence interval:
$$\bar{x} \pm  t_{n-1, \alpha /2} * \frac{s}{\sqrt{n}}$$

**NOTE: In the t-distribution table when it says $\alpha = ...$ this is $\alpha /2$ in our calculations.** There might also be a CI row at the bottom of the table indicating the level of confidence. 

# Margin of Error
The margin of error is part of our confidence interval formulas. In the case of the normal distribution the margin of error is:
$$z_{\alpha /2 *\frac{\sigma}{\sqrt{n}}}$$
For the t-distribution it is:
$$t_{n-1, \alpha /2}*\frac{s}{\sqrt{n}}$$

A smaller margin of error means a narrower confidence interval. A smaller test statistic (also known as the reliability factor) and a smaller standard deviation will lead to a smaller margin of error. Similarly a larger sample size reduces the margin of error.

$\color{blue}{\Large{\text{Two Populations}}}$

# Dependent Samples
There are several cases where we can get dependent samples, for example, when researching the same subject over time to measure changes in weight or blood samples. Another case is when investigating couples or families, e.g. habits of husbands and wives, these depend on each other. We can also get dependent samples when one sample affects the outcome of the other, such as SAT scores and college admittance. These are the same people in both samples.

There is a formula for confidence intervals for dependent samples and statistical methods such as regressions are used for analysis.

In biology normality is so often observed that we assume variables such as magnesium levels are normally distributed.

## Confidence Intervals
Confidence interval for difference of two means, dependent samples formula:
$$\bar{d} \pm t_{n-1, \alpha /2}*\frac{s_d}{\sqrt{n}}$$
where $\bar{d}$ is the mean of the differences and $s_d$ is the standard deviation of the differences.

# Confidence Intervals: Independent Samples

There are 3 cases:
1. Known population variances
2. Unknown population variances but assumed equal
3. Unknown population variances but assumed different

### Variance of difference 
$$\sigma^2_{diff} = \frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}$$

## Type 1 - known variance
Confidence interval:
$$(\bar{x} - \bar{y}) \pm z_{\alpha /2} \sqrt{\frac{\sigma^2_x}{n_x} + \frac{\sigma^2_y}{n_y}}$$

We use z since we know the population variance. 

**BE MINDFUL WHEN LOOKING AT A DATASET IF YOU HAVE THE POP. VARIANCE OR THE POP. STANDARD DEVIATION!!**

## Type 2 - unknown equal variance
We want to estimate the population variance. To do this we use the unbiased estimator: **pooled sample variance** which is given by the formula below
$$s^2_p = \frac{(n_x - 1)s^2_x + (n_y -1)s^2_y}{n_x+n_y-2}$$

Confidence interval:
$$(\bar{x} - \bar{y}) \pm t_{n_x + n_y -2, \alpha /2}\sqrt{\frac{s^2_p}{n_x} + \frac{s^2_p}{n_y}}$$

## Type 3 - unknown different variance
Confidence interval:
$$(\bar{x} - \bar{y}) \pm t_{\nu, \alpha /2} \sqrt{\frac{s^2_x}{n_x} + \frac{s^2_y}{n_y}}$$

The difficult part comes with estimating the degrees of freedom, but luckily we have a formula for this:
$$\nu = \frac{\left(\frac{s^2_x}{n_x} + \frac{s^2_y}{n_y}\right)^2}{\frac{(s^2_x\text{ / }n_x)^2}{(n_x - 1)} + \frac{(s^2_y\text{ / }n_y)^2}{(n_y - 1)}}$$


$\color{red}{\LARGE{\text{Hypothesis Testing}}}$

- Can have one or two tailed tests

## Significance Level
- Denoted by $\alpha$, it is the probability of rejecting the null hypothesis, if it is true. Typical values are 0.01, 0.05, and 0.1.
- Machinery has a higher level of accuracy and it is more important for specific measurements to hold true, therefore a lower significance level would be applied.
- Compared to this human or company behaviour can be expected to be more random, thus the significance level may be higher.

## Rejection Region
### z-test (two-sided test)
$$z = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}$$
- For a 95% confidence interval if $-1.96 \leq z \leq 1.96$ then we do not reject $H_0$
- This comes from part of the rejection region being in each tail, so we find $z_{\alpha /2}$

### One-sided test
- In this case the entire rejection region is on one side, so the rejection region is equal to $z_{\alpha}$. 
- Recall we find this by letting $P(z_{\alpha}) = 1 - \alpha$. Since the normal distribution is symmetric, we know that if $P(z) = 0.4$ this is equal to $P(-z)=0.6$

## Errors
### Type 1 
- This is when you reject a true null hypothesis
- Also called a false positive
- This occurs with probability $\alpha$

### Type 2
- Accept a false null hypothesis
- Also called a false negative
- This occurs with probability $\beta$
- $\beta$ depends on the sample size and variance

### Note: It is impossible to make both errors at the same time

## Power of the test
- The probability of rejecting a false null hypothesis is $1- \beta$, this is called the power of the test.
- To increase the power of the test researchers typically increase the sample size.
- Rejecting a false null hypothesis is the aim of the researcher

# Single Population Tests
## Test for Mean - Known Variance
$$Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}$$
where $\mu_0$ is the mean from the null hypothesis
- If the absolute value of this Z-score is greater than the reliability factor (found on the normal distribution table, also called the critical value, which we will denote with z) then we reject the null hypothesis
- If $Z>z$ we reject the null hypothesis

### p-value
- This is the most common way to test hypotheses
- The p-value is the smallest level of significance at which we can still reject the null hypothesis, given the observed sample statistic
- You should reject the null hypothesis if the p-value $< \alpha$

One sided p-value:
- 1-number from table

Two sided p-value:
- 2(1-number from table)

## Test for Mean - Unknown Variance
- Use t-statistic here
$$T = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}$$
- We then get the critical value t from the distribution table using the degrees of freedom and level of significance
- If $T>t$ we reject the null hypothesis

- Again if p-value < $\alpha$ then reject the null hypothesis

**NOTE: An online calculator is required to find the p-value for t-statistics**

# Multiple Population Tests
## Dependent Samples
- We calculate the differences
- We then get the sample mean of the differences
$$T = \frac{\bar{d} - \mu_0}{St. Error}$$
- We then calculate the p-value to see if we reject the null hypothesis

## Independent Samples - Known Variance
$$Z = \frac{\bar{x} - \mu_0}{\sqrt{\frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}}}$$
- Use this to find the p-value and decide if we should reject the null hypothesis
- If the Z-statistic is negative then it is likely that the true value of the mean is lower than the hypothesized mean, similarly if it is positive the true value is higher than the hypothesized mean

## Independent Samples - Unknown Equal Variance
**Recall:**

**Pooled variance:**
$$s^2_p = \frac{(n_x - 1)s^2_x + (n_y -1)s^2_y}{n_x+n_y-2}$$

**Standard error of the difference of means:**
$$\sqrt{\frac{s^2_p}{n_x} + \frac{s^2_p}{n_y}}$$

**Degrees of freedom:**
$$n_x+n_y-2$$

**T-statistic:**
$$T = \frac{\bar{d} - \mu_0}{St. Error}$$

Note: It is a rule of thumb to reject the null hypothesis when the T-score is bigger than 2

Note: Generally for Z and T, a value higher than 4 is extremely significant

