# Stats Basics



# 1. Population and Parameter.
A **population** is any large collection of objects or individuals, such as Americans, students, or trees about which information is desired.

A **parameter** is any summary number, like an average or percentage, that describes the entire population.

>The population mean $\mu$ and the population proportion p are two different population parameters

>The problem is that 99.999999999999... % of the time, we don't — or can't — know the real value of a population parameter. The best we can do is estimate the parameter!

---



# Sample and Statistic
A **sample** is a representative group drawn from the population.

A **statistic** is any summary number, like an average or percentage, that describes the sample.


|         |Statistic from Sample        |Parameter from Population|
|:------------ |:-------------|:-----|
| Average      | $\bar{x}$ <br> is the avg. revenue from a random sample of 100 customers | $\mu$ <br>is the avg. revenue from all customers |
| Proportion      | $\hat{p}$ <br>is the proportion in a random sample of 100 customers who are loyal      |   p <br> is the proportion of all loyal customers |
|Use-case  | Because samples are manageable in size, we can determine the actual value of any statistic      |    We use the known value of the sample statistic to learn about the unknown value of the population parameter.



---



# The Empirical Rule
The standard deviation is the way variability is often described. If the data we are working with are roughly bell-shaped and symmetrical (looking like a normal distribution) then we can use the standard deviation to tell approximately how much of the data will cluster around the mean. 


The so called empirical rule states that the bulk of the data cluster around the mean in a normal distribution. In fact:
- 68% of values fall within $ \pm 1 $  standard deviation of the mean
- 95% fall within $ \pm 2 $  standard deviations of the mean
- 99% fall within  $\pm 3 $ standard deviations of the mean

It's called the empirical rule since experimenters have observed roughly these patterns from their data over and over again when they empirically collect data.

# Z-Score
The z-score is just a fancy name for standard deviations. So a z-score of 2 is like saying 2 standard deviations above and below the the mean. A z-score of 1.5 is 1.5 standard deviations above and below the mean. A z-score of 0 is no standard deviations above or below the mean (it's equal to the mean). 


# Bias
Bias is the tendency of a statistic to overestimate or underestimate a parameter.
> Bias can seep into your results for a slew of reasons including sampling or measurement errors, or unrepresentative samples.
mm

# Estimation

In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample.

**Point estimate** A point estimate of a population parameter is a single value of a statistic. 
>For example, the sample mean $\bar{x}$ is a point estimate of the population mean $\mu$. 

>Similarly, the sample proportion p is a point estimate of the population proportion P.

**Interval estimate** An interval estimate is defined by two numbers, between which a population parameter is said to lie. 
>For example, a < $\bar{x}$ < b is an interval estimate of the population mean $\mu$. 

It indicates that the population mean is greater than 'a' but less than 'b'.



# Confidence interval & level
- Confidence interval (CI) is a type of interval estimate which is *likely* to include an unknown population parameter,

$CI = \bar{X} \pm z*\dfrac{\sigma}{\sqrt(n)}$

>z is the value from the standard normal distribution for the selected confidence level 
(e.g., for a 95% confidence level, z=1.96)

# t-interval
The population standard deviation is usually not known, (if we knew it, we would likely also know the population average $\mu$, and have no need for an interval estimate.)

Hence we calcualte T-interval for sample. 

$CI = \bar{X} \pm z*\dfrac{SD}{\sqrt(n)}$

# Margin of Error. 
It can be used whenever samples are taken and an estimate is made about a larger population. The margin of error is half the confidence interval. The smaller the sample, the more variable the responses will be and the bigger the margin of error. 

# Effect of Population variability
As the variability of the population you're sampling from increases the confidence interval of your sample gets wider. 

Statisticians use summary measures to describe the amount of variability or spread in a set of data. The most common measures of variability are the range, the interquartile range (IQR), variance, and standard deviation.


#  Coefficient of variation ($C_v$)
$C_v = \dfrac{\sigma}{\mu}$
 - Also known as relative standard deviation (RSD)
 - Is a standardized measure of dispersion and is unitless
 
The higher the coefficient of variation, the greater the level of dispersion around the mean. It is generally expressed as a percentage. Without units, it allows for comparison between distributions of values whose scales of measurement are not comparable.

> For example, the expression “The standard deviation is 15% of the mean” is a CV. The CV is particularly useful when you want to compare results from two different surveys or tests that have different measures or values. For example, if you are comparing the results from two tests that have different scoring mechanisms.


# Is Standard Deviation high for my data?
Standard deviations aren't "good" or "bad". They are indicators of how spread out your data is. Sometimes, in ratings scales, we want wide spread because it indicates that our questions/ratings cover the range of the group we are rating. Other times, we want a small sd because we want everyone to be "high".

- Min Order Value is Rs.8 
- Max Order Value is Rs.284
- Average Order Value is Rs.28 
- Standard Deviation - Rs.27
- CV - 96.4%



# Hypothesis Testing
- A hypothesis test is a statistical test that is used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population.
- A hypothesis test examines two opposing hypotheses about a population: the null hypothesis and the alternative hypothesis. 
- **The null hypothesis** ($H_0$) is the statement being tested. Usually the null hypothesis is a statement of "no effect" or "no difference". 
- **The alternative hypothesis** ($H_1$ or $H_a$) is the statement you want to be able to conclude is true.

# Critical value approach
- The critical value approach involves determining "likely" or "unlikely" by determining whether or not the observed test statistic is more extreme than would be expected if the null hypothesis were true. That is, it entails comparing the observed test statistic to some cutoff value, called the "critical value.

# P-value approach
- The P-value approach involves determining "likely" or "unlikely" by determining the probability — assuming the null hypothesis were true — of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed. If the P-value is small, say less than (or equal to) α, then it is "unlikely." And, if the P-value is large, say more than α, then it is "likely."

# Chi-square test of independence
The chi-square test is a statistical test of independence to determine the dependency of two variables. It shares similarities with coefficient of determination, $R^2$. 
- However, chi-square test is only applicable to categorical or nominal data while $R^2$ is only applicable to numeric data.

