# Chapter 1: Exploratory Data Analysis

Independent and identically distributed random variables - if each RV has the same probability distribution as the others and all are mutually independent. 

PDF of the is a statistical expression that defines a probability distribution for a continuous random variable. All it does is tells us the probability of x lying between two values, say a and b. However, for one value (X=x) it will be 0. The formula is: $P(a<x<b) = \int_{a}^{b} f(x) dx $. For discrete values, PDF becomes probability mass function (PMF) and the formula is $P(x_{k})=f(x_{k})$ for $k=1,2,3,...$. The summation of all values should be 1: $\sum f(x) = 1$

CDF - cumulative distribution function. Tells us the probability that the variable takes a value less than or equal to x. That is $F(x) = Pr(X<=x)=\alpha$. For continious distribution, this can be expressed as: $F(x) = \int_{-\infty}^{x} f(u) du$ and foir discrete: $F(x) = \sum_{i=0}^{x} f(u)$



Data can be continious, discrete, categorical, binary, ordinal (example, numerical rating (1,2,3,4,5)




Rectangular Data: 
 - A spreadsheet or database table
 - Essentially 2D Matrix with rows indicating records (cases) and columns indicating features 
 - Sample might mean a single row (for CS) or a collection of rows (for statisticians)
 
Non-Rectangular Data
 - Used in IoT
 - Object representation, the focus is an object (a house) and its spatial coordinates.
 - The field view, by contrast, focuses on small units of space and the value of a relevant metric 
 - Graph data structures are used to represent physical, social and abstract relationships
  
Mean - sum of all values divided by number of values $\mu = \frac {\sum_{i}^{n} x_{i}}{n}$ 
```python
import numpy as np
np.mean(a, axis=None, dtype=None, out=None)
```
  
Weighted Mean - sum of all values times a weight divided by sum of the weights  $\mu = \frac {\sum_{i}^{n} x_i \cdot w_i}{\sum_{i}^{n} w_i}$ 

Example: if values 1, 2, 3, 4 have all different weights (0.1: 1; 0.1: 2; 0.7: 3; 0.1: 4), then the weighted mean is 

$ 0.1\cdot 1 +0.1 \cdot 2 +0.7 \cdot 3 +0.1 \cdot 4 = 2.8$

```python
import numpy as np
np.average(a, axis=None, weights=None,retured=False)
```


Median - value such that one-hald of the data lies above and below. Range has to be sorted. If it's odd, then median is 5 in range $1, 2, 3, 4, 5, 6, 7, 8, 9 $. If range is even $1, 2, 3, 4, 5, 6, 7, 8, 9, 10$, then median is average of 2 numbers in the middle $0.5 \cdot (5+6) = 5.5$ 

Weighted Median - The value such that one-half of the sum of the weights lies above and below the sorted data

Trimmed mean - Avg. of all values after dropping a fixed number of extreme values. Eliminates the influence of extreme values. Also known as truncated mean 

$ \mu_{trimmed} = \frac {\sum_{i=p+1}^{n-p} x_{i}}{n-2\cdot p} $

Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
 - Trim the top and bottom 20% from the data -> (81, 83, 91, 99 )
 - Find the mean of the remaining values: 85
 
  
Robust - not sensitive to extreme values


#### Estimates of Variablility

Range - difference between largest and smallest values 

Deviations (errors, residuals) - difference between the observed values and estimate of location. So if we have [1, 4, 4] then the mean is 3 and median is 4. The deviations are: 1-3=-2;4-3=1;4-3=1. Those values tell us how dispersed the data is around the cetnral value. 

Means absolute deviation (l1 norm, Manhattan norm) - mean of the absolute value of the deviations from the mean. $ {mean absolute deviation} = \frac {\sum_{i=1}^{n} |(x_i-\mu)|}{n} $. Example above: (2+1+1)/3 = 1.33

Median absolute deviation - used to compute median absolute deviation of a variable. $MAD = median(|Yi – median(Yi|)$. Numerical examples: https://www.statisticshowto.datasciencecentral.com/median-absolute-deviation/


Variance (mean-squared-error) -  is the expectation of the squared deviation of a random variable from its mean. If data is continious: $Var(X) = \sigma^{2} = \int (x-\mu)^{2} \cdot f(x) dx$. If data is discrete, $Var(X) = \frac{1}{n} \cdot \sum_{i=1}^{n} (x_{i} - \mu )^{2}$

It is sum of squared deviations from the mean divided by n-1 where n is the number of data values. $\sigma ^{2} = \frac {\sum x - \mu}{n-1}$


Standard deviations (l2 norm, Euclidean norm) - square root of the variance. $\sigma$. Easier to interpret than variance because it is on the same scale as the original data. However, working with squared values is much more conventient than absolute values, especially for statistical models. $\sigma = \sqrt {\frac{1}{N} \sum_{i=1}^{N} (x_{i} - \mu)^{2}}$ . Z-score/factor tells us the number of standard deviations, an observation is above/below mean (z-score=1, means that a value is one standard deviation from the mean)


Median absolute deviation from the median - the median of the absolute value of the deviations from the median


###### Order statistics (ranks) - Statistics based on sorted (ranked) data. 

Range - difference between the largest and the smallest value in the data set. The most basic measure of order statistics. Very sensitive to outliers. To avoid senstivity to outliers, we can look at range of data _after_ dropping values from each end. Formally, these types of estimates are based on differences between percentiles. 
Percentile (quantile) - Value such that P percent of the values take on this value or less and (100 - P) percent take on this value or more. (median is 50th percentile). Quantile is the same as percentile but in decimal (0.4 quantile = 40% of percentile). The quantiles are values which divide the distribution such that there is a given proportion of observations below the quantile

```python
import numpy as np
np.percentile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)
#a-array
#q - q-th percentile
#out - alternative output array in which to place the result
```



Interquartile range - difference between the 75th percentile and 25th percentile (IQR). For very large dataset, calculating percentiles is expensive. ML uses special algos to get an approximate percentile.





#### Exploring Data Distribution

Boxplot (box and whiskers plot) - plot used to visualize the distribution of data. If boxplot is notched, it means that the data is skewed. Top and Bottom of the box are 75th and 25th percentiles. The median is shown by the horizontal line. Dashed lines, known as whiskers, indicate the maximum and minimum values.

```
matplotlib.pyplot.boxplot (x, ... )

```

Frequency table - a label of the count of numerica data values that fall into a set of intervals (bins) 

Histogram - a plot of the frequency table with the bins on the x-axis and count on y-axis

```
matplotlib.pyplot.hist (x, bins=None, range=None, density=None)

```

Location and variability are referred to as first and second moments of a distribution. Third and fourth moments are skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the tendency of the data to have extreme values


Density Plot - a smoothed version of the histogram, often based on a kernel density estimation. 



#### Exploring Binary and Categorical Data

Mode - the most commonly occuring category or value in a dataset
```
import statistics
statistics.mode(data)
```

Expected Value - when the categories can be associated with a numeric value, this gives an average value based on a category's probability of occurence. Expected value is what is expected to be the average value of the outocmes on a large number of rolls. Also known as mean (or more precise a weighted mean)


Bar charts - frequency or proportion for each category plotted as bars. Common visual tool for diplaying categorical variables. Different from historgram; in hist. x-axis represetns values of a single variable on a numerica scale, and bars are typically touching each other. Bar chars, bars are shown separate from one another. 
```
matplotlib.pyplot.bar(x, height, width=0.8, bottom=None, *, align='center', data=None, **kwargs)
```

Pie chart - frequency or proportion for each category plotted as wedges in a pie


#### Exploring two or more variables

Correlation coefficient (known as Pearson's correlation coefficient) gives us a measurement of correlation (-1 to 1)
$r=\frac{\sum (x-\mu_{x})\cdot (y-\mu_{y}}{\sqrt (ss_{x} \cdot ss_{y})} $. Useful to compare two variables (bivariate analysis). If more than 2, then multivariate analysis

Other correlation coefficients are Spearman's rho and Kendall's tau. 

Spearman's coefficient is appropriate for both continious and discrete varaibles, including ordinal variables (variables, where order is important). 

Kendall

#### Hexagonal Binning and Contours (Plotting numeric vs Numeric Data)

Scatterplots are fine when there is a relatively small number of data values. If that number is too large, then scatterplots will be too dense. 

In that case, a hexagon binning plot is useful. Rather than plotting points, which would appear as a dark cloud, we group the records into hexagonal bins and plot those bins
```
matplotlib.pyplot.hexbin(x, y, C=None, gridsize=100, bins=None, xscale='linear', yscale='linear', extent=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, edgecolors='face', reduce_C_function=<function mean at 0x7f7b286816a8>, mincnt=None, marginals=False, *, data=None, **kwargs)
```

#### Two Categorical variables

A useful way to summarize two categorical variables is a contingency table - a table of counts by category. Pivot tables in Excel are the most common tool used to create contingency tables.

#### Categorical and numerical data

Boxplots are a simple way to compare the distributions of a numeric variable grouped according to a categorical variable. A violin plot is an enhancement to the boxplot and plots the density estimate with the density on the y-axis. The density is flipped over and resulting shape is filled in, creating an image that looks like a violin. It is more informative, as it shows the distribution of data (unlike in the boxplot).
```
matplotlib.pyplot.violinplot(dataset, positions=None, vert=True, widths=0.5, showmeans=False, showextrema=True, showmedians=False, points=100, bw_method=None, *, data=None)
```


Kernel density estimation - is the technique to estimate the underlying probability distribution of a random variable, based on a sample of points taken from that distribution.


Degrees of freedom is central to the principle of estimating statistics of population from samples of them. It is a mathematical restriction that needs to be put in place when estimating one statistic from an estimate of another.

http://onlinestatbook.com/2/estimation/df.html

# Chapter 2: Data and Sampling distributions

Population follows an unknown distribution. The only thing that is available is sample data and its empirical distribution. To get from population to a smaller representation of it, _sampling_ techiniques are used. 

#### Random Sampling and Sample Bias
 
A sample is a subset of data from a larger dataset (population). Population is N, whereas sample is n. _Random sampling_ is drawing elements into a sample at random. _Stratified sampling_ is dividing the population into strate and random sampling from each strata (https://www.chegg.com/homework-help/definitions/stratified-sampling-31). Sample bias is a sample that _misrepresents_ the population.

Random sampling is the process in which each available member of the population being sample has an equal chance of being chosen for the sample at each draw; the sample that results is called a simple random sample. Sampling can be done with replacement, in which observations are put back in the population after each draw for possible future reselection; or without replacement, in which case observations, once selected are unavailable for future draws.

Sample bias means that sample is different from the larger population that it's meant to represent; it is different in some meaningful nonrandom way from the larger population that it's meant to represent. The term nonrandom is important - hardly any sample will be exactly representative of the population but in the case of sample bias, that difference is meaningful.
```
#Random sampling
import random
random.sample (population, k) 

#Random sample with or without replacement
numpy.random.choice(a, size=None, replace=True,p=None)
#a-1D array; size: output shape; replace- with/without replacement; p - 1D array - probabilities associated with each entry in a
```

##### Bias

Statistical bias refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process. There is a crucial difference between errors due to random chance, and due to bias. The first case will produce errors but they will be at random and will not tend in any direction. 

In statistics, sample mean, that is observed, is denoted as $\vec{x}$ and $\mu$ represents the mean of the population. Mean of the population is often inferred from the smaller samples. 

##### Selection Bias

Selection bias refers to the practice of selectively choosing data in a way that leads to conclusion that is misleading. Typical forms of selection bias in statistics inlude nonrandom sampling, cherry-picking data, or stopping an experiment when results look interesting.

Specifying hypothesis, then collecting data following randomization and random sampling principles, ensures against bias.


##### Sampling Distribution of a Statistic

Sampling distribution refers to the distribution of some sample statistic, over many samples drawn from the same population. 
We draw a sample with the goal of either measuring something, or modeling something. Our model is based on the sample; as the result, it might vary based on the sample drawn from population. Sampling variability is how much an estimate varies between samples. There is also a difference between _data distribution_ and _sampling distribution_, where former is the distribution of the observations in the data (for example, the scores of students taking statistics course) and latter, is distribution of a sample taken from population; if we take $N$ samples and each sample has $n$ mean, then we have $n$ means of $N$ samples and this is known as the distribution of the sample mean.

Sampling distribution of the mean: probability distribution of means for ALL possible random samples of A GIVEN SIZE from some population (http://www.cogsci.ucsd.edu/~nunez/COGS14B_W17/W3b.pdf). Pretty much it means that if keep taking samples and we get mean of those samples, we will arrive to the mean of the whole population. It is used to construct confidence intervals for the mean and for significance testing. The formula to compute the standard deviation of the sampling distribution of sample means: $\sigma_{\vec{X}} = \frac {\sigma}{\sqrt{n}}$

Central Limit Theorem - tendency of the sampling distribution to take on a normal shape (Gaussian Distribution) as sample size rises, even if the source population is not normally distributed (under assumption that the sample size is large enough, usually>30; if the sample size (n)<30, then the population should have a normal distribution). The central-limit theorem allows normal approximation formulas like t-distribution to be used in calculating sampling distributions for inference - that is, confidence intervals and hypothesis tests. CLT is not so central in DS, because formal hypothesis tests and confidence intervals play a small role in DS. 

_Standard Error_ is a single metric that sums up the variability in the sampling distribution for a statistic. It is important because it tells use how much of sampling fluctuations a statistic will show. It tells us the accuracy with which a sample represents a population. The smaller the standard error, the more representative the sample will be of the overall population.  Inferential statistics involved in construction of confidence intervals and significance testing is based on SEs. SE can be estimated using a statistic based on the standard deviation _s of the sample values, and the sample size n_: $SE = \frac {s}{\sqrt{n}} $.

We can use the following approach to measure standard error:
 - Collect a new sample from the population
 - For each new sample, calculate the statistic (mean for example)
 - Calculate the standard deviation of the statistics computed in previous step; use this as your estimate of standard error
 
Problem is that it's not always possible to collect new samples. Instead, it is possible to boostrap resamples. Booststrap became a standard in modern statistics, as it does not rely on CLT. Standard error is different from standard deviation: SD measures the variability of individual data points, while SE measures variability of sample

 - Frequency distribution of a sample statistic tells us how that metric would turn out differently from sample to sample
 - This sampling distribution can be estimated either via boostrap, or via formulas that rely on CLT
 - A key metric that sums up the variability of a sample statistic is its standard error.
 
 
##### Bootstrap

Bootstrap - sample taken with replacement from an observed dataset. Think of a bootstrap as replicating the original sample as many time as you want, so we have a hypothetical population that embodies the knowledge from the original sample. In practice, there is no need to replicate the sample a huge number of times; we just replace each observation after each draw (_sampling with distribution_). Bootstrap does not compensate for a small sample size; all it does it informs us about how lots of additional samples would behave when drawn from a population like the original sample. Bootstrap allows us to estimate sampling distributions for statistics where no math. approximation to sampling distribution has been developed. It is a powerful tool for assessing the variability of a sample statistic. When applied to predictive models, aggregating multiple bootstrap sample predictions (bagging) outperform the use of a single model.

Resampling - a process of taking repeated sample from observed data; includes both bootstrap and permutation (shuffling) procedures. The algorithm for bootstrapping can be summarized as:
 - Draw a sample value, record, and replace it
 - Repeat _n_ times
 - Record the mean of _n_ resampled values
 - Repeat Steps 1-3 _R_ times (_R_ is number of iterations of the bootstrap is arbitrary. However, the more iterations we have, the more accurate the estimate of the standard error)
 - Use R results to: calculate their standard deviation (this estimates sample mean standard error); produce a histogram or boxplot; find a confidence interval
 
##### Confidence Intervals

Confidence intervals are another way to understand the potential error in a sample estimate. It is defined as giving an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data; in other words, confidence intervals are used to describe how accurate our estimate is likely to be. Width of confidence interval depends on two things; variation of population and sample size. If variation of population has low variance, then variation of sample will also have a low variance. Also, if we take a small sample, then they will vary more from each other and have less information, which leads to confidence intervals. CIs can be calculated via bootstrapping:
- Draw a random sample of size n with replacement from the data (a resample)
- Record the statistic of interest for the resample
- Repeat steps 1-2 many (R) times
- For an z% confidence interval, trim ((1-[z/100])/2)% of the R resample results from either end of the distribution
- The trim points are the endpoints of an z% bootstrap CI

That z% of CI is termed as the level of confidence. The higher the level of confidence, the wider the interval. Smaller sample means wider interval.

Confidence intervals are constructed at a confidence level, such as 95 %, selected by the user. What does this mean? It means that if the same population is sampled on numerous occasions and interval estimates are made on each occasion, the resulting intervals would bracket the true population parameter in approximately 95 % of the cases. 

Example:


Suppose that a 90% confidence interval states that the population mean is greater than 100 and less than 200. How would you interpret this statement?


Some people think this means there is a 90% chance that the population mean falls between 100 and 200. This is incorrect. Like any population parameter, the population mean is a constant, not a random variable. It does not change. The probability that a constant falls within any given range is always 0.00 or 1.00.


The confidence level describes the uncertainty associated with a sampling method. Suppose we used the same sampling method to select different samples and to compute a different interval estimate for each sample. Some interval estimates would include the true population parameter and some would not. A 90% confidence level means that we would expect 90% of the interval estimates to include the population parameter(such as mean); a 95% confidence level means that 95% of the intervals would include the parameter; and so on.
https://www.youtube.com/watch?v=tFWsuO9f74o

Small samples vary more often from each other and have less information which leads to wider confidence intervals. Larger samples give more confidence. CIs can be calculated either through traditional normal-based or bootstrapping. The point is that from a frequentist perspective, a 95% CI implies that if the entrire study were repeated identically ad infinitum, 95% of such confidence intervals formed in this manner will include the true value. 

#### Distributions

##### Normal Distribution

One in which the units on the x-axis are expressed in terms of standard deviations away from the mean. To compare data to a standard normal distribution, you substract the mean then divide by the standard deviation; also known as _normalization or standardization_. Transformed value is _z-score_ and the normal distribution is sometimes known as _z-distribution_.

$z_{i} = \frac {x_{i} - \bar{x}}{s} $.

Z-score is the number of standard deviations for a point from the mean. Z-scores are used to compare results from a test to a "normal" population (we standartize the data). 

QQ-Plot (quantile-quantile plot) is used to visually determine how close a sample distribution is to the normal distribution; also, it is used to determine if two datasets come from populations with a common distribution. QQ-Plot orders _z-scores_ from low to high, and plots each value's _z-score_ on the y-axis; the x-axis is the corresponding qunatile of a normal distribution for that value's rank. If points roughly fall on the diagonal line, then the sample distribution can be considered close to normal. 
![alt text](qq-plot.png "QQ-Plot")

```
numpy.random.normal(loc=0.0, scale=1.0, size=None), where loc = mean; scale = standard deviation
```
###### Long-Tailed Distribution

Tails of a distribution correspond to the extreme values (small and large). 
```
Log normal can be used as an example

numpy.random.lognormal(mean, sigma, size)
```

##### Student's t-distribution

Degrees of freedom is a parameters that allows t-distribution to adjust to different sample sizes, statistics and number of groups. 

According to the central limit theorem, the sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is sufficiently large. Therefore, when we know the standard deviation of the population, we can compute a z-score, and use the normal distribution to evaluate probabilities with the sample mean.

But sample sizes are sometimes small, and often we do not know the standard deviation of the population. When either of these problems occur, statisticians rely on the distribution of the t statistic (also known as the t score). If standard deviation is known or sample size is large (>30), use normal distribution. 

The formula to calculate t statistic (also known as t-score): $t = \frac {\vec{x} - \mu}{s/\sqrt(n)}$, where s is standard deviation of the sample and n is the sample size. 
```
numpy.random.standard_t(df,size=None)
```

The particular form of the t-distribution is determined by the degrees of freedom. DFs are the values in the final calculation of a statistic that are free to vary. In ML, the independent variables (x) on which the target depends on are called the degrees of freedom (number of observations and number of predictors). Using three predictors, we have 46 degrees of freedom: 50 states minus three variables plus the intercept. DF equals the number of observations minus the number of parameters estimated. Degrees of freedom is a combination of how much data you have and how many parameters you need to estimate. DF are the number of independent variables that a statistical analysis can estimate. https://stats.stackexchange.com/questions/277009/why-are-the-degrees-of-freedom-for-multiple-regression-n-k-1-for-linear-reg


Properties of t-distribution: 
 - The mean of the distribution is 0
 - The variance is equal to df/(df-2), where df is degrees of freedome
 - Variance is always greater than 1
 


##### Binomial Distribution

Trial is an event with a discrete outcome. 

Success is the outcome of interest for a trial. _p_ probability of success. Failure is 1-p.

Binomial trial - a trial with two outcomes

Binomial distribution (also known as Bernoulli distribution) is distribution of number of successes in x trials.

With large n, and provided _p_ (probability of success) is not close to 0 or 1, the binomial distribution can be approximated by the normal distribution

```
numpy.random.binomial(n,p,size=None)
```

##### Possion and Related Distribution

Many processes produce events randomly at a given overall rate - visitors arriving at a website, cars arriving at a toll plaza. Poisson Distributions tells us the distribution of events per unit of time or space when we sample many such units.

_Lambda_ - the rate (per unit of time or space) at which events occur

_Poisson Distribution_ - frequency distribution of the number of events in sampled units of time or space

_Exponential distribution_ - frequency distribution of the time or distance from one event to the next event

_Weibull Distribution_ - a generalized version of the exponential, in which the event rate is allowed to shift over time

The key parameter in Poisson is $\lambda$, which is the mean number of events that occur in a specified interval of time or space. The variance is also $\lambda$.

```
numpy.random.poisson(lam,size=None)
```

###### Exponential Distribution

Using the same parameter $\lambda$ that we used in Poisson distribution, we can also model the distribution of the time between events: time between visits to a website or between cars arriving at a toll plaza. Also used in engineering to model time to failure. The key assumption here or in Poisson distribution is that the rate $\lambda$ will remain constant over the period being considered. Even though that doesn't happen in global sense, time periods can be divided into segments that are sufficiently homogeneous so that analysis or simulation within those periods is valid.

The event rate $\lambda$ can be estimated from prior data. That doesn't work for rare events though. If there is some data but not enough to provide a precise, reliable estimate of the rate, a goodness-of-fit test (Chi-square) can be applied to various rates to determine how well they fit the observed data

###### Weibull Distribution

In many cases, event rate does not remain constant over time. If the event rate changes over the time of the inverval, Poisson (or exponential) is no longer useful. The WD is an extension of the exponential distribution, in which the event rate is allowed to change, as specified by a shape parameter, $\beta$. If $\beta >1$, probability of an event increases over time, if $\beta<1$, it decreases. WD is used with time-to-failure analysis instead of event rate, the second parameters is expressed in terms of characteristic life, rather than in terms of the rate of events per interval. The symbol used is called the scale parameter $\eta$. 

# Chapter 3: Statistical Experiments and Significance testing

The goal is to design an experiment in order to confirm or reject a hypothesis. An experiment (it might be an A/B Test) is designed to test the hypothesis. The term inference tells is that we are going to apply the experiment results, which involve a limited set of data, to a population

###### A/B Testing

A/B Test is an experiment with two groups to establish which of two approaches, procedures is better (tests the viability of a new product or a new feature). Often one of the two approaches is the standard one. It could also be the case that no approach (or treatments) is a standard approach. This standard approach is called a _control_ approach. A typical hypothesis is that treatment is better than control.

_Key Terms_:
 - Treatment - something (drug, price, web headline) to which a subject is exposed
 - Treatment group - a group of subjects exposed to standard treatment
 - Randomization - process of randomly assigning subjects to treatments
 - Subjects - the items that are exposed to treatments
 - Test statistic - the metric used to measure the effect of the treatment
 
A/B tests are common because results are so readily measured. Ideally, subjects are randomized to treatments (or approaches). In that case, the difference in results is either due to different treaments (or approaches), or because test subjects are distributed in a way that is more suitable for A/B treament (luck of draw).
 
A need to have a control group comes from the fact that it is known how this control group would react to a standard approach.

The use of A/B testing in DS is typically in a web context. Treatment might be the design of a web page, the price of the product.


###### Hypothesis Tests

Hypothesis tests are also called significance tests. They are used to help to understand whether random chance might be responsible for an observed effect and to protect researchers from being fooled by random chance.

_Key Terms_:
 - Null Hypothesis - hypothesis that chance is to blame
 - Alternative Hypothesis - counterpoint to the null (something you hope to prove)
 - One way tests - hypothesis test that counts chance results only in one direction
 - Two-way tests - hypothesis test that counts chance results in two directions
 
A/B tests are usually constructed with the hypothesis in mind. In a properly designed A/B test, you collect data on treatments A and B in such a way that any observed difference between A and B is due to either a random chance in assignment of subjects, or a true difference between A and B.

Null Hypothesis assumes that both treatments are equivalent, and any difference between the groups is due to chance. This is a baseline assumption and essentially, our _null hypothesis_. 

In A/B tests, we are testing two approaches. We want a hypothesis test to protect us from being fooled by chance in the direction favoring a new approach (say B). We don't care about being fooled other way around, because we want to stay with A, unless B is so much better. But if we want to be protected from being fooled by chance in either direction, then alternate hypothesis is bidirectional and we use two-way hypothesis.

###### Resampling

Resampling in statistics means to repeatedly sample values from observed data, with a general goal of assessing random variability in a statistic. Two main types of resampling: bootstrap or premutation test. Permutation tests are used to test hypotheses, typically involving two or more groups.

_Key Terms_:
 - Permutation test - procedure of combining two or more samples together and randomly (or exhaustively) reallocating the observations to resamples. 
 
Permute means to change the order of a set of values. The test has the following procedure:
 - Combine the results from different groups in a single dataset
 - Shuffle the combined data, then randomly draw (w/o replacing ) a resample of the same size as group A
 - From the remaining data, randomly draw (w/o replacing) a resample of the same size as group B
 - Whatever statistic or estimate was calculated for the original samples, calculate it now for the resamples
 - Repeat the previous steps _R_ times to yield a permutation distribution of the test statistic.
 
Now, we have to go back to observed difference between groups and compare it to the set of permuted differences. It it lies within the set of permuted differences, then we have not proven anything - the observed difference is within the range of what chance might produce. However, if the observed difference lies _outside_ most of the permutation distribution, then the chance is not responsible. Technically, the difference is _statistically significant_.

##### Exhaustive and Bootstrap Permutation Test
There are two variants of permutation test: exhaustive permutation test and bootstrap permutation test

In an exhaustive permutation test, instead of just randomly shuffling and dividing the data, we actualy figure out all the possible ways it could be divided. Practical only for small sample sizes. With a large number of repeated shufflings, the random permutation test results approximate those of the exhaustive permutation test, and approach them in limit. Exhaustive permutation tests are known as exact tests, due to the their statisstical property of guaranteeting that the null model will not test as significant more than the alpha level of the test.

Permutation tests are useful in exploring the role of random variation.

In permutation test, multiple samples are combined and then shuffled. The shuffled values are then divided into resamples, and the statistic of interest is calculated. This process is then repeated, and the resampled statistic is tabulated. Comparing the observed value of the statistic to the resampled distribution allows you to judge whether an observed difference between samples might occur by chance. If observed difference lies well within the set of permuted differences, then we have not proven anything - the observed difference is within the range of what chance might produce. However, if the observed difference lies outside most of the permutation distribution, then we conclude that chance is not responsible.


##### Statistiical Significance and P-values

Significance is how statisticians measure whether an experiment yields a result more extreme than what chance might produce. If the result is beyond the realm of chance variation, it is said to be statistically significant.

_Key Terms_:
 - P-value - Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unsual or extreme as the observed results. P-value is the evidence against a null hypothesis. It is the probability of finding the observed, or more extreme, results when the null hypothesis is true. It is the level of marginal significance within a statistical hypothesis test representing the probability of the occurence of the given event. The p-value approach to hypothesis testing uses the calculated probability to determine whether there is evidence to reject the null hypothesis. A smaller p-value indicates that there is stronger evidence favoring the alternative hypothesis.
 - P-hacking - Data dredging (also data fishing, data snooping, data butchery, and p-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect
 - Alpha - the probability threshold of "unusualness" that chance results must surpass, for actual outcomes to be deemed statistically significant
 - Type 1 Error - Mistakenly concluding an effect is real (when it is due to the chance). False rejection of null hypothesis, when it is True. 
 - Type 2 Error - Mistakenly concluding an effect is due to chance, when it is real. Fails to reject null hypothesis, when it is False
 
By chance variation, it is meant that the random variation produced by a probability model that embodies the null hypothesis that there is no difference between the rates. "If the two prices shares the same conversion rate, could chance variation produce a difference as big as 5%?"
 
##### P-value. Alphas

We can estimate a p-value from our permuation test by taking the proprtion of times that the permutation test produces a difference equal to or greater than the observed difference

Alphas is a threshold. "More extreme than 5% of the chance (null hypothesis) results". 

Significance tests (also called hypothesis tests) is to protect against being fooled by random chance.



```
Other sources
```
Before we run any statistical test, we must first determine the alpha level, which is also called the significance level.By definition, alpha level is the probability of rejecting the H_{0} when the null hypothesis is True. (probability of making a wrong prediction). This number tells is the maximum probability with which we would be willing to risk a Type I error (FP: reject when it's true).

Once we've chosen alpha, we are ready to conduct a hypothesis test. We can use any software, such as Minitab, but in the end, we will arrive at something known as p-value. P-value is the probability of obtaining a result as extreme as (or more extreme than) the result actually obtained when the null hypothesis is true. In the case when _p<=$\alpha$_, we reject null hypothesis and result is statistically significant. If _p>$\alpha$_, then we fail to reject null hypothesis and results are statistically non-significant. Alpha sets the standard for how extreme the data must be before we can reject the null hypothesis. The p-value tells us how extreme the data is. 

P-value answers the question: How likely is it that I would get the data I have, assuming the null hypothesis is true? A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high p value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. The null hypothesis is usually an hypothesis of "no difference" e.g. no difference between blood pressures in group A and group B. Define a null hypothesis for each study question clearly before the start of your study. The alternative hypothesis (H1) is the opposite of the null hypothesis; in plain language terms this is usually the hypothesis you set out to investigate. If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis.

https://slideplayer.com/slide/6365872/



###### T-tests

Numerous types of significance tests, depending on whether the data comprises count data or measured data, how many samples there are and what is being measured. A very common one is t-test. All significance tests requires us to specify a test statistic to measure the effect you are interested in.

###### Notes
This issue is related to the problem of overfitting in data mining, or “fitting the model to the noise.” The more variables you add, or the more models you run, the greater the probability that something will emerge as “significant” just by chance.

For predictive modeling, the risk of getting an illusory model whose apparent efficacy is largely a product of random chance is mitigated by cross-validation (see “Cross-Validation”), and use of a holdout sample.

###### Degrees of freedom

The concept is applied to statistics calculated from sample data, and refers to the number of values free to vary. For example, if you know the mean for a sample of 10 values, and you also know 9 of the values, you also know the 10th value. Only 9 are free to vary.

###### ANOVA

Instead of A/B test, we have a comparion of multiple groups, say A-B-C-D, each with numeric data. Statistical procedure that tests for a statistically significant difference among the groups is called analysis of variance or ANOVA. Using this test, we can determine if there are any significant differences between means of three or more independent groups. The one-way ANOVA compares the means between the groups you are interested in and determines whether any of those means are statistically different from each other. So the $H_{0}$ is: $H_{0}: \mu_1=\mu_2=\mu_k$. 

If, however, the one-way ANOVA returns a statistically significant result, we accept the alternative hypothesis (HA), which is that there are at least two group means that are statistically significantly different from each other.

_Key Terms_
 - Pairwise comparison - a hypothesis test between two groups among multiple groups
 - Omnibus test - a single hypothesis test of the overall variance among multiple group means
 - Decomposition of variance - separation of components, contributing to an individual value
 - F-statistic - a standardized statistic that measures the extent to which differences among group means exceeds what might be expected in a chance model
 - SS - sum of squares, referring to deviations from some average value
 
 Used to find out if survey or experimental results are significant. It helps to figure out if we need to reject the null hypothesis or accept the alternate hypothesis. ANOVA is a statistical method used to test differences between two or more means. One-way ANOVA tests the null hypothesis that three or more group means are equal: $H_{0}: \mu_{group 1} = \mu_{group 2} =\mu_{group 3} = \mu_{group k} $. 1-way ANOVA is an omnibus test (a single hypothesis test of the overall variance among multiple group means). It cannot tell us what specific groups were statistically significantly different from each other; it just tells you that they two groups are different.
 
Two-way ANOVA tests the effect of 2 independent variables (factors) on a dependent variable. 
 
 
##### F-statistic

Just like the t-test can be used isntead of a permutation test for comparing the mean of two groups, there is a statstical test for ANOVA based on the F-statistic. Value you get when you run ANOVA test is called F-value. F-Test tells us if a group of variables are sognificant. If calculated F value in a test is larger than your F-statistic, we can reject null hypothesis. F-value in one-way ANOVA can help us to answer a question: "Is the variance between the means of two populations significantly different?". F value = variance of the group means/mean of the within group variance.


###### Chi-square Test

The chi-square test is used with count data to test how well it fits some expected distribution. Test how likely it is that an observed distribution is due to chance. Also, called goodness of fit statistic because it measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent. Designed to analyze categorical data. 

_Key Terms_
 - Chi-square statistic - a measure of the extent to which some observed data departs from expectation
 - Expectation or expected - how we would expect the data to turn out under some assumption, typically the null hypothesis

It is a procedure for testing if two categorical variables are related in some population (independent vs. related). It is a nonparametric test. 


There are actually two types of chi-square tests: a cs goodness of fit test and chi-square test for independence. Chi-square goodess of fit test determines if a sample data matches a population (also known as goodness of fit test). The one we are working with is known as chi-square test for independence. In a larger sense, this test is trying to see whether distributions of categorical variables differ from each other (if there is a small chi-squared test statistic, then there is a relationship). The formula for chi-squared statistic is determined as following:
 - Find Pearson residual first -> $R=\frac{Observed-Expected}{\sqrt{Expected}}$
 - Then the statistic is essentially the sum of the squared Pearson residuals: $\chi=\sum_{i}^{row} \sum_{j}^{column} R^{2}$ or if we want to combine two formulas into 1: $\chi^{2} = \frac{(Observed_{ij}-Expected_{ij})^{2}}{Expected_{ij}}$
 - The output will be a p-value. 
 
The null hypothesis is formulated as:
 - $H_{0}$: Variable 1 is independent of Variable 2
 - $H_{1}$: Variable 1 is not independent of Variable 2
If calculated $\chi^{2}$ value is larger than critical $\chi^{2}$ value, then we reject a null hypothesis. In regards to p-values, the approach is the same: if p-value<$\alpha$ then we reject null Hypothesis and two var-s are dependent, otherwise we fail to reject null hypothesis and two var-s are independent.
 
The degrees of freedom for chi-squared test is calculated as $df=(r-1)\cdot(c-1)$.
 
###### Relevance for DS

One DS application of the chi-square test, especially Fisher's exact version, is in determing appropriate sample sizes for web experiemnts. Chi-square tests, are used in DS more as a filter to determine whether an effect or feature is worthy of further consideration than as a formal test of significance. Can be used in automated feature selection in ML, to assess class prevalence across features and identify features where the prevalence of a certain class is unsually high or lwo

# Interview Questions

##### 109 Data Science Interview Questions and Answers ( https://www.springboard.com/blog/data-science-interview-questions/ )

Q: What is the Central Limit Theorem and why is it important?

A: Central Limit Theorem - tendency of the sampling distribution to take on a normal shape (Gaussian Distribution) as sample size rises, even if the source population is not normally distributed (under assumption that the sample size is large enough, usually>30; if the sample size (n)<30, then the population should have a normal distribution). The mean of the sampling distribution will approximate the mean of the true population. A larger sample size will produce a smaller sampling distribution variance. The central-limit theorem allows normal approximation formulas like t-distribution to be used in calculating sampling distributions for inference - that is, confidence intervals and hypothesis tests.

Q: What is sampling? How many sampling methods do you know?

A: Sampling is the statistical analysis technique used to select, manipulate and analyze the representative subset of the data points with the goal of identifying patterns in the larger dataset. There are several main sampling techniques:
 - Simple random sampling
 - Stratified sampling - subsets of population are created based on some common factor and samples are randomly collected from each subgroup
 - Cluster sampling - larger data sets are divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed
 - Bootstrap aggregating
 
Q: What is the difference between type I vs type II error?
 
A: Type 1 Error - False rejection of null hypothesis, when it is True. Type 2 Error - Fails to reject null hypothesis, when it is False.
 
Q: What are the assumptions required for linear regression?
 
A: There are four major assumptions: 1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
 
Q: What is a statistical interaction?
 
A: Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.
 
Q: Binomial Probability Formula?

A: P(x) = $\frac {n!}{(n-x)! \cdot x!} \cdot p^{x} \cdot q^{n-x}$, where
- n is the number of trials
- x is the number of success desired
- p - probability of getting a success in one trial
- q - probability of getting a failure in one trial
