<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Statistics Fundamentals Part 2

_Authors: Alexander Egorenkov (DC)_

---

<a id="learning-objectives"></a>
### Learning Objectives
- **Explain** the difference between causation and correlation
- **Determine** causality and sampling bias using Directed Acyclic Graphs
- **Identify** what missing data is and how to handle it
- **Test** a hypothesis using a sample case study

### Lesson Guide
- [Data Source](#data-source)
	- [What are the features/covariates/predictors?](#what-are-the-featurescovariatespredictors)
	- [What is the outcome/response?](#what-is-the-outcomeresponse)
	- [What do you think each row in the dataset represents?](#what-do-you-think-each-row-in-the-dataset-represents)
- [Math review](#math-review)
	- [Covariance](#covariance)
	- [Correlation](#correlation)
	- [The variance-covariance matrix](#the-variance-covariance-matrix)
- [Causation and Correlation](#causation-and-correlation)
	- [Structure of causal claims](#structure-of-causal-claims)
	- [Why do we care?](#why-do-we-care)
	- [How do we determine if something is causal?](#how-do-we-determine-if-something-is-causal)
- [Pearlean Causal DAG model](#pearlean-causal-dag-model)
	- [What is a DAG?](#what-is-a-dag)
	- [It's possible that X causes Y.](#its-possible-that-x-causes-y)
	- [Y causes X.](#y-causes-x)
	- [The correlation between X and Y is not statistically significant.](#the-correlation-between-x-and-y-is-not-statistically-significant)
	- [X or Y may cause one or the other indirectly through another variable.](#x-or-y-may-cause-one-or-the-other-indirectly-through-another-variable)
	- [There is a third common factor that causes both X and Y.](#there-is-a-third-common-factor-that-causes-both-x-and-y)
	- [Both X and Y cause a third variable and the dataset does not represent that third variable evenly.](#both-x-and-y-cause-a-third-variable-and-the-dataset-does-not-represent-that-third-variable-evenly)
	- [Controlled Experiments](#controlled-experiments)
	- [When is it OK to rely on association?](#when-is-it-ok-to-rely-on-association)
	- [How does association relate to causation?](#how-does-association-relate-to-causation)
- [Sampling bias](#sampling-bias)
	- [Forms of sampling bias](#forms-of-sampling-bias)
	- [Problems from sampling bias](#problems-from-sampling-bias)
	- [Recovering from sampling bias](#recovering-from-sampling-bias)
    - [Stratified random sampling](#stratified-random-sampling)
- [Missing data](#missing-data)
	- [Types of missing data](#types-of-missing-data)
	- [De minimis](#de-minimis)
	- [Class imbalance](#class-imbalance)
    - [Relation to machine learning](#relation-to-machine-learning)
- [Introduction to Hypothesis Testing](#introduction-to-hypothesis-testing)
	- [Validate your findings](#validate-your-findings)
	- [Confidence intervals](#confidence-intervals)
	- [Error types](#error-types)
- [Scenario](#scenario)
	- [Exercises](#exercises)
	- [Statistical Tests](#statistical-tests)
	- [Interpret your results](#interpret-your-results)


<a id="data-source"></a>
## Data Source

---

Today, we’ll use advertising data from an example in the book [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/).
- This is a well known, standard introduction to Machine Learning.
- The book has a more advanced version, [Elements of Statistical Learning](http://web.stanford.edu/~hastie/ElemStatLearn/), if you are comfortable with Linear Algebra and Statistics approaching the grad level.

#### Codealong: Bring in Today's data

In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# this allows plots to appear directly in the notebook
%matplotlib inline
plt.style.use('fivethirtyeight')

In [2]:
# read data into a DataFrame
# we use index_col to tell Pandas that the first column in the data has row labels
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

In [3]:
# examine the data with .head()

##### Questions About the Advertising Data

Let's pretend you work for the company that manufactures and markets a new device. The company might ask you the following: From the data we currently have, how should we spend  our advertising money in the future?

<a id="what-are-the-featurescovariatespredictors"></a>
### What are the features/covariates/predictors?

In [4]:
# Answer:

<a id="what-is-the-outcomeresponse"></a>
### What is the outcome/response?

In [5]:
# Answer:

<a id="what-do-you-think-each-row-in-the-dataset-represents"></a>
### What do you think each row in the dataset represents?

In [6]:
# Answer:

<a id="math-review"></a>
## Math review
---

<a id="covariance"></a>
### Covariance

Covariance is a measure of the joint variability between two random variables.

You can think of this as a measure of linear association. If you have the variance of Y and the variance of X, the covariance is the amount of variance that they share.

$$cov(X, Y) = \frac {\sum{(x_i - \bar{X})(y_i - \bar{Y})}} {n}$$

> We can gain insight into covariance by looking closely at this formula. First, observe that the formula effectively pairs the first $x$ data point with the first $y$ data point: $(x_1, y_1)$. All computations are done only on these pairs of points.

> Second, let's ask **when would covariance be positive**? From the numerator, covariance would be positive if for all pairs of data points, $(x_i - \bar{X})$ and $(y_i - \bar{Y})$ are (1) both positive or (2) both negative. This occurs when: (1) both data points are above their respective means. Or when: (2) both data points are below their respective means! So, if the $x$ data points vary from their mean in the same way the $y$ data points vary from their mean, covariance will be positive.

> Third, **might outliers affect covariance?** Yes! Due to the structure of the formula (a sum of terms), a large outlier pair far from the means could strongly pull the covariance in one direction.

Expressed using matrix notation:
$$cov(\mathbf{X}, \mathbf{Y}) = \mathbb{E}[(\mathbf{X}-\mathbb{E}[\mathbf{X}])(\mathbf{Y}-\mathbb{E}[\mathbf{Y}])]$$

A useful special case (used below):

$$cov(X, X) = \frac {\sum{(x_i - \bar{X})^2}} {n} = var(X) = \sigma_X^2$$

<a id="correlation"></a>
### Correlation

While covariance is a useful measure, it can be difficult to compare covariances because they are not standardized. 

In place, we can use the correlation which measures the same effect, but reports it as a range from -1 to 1. 1 means perfect covariance and correlation, 0 means no correlation, and -1 one means perfect inverse correlation.

$$corr(X,Y) = \frac {cov(X,Y)} {\sigma_X\sigma_Y} = \frac {\mathbb{E}[(\mathbf{X}-\mathbb{E}[\mathbf{X}])(\mathbf{Y}-\mathbb{E}[\mathbf{Y}])]} {\sigma_X\sigma_Y}$$

Note that the variance is always positive, making the denominator positive. So, the sign of the covariance between $X$ and $Y$ is the same as the sign of their correlation! 

To better illustrate how correlation refers to how $X$ and $Y$ change together, following are some visual examples. Notice that a correlation number by itself is not always indicative of the relationship of the variables -- always try to supplement 2D correlation with a visual!

![](./assets/images/correlation_examples.png)

<a id="the-variance-covariance-matrix"></a>
### The variance-covariance matrix

For our purposes in modeling and machine learning, the fastest way to get a preview of the underlying relationships in our data is to use the variance-covariance matrix.

The variance-covariance matrix shows the covariance between every variable in our data.

Given $n$ features from $X_1$ to $X_n$, the variance-covariance matrix looks like this (recall that $cov(X, X) = var(X)$):

$$\left[ \begin{array}{c}
var(X_1) & cov(X_1,X_2) & ... & cov(X_1,X_n)  \\
cov(X_2,X_1) & var(X_2) & ... & cov(X_2,X_n)  \\
... & ... & ... & ... \\
cov(X_n,X_1) & cov(X_n,X_2) & ... & var(X_n)
\end{array} \right]$$

By quickly glancing at this matrix, we can gleam insight about which variables might be strongly correlated. This may indicate redundant features and/or may affect some models.

If data is mean-centered, every column has its mean subtracted from itself. So, the mean for every column is now 0. You can then compute the variance-covariance matrix as:

$$\frac {X^TX} {n}$$

Those of you who have been exposed to linear regression may recognize this term.

#### Calculate the variance-covariance matrix. Make sure to first de-mean the data.

In [7]:
# Answer:

#### Calculate the correlation matrix using the built-in `corr` method of the DataFrame.

In [8]:
# Answer:

When we have a lot of data, the correlation matrix may be too difficult to read. It can help to make a plot.

#### Use seaborn's `heatmap` function to make a plot of the correlation matrix.

In [9]:
# Answer:

Of course, looking at linear association doesn't tell us the whole picture. We can get a more detailed look with a scatter plot matrix.

#### Use seaborn's `pairplot` function to make joint scatterplots of the data.

In [10]:
# Answer:

<a id="causation-and-correlation"></a>
## Causation and Correlation
---

#### Objective: Explain the difference between causation and correlation

- Think of various examples you’ve seen in the media related to food.
- [Study links coffee consumption to decreased risk of colorectal cancer](https://news.usc.edu/97761/new-study-links-coffee-consumption-to-decreased-risk-of-colorectal-cancer/)
- [Coffee Does Not Decrease Risk of Colorectal Cancer](http://news.cancerconnect.com/coffee-does-not-decrease-risk-of-colorectal-cancer/)
- [There's a whole book series based on these Spurious Correlations](http://www.tylervigen.com/spurious-correlations)

**Why is this?**
- Sensational headlines?
- There is neglect of a robust data analysis.
- Causal claims and associations are difficult to convey in an unambiguous way

- The above food claims are **correlated**, but may or may not be **causal**.

<a id="structure-of-causal-claims"></a>
### Structure of causal claims
- If X happens, Y must happen
- If Y happens, X must have happened 
  - (You need X and something else for Y to happen)
- If X happens, Y will probably happen
- If Y happens, X probably happened

> **Note:** Properties from definition are not causal. If some something is a triangle, it is implied that it has three sides. However, its being a triangle does not cause it to have three sides.

<a id="why-do-we-care"></a>
### Why do we care?
- Understanding this difference is critical in the data science workflow, especially when Identifying and Acquiring data.
- We need to fully articulate our question and use the right data to answer it, including considering any **confounders**.

> **Confounders** are unobserved variables that could affect the outcome. If we neglect to include confounding variables in an analysis, we could easily make an inaccurate model. For example, we might falsely assume that eating more ice cream cones causes us to wear fewer layers of clothing. In actuality, eating ice cream is correlated with a confounding variable -- temperature! To make an accurate analysis, we can only conclude that ice cream consumption is _correlated with_ clothing layers.

- We don’t want to overstate what our model measures.
- Be careful not to say “caused” when you really mean “measured” or “associated”.

<a id="how-do-we-determine-if-something-is-causal"></a>
### How do we determine if something is causal?
Causal criteria is one approach to assessing causal relationships.

However, it’s very hard to define universal causal criteria.

One attempt that is commonly used in the medical field is based on work by Bradford Hill.


He developed a list of “tests” that an analysis must pass in order to indicate a causal relationship:


- Strength of association
- Consistency
- Specificity
- Temporality
- Biological gradient
- Plausibility
- Coherence
- Experiment
- Analogy


**Strength (effect size)**: A small association does not mean that there is not a causal effect, though the larger the association, the more likely that it is causal.

**Consistency (reproducibility)**: Consistent findings observed by different persons in different places with different samples strengthens the likelihood of an effect.

**Specificity**: Causation is likely if there is a very specific population at a specific site and disease with no other likely explanation. The more specific an association between a factor and an effect is, the bigger the probability of a causal relationship.

**Temporality**: The effect has to occur after the cause (and if there is an expected delay between the cause and expected effect, then the effect must occur after that delay).

**Biological gradient**: Greater exposure should generally lead to greater incidence of the effect. However, in some cases, the mere presence of the factor can trigger the effect. In other cases, an inverse proportion is observed: greater exposure leads to lower incidence.

**Plausibility**: A plausible mechanism between cause and effect is helpful (but Hill noted that knowledge of the mechanism is limited by current knowledge).

**Coherence**: Coherence between epidemiological and laboratory findings increases the likelihood of an effect. However, Hill noted that "... lack of such [laboratory] evidence cannot nullify the epidemiological effect on associations".

**Experiment**: "Occasionally it is possible to appeal to experimental evidence".

**Analogy**: The effect of similar factors may be considered.

<a id="pearlean-causal-dag-model"></a>
## Pearlean Causal Directed Acyclic Graph (DAG) model

---
### Some quick background notes:
- This is a visual tool to help us reason about causality and association
- It was proposed by Judea Pearl, although there are many similar models
- We will only scratch the surface, so look into other resources if you find this interesting
    - We cover the basic idea and most notable cases
    - We won't talk about the formal mathematics and probability underneath or how to use d-seperation to infer causality

<a id="what-is-a-dag"></a>
### What is a DAG?
- DAG stands for directed acyclic graph, it's a collection of nodes connected by lines. 
- Each line has an arrow to point in a direction.
- If you follow the arrows you reach a final node; there are no loops.

A single circle or node in a Causal DAG represents an event, something that happens at one point in time.

![](./assets/images/dag1.png)

Let's pretend the random variables X and Y, or two different types of events, are correlated with each other.

**What are the possible causal structures that will give us this correlation?**
- X causes Y
- Y causes X
- There is no actual causation
- X or Y indirectly causes the other
- There is a third factor that causes both
- X and Y cause a third factor, but our data collects the third factor unevenly

<a id="its-possible-that-x-causes-y"></a>
### It's possible that X causes Y.
![](./assets/images/x-cause-y.png)

In [11]:
# Example where Y is a function of X:
X = np.random.randn(100)
Y = 5 + 2*X + np.random.randn(100)
dag = pd.DataFrame({'X':X, 'Y':Y})

# make a pairplot of the data:

<a id="y-causes-x"></a>
### Y causes X.
![](./assets/images/y-cause-x.png)

In [12]:
# Example where X is a function of Y:
Y = np.random.randn(100)
X = 5 + 2*Y + np.random.randn(100)
dag = pd.DataFrame({'X':X, 'Y':Y})

# make a pairplot of the data:

<a id="the-correlation-between-x-and-y-is-not-statistically-significant"></a>
### The correlation between X and Y is not statistically significant.
![](./assets/images/xy.png)

In [13]:
# No correlation between X and Y:
X = np.random.randn(100)
Y = 5 + np.random.randn(100)
dag = pd.DataFrame({'X':X, 'Y':Y})

# make a pairplot of the data:

<a id="x-or-y-may-cause-one-or-the-other-indirectly-through-another-variable"></a>
### X or Y may cause one or the other indirectly through another variable.
![](./assets/images/x-c-z-y.png)

In [14]:
# Y is a function of Z, and Z is a function of X:
X = 5 + np.random.randn(100)
Z = X + 0.1*np.random.randn(100)
Y = 3 + Z + np.random.randn(100)

dag = pd.DataFrame({'X':X, 'Y':Y, 'Z':Z})

# make a pairplot of the data:

<a id="there-is-a-third-common-factor-that-causes-both-x-and-y"></a>
### There is a third common factor that causes both X and Y.
![](./assets/images/z-cause-xy.png)

In [15]:
# Both X and Y are functions of Z:
Z = np.random.randn(100)

X = 5 + 2*Z + np.random.randn(100)
Y = 3 + 3*Z + np.random.randn(100)
common_cause = pd.DataFrame({'X':X, 'Y':Y, 'Z':Z})

# make a pairplot of the data:

<a id="both-x-and-y-cause-a-third-variable-and-the-dataset-does-not-represent-that-third-variable-evenly"></a>
### Both X and Y cause a third variable and the dataset does not represent that third variable evenly.

![](./assets/images/xy-causez.png)

In [16]:
# Z is a function of X and Y:
X = 5 + np.random.randn(100)
Y = 3 + np.random.randn(100)
Z = X + Y + 0.1*np.random.randn(100)
common_effect = pd.DataFrame({'X':X, 'Y':Y, 'Z':Z})

# make a pairplot of the data:

Generally, recovering the causality structure from a correlation matrix is difficult or impossible. However, thinking through causal effects can give you a much better intuition of your variables and data.

### What is a "confounder"?

Let’s say we did an analysis to understand what causes lung cancer. 

We find that people who carry cigarette lighters are 2.4 times more likely to contract lung cancer as people who don’t carry lighters.

Does this mean that the lighters are causing cancer?

As noted before, if lighters and cancer are both caused by smoking, there will be a correlation between lighters and cancer. This isn't the only possible diagram, but it makes the most sense.
![](./assets/images/smoke-lighter-cancer.png)

Conditioning on smoking by only looking at non-smokers, removes the correlation between lighters and cancer if we believe the above structure.

<a id="controlled-experiments"></a>
### Controlled Experiments

- The most foolproof way to measure an effect is to control all the confounders and to directly intervene and control our variable of interest. 
- This way we know that any correlation we find is not due to the confounders, but instead due to the variable we control. 
- This also means that all the effects we see are due to the variable we control.
- However, experiments are not always possible, and take longer to create than using observational data.

<a id="when-is-it-ok-to-rely-on-association"></a>
### When is it OK to rely on association?

- **When any intervention that arises from your model affects only the outcome variable.**
    - In other words, you only need to predict y.
    - This works because we only need to observe explanatory variables and implicitly know what the confounder is doing.
    - Decision making and intervation due to your model are a hidden danger that can shift confounders
    - You can always retrain your model to work with a new set of confounders if they shift.


- **When correlation is causal**
    - If you are sure there are no confounding factors or selection bias, then that association might be a causation (risky)
    - It's OK to exclude confounders that have very unlikely or small effects
    - This is a saving grace. To have a good model you only need variables that correlate with your outcome.
        - Those variables merely need to meaningfully correlate with your outcome.

<a id="how-does-association-relate-to-causation"></a>
### How does association relate to causation?

Most commonly, we find an association between two variables.
- There is an observed correlation between the variables.
- There is an observed correlation in a subset of data.
- We find that the descriptive statistics significantly differ in two subsets of the data.

We may still not fully understand the causal direction (e.g. does smoking cause cancer or does cancer cause smoking?).
- A causes B, B causes A, or a third factor causes both.
    - A and B never cause each other!

We also might not understand other factors influencing the association.

Confounding variables often hide the true association between causes and outcomes.

A Directed Acyclic Graph (DAG) can help determine which variables are most important for your model.  It helps visually demonstrate the logic of your models.

A DAG always includes at least one exposure/predictor and one outcome.


### Codealong: Explore the associations in the advertising data

#### Visualize the relationship between the features and the response using scatterplots.

In [17]:
# visualize the relationship between the features and the response using scatterplots:

**Is there a relationship between ads and sales? Which type of ads?**

In [18]:
# Answer:

**Can we say this a causal relationship?**

In [19]:
# Answer:

**What other questions might we want to know about this data?**

In [20]:
# Answer:

### Group Exercise: evaluate which type of ad is associated with higher sales.

Let's say we want to evaluate which type of ad is associated with higher sales.

1. Draw a basic DAG on your table or whiteboard.
- Think about other variables that may predict sales.
- Think about confounding causes.
- Think about downstream effects changing investement in advertising.
- Be ready to share an example.

### Section Summary

1) **The importance of having deep subject area knowledge.** You'll develop this over time and it will help you move through your analysis in a logical manner. However, keep in mind that you can show a strong association and still be totally wrong.

2) **A DAG (directed acyclic graph) can be a handy tool for thinking through the logic of your models.**

3) **The distinction between causation and correlation.** In our smoking example, it's relatively obvious that there's a flaw in our logic; however, this won't always be so readily apparent... especially in cutting edge fields where there are many other unknown variables.

4) **The importance of good data.** Throughout the class we will be working on helping you develop your data intuition, so that you can spot gaps and bias more readily. With this will come a bunch of tools to help you. However, your analysis is only as good as your understanding of the problem and the data

<a id="sampling-bias"></a>
## Sampling bias
---

**Sampling bias** occurs when a sample is collected in such a way that some members of the intended population are more or less likely to be included than others.

This can happen when a sample is taken non-randomly, either implicitly or explicitly.

When we have non-random sampling that results in sampling bias this can affect the inferences or results of our analyses. We must be sure not to attribute our results to the process we observe when it could actually be due to non-random sampling.

This is conceptually straightforward: when we have sampling bias we aren't measuring what we think we are measuring.

<a id="forms-of-sampling-bias"></a>
### Examples of sampling bias

- **Pre-screening:** Purposely restricting the sample to a specific group or region.
    - This typically happens when people try to study priority areas to save costs and assume priority areas are the same as random areas.
- **Self-selection:** When someone has the ability to non-randomly decide what gets included in our sample.
    - This typically happens in surveys and polls, but can also be an issue with other kinds of reporting.
- **Survivorship bias:** When we select only surviving subjects in a sample over time.
    - This might happen when we only look at existing customers and assume they have the same characteristics as new incoming customers.

<a id="problems-from-sampling-bias"></a>
### Problems that arise from sampling bias
- We will overestimate or underestimate means and sample statistics for simple characteristics.
- It's possible to have artificial correlation where there should be none.

<a id="recovering-from-sampling-bias"></a>
### Recovering from sampling bias
- Working out causal DAGs can help you identify when you need to watch out for sample bias
- Generally, it's best to prevent sample bias whenever possible
- We can't really do anything if we ENTIRELY exclude an important group of data
- However, if portions of our data are overrepresented or underrepresented, there are ways to correct the effect.
    - Typically, we explicitly model the selection process, which means we need data on factors that determine whether someone participates or not.

<a id="stratified-random-sampling"></a>
### Stratified random sampling

We discussed above that it is important to obtain a random sample of our population. However, sometimes it is more effective to apply some reasoning to our sampling process. By optimizing how we choose samples, we can possibly make a more accurate model using fewer samples.

- **Stratified random sampling** ensures we capture important population characteristics in the random sample. If we know that the population is half males and half females, for example, we can make sure that our sample is half male and half female. We effectively break the population into two "strata" (groups), then randomly sample from each group to obtain our overall sample. This method is similar to taking a weighted average and depends on knowing key population statistics.

    - For example, if we are collecting survey data, we might ensure our participants are evenly split between men and women.

<a id="missing-data"></a>
## Missing data
---

Sometimes we are unable to collect every attribute for a particular observation. 

Unfortunately, this makes the observation unusable until we decide how to deal with it.

**We have to decide whether to:**
- Drop the observation
- Drop the attribute
- Impute a value for that specific attribute and observation

**How do we decide?**

<a id="types-of-missing-data"></a>
### Types of missing data
- **Missing completely at random (MCAR)**
    - The reason that the data is missing is completely random and introduces no sampling bias
    - In this case it's very safe to drop or impute
    - We can test for this by looking at other attribute for missing and non-missing groups to see if they match


- **Missing at random (MAR)**
    - The data is missing in a way that is related to another factor
    - This is a form of sampling bias
    - Like other sampling bias, we can fix this by modeling the selection process
        - This is done by building a model to impute the missing value based on other variables


- **Missing not at random (MNAR)**
    - The response is missing in a way that relates to its own value
    - We can't test for this
    - We also can't fix this in a reasonable way

<a id="de-minimis"></a>
### De minimis
- If few enough observations are missing, it's not likely to change our results to a meaningful degree.
- In these case, we don't have to bother with trivialities and just pick a method that works well enough.

<a id="class-imbalance"></a>
### Class imbalance

Sometimes, a sample may include an overrepresented sample of one type of class. For example, airport security may have 990 xray scans showing the absence of a weapon. Due to natural scarcity, they may provide only 10 scans showing a weapon.

- If our goal is to create a model that indicates whether a weapon is present, then we are at a disadvantage. **Ignoring the class imbalance** would lead to a model that simply always guesses that a weapon is not present!
    - Note that most optimization procedures optimize for training data accuracy. Always guessing that a weapon is absent leads to 990/1000 correct, an accuracy of 99%.


- A simple way to get around this is to **undersample** the majority class, deliberately leaving us with a balanced dataset of 10 each. However, this is less than ideal since it effectively ignores much of the available data.

- We could alternatively **oversample** the minority class by duplicating examples. Again, this is not ideal. Because we have very little data, this will magnify small differences that may just be error, leading to a model that overfits.

Later in the course, we will look at additional methods for training models to get around class imbalance. For example, we may use an optimization algorithm that cares less about accuracy and more about minimizing particular types of errors.

<a id="relation-to-machine-learning"></a>
### Relation to machine learning

Many of the topics discussed in this lesson are used both in statistics and machine learning. However, some of the terminology differs. 

Throughout this lesson, we have discussed **variables** (typically **independent variables** and **dependent variables**). For example, we might be given the **linear estimator** $Y = mX + b$. We could say this contains two variables ($X$ - independent, and $Y$ - dependent (i.e. the prediction) since it depends on $X$), a coefficient $m$, and the constant $b$.

In machine learning, we typically rewrite this as a function, $\hat{y}(x) = mx + b$, and call it a **linear model**. The predicted value is $\hat{y}(x)$ ("prediction" is denoted by the carat) which is dependent on $x$. We might call $x$ a **feature** rather than a variable.

> **Example:** Suppose the house price $P$ is linearly dependent on the square footage $S$. So, we might predict $P = cS + b$, where $c$ and $b$ are constants. Alternatively, we could write $\hat{p}(s) = cs + b$. Here, we took a complicated house and modeled it using a single feature, its square footage. Of course, we are likely missing many confounding variables/features that also affect the price! So, our model likely has a lot of error.

<a id="introduction-to-hypothesis-testing"></a>
## Introduction to Hypothesis Testing
---

#### Objective: Test a hypothesis within a sample case study

You'll remember from last time that we worked on descriptive statistics such as mean and variance. How would we tell if there is a difference between our groups? How would we know if this difference was real or if our finding is simply due to chance?

For example, if we are working on sales data, how would we know if there was a difference between the buying patterns of men and women at Acme Inc? Hypothesis testing!

> **Note:** In this class, hypothesis testing is primarily used to assess foundational models such as linear and logistic regression.

### Hypothesis testing steps

Generally speaking, we start with a **null hypothesis** and an **alternative hypothesis**, which is opposite the null. Then, you check whether the data supports rejecting your null hypothesis or fails to reject the null hypothesis.

For example:

    Null hypothesis: There is no relationship between Gender and Sales.
    Alternative hypothesis: There is a relationship between Gender and Sales

Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. Your alternative hypothesis may indeed be true, but you don't necessarily have enough data to show that yet.

This distinction is important to help you avoid overstating your findings. You should only state what your data and analysis can truly represent.

<a id="validate-your-findings"></a>
### Validate your findings

##### How do we tell if the association we observed is statistically significant?

Statistical Significance is the likelihood that a result or relationship is caused by something other than mere random chance. Statistical hypothesis testing is traditionally employed to determine if a result is statistically significant or not.

We ask: **how likely is the effect observed to be true, assuming the null hypothesis is true?**. If there is less than a 5% chance of observing what we observed by chance (supposing the null hypothesis), then we reject the null hypothesis. Note that the 5% value is in many ways arbitrary -- many statisticians require even higher confidence levels.

The probability of our observations occuring by chance, given the null hypothesis, is the **p-value** $p$.

---

**Example:** Suppose you flip a coin three times and get three heads in a row. These three flips are our observations.

+ We want to know whether the coin is fair or not. So, we select the **null hypothesis: The coin is fair.**
+ Now, let's suppose the null hypothesis is true. Three heads in a row occurs with chance $1/2^3 \approx 12.5\%$.
+ Because there is a reasonable ($> 5\%$) chance of three heads occuring naturally, we do not reject the null hypothesis.
+ So, **we conclude:** we do not have enough data to tell whether the coin is fair or not, $p = 0.125$.

---


In other words, we say that something is NOT statistically significant if there is a less than 5% chance that our finding was due to chance alone (assuming the null hypothesis is true).

<a id="confidence-intervals"></a>
### Confidence intervals

A closely related concept is **confidence intervals**. A 95% confidence interval can be interpreted as follows: If the population from which this sample was drawn was **sampled 100 times**, approximately **95 of those samples** would contain an effect at least as large as the one we measured.



Keep in mind that we only have a **single sample of data**, and not the **entire population of data**. The "true" effect/difference is either within this interval or it isn't, but there's no way to actually know. We estimate the difference with the data we do have, and we show uncertainty about that estimate by giving a range that the difference is **probably** within.

Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals (which will be more liberal), 99% confidence intervals (which will be more conservative), or whatever intervals you like.


<a id="error-types"></a>
### Error types

Statisticians often classify errors not just as errors, but as two specific types of errors -- Type I and Type II.

+ **Type I Errors** are false positives.
    - Machine learning: Our model falsely predicts "positive". (The prediction is incorrect.)
    - Statistics: Incorrect rejection of a true null hypothesis.


+ **Type II Errors** are false negatives.
    - Machine learning: Our model falsely predicts "negative". (The prediction is incorrect.)
    - Statistics: Incorrectly retaining a false null hypothesis.


Understanding these errors can be very beneficial when designing models. For example, we might decide that Type I errors are okay but Type II errors are not okay. We can then optimize our model appropriately.

> **Example:** Suppose we make a model for airline security where we predict whether a weapon is present ("positive"). In this case, we would much rather have Type I errors (falsely predict a weapon) than Type II errors (falsely predict no weapon).

> **Example:** Suppose we make a model for the criminal justice system, predicting whether a defendant is guilty ("positive"). In this case, we would much rather have Type II errors (falsely predict innocent) than Type I errors (falsely predict guilty).

Can you phrase these examples in terms of null hypotheses?

## Class Challenge: A/B Testing Hypothesis tests

<a id="scenario"></a>

---

### Scenario

You are a data science team working for a web-based company and you are planning to roll out a new site design soon. For random samples of users one of two competing designs were presented and the ultimate purchase total was recorded (if any).

Your task is to determine which of the two designs yields higher total purchases, and if the result is statistically significant.

In [21]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

%matplotlib inline
np.random.seed(42)

In [22]:
# Generate some data and randomize

# Some people bought nothing, the others bought 
# with some distribution
data1 = [0] * 50
data1.extend(np.random.normal(14, 4, 150))
np.random.shuffle(data1)

# The second design hooked less people, 
# but those that were hooked bought more stuff
data2 = [0] * 100
data2.extend(np.random.normal(20, 5, 100))
np.random.shuffle(data2)

# Make a data frame
df = pd.DataFrame()
df["A"] = data1
df["B"] = data2

df.head()

Unnamed: 0,A,B
0,14.685473,25.66671
1,20.152146,0.0
2,14.274252,18.370134
3,12.122102,26.632519
4,18.228489,25.862179


#### Plot out the distributions of group A and group B.

In [23]:
# Answer:

In [24]:
# Answer:

#### Make a boxplot and a violin plot of the two groups using seaborn.

In [25]:
# Plot the violin plot:

In [26]:
# Plot the boxplot:

**Are our datasets (approximately) normal? Use what we learned in the previous lesson to decide.**

In [27]:
# Plot the distributions for group A and B. Are they approximately normal?

<a id="statistical-tests"></a>
### Statistical Tests

There are a few good statistical tests for A/B testing:
* [ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance)
* [Welch's t-test](https://en.wikipedia.org/wiki/Welch's_t-test)
* [Mann-Whitney test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)

**Each test makes various assumptions:**
* ANOVA assumes the residuals are normally distributed and data has equal variances
* The Welch t-test assumes normal distributions but not necessarily equal variances, and accounts for small sample sizes better
* The Mann-Whitney test assumes nothing about the distributions but requires at least 20 data points in each set, and produces a weaker p-value

Typically you need to choose the most appropriate test. Tests that make more assumptions are more discriminating (stronger p-values) but can be misleading on data sets that don't satisfy the assumptions.

**Which test is most appropriate for our data?**

In [28]:
# Answer:

In statistics, **one-way analysis of variance** (abbreviated one-way **ANOVA**) is a technique used to compare means of three or more samples (using the **F distribution**). The **ANOVA** tests the **null hypothesis** (default position that there is no relationship) that samples in two or more groups are drawn from populations with the same mean values. Typically, however, the **one-way ANOVA** is used to test for differences among at least three groups, since the two-group case can be covered by a **t-test**. When there are only two means to compare, the **t-test** and the **F-test** are equivalent.

> **Note:**  One-Way ANOVA: An ANOVA hypothesis tests the difference in population means based on one characteristic or factor.

> Two-Way ANOVA: An ANOVA hypothesis tests comparisons between populations based on multiple characteristics.

**Use the Mann-Whitney test on our data**

- look up the function in scipy from the link below
    - https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
- statistic : float The Mann-Whitney U statistic, equal to min(U for x, U for y) if alternative is equal to None (deprecated; exists for backward compatibility), and U for y otherwise.
- pvalue : float p-value assuming an asymptotic normal distribution. One-sided or two-sided, depending on the choice of alternative.

In [29]:
# Answer:

The Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.

Unlike the t-test it does not require the assumption of normal distributions. It is nearly as efficient as the t-test on normal distributions.

<a id="interpret-your-results"></a>
### Interpret your results
* Is there a significant difference in the mean total purchases in the two designs?
* Which design do you recommend and why? 
* Write two sentences explaining your results and your recommendation.

In [30]:
# Answer: