<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Matching" data-toc-modified-id="Matching-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Matching</a></span></li><li><span><a href="#Propensity-Score" data-toc-modified-id="Propensity-Score-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Propensity Score</a></span></li><li><span><a href="#Inverse-Probability-of-Treatment-Weighting" data-toc-modified-id="Inverse-Probability-of-Treatment-Weighting-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Inverse Probability of Treatment Weighting</a></span></li><li><span><a href="#McNemar’s-Test" data-toc-modified-id="McNemar’s-Test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>McNemar’s Test</a></span></li><li><span><a href="#Reference" data-toc-modified-id="Reference-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

Correlation doesn't imply causation, and even if there is a causation, the direction might be unclear.

- Are physically active people more likely to prioritize living near green spaces? Because we like to exercise, we plan to move near the park.
- Does green space in urban environment cause people to exercise more? If there was a park like this near where we live, we would exercise more.

- Potential Outcome: outcome we would see under each possible treatment option.
- Counterfactual Outcome: outcome that would have been observed had the treatment been different. i.e. What would have happened (contrary to what have actually happened).

We denote $Y^a$ as the outcome that would have been observed if treatment was set to $A = a$.

Before the treatment decision is being made, any outcome is a potential outcome. After the study, there is a observed outcome and a counterfactual outcome.

A typical statement that people make in the real world follows a pattern like this:

> I took ibuprofen and my headache is gone, therefore the medicine worked.

But this statement is only telling us $Y^1=1$, it doesn't tell us what would have happened had we not taken ibuprofen, $Y^0=?$. We can only state that there is a casual effect if $Y^1 \neq Y^0$.

This is known as the **fundamental problem of casual inferencing** as we can only observe one potential outcome for each person. However, with certain assumptions, we can estimate population level causal effects. In other words, it is possible for us to answer questions such as: what would the rate of headache remission be if everyone took ibuprofen when they had a headache versus if no one did.

Thus the next question is, how do I use observed data to link observed outcome to potential outcome.

- Positivity Assumption: For every set of values for $X$, treatment assignment was not deterministic: $P(A=a|X=x) > 0$ for all $a$ and $x$. This assumption ensures we can have some data at every level of $X$ for people who are treated and not treated. In some cases where people with certain diseases might be ineligible for a particular treatment, we wouldn't want to make inference about that population, so we would probably exclude them from the study.
- Stable Unit Treatment Value Assumption (SUTVA): Treatment assignment of one unit does not affect the outcome of another unit, i.e. no interference between units.
- Ignorability Assumption: Given pre-treatment covariates $X$, treatment assignment is independent of the potential outcome. $Y^1, Y^0 \perp A | X$ (here, $\perp$ denotes independence). So among people with the same values of $X$, we could essentially think of treatment as being randomly assigned. This is also referred to as "no unmeasured confounders' assumption".
- Consistency:

We're mainly interested in the relationship between means of different potential outcomes.

\begin{align}
E(Y^1 - Y^0)
\end{align}

Confounding Variables: Defined as variables that affect the treatment and outcome.

e.g. If older people are at higher risk of cardiovascular disease (the outcome), and are more likely to receive statins (the treatment), then age is a confounder.

Confounding Control:

- We're interested in identifying a set of variables $X$ that will make the ignorability assumption hold. If we do this, then the set of variables will be sufficient enough to control for confounding. And we would then be able to use statistical methods to control for these variables and estimate casual effects.

DAG (Direct Acyclic Graph) encode assumptions about dependencies between nodes/variables

A DAG will tell us:

- which variables are independent from each other
- which variables are conditionally independent from each other
- ways we can simplify the joint distribution

To identify a set of variables that are sufficient to control for confounding. We need to:

- **block backdoor paths** from treatment to outcome.
- it does not include any descendants of treatment.

## Matching

Matching is a method that attempts to make an observational study more like a randomized trial. The main idea is to match individuals in the treated group $A=1$ to individuals in the control group $A=0$ on the covariates $X$.

In a randomized trial, for any particular age, there should be about the same number of treated and untreated people. In the cases where older people are more likely to get $A=1$, if we were to match treated people to control people of the same age, there will be about the same number of treated and controls at any age. Once the data are matched, we can treat it as if it was a randomized trial.

Matching can help reveal lack of overlap in covariate distribution.

We can't exactly match on the full set of covariates, so what we'll do is try and make sure the distribution of covariates is balanced between the groups, also referred to as stochastic balance (The distribution of confounders being similar for treated and untreated subjects). This is similar to the notion of estimating the causal effect of the treatment on the treated.


When performing matching, we typically can't match exactly. We first need to choose some metrics of closeness. e.g.

- Mahalanobis Distance
- Robust Mahalanobis Distance. The motivation is that outliers can create large distance between subjects, even if the covariates are otherwise similar. Hence the rank of covariates might be more relevant, i.e. highest and second highest ranked values of covariates perhaps should be treated as similar, even if the values are far apart.
    - We replace each covariate value with its rank.
    - Constant diagonal covariance matrix.
    - Perform Mahalanobis Distance on the rank.
    

Assessing balance after the match using standardized mean difference. A standardized difference is the difference between groups, divided by the pooled standard deviation.

\begin{align}
\frac{\bar{X}_{treatment} - \bar{X}_{control}}{\sqrt{\frac{s^2_{treatment} + s^2_{treatment}}{2}}}
\end{align}

- It is a measure that does not depend on sample size.
- Often times, absolute value is reported.
- We calculate it for each of the variable that we matched on.
- Rule of thumb:
    - value <0.1 indicate adequate balance.
    - value 0.1-0.2 are not too alarming.
    - value >0.2 indicate serious imbalance.

Assume that we've already matched our data, we've checked for balance. And we're happy with that and now we're ready to analyze our outcome data.
    
If there is a large treatment effect, then we expect the observed difference in mean of the outcome between matched pairs to be very different from the difference in means if we randomly permute the treatment labels. There is a treatment effect, so the labels are meaningful. If there was no treatment effect, randomly switching the labels shouldn't matter.

## Propensity Score

The propensity score is the probability of receiving treatment, rather than control, given covariates $X$.

We'll define $A=1$ for treatment and $A=0$ for control. The propensity score for subject $i$ is denoted as $\pi_i$.

\begin{align}
\pi_i = P(A=1 | X_i)
\end{align}

As an example, if a person had a propensity score of 0.3, that would mean that given their particular covariates, there was a 30% chance that they'll receive the treatment.

## Inverse Probability of Treatment Weighting

In the original population, some people were more likely to get treated than other people, based on their $X's$. In the pseudo-population, everyone is equally likely to be treated, regardless of their $X$ values.

Someone who was likely to be treated, given their covariates, but wasn't will have a large weight


- IPTW estimation works because it creates an unconfounded pseudo-population.
- Marginal structural models are models of the mean of the potential outcome as a function of possible values of treatment.

doubly robust estimation, also known as augmented inverse probability of treatment weighting

## McNemar’s Test

- https://machinelearningmastery.com/mcnemars-test-for-machine-learning/
- https://www.theanalysisfactor.com/difference-between-chi-square-test-and-mcnemar-test/
- https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html
- http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.evaluate/#mcnemar

## Reference

- http://www.degeneratestate.org/posts/2018/Mar/24/causal-inference-with-python-part-1-potential-outcomes/
- https://github.com/ijmbarr/notes-on-causal-inference
- http://laurence-wong.com/software/
- https://github.com/laurencium/Causalinference
- https://github.com/jrfiedler/causal_inference_python_code
- https://github.com/vlaskinvlad/coursera-causality-crash-course
- http://www.rebeccabarter.com/blog/2017-07-05-ip-weighting/