# Homework 3: Correlation, Casual Modeling, and Regressions

* 3.1 Dissecting some data (including checking for indepedance)
* 3.2 Creating your own causal model
* 3.3 Statistical tests

## Setup

In [None]:
# The standard packages
import os
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib as mpl
import matplotlib.pyplot as plt

# Additional packages relevant for this HW
import scipy.stats as sps
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline

## Problem 3.1: Dissecting a causal model

Below is a graphical model for how recovery from a ailment depends on whether or not a person afflicted takes a given drug. The drug is known through two speerate pathways: affecting the patients heart rate and their immune response. The other major assumption of this model is that the age is only reelvant insofar as it affects a set of risk factors that affect (1) the heart rate of the patient, (2) their immune response, (3) their recovery rate in general, and (4) whether or not they decide to take the drug. 

This is not a controlled experiment but observational data collected over years of patients using the drug. Assume you have access to data for all the nodes in your model. 


In [None]:
from causalgraphicalmodels import CausalGraphicalModel

In [None]:
recovery_model = CausalGraphicalModel(
    nodes=["drug", "heart rate", "immune response", "recovery", "age", "risk factors"],
    edges=[
        ("drug", "immune response"),
        ("drug", "heart rate"),
        ("heart rate", "recovery"),
        ("immune response", "recovery"),
        ("age", "drug"),
        ("age", "risk factors"),
        ("risk factors", "drug"),
        ("risk factors", "immune response"),
        ("risk factors", "heart rate"),
        ("risk factors", "recovery"),

    ]
)

recovery_model.draw()

### 3.1.1 Is age really not directly causally related to recovery?

This model makes a strong assumption that age only affects the recovery through other risk factors.What variables would you need to condition on to see that age and recovery do not have a causal link as the model assumes? 

**Your answer:**

???

In [None]:
# Check using `is_d_separated` fuction. It will return True of the set of conditioning variables you enter is sufficient to make age and recovery independant. 
recovery_model.is_d_separated("age", "recovery", {"risk factors", "???"})

### 3.1.2 Is the casual relationship between drug and recovery identifiable from the data using backdoor criterion?

According to our model, can we figure out $P(\textrm{recovery} | \textrm{do(drug)})$ from our data by applying the backdoor criterion? What variables would we need to use for backdoor correction? _Bonus: type out the equation you would use_



**Yor answer**:

???

### 3.1.3 Is the casual relationship between drug and recovery identifiable from the data using frontdoor criterion?

According to our model, can we figure out $P(\textrm{recovery} | \textrm{do(drug)})$ from our data by applying the frontdoor criterion? What variables would we need to use for frontdoor correction? _Bonus: type out the equation you would use_


In [None]:
# Check using `get_all_backdoor_adjustment_sets`, it spits our a `fozen set` of all of the possible sets you could use to do backdoor 
recovery_model.get_all_backdoor_adjustment_sets("drug", "recovery")

In [None]:
According to our model, can we figure out $P(\textrm{recovery} | \textrm{do(drug)})$ from our data by applying the backdoor criterion? What variables would we need to use for backdoor correction? 

## Problem 3.2: Creating your own causal model

In this question you'll use the `causalgraphicalmodels` package, generate your own causal graphical model - ideally one relevant to your research. It's even better if you create a graphical model relevant for the data you plan to use in your final project.

In [None]:
from causalgraphicalmodels import CausalGraphicalModel

### 3.2.1 Construct your model

Construct your model. It's OK to omit unobserved sources of error as long as they are uncorrelated. 


In [None]:
my_model = CausalGraphicalModel(
    nodes=[???],
    edges=[
        ???
    ]
)

my_model.draw()

### 3.2.2 Find the predicted independance relationships for your model

Determine what conditional indepedance relations your model predicts using criterion we discussed in class. 


**List them here**:

1. ???

Verify these are right using the `get_all_independence_relationships()` function attached to your model 

In [None]:
my_model.sprinkler.get_all_independence_relationships()

### 3.2.3 Is the relationship you are interested in identifiable?  

Identify a pair of variables in your model that you are interested in finding the relationship between. Describe whether the relationship between these variables $ P(Y|do(X)) $
is _identifiable_ given your model. (You can use the front or back door criterion). 

*Bonus: Use one of these methods to write $ P(Y|do(X)) $ in terms of the observed probabilities.* 


**Which variables are you interested in**:

1. ???
2. ???

**Describe whether this relationship is identifiable given your model**:

???



**If it is identifiable, write the formula in terms of the data. If it isn't say what experimental manipluations might change your graph to make it identifiable**:

???

## Problem 3.3: Classical Statistical Tests

For thsi question you need to reproduce the test results from https://elifesciences.org/articles/07643/figures Figure 4A, B, or C (pick one.) 


### Problem 3.3.1: Download the data, upload to cluster, import as pandas dataframe

The data is in an excel file under the figure. pandas `read_excel` function has a keyword argument `sheet_name` where you can tell it which sheet you want.

In [None]:
fig4_data = pd.read_excel(???, sheet_name=???)

### Problem 3.3.2: Plot the comparison you are interested in

Do this any way you like just remember to represent uncertainty some how and label everything appropriately 

### Problem 3.3.3: Test for normality, equal variances, and then run approriate t-test

In [36]:
???

Object `?` not found.


Did you reproduce their results?

** Your answer here**:

???