Homework Reflection 5

1. Draw a diagram for the following negative feedback loop:

Sweating causes body temperature to decrease.  High body temperature causes sweating.

A negative feedback loop means that one thing increases another while the second thing decreases the first.

Remember that we are using directed acyclic graphs where two things cannot directly cause each other.

2. Describe an example of a positive feedback loop.  This means that one things increases another while the second things also increases the first.

3. Draw a diagram for the following situation:

Lightning storms frighten away deer and bears, decreasing their population, and cause flowers to grow, increasing their population.
Bears eat deer, decreasing their population.
Deer eat flowers, decreasing their population.

Write a dataset that simulates this situation.  (Show the code.) Include noise / randomness in all cases.

Identify a backdoor path with one or more confounders for the relationship between deer and flowers.

4. Draw a diagram for a situation of your own invention.  The diagram should include at least four nodes, one confounder, and one collider.  Be sure that it is acyclic (no loops).  Which node would say is most like a treatment (X)?  Which is most like an outcome (Y)?


**Reflection 5 Answer**

1. See attachment

2. An example of a positive feedback loop would be population growth. As more people or animals in a population give birth, the population increases, and with a higher population, there are more individuals giving birth, continuing the cycle of population growth.

3. See attachment and code below. A possible backdoor path would be lightning's postive effect on flowers and negative effect on deer. This creates a confounding path where deer <-- lightning --> flowers.

4. See attachment. In the example, income is the confounder as it causes both the treatment (diet/exercise or X) and the outcome (health or Y). The collider is health as diet and exercise both have an effect on it. 

In [61]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Number of datapoints
n = 100

# Simulate lightning  
lightning = np.random.normal(loc=0, scale=1, size=n)

# Simulate flower population: lightning increases growth
flowers = 30 + 2 * lightning + np.random.normal(0, 1, n)

# Simulate deer population: lightning decreases growth
deer = 25 - 3 * lightning + np.random.normal(0, 1, n)  

# Simulate bear population: lightning decreases growth 
bears = 10 - 2 * lightning + np.random.normal(0, 1, n)  

# Adjust deer to account for bear predation
deer -= 2 * bears 

# Adjust flowers to account for deer grazing
flowers -= 2 * deer 

df_sim = pd.DataFrame({
    'Lightning': lightning,
    'Bears': bears,
    'Deer': deer,
    'Flowers': flowers
})

df_sim.head()

Unnamed: 0,Lightning,Bears,Deer,Flowers
0,0.496714,8.177577,7.512492,14.553074
1,-0.138264,9.716348,6.542882,16.217061
2,0.647689,9.451917,5.236153,20.480357
3,1.52303,7.564311,6.356091,19.5316
4,-0.234153,10.447405,3.42998,22.510447


Homework Reflection 6

1. What is a potential problem with computing the Marginal Treatment Effect simply by comparing each untreated item to its counterfactual and taking the maximum difference?  (Hint: think of statistics here.  Consider that only the most extreme item ends up being used to estimate the MTE.  That's not necessarily a bad thing; the MTE is supposed to come from the untreated item that will produce the maximum effect.  But there is nevertheless a problem.)
Possible answer: We are likely to find the item with the most extreme difference, which may be high simply due to randomness.
(Please explain / justify this answer, or give a different one if you can think of one.)

2. Propose a solution that remedies this problem and write some code that implements your solution.  It's very important here that you clearly explain what your solution will do.
Possible answer: maybe we could take the 90th percentile of the treatment effect and use it as a proxy for the Marginal Treatment Effect.
(Either code this answer or choose a different one.)

**Reflection 6 Answer**

1. The Marginal Treatment Effect (MTE) is meant to reflect the treatment effect for the next unit to be treated, based on who would benefit the most. As noted, this could be approximated by taking the maximum difference by comparing each untreated item to its counterfactual. The problem is that this maximum difference could be due to noisy estimates and randomness, and this is increasingly likely when selecting the maximum difference. Each estimated treatment effect includes noise and so some differences may be large, but due to noise and not the actual treatment effect. This results in identifying an overstated effect when selecting the maximum difference.

2. Using the 90th percentile allows us to still focus on large treatment effects but without the issue of using the maximum difference which is sensitive to random noise. The effects of all large outliers is reduced. In the code below, synthetic data is generated and a regression model is fit to the data with random treatment assignment. The counterfactuals for each instance are computed and predicted and the differences are taken to get the treatment effects. The 90th percentile of these differences is then used as the MTE to avoid the issue of random noise likely influencing the maximum difference.

In [48]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(42)

# Generate data
num = 1000
X = np.random.uniform(0, 1, num)
y = np.random.normal(10, 5, num) + X * 5

# Randomly assign treatment
treated = np.random.binomial(1, 0.5, num)

df = pd.DataFrame({
    'X': X,
    'y': y,
    'treated': treated
})

# Fit regression
X_model = sm.add_constant(df[['X', 'treated']], has_constant='add')
y_model = df['y']
model = sm.OLS(y_model, X_model).fit()

# Compute counterfactual
df_untreated = df[df['treated'] == 0].copy()
df_untreated['treated'] = 1 

X_cf = sm.add_constant(df_untreated[['X', 'treated']], has_constant='add')
X_cf = X_cf[model.params.index]  

# Predict counterfactual
df_untreated['pred_if_treated'] = model.predict(X_cf)

# Get treatment effects and 90th percentile
df_untreated['treatment_effect'] = (
    df_untreated['pred_if_treated'] - df[df['treated'] == 0]['y'].values
)

mte_90th = np.percentile(df_untreated['treatment_effect'], 90)
print(f"Estimated Marginal Treatment Effect (90th percentile): {mte_90th:.2f}")

Estimated Marginal Treatment Effect (90th percentile): 6.44


Homework Reflection 7

1. Create a linear regression model involving a confounder that is left out of the model.  Show whether the true correlation between X and Y is overestimated, underestimated, or neither.  Explain in words why this is the case for the given coefficients you have chosen.

2. Perform a linear regression analysis in which one of the coefficients is zero, e.g.

W = [noise]

X = [noise]

Y = 2 * X + [noise]

And compute the p-value of a coefficient - in this case, the coefficient of W.  
(This is the likelihood that the estimated coefficient would be as high or low as it is, given that the actual coefficient is zero.)
If the p-value is less than 0.05, this ordinarily means that we judge the coefficient to be nonzero (incorrectly, in this case.)
Run the analysis 1000 times and report the best (smallest) p-value.  
If the p-value is less than 0.05, does this mean the coefficient actually is nonzero?  What is the problem with repeating the analysis?

**Reflection 7 Answer**

In [11]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(42)

# Synthetic data
n = 1000
Z = np.random.normal(0, 1, n)
epsilon_X = np.random.normal(0, 1, n)
epsilon_Y = np.random.normal(0, 1, n)

X = 0.6 * Z + epsilon_X
Y = 3 * X + 5 * Z + epsilon_Y

df = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})

X_full = sm.add_constant(df[['X', 'Z']])
model_full = sm.OLS(df['Y'], X_full).fit()

# Confounder left out of model
X_omit = sm.add_constant(df[['X']])
model_omit = sm.OLS(df['Y'], X_omit).fit()

# Extract coefficients
coef_full = model_full.params
coef_omit = model_omit.params

coef_full, coef_omit

(const    0.006134
 X        2.989823
 Z        5.027912
 dtype: float64,
 const   -0.068472
 X        5.073939
 dtype: float64)

As seen above, the true X coefficient is approximately 3.0, but omitting the confounder gives an X coefficient of 5.0. From the X and Y equations, Z has a positive correlation with both. When Z is omitted, the previous positive effect of Z on Y is then falsely attributed to X. This results in an overestimated X coefficient.

In [14]:
np.random.seed(42)

p_values = []

for _ in range(1000):
    n = 100
    W = np.random.normal(0, 1, n)
    X = np.random.normal(0, 1, n)
    noise = np.random.normal(0, 1, n)
    Y = 2 * X + noise  

    # Regress Y on both W and X
    df = sm.add_constant(np.column_stack((W, X)))
    model = sm.OLS(Y, df).fit()
    
    # Store the p-value for W 
    p_values.append(model.pvalues[1])

# Minimum p-value
min_p_value = np.min(p_values)
min_p_value

np.float64(0.0004745704124689169)

Since the observed minimum p-value through the 1000 iterations is less than 0.05, the initial thought would be that the coefficient is nonzero. However, since the regression is setup so that W is 0, we know this is not true and that the true W coefficient is zero. The problem with repeating the analysis 1000 times is that when running many hypothesis tests, there are still some instances where the p-value will be small just by chance. Selecting the smallest p-value out of these iterations increases the chance of finding one of these instances and false rejecting the null hypothesis.

Homework Reflection 8

Include the code you used to solve the two coding quiz problems and write about the obstacles / challenges / insights you encountered while solving them.

Some of the insights gathered included that using Mahalanobis distance instead of Euclidean distance matched the items based on multivariate similarity. In this case, it was able to account for the scale of correlations of Z1 and Z2, giving more accurate counterfactuals. Another insight was that propensity score estimation is quite sensitive, even with just a single Z producing a wide range of scores through logistic regression. A challenge was refreshing myself on matrix manipulation in Python, used in the latter half of the problems. Functions like np.cov and the np.linalg library were helpful to revisit. 

**Reflection 8 Answer**

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from scipy.spatial.distance import mahalanobis
from scipy.spatial.distance import cdist

In [2]:
df = pd.read_csv('homework_8.1.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,X,Y,Z
0,0,1,4.109218,1.764052
1,1,0,2.259504,0.400157
2,2,0,-0.647584,0.978738
3,3,0,2.106071,2.240893
4,4,1,3.583464,1.867558


In [3]:
X_covariate = df[['Z']]
treatment = df['X']

log_reg = LogisticRegression()
log_reg.fit(X_covariate, treatment)

# Predict propensity scores
propensity_scores = log_reg.predict_proba(X_covariate)[:, 1]

# Inverse probability weights
weights = np.where(df['X'] == 1, 1 / propensity_scores, 1 / (1 - propensity_scores))

# Calculate weighted means
weighted_y_treated = np.sum((df['Y'] * weights)[df['X'] == 1]) / np.sum(weights[df['X'] == 1])
weighted_y_untreated = np.sum((df['Y'] * weights)[df['X'] == 0]) / np.sum(weights[df['X'] == 0])

# ATE
ate_ipw = weighted_y_treated - weighted_y_untreated
ate_ipw

np.float64(2.2743411898510133)

In [4]:
# Display the first three propensity scores
propensity_scores[:3]

array([0.84011371, 0.58464597, 0.71108245])

In [9]:
df2 = pd.read_csv('homework_8.2.csv')

treated = df2[df2['X'] == 1].reset_index(drop=True)
untreated = df2[df2['X'] == 0].reset_index(drop=True)

# Extract covariates
Z_all = df2[['Z1', 'Z2']].values.T
cov_matrix = np.cov(Z_all)
inv_cov_matrix = np.linalg.inv(cov_matrix)

treated_vectors = treated[['Z1', 'Z2']].values
untreated_vectors = untreated[['Z1', 'Z2']].values

# Compute all Mahalanobis distances
all_distances = cdist(treated_vectors, untreated_vectors, metric='mahalanobis', VI=inv_cov_matrix)

# Index of the nearest untreated unit for each treated unit
nearest_indices = np.argmin(all_distances, axis=1)

# Use indices to get matched outcomes from untreated group
matched_outcomes = untreated.iloc[nearest_indices]['Y'].values

# Calculate ATE
ate_mahalanobis = (treated['Y'].values - matched_outcomes).mean()
ate_mahalanobis

np.float64(3.437678997912609)

In [7]:
treated_vectors = treated[['Z1', 'Z2']].values
untreated_vectors = untreated[['Z1', 'Z2']].values

# Compute Mahalanobis distances for all treated and untreated pairs
all_distances = cdist(treated_vectors, untreated_vectors, metric='mahalanobis', VI=inv_cov_matrix)

# Find the index of the treated item 
min_distances = all_distances.min(axis=1)
worst_match_index_no_lambda = np.argmax(min_distances)

# Get the Z values
nearest_z_values = treated.iloc[worst_match_index_no_lambda][['Z1', 'Z2']]
nearest_z_values.tolist()

[2.6962240525635797, 0.5381554886023228]