(activity7)=
# Activity 7: Positivity in Observational Studies

**2025-02-27**

---

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets


In [45]:
# let's load the data
learning_df = pd.read_csv("~/COMSC-341CD/data/learning_mindset.csv")

This selected portion of the National Study of Learning Mindsets dataset is not truly randomized, so we'll need to adjust for confounding.

The columns we will look at are:

- `intervention`: whether the student received the intervention (1) or not (0)
- `success_expect`: stud
ent Likert scale response to a fixed mindset survey question prior to the intervention: "You have a certain amount of intelligence, and you really can’t do much to change it" (1, strongly disagree; 7, strongly agree)
- `frst_in_family`: whether the student would be the first in their family to attend college (1) or not (0)
- `gender`: student's self-reported gender
- `school_urbanicity`: categorical variable corresponding to the urbanicity of the school the student attends, e.g. urban, suburban, rural

# Part 1 

In [89]:
covariates = ['success_expect', 'frst_in_family']

Let's first consider the variables in `covariates` as `success_expect` and `frst_in_family`. These seem to be reasonable potential confounders to control for. If we take the same strategy as we have done with the kidney stone dataset, we'll need to bin on the confounders and compute treatment effects for each bin.

However, as we've seen today, we also need to be careful about positivity violations. First, let's compute the total number of bins we need to create if we want to control for these two covariates.

We can do this by using [pd.Series.nunique](https://pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html) to get the number of unique values for each covariate and then multiplying them together. This is like taking a cross product over the all possible values of each variable.

In [None]:
# TODO calculate the total number of bins 
total_bins = 0

print(f"Total number of bins: {total_bins}")   

# Part 2

Next, let's see if there are any positivity violations. We can do this by grouping over the covariiats plus the intervention, and then counting the number of unique groups are actually present in the data.

To generate the per-bin counts, we perform a `groupby(all_cols, as_index=False)` over the intervention and all combinations of the other columns, and the check the `ngroups` attribute of the resulting groupby object. How many groups are there?

In [None]:
# Group by the intervention column and the two covariates
all_cols = []
group_count = learning_df.groupby(all_cols, as_index=False).ngroups

print(f"Number actual groups among the bins for {all_cols}: {group_count}")

Since we need each bin to have both control and treatment units in order to have a valid comparison, the total number of groups should be equal to **2 times the total number of bins possible** for there to be no positivity violations.

Does the number of rows you found in part 2 match this?

**Your response**: TODO

# Part 3

Ideally we'd like to control for as many confounders as possible, let's now add `gender` and `school_urbanicity` to our list of covariates, making a total of 4 confounders.

Repeat the analysis above with the new set of covariates. Do we see positivity violations with the new set of covariates?

**Your response**: TODO


In [None]:
# TODO your code here for part 3

**Optional extra**: if we actually want to see the bins that are missing, we can generate a [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) of the counts, and then identify the bins that are missing in either the control or treatment group.

In [None]:
count_df = learning_df.groupby(all_cols, as_index=False).size()

# Create a pivot table to show counts by intervention and bins
bin_pivot = pd.pivot_table(
    count_df, 
    index=['success_expect', 'gender', 'frst_in_family', 'school_urbanicity'],
    columns=['intervention'],
    values='intervention',
    fill_value=0
)

# Display information about the pivot table
print("Bins with no control units:")
display(bin_pivot[bin_pivot[0] == 0])

print("Bins with no treatment units:")
display(bin_pivot[bin_pivot[1] == 0])

# References

- Yeager, D. S. et al. (2019). A national experiment reveals where a growth mindset improves achievement. Nature.
- Athey, S., & Wager, S. (2019). Estimating treatment effects with causal forests: An application. Observational studies.
- Facure, M. (2023). Causal Inference for the Brave and the True.