---
title: "Counterfactual Data Balancing"
format:
    html: 
        toc: true
        code-fold: false
        embedded-resouces: true
---

# Introduction and Motivation
In this project, I aim to create a balanced dataset to facilitate supervised learning for predicting factors associated with exoneration. The core idea behind this counterfactual balancing is to construct a dataset that includes both exonerated individuals and a comparable set of non-exonerated individuals, representative of the incarcerated population in Illinois. This allows for fair and meaningful comparisons in modeling exoneration outcomes.

## Why Counterfactual Data?
Counterfactual data is critical when direct access to complete prison population databases is unavailable. Since I don’t have access to a comprehensive dataset of all incarcerated individuals in Illinois and their exoneration statuses, I use counterfactuals to create a synthetic, balanced dataset.

A counterfactual statement allows us to explore “what if” scenarios, comparing two outcomes that differ in one key aspect. For example, a counterfactual asks: What if this person had not been exonerated? Would their characteristics resemble those of non-exonerated individuals?

As explained in this <a href="https://bayes.cs.ucla.edu/PRIMER/primer-ch4.pdf" target="_blank">primer on counterfactuals and their applications</a> , “This kind of statement—an ‘if’ statement in which the ‘if’ portion is untrue or unrealized—is known as a counterfactual. The ‘if’ portion of a counterfactual is called the hypothetical condition, or more often, the antecedent.” Counterfactual data allows us to compare outcomes under identical conditions, differing only in the hypothetical condition—in this case, exoneration versus non-exoneration. This approach helps mitigate biases and ensures the models have balanced, reliable inputs for training.

### Acknowledgments
The implementation of this counterfactual data balancing relied heavily on expert guidance and code contributions from <a href="https://gufaculty360.georgetown.edu/s/contact/003Hp00002jMlEDIA0/jeffrey-jacobs" target="_blank">Professor Jeff Jacobs</a>. His guidance helped shape the methodology and implementation of the counterfactual sampling process.

# Implementation

In [1]:
from io import StringIO
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

## Scrapping & Cleaning Rows for Exoneree Data
Raw HTML from [Prison Policy Initiative](https://www.prisonpolicy.org/racialgeography/counties.html)

In [5]:
html_url = "https://raw.githubusercontent.com/jpowerj/dsan-content/refs/heads/main/2024-fall-dsan5000/exoneration/counties.html"

result = requests.get(html_url)
soup = BeautifulSoup(result.text)

table_elt = soup.find("table")

table_sio = StringIO(str(table_elt))
county_df = pd.read_html(table_sio)[0]

county_df.columns = [c.replace("","").replace("","").strip() for c in county_df.columns]

il_df = county_df[county_df['State'] == "Illinois"].copy()
il_df.head(3)


Unnamed: 0,County,State,Total Population,Total White Population,Total Black Population,Total Latino Population,Incarcerated Population,Incarcerated White Population,Incarcerated Black Population,Incarcerated Latino Population,Non-incarcerated Population,Non-incarcerated White Population,Non-Incarcerated Black Population,Non-Incarcerated Latino Population,Ratio of Overrepresentation of Whites Incarcerated Compared to Whites Non-Incarcerated,Ratio of Overrepresentation of Blacks Incarcerated Compared to Blacks Non-Incarcerated,Ratio of Overrepresentation of Latinos Incarcerated Compared to Latinos Non-Incarcerated
595,Adams County,Illinois,67103,62414,2331,776,110,73,36,0,66993,62341,2295,776,0.71,9.54,0.0
596,Alexander County,Illinois,8238,4983,2915,155,411,89,242,79,7827,4894,2673,76,0.35,1.72,19.82
597,Bond County,Illinois,17768,15797,1080,547,1542,500,657,304,16226,15297,423,243,0.34,16.32,13.14


## Narrowing to Incarcerated Population 
We're interested specifically in simulating "draws" from the incarcerated population of Illinois. So, we select the relevant subset of columns here (renaming them to be a bit shorter while we're at it -- two birds one stone)

In [6]:
rename_map = {
    'County': 'County',
    'State': 'State',
    'Incarcerated Population': 'Total',
    'Incarcerated White Population': 'White',
    'Incarcerated Black Population': 'Black',
    'Incarcerated Latino Population': 'Latino',
}
# Keep only the cols in the rename_map
cols_to_keep = list(rename_map.keys())
il_df = il_df[cols_to_keep].copy()
# And do the renaming
il_df.rename(columns=rename_map, inplace=True)
il_df.head()

Unnamed: 0,County,State,Total,White,Black,Latino
595,Adams County,Illinois,110,73,36,0
596,Alexander County,Illinois,411,89,242,79
597,Bond County,Illinois,1542,500,657,304
598,Boone County,Illinois,71,38,12,21
599,Brown County,Illinois,2059,419,1267,367


In [7]:
# Snce the Exoneree project uses just the county name (like "Cook"), we'll remove the trailing " County" (so, e.g., "Cook County" will turn into just "Cook"):
il_df['County'] = il_df['County'].str.replace(" County","")

# And, we can compute a state_prop column representing the % of all Illinois inmates contained in each county:
il_df['state_prop'] = il_df['Total'] / il_df['Total'].sum()
il_df.sort_values(by='state_prop', ascending=False).head()


Unnamed: 0,County,State,Total,White,Black,Latino,state_prop
610,Cook,Illinois,11649,1769,8369,1468,0.164469
693,Will,Illinois,3902,811,2528,538,0.055091
673,Randolph,Illinois,3571,934,2250,377,0.050418
648,Logan,Illinois,3060,963,1705,389,0.043203
647,Livingston,Illinois,2798,905,1577,294,0.039504


So we see that our sample should be about 16.4% from Cook County, 5.5% from Will, 5% from Randolph, and so on.

To avoid confusing the state_prop value with the sampled proportion that we compute below, we can drop state_prop now:

In [8]:
# Since they're only tracking three racial groups, the total of the three race counts should not equal the total incarcerated population. But let's check:
il_df['three_cat_total'] = il_df['Black'] + il_df['White'] + il_df['Latino']
il_df.head()

Unnamed: 0,County,State,Total,White,Black,Latino,state_prop,three_cat_total
595,Adams,Illinois,110,73,36,0,0.001553,109
596,Alexander,Illinois,411,89,242,79,0.005803,410
597,Bond,Illinois,1542,500,657,304,0.021771,1461
598,Boone,Illinois,71,38,12,21,0.001002,71
599,Brown,Illinois,2059,419,1267,367,0.02907,2053


Since we need our sample to be fully representative of the county-by-county distributions, we need to use the difference (between three_cat_total and total) to construct an other category:

In [9]:
il_df['Other'] = il_df['Total'] - il_df['three_cat_total']
il_df.head()

Unnamed: 0,County,State,Total,White,Black,Latino,state_prop,three_cat_total,Other
595,Adams,Illinois,110,73,36,0,0.001553,109,1
596,Alexander,Illinois,411,89,242,79,0.005803,410,1
597,Bond,Illinois,1542,500,657,304,0.021771,1461,81
598,Boone,Illinois,71,38,12,21,0.001002,71,0
599,Brown,Illinois,2059,419,1267,367,0.02907,2053,6


It's not all that well-documented in the source for this data, but I think there might be some counties that report people more than once if they put more than one race down? I say that because, some of the three_cat_total values are actually higher than the overall county totals:

In [10]:
il_df[il_df['three_cat_total'] > il_df['Total']]

Unnamed: 0,County,State,Total,White,Black,Latino,state_prop,three_cat_total,Other
608,Clinton,Illinois,1599,486,917,199,0.022576,1602,-3
611,Crawford,Illinois,1230,310,782,141,0.017366,1233,-3
620,Fayette,Illinois,1527,467,933,129,0.021559,1529,-2
635,Jefferson,Illinois,1857,827,812,224,0.026218,1863,-6
645,Lawrence,Illinois,2358,486,1490,393,0.033292,2369,-11
654,Madison,Illinois,14,0,11,14,0.000198,25,-11
655,Marion,Illinois,114,69,37,10,0.00161,116,-2
686,Vermilion,Illinois,2084,536,1236,319,0.029423,2091,-7
690,Wayne,Illinois,2,0,2,2,2.8e-05,4,-2
691,White,Illinois,72,35,19,36,0.001017,90,-18


However, since most of these are low numbers (Madison County and White County are obviously exceptions, sketchy af but... not much we could do besides contacting the correctional facilities in those counties 😵), we will set the other value to 0 in these cases:

In [11]:
il_df['Other'] = il_df['Other'].apply(lambda x: 0 if x < 0 else x)
# And now we can drop three_cat_total, since we only needed that in order to form the other count:
il_df.drop(columns=['three_cat_total'], inplace=True, errors='ignore')

#  since we've now arrived at consistent names for the four race categories used by PPI, we store these names in a list for future use (to ensure consistency in naming throughout):
race_category_names = ['White', 'Black', 'Latino', 'Other']
il_df.head()

Unnamed: 0,County,State,Total,White,Black,Latino,state_prop,Other
595,Adams,Illinois,110,73,36,0,0.001553,1
596,Alexander,Illinois,411,89,242,79,0.005803,1
597,Bond,Illinois,1542,500,657,304,0.021771,81
598,Boone,Illinois,71,38,12,21,0.001002,0
599,Brown,Illinois,2059,419,1267,367,0.02907,6


## Illinois Exoneree Counts/Demographics

In [20]:
exon_il_df = pd.read_csv('../../data/processed-data/illinois_exoneration_data.csv')
exon_il_df.head(3)

Unnamed: 0,last_name,first_name,age,race,sex,state,county,latitude,longitude,worst_crime_display,...,child_welfare_worker_misconduct,withheld_exculpatory_evidence,misconduct_that_is_not_withholding_evidence,knowingly_permitting_perjury,witness_tampering_or_misconduct_interrogating_co_defendant,misconduct_in_interrogation_of_exoneree,perjury_by_official,prosecutor_lied_in_court,tag_sum,geocode_address
0,Abbott,Cinque,19.0,Black,male,Illinois,Cook,41.819738,-87.756525,Drug Possession or Sale,...,0,1,1,0,0,0,0,0,7,"Cook County, Illinois, United States"
1,Abernathy,Christopher,17.0,White,male,Illinois,Cook,41.819738,-87.756525,Murder,...,0,1,1,0,0,1,0,0,10,"Cook County, Illinois, United States"
2,Abrego,Eruby,20.0,Hispanic,male,Illinois,Cook,41.819738,-87.756525,Murder,...,0,1,1,0,1,1,1,0,9,"Cook County, Illinois, United States"


In [24]:
num_il = len(exon_il_df)
num_il

548

In [21]:
exon_il_df['race'].value_counts(normalize=True)

race
Black              0.762774
Hispanic           0.147810
White              0.085766
Asian              0.001825
Native American    0.001825
Name: proportion, dtype: float64

Since the PPI data only has black, white, latino, and other, we need to rename Hispanic and then combine Asian and Native American into "other" (since we're going to combine these at the end, based on the race_categories list we created above). We still keep the original Race variable, just in a new column named Race_orig:

In [23]:
recode_map = {
    'Black': 'Black',
    'Hispanic': 'Latino',
    'White': 'White',
    'Asian': 'Other',
    'Native American': 'Other',
}
exon_il_df['Race_orig'] = exon_il_df['race']
exon_il_df['race'] = exon_il_df['race'].apply(lambda x: recode_map[x])
exon_il_df['race'].value_counts(normalize=True)

race
Black     0.762774
Latino    0.147810
White     0.085766
Other     0.003650
Name: proportion, dtype: float64

## Samplng from Incarcerated Population
Now we can conduct the simulated "sample". The first step is to:

Sample 548 "people" from among the prison population in Illinois, by first taking a weighted replacement sample of size 548 from il_df, which will give us a valid population-weighted sample of 548 "people" where all we know about these people is their county.
Here we also make sure to set the seed for Pandas' random number generator, so that our sample is replicable:

In [25]:
il_sample_df = il_df.sample(
    num_il,
    replace = True,
    weights = il_df['Total'],
    random_state = 5000,
).copy()
il_sample_df.head()

Unnamed: 0,County,State,Total,White,Black,Latino,state_prop,Other
610,Cook,Illinois,11649,1769,8369,1468,0.164469,43
631,Henry,Illinois,301,172,108,21,0.00425,0
667,Perry,Illinois,2323,561,1398,352,0.032798,12
610,Cook,Illinois,11649,1769,8369,1468,0.164469,43
648,Logan,Illinois,3060,963,1705,389,0.043203,3


In [26]:
il_sample_df['County'].value_counts(normalize=True).head()

County
Cook        0.142336
Will        0.060219
Randolph    0.056569
Perry       0.040146
Logan       0.040146
Name: proportion, dtype: float64

Now, all that's left is the second step:

For each row in il_sample_df, use the counts for each race to form a distribution, then draw from it to replicate that county's racial distribution of inmates in our sample.
This time, since we're working row-by-row, we use NumPy, still making sure to seed the RNG so that our results are replicable:

In [27]:
rng = np.random.default_rng(seed = 5000)
def draw_race_sample(row):
  race_counts = [row[cur_val] for cur_val in race_category_names]
  total_count = sum(race_counts)
  race_probs = [cur_count / total_count for cur_count in race_counts]
  # And now we have a probability distribution! We can use rng.choice() to
  # sample from it
  sampled_vals = rng.choice(race_category_names, size=1, p=race_probs)
  # We only sampled 1 value here, so we use [0] to extract it
  sampled_val = list(sampled_vals)[0]
  return sampled_val

Before using it to sample, it's helpful to check that our function works by using it to draw a bunch of samples for a specific county (Cook, in this case). First, let's compute what we would expect in terms of proportions if we sampled N inmates from Cook:

In [28]:
cook_row = il_df[il_df['County'] == "Cook"].iloc[0]
for cname in race_category_names:
  cook_row[f'{cname}_prop'] = cook_row[cname] / cook_row['Total']
cook_row

County             Cook
State          Illinois
Total             11649
White              1769
Black              8369
Latino             1468
state_prop     0.164469
Other                43
White_prop     0.151859
Black_prop     0.718431
Latino_prop    0.126019
Other_prop     0.003691
Name: 610, dtype: object

This means that, if our draw_race_sample() function is working correctly, we'd expect it to generate "white" 15.2% of the time, "black" 71.8% of the time, and so on. So, let's check that by using it to generate an  N=5000  sample from Cook:

In [29]:
N = 5000
cook_samples = [draw_race_sample(cook_row) for _ in range(N)]
cook_sample_df = pd.DataFrame(cook_samples, columns = ['Race'])
cook_sample_df['Race'].value_counts(normalize=True)

Race
Black     0.7186
White     0.1518
Latino    0.1260
Other     0.0036
Name: proportion, dtype: float64

Looks good, very close to the expected proportions! So, now that we trust our draw_race_sample() function a bit more, we can use it to sample a race value for each row in il_sample_df. This also gives me a chance to show how the tqdm library works, which is helpful when doing things like this to check the progress (and thus to make sure that your simulation code isn't taking too long per row):

In [30]:
il_sample_df['Race'] = il_sample_df.progress_apply(draw_race_sample, axis=1)

100%|██████████| 548/548 [00:00<00:00, 22719.42it/s]


In [31]:
sample_cols_to_keep = [
    'County',
    'State',
    'Race'
]
il_sample_df = il_sample_df[sample_cols_to_keep].copy()
il_sample_df

Unnamed: 0,County,State,Race
610,Cook,Illinois,Black
631,Henry,Illinois,White
667,Perry,Illinois,Black
610,Cook,Illinois,Black
648,Logan,Illinois,Black
...,...,...,...
646,Lee,Illinois,White
605,Christian,Illinois,Black
620,Fayette,Illinois,Black
639,Kane,Illinois,White


Let's see what the racial distribution of the Cook County subset of our sample ended up looking like:

In [32]:
cook_sample_df = il_sample_df[il_sample_df['County'] == "Cook"].copy()
cook_sample_df['Race'].value_counts(normalize=True)

Race
Black     0.743590
Latino    0.166667
White     0.089744
Name: proportion, dtype: float64

So, we ended up with a slight oversample of Latinos relative to the population expectation, and an undersample of Whites, but that's exactly (as weird as it might feel) a "feature" of this mode of sampling: it's what we want since we're trying to simulate our simplified model of the Exoneration Project: that their sample of exonerees represents some subset of 548 inmates from Cook, so we want to compare them with another size-548 subset of those still incarcerated in Cook.

So, with all that completed, we can combine the 548 rows in il_sample_df with exon_il_df to create a balanced DataFrame with (548 * 2) = 1096 rows, half of which are exonerated inmates from Illinois and half of which are non-exonerated inmates from Illinois (where the non-exonerated group is statistically representative of the incarcerated population of Illinois as a whole):

In [42]:
# Construct our new label: exonerated vs. non-exonerated
exon_il_df['Label'] = "Exonerated"
il_sample_df['Label'] = "Non-Exonerated"
# And combine!
balanced_df = pd.concat([exon_il_df, il_sample_df], axis=0)
# Define the mapping for 'race'
race_mapping = {
    'Asian': 'Other',
    'Native American': 'Other',
    'Black': 'Black',
    'White': 'White',
    'Hispanic': 'Hispanic'
}

# Map the 'race' column
balanced_df['race'] = balanced_df['race'].map(race_mapping)

# Combine 'race' and 'Race' columns, prioritizing non-NaN values
balanced_df['Race'] = balanced_df['race'].combine_first(balanced_df['Race'])

# Drop the old 'race' column
balanced_df.drop(columns=['race'], inplace=True)

# Combine 'county' and 'County' columns, prioritizing non-NaN values
balanced_df['County'] = balanced_df['county'].combine_first(balanced_df['County'])

# Drop the old 'county' column
balanced_df.drop(columns=['county'], inplace=True)

# Verify the final Race column
print(balanced_df['Race'].value_counts())
print(balanced_df['County'].value_counts())
balanced_df.head()

Race
Black     709
White     207
Latino     92
Other       5
Name: count, dtype: int64
County
Cook           552
Will            37
Randolph        31
Jefferson       23
Perry           22
Livingston      22
Logan           22
Johnson         21
Fulton          21
Tazewell        19
Lawrence        18
Montgomery      17
Bond            17
Vermilion       16
Winnebago       15
La Salle        14
Clinton         14
St. Clair       14
Lake            14
Fayette         13
Lee             13
Kane            12
Brown           12
Knox            11
Dupage          10
Morgan          10
Rock Island     10
Peoria          10
Crawford         9
Macon            9
DuPage           6
Williamson       6
Christian        6
Champaign        5
Sangamon         4
Mclean           4
McHenry          4
Mchenry          3
Henry            3
Kankakee         3
Iroquois         2
Stephenson       2
Effingham        2
Edgar            2
Woodford         2
Adams            2
Menard           1
Boone        

Unnamed: 0,last_name,first_name,age,sex,state,latitude,longitude,worst_crime_display,sentence,sentence_in_years,...,misconduct_in_interrogation_of_exoneree,perjury_by_official,prosecutor_lied_in_court,tag_sum,geocode_address,Race_orig,Label,County,State,Race
0,Abbott,Cinque,19.0,male,Illinois,41.819738,-87.756525,Drug Possession or Sale,Probation,0.0,...,0.0,0.0,0.0,7.0,"Cook County, Illinois, United States",Black,Exonerated,Cook,,Black
1,Abernathy,Christopher,17.0,male,Illinois,41.819738,-87.756525,Murder,Life without parole,100.0,...,1.0,0.0,0.0,10.0,"Cook County, Illinois, United States",White,Exonerated,Cook,,White
2,Abrego,Eruby,20.0,male,Illinois,41.819738,-87.756525,Murder,90 years,90.0,...,1.0,1.0,0.0,9.0,"Cook County, Illinois, United States",Hispanic,Exonerated,Cook,,
3,Adams,Demetris,22.0,male,Illinois,41.819738,-87.756525,Drug Possession or Sale,1 year,1.0,...,0.0,0.0,0.0,7.0,"Cook County, Illinois, United States",Black,Exonerated,Cook,,Black
4,Adams,Kenneth,22.0,male,Illinois,41.819738,-87.756525,Murder,75 years,75.0,...,0.0,0.0,0.0,11.0,"Cook County, Illinois, United States",Black,Exonerated,Cook,,Black


In [43]:
balanced_df.to_csv("../../data/processed-data/exonerees_balanced.csv", index=False)