---
title: "Counterfactual Data Balancing"
format:
    html: 
        toc: true
        code-fold: false
        embedded-resouces: true
bibliography: ../../references.bib
---

# Introduction and Motivation
In this project, I set out to create a balanced dataset that would support supervised learning models for predicting the factors linked to exonerations. At the heart of this process is **counterfactual balancing**: building a dataset that includes exonerated individuals alongside a comparable group of non-exonerated individuals, drawn to reflect the broader incarcerated population in Illinois. This balance is critical—it allows the model to make fair and meaningful comparisons when identifying patterns and predictors of exoneration outcomes.  

## Why Use Counterfactual Data?  
Counterfactual data is a necessity when access to complete prison population records is unavailable. Since I don’t have access to a full dataset of all incarcerated individuals in Illinois and their exoneration statuses (e.g., 'exonerated', 'not exonerate'), I relied on counterfactuals to bridge the gap and construct a balanced dataset.  

The idea behind counterfactuals is simple: they allow us to ask *“what if?”* questions. For example: *What if an exonerated person had not been exonerated? Would their characteristics look similar to non-exonerated individuals?* Counterfactual data helps isolate these comparisons by holding everything else constant except the hypothetical condition—in this case, exoneration.  

As explained in this [primer on counterfactuals](https://bayes.cs.ucla.edu/PRIMER/primer-ch4.pdf), a counterfactual statement operates on an unrealized “if” condition. The “if” portion, also known as the antecedent, frames the comparison: exonerated individuals versus those who weren’t. This approach is powerful because it reduces bias and ensures that the model is trained on data that is reliable, balanced, and representative. [@pearl2016counterfactuals]

### Acknowledgments
The implementation of this counterfactual data balancing relied heavily on expert guidance and code contributions from <a href="https://gufaculty360.georgetown.edu/s/contact/003Hp00002jMlEDIA0/jeffrey-jacobs" target="_blank">Professor Jeff Jacobs</a>. His insights and support were invaluable in refining the methodology and making this process possible.  

# Narrowing to Incarcerated Population 
To focus on the incarcerated population in Illinois, the dataset was filtered to include only relevant columns that captured key demographic details, such as total incarcerated populations broken down by race—White, Black, and Latino. This step ensured that the precise subset of data needed for balancing was used while also laying the groundwork for simulating representative draws from the Illinois incarcerated population.  

In [59]:
import pandas as pd
import numpy as np
from tqdm import tqdm # Adds progress bars to loops and other iterable processes for better visualization.
tqdm.pandas() # Allows progress bars to appear during DataFrame operations.

In [60]:
il_df = pd.read_csv('../../data/processed-data/representation_by_county.csv')
il_df = il_df[il_df['state'] == "Illinois"].copy()
il_df.head(3)

Unnamed: 0,county,state,total_population,total_white_population,total_black_population,total_latino_population,incarcerated_population,incarcerated_white_population,incarcerated_black_population,incarcerated_latino_population,non-incarcerated_population,non-incarcerated_white_population,non-incarcerated_black_population,non-incarcerated_latino_population,ratio_of_overrepresentation_of_whites_incarcerated_compared_to_whites_non-incarcerated,ratio_of_overrepresentation_of_blacks_incarcerated_compared_to_blacks_non-incarcerated,ratio_of_overrepresentation_of_latinos_incarcerated_compared_to_latinos_non-incarcerated
0,Adams,Illinois,67103,62414,2331,776,110,73,36,0,66993,62341,2295,776,0.71,9.54,0.0
1,Alexander,Illinois,8238,4983,2915,155,411,89,242,79,7827,4894,2673,76,0.35,1.72,19.82
2,Bond,Illinois,17768,15797,1080,547,1542,500,657,304,16226,15297,423,243,0.34,16.32,13.14


Columns are renamed to streamline the analysis, removing unnecessary verbosity while retaining clarity.  

In [61]:
rename_map = {
    'county': 'county',
    'state': 'state',
    'incarcerated_population': 'Total',
    'incarcerated_white_population': 'White',
    'incarcerated_black_population': 'Black',
    'incarcerated_latino_population': 'Latino',
}

# Keep only the cols in the rename_map
cols_to_keep = list(rename_map.keys())
il_df = il_df[cols_to_keep].copy()

# And do the renaming
il_df.rename(columns=rename_map, inplace=True)
il_df.head()

Unnamed: 0,county,state,Total,White,Black,Latino
0,Adams,Illinois,110,73,36,0
1,Alexander,Illinois,411,89,242,79
2,Bond,Illinois,1542,500,657,304
3,Boone,Illinois,71,38,12,21
4,Brown,Illinois,2059,419,1267,367


To align the data with the exoneration registry, a small adjustment was made to clean up the county names. The original dataset listed counties with the trailing word "County" (e.g., "Cook County"), but the registry uses simplified names (like "Cook"), ensuring consistency across datasets.  

A **`state_prop`** column was then added to represent the proportion of all Illinois inmates coming from each county. This was calculated by dividing each county's total incarcerated population (`Total`) by the sum of the total population across all counties. Sorting the values in descending order highlighted the counties with the largest share of the state's incarcerated population.  


In [62]:
# Since the Exoneree project uses just the county name (like "Cook"), we'll remove the trailing " County" (so, e.g., "Cook County" will turn into just "Cook"):
il_df['county'] = il_df['county'].str.replace(" county","")

# Compute a state_prop column representing the % of all Illinois inmates contained in each county:
il_df['state_prop'] = il_df['Total'] / il_df['Total'].sum()
il_df.sort_values(by='state_prop', ascending=False).head()


Unnamed: 0,county,state,Total,White,Black,Latino,state_prop
15,Cook,Illinois,11649,1769,8369,1468,0.164469
98,Will,Illinois,3902,811,2528,538,0.055091
78,Randolph,Illinois,3571,934,2250,377,0.050418
53,Logan,Illinois,3060,963,1705,389,0.043203
52,Livingston,Illinois,2798,905,1577,294,0.039504


From the output, **Cook County** stands out, contributing roughly 16% of Illinois’ incarcerated individuals, followed by Will, Randolph, Logan, and Livingston counties. This helps identify where most of the incarcerated population is concentrated, which will be key for balancing comparisons in the analysis. 

In [63]:
# To avoid confusing the state_prop value with the sampled proportion that we compute below, we can drop state_prop now:
il_df = il_df.drop(columns=['state_prop'])
# Since they're only tracking three racial groups, the total of the three race counts should not equal the total incarcerated population. But let's check:
il_df['three_cat_total'] = il_df['Black'] + il_df['White'] + il_df['Latino']
il_df.head()

Unnamed: 0,county,state,Total,White,Black,Latino,three_cat_total
0,Adams,Illinois,110,73,36,0,109
1,Alexander,Illinois,411,89,242,79,410
2,Bond,Illinois,1542,500,657,304,1461
3,Boone,Illinois,71,38,12,21,71
4,Brown,Illinois,2059,419,1267,367,2053


To ensure the sample accurately represents the county-by-county distributions, the difference between `three_cat_total` and `Total` was used to construct the "Other" category.  

In [64]:
il_df['Other'] = il_df['Total'] - il_df['three_cat_total']
il_df.head()

Unnamed: 0,county,state,Total,White,Black,Latino,three_cat_total,Other
0,Adams,Illinois,110,73,36,0,109,1
1,Alexander,Illinois,411,89,242,79,410,1
2,Bond,Illinois,1542,500,657,304,1461,81
3,Boone,Illinois,71,38,12,21,71,0
4,Brown,Illinois,2059,419,1267,367,2053,6


The data source doesn’t provide much documentation, but it seems like some counties might be double-counting individuals who report more than one race. This assumption comes from the fact that, in some cases, the `three_cat_total` values (sum of White, Black, and Latino counts) are higher than the overall `Total` population for those counties.  

In [65]:
il_df[il_df['three_cat_total'] > il_df['Total']]

Unnamed: 0,county,state,Total,White,Black,Latino,three_cat_total,Other
13,Clinton,Illinois,1599,486,917,199,1602,-3
16,Crawford,Illinois,1230,310,782,141,1233,-3
25,Fayette,Illinois,1527,467,933,129,1529,-2
40,Jefferson,Illinois,1857,827,812,224,1863,-6
50,Lawrence,Illinois,2358,486,1490,393,2369,-11
59,Madison,Illinois,14,0,11,14,25,-11
60,Marion,Illinois,114,69,37,10,116,-2
91,Vermilion,Illinois,2084,536,1236,319,2091,-7
95,Wayne,Illinois,2,0,2,2,4,-2
96,White,Illinois,72,35,19,36,90,-18


Since most of these cases involve low numbers (with Madison County and White County as notable exceptions—anomalous, but beyond the scope of what can be addressed without direct input from correctional facilities), the "Other" value was set to `0` in these instances.  

In [66]:
il_df['Other'] = il_df['Other'].apply(lambda x: 0 if x < 0 else x)

# Drop three_cat_total, since we only needed that in order to form the other count:
il_df.drop(columns=['three_cat_total'], inplace=True, errors='ignore')

#  Store these names in a list for future use (to ensure consistency in naming throughout):
race_category_names = ['White', 'Black', 'Latino', 'Other']
il_df.head()

Unnamed: 0,county,state,Total,White,Black,Latino,Other
0,Adams,Illinois,110,73,36,0,1
1,Alexander,Illinois,411,89,242,79,1
2,Bond,Illinois,1542,500,657,304,81
3,Boone,Illinois,71,38,12,21,0
4,Brown,Illinois,2059,419,1267,367,6


# Illinois Exoneree Counts/Demographics
The Illinois exoneration data was loaded, and the total number of exonerated individuals was calculated by taking the length of the dataframe using `len(exon_il_df)`. The result: **548 exonerations**. This serves as the starting point for understanding the scope of exoneration cases in Illinois.  

In [67]:
exon_il_df = pd.read_csv('../../data/processed-data/illinois_exoneration_data.csv')
exon_il_df.head(3)

Unnamed: 0,last_name,first_name,age,race,sex,state,county,latitude,longitude,worst_crime_display,...,child_welfare_worker_misconduct,withheld_exculpatory_evidence,misconduct_that_is_not_withholding_evidence,knowingly_permitting_perjury,witness_tampering_or_misconduct_interrogating_co_defendant,misconduct_in_interrogation_of_exoneree,perjury_by_official,prosecutor_lied_in_court,tag_sum,geocode_address
0,Abbott,Cinque,19.0,Black,male,Illinois,Cook,41.819738,-87.756525,Drug Possession or Sale,...,0,1,1,0,0,0,0,0,7,"Cook County, Illinois, United States"
1,Abernathy,Christopher,17.0,White,male,Illinois,Cook,41.819738,-87.756525,Murder,...,0,1,1,0,0,1,0,0,10,"Cook County, Illinois, United States"
2,Abrego,Eruby,20.0,Hispanic,male,Illinois,Cook,41.819738,-87.756525,Murder,...,0,1,1,0,1,1,1,0,9,"Cook County, Illinois, United States"


In [68]:
num_il = len(exon_il_df)
num_il

548

The `value_counts()` function was applied to the `race` column with `normalize=True` to calculate the proportion of exonerated individuals by race in Illinois. The results highlight significant disparities:

- **Black individuals** make up the majority of exonerations at **76.3%**.  
- **Hispanic individuals** account for **14.8%**, while **White individuals** represent only **8.6%**.  
- The remaining categories, including **Asian** and **Native American**, each comprise less than **0.2%** of exonerations.  

In [69]:
exon_il_df['race'].value_counts(normalize=True)

race
Black              0.762774
Hispanic           0.147810
White              0.085766
Asian              0.001825
Native American    0.001825
Name: proportion, dtype: float64

Since the Prison Policy Initiative demographic data only includes Black, White, Latino, and Other as race categories, "Hispanic" was first renamed to "Latino" for consistency. "Asian" and "Native American" were then combined into the "Other" category. To preserve the original race data, it was saved into a new column called `Race_orig` for future reference if needed.  


In [70]:
recode_map = {
    'Black': 'Black',
    'Hispanic': 'Latino',
    'White': 'White',
    'Asian': 'Other',
    'Native American': 'Other',
}
exon_il_df['Race_orig'] = exon_il_df['race']
exon_il_df['race'] = exon_il_df['race'].apply(lambda x: recode_map[x])
exon_il_df['race'].value_counts(normalize=True)

race
Black     0.762774
Latino    0.147810
White     0.085766
Other     0.003650
Name: proportion, dtype: float64

# Sampling from the Incarcerated Population  

## Draw Representative Samples  
The first step in the simulation is to draw a representative sample of **548 "people"** from the Illinois prison population. To achieve this, a weighted random sample with replacement was performed from the `il_df` dataset. Sampling weights were determined based on each county's total incarcerated population, ensuring that counties with larger populations contributed proportionally more to the sample.  

A random seed (`random_state=5000`) was set to ensure the results are replicable. This step produces a valid population-weighted sample where the only known characteristic of each "person" is their county.  


In [71]:
il_sample_df = il_df.sample(
    num_il,
    replace = True,
    weights = il_df['Total'],
    random_state = 5000,
).copy()
il_sample_df.head()

Unnamed: 0,county,state,Total,White,Black,Latino,Other
15,Cook,Illinois,11649,1769,8369,1468,43
36,Henry,Illinois,301,172,108,21,0
72,Perry,Illinois,2323,561,1398,352,12
15,Cook,Illinois,11649,1769,8369,1468,43
53,Logan,Illinois,3060,963,1705,389,3


In [72]:
il_sample_df['county'].value_counts(normalize=True).head()

county
Cook        0.142336
Will        0.060219
Randolph    0.056569
Perry       0.040146
Logan       0.040146
Name: proportion, dtype: float64

## Simulating Racial Distribution  

To replicate the racial makeup of the incarcerated population, racial counts for each county were used to create a probability distribution for race. For each row in `il_sample_df` (which represents a sampled county), a distribution was formed based on the race-specific counts, and a single "person" was drawn from that distribution.  

This process was done row-by-row using NumPy's `random.choice()` function. A random seed (`RNG`) was also set to ensure the results remain consistent and replicable across runs.  

In [73]:
rng = np.random.default_rng(seed = 5000)
def draw_race_sample(row):
  race_counts = [row[cur_val] for cur_val in race_category_names]
  total_count = sum(race_counts)
  race_probs = [cur_count / total_count for cur_count in race_counts]
  # And now we have a probability distribution! We can use rng.choice() to sample from it
  sampled_vals = rng.choice(race_category_names, size=1, p=race_probs)
  # We only sampled 1 value here, so we use [0] to extract it
  sampled_val = list(sampled_vals)[0]
  return sampled_val

Before sampling, the function was tested by drawing multiple samples for a specific county—Cook County, in this case. To verify its accuracy, the expected proportions for sampling `N` inmates from Cook were first computed.  


In [74]:
cook_row = il_df[il_df['county'] == "Cook"].iloc[0]
for cname in race_category_names:
  cook_row[f'{cname}_prop'] = cook_row[cname] / cook_row['Total']
cook_row

county             Cook
state          Illinois
Total             11649
White              1769
Black              8369
Latino             1468
Other                43
White_prop     0.151859
Black_prop     0.718431
Latino_prop    0.126019
Other_prop     0.003691
Name: 15, dtype: object

This means that if the `draw_race_sample()` function is working correctly, it should generate "White" 15.2% of the time, "Black" 71.8% of the time, and so on. To confirm this, a sample of size **N=5000** was generated from Cook County to check whether the proportions align with the expected values.  

In [75]:
N = 5000
cook_samples = [draw_race_sample(cook_row) for _ in range(N)]
cook_sample_df = pd.DataFrame(cook_samples, columns = ['Race'])
cook_sample_df['Race'].value_counts(normalize=True)

Race
Black     0.7186
White     0.1518
Latino    0.1260
Other     0.0036
Name: proportion, dtype: float64

The results look good and are very close to the expected proportions, which confirms that the `draw_race_sample()` function is working as intended. With this validation, the function can now be used to sample a race value for each row in `il_sample_df`.  

This step also introduces the `tqdm` library, which is useful for tracking progress when running simulations like this. It helps monitor how long the code takes per row, ensuring the simulation remains efficient.  

In [76]:
il_sample_df['Race'] = il_sample_df.progress_apply(draw_race_sample, axis=1)

100%|██████████| 548/548 [00:00<00:00, 8289.38it/s]


In [78]:
sample_cols_to_keep = [
    'county',
    'state',
    'Race'
]
il_sample_df = il_sample_df[sample_cols_to_keep].copy()
il_sample_df

Unnamed: 0,county,state,Race
15,Cook,Illinois,Black
36,Henry,Illinois,White
72,Perry,Illinois,Black
15,Cook,Illinois,Black
53,Logan,Illinois,Black
...,...,...,...
51,Lee,Illinois,White
10,Christian,Illinois,Black
25,Fayette,Illinois,Black
44,Kane,Illinois,White


Let’s take a look at the racial distribution of the Cook County subset from our sample to see how it turned out:  

In [79]:
cook_sample_df = il_sample_df[il_sample_df['county'] == "Cook"].copy()
cook_sample_df['Race'].value_counts(normalize=True)

Race
Black     0.743590
Latino    0.166667
White     0.089744
Name: proportion, dtype: float64

The results show a slight oversample of Latinos compared to the population expectation and an undersample of Whites. While this might seem odd, it’s actually a *feature* of this sampling process. The goal here is to simulate the simplified model of the Exoneration Registry, where the sample of exonerees represents a subset of 548 inmates from Cook County. This allows for a direct comparison with another size-548 subset of those still incarcerated in Cook.  

With this step completed, the 548 rows from `il_sample_df` can now be combined with the 548 rows in `exon_il_df`, creating a balanced DataFrame with a total of **1,096 rows**. Half of these rows represent exonerated individuals from Illinois, and the other half represent non-exonerated individuals, sampled to be statistically representative of Illinois' incarcerated population as a whole.  

## Constructing the Final Balanced Dataset  

To prepare the final balanced dataset, a new label column was added to distinguish between exonerated and non-exonerated individuals. Specifically:  
- The `Label` column in `exon_il_df` was set to **"Exonerated"**.  
- The `Label` column in `il_sample_df` was set to **"Non-Exonerated"**.  

To avoid confusion when combining datasets, the `county` column in `il_sample_df` was renamed to **`County`**. With the labels in place and columns aligned, both datasets were combined into a single DataFrame using `pd.concat()`.  

Next, a race mapping was applied to standardize the race categories across datasets:  
- "Asian" and "Native American" were combined into the **"Other"** category.  
- "Black," "White," and "Hispanic" categories were kept as-is.  

To clean up, the **`race`** and **`Race`** columns were combined, prioritizing non-NaN values to ensure no data was lost. The original `race` column was then dropped. Similarly, the **`county`** and **`County`** columns were merged, and the original `county` column was removed to streamline the final DataFrame.  

Finally, the resulting **`Race`** and **`County`** columns were checked to confirm the expected values, and the first few rows of the balanced dataset were displayed to verify everything was in place.  


In [80]:
# Construct our new label: exonerated vs. non-exonerated
exon_il_df['Label'] = "Exonerated"
il_sample_df['Label'] = "Non-Exonerated"
il_sample_df = il_sample_df.rename(columns={'county' : 'County'}) # Rename to distinguish when combining datasets

# And combine!
balanced_df = pd.concat([exon_il_df, il_sample_df], axis=0)
# Define the mapping for 'race'
race_mapping = {
    'Asian': 'Other',
    'Native American': 'Other',
    'Black': 'Black',
    'White': 'White',
    'Hispanic': 'Hispanic'
}


# Map the 'race' column
balanced_df['race'] = balanced_df['race'].map(race_mapping)

# Combine 'race' and 'Race' columns, prioritizing non-NaN values
balanced_df['Race'] = balanced_df['race'].combine_first(balanced_df['Race'])

# Drop the old 'race' column
balanced_df.drop(columns=['race'], inplace=True)

# Combine 'county' and 'County' columns, prioritizing non-NaN values
balanced_df['County'] = balanced_df['county'].combine_first(balanced_df['County'])

# Drop the old 'county' column
balanced_df.drop(columns=['county'], inplace=True)

# Verify the final Race column
print(balanced_df['Race'].value_counts())
print(balanced_df['County'].value_counts())
balanced_df.head()

Race
Black     709
White     207
Latino     92
Other       5
Name: count, dtype: int64
County
Cook           552
Will            37
Randolph        31
Jefferson       23
Logan           22
Perry           22
Livingston      22
Fulton          21
Johnson         21
Tazewell        19
Lawrence        18
Montgomery      17
Bond            17
Vermilion       16
DuPage          15
Winnebago       15
Lake            14
St. Clair       14
Clinton         14
La Salle        14
Fayette         13
Lee             13
Brown           12
Kane            12
Knox            11
Peoria          10
Morgan          10
Rock Island     10
Macon            9
Crawford         9
McHenry          7
Christian        6
Williamson       6
Champaign        5
McLean           4
Sangamon         4
Henry            3
Kankakee         3
Stephenson       2
Woodford         2
Edgar            2
Effingham        2
Iroquois         2
Adams            2
Richland         1
Menard           1
Pope             1
Madison      

Unnamed: 0,last_name,first_name,age,sex,state,latitude,longitude,worst_crime_display,sentence,sentence_in_years,...,witness_tampering_or_misconduct_interrogating_co_defendant,misconduct_in_interrogation_of_exoneree,perjury_by_official,prosecutor_lied_in_court,tag_sum,geocode_address,Race_orig,Label,County,Race
0,Abbott,Cinque,19.0,male,Illinois,41.819738,-87.756525,Drug Possession or Sale,Probation,0.0,...,0.0,0.0,0.0,0.0,7.0,"Cook County, Illinois, United States",Black,Exonerated,Cook,Black
1,Abernathy,Christopher,17.0,male,Illinois,41.819738,-87.756525,Murder,Life without parole,100.0,...,0.0,1.0,0.0,0.0,10.0,"Cook County, Illinois, United States",White,Exonerated,Cook,White
2,Abrego,Eruby,20.0,male,Illinois,41.819738,-87.756525,Murder,90 years,90.0,...,1.0,1.0,1.0,0.0,9.0,"Cook County, Illinois, United States",Hispanic,Exonerated,Cook,
3,Adams,Demetris,22.0,male,Illinois,41.819738,-87.756525,Drug Possession or Sale,1 year,1.0,...,0.0,0.0,0.0,0.0,7.0,"Cook County, Illinois, United States",Black,Exonerated,Cook,Black
4,Adams,Kenneth,22.0,male,Illinois,41.819738,-87.756525,Murder,75 years,75.0,...,1.0,0.0,0.0,0.0,11.0,"Cook County, Illinois, United States",Black,Exonerated,Cook,Black


In [82]:
balanced_df.to_csv("../../data/processed-data/exonerees_balanced.csv", index=False)

# Summary and Next Steps  

The final balanced dataset now consists of **1,096 rows**, split evenly between exonerated and non-exonerated individuals. Key steps included creating consistent labels, standardizing race categories, and combining the datasets while ensuring no critical data was lost. The resulting DataFrame provides a clean and structured foundation for further analysis.  

## Next Steps  
This balanced dataset can now be used for **supervised learning** tasks, such as:  
- **Predicting Exoneration Factors:** Training machine learning models to identify the characteristics most associated with exoneration outcomes.  
- **Comparative Analysis:** Exploring differences in demographics, geographic distribution, or other variables between exonerated and non-exonerated individuals.  
- **Visualization and Insights:** Mapping trends or disparities across counties and racial groups to better understand systemic patterns in wrongful convictions.  

With this dataset, models and analyses can provide deeper insights into the factors driving exonerations while ensuring fairness and balance in comparisons.  