---
title: "Data Collection"
format:
    html: 
        toc: true
        code-fold: false
        embedded-resouces: true
bibliography: ../../references.bib
---


{{< include overview.qmd >}} 

# Methods
The methods combined direct downloads, web scraping, and API-based geocoding to assemble a robust dataset for analysis. The exoneration and arrest datasets served as the foundation, while geocoding and additional scraping added valuable spatial and demographic context.

## Illinois Arrest Data
The Illinois arrest dataset was sourced from the [**Illinois Criminal Justice Information Authority's (ICJIA) Arrest Explorer**](https://icjia.illinois.gov/arrestexplorer/docs/#data-privacy-and-precision){target="_blank"}, a platform providing aggregate arrest data from the **Criminal History Record Information (CHRI)** system—a statewide resource for demographic and offense-related variables. [@illinois_criminal_justice_information_authority_overview_2024]

To ensure privacy and confidentiality, ICJIA applied the following modifications:  

- Counts under 10 are approximated (e.g., 1 for counts 0–4, 6 for counts 5–9),  
- Subtotals, such as arrests by race or county, are accurate within +1/-1, and  
- Statewide totals align exactly with the CHRI database at the time of retrieval, which occurs twice annually.  

Further, the dataset excludes juvenile arrests, class C misdemeanors, and cases with missing demographic details. For this project, the data was first filtered by **race, county, and year**, and then downloaded directly to examine patterns relevant to my analysis. [@illinois_criminal_justice_information_authority_arrests_2024]  

In [4]:
import pandas as pd
arrest_df = pd.read_csv('../../data/raw-data/illinois_arrest_explorer_data.csv')
arrest_df.head(3)

Unnamed: 0,Year,race,county_Adams,county_Alexander,county_Bond,county_Boone,county_Brown,county_Bureau,county_Calhoun,county_Carroll,...,county_Wabash,county_Warren,county_Washington,county_Wayne,county_White,county_Whiteside,county_Will,county_Williamson,county_Winnebago,county_Woodford
0,2001,African American,226,147,25,18,18,48,6,12,...,6,46,25,6,16,128,3000,75,2509,42
1,2001,Asian,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,14,1,28,1
2,2001,Hispanic,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1



## Exoneration Data 
The exoneration dataset was downloaded directly from the [***National Registry of Exonerations***](https://www.law.umich.edu/special/exoneration/Pages/Spread-Sheet-Request-Form.aspx){target=_blank}, a collaborative initiative by the Newkirk Center for Science and Society at the University of California (Irvine), the University of Michigan Law School, and Michigan State University College of Law. Established in 2012 by Rob Warden, then Executive Director of Northwestern University’s Pritzker School of Law’s Center on Wrongful Convictions, and Samuel R. Gross, a Law Professor at the University of Michigan, the Registry collects and publishes comprehensive, searchable statistical data and detailed case records for exonerations of innocent criminal defendants in the United States dating back to 1989 [@university_of_california_irvine_newkirk_center_for_science__society_national_2024].

The Registry defines exonerations as cases where a person, following new evidence of innocence, is officially cleared through actions like **factual declarations of innocence**, **pardons**, or the **dismissal/acquittal** of charges. [@university_of_california_irvine_newkirk_center_for_science__society_national_criteria]

To access the exoneration dataset, a [**spreadsheet request form**](https://www.law.umich.edu/special/exoneration/Pages/Spread-Sheet-Request-Form.aspx){target=_blank} was submitted, and the dataset was provided under conditions ensuring its proper use. These conditions include restrictions on retransmission, a requirement for advance notice of publication, and the obligation to report any identified errors or missing data.

In [3]:
exoneration_df = pd.read_csv('../../data/raw-data/US_exoneration_data.csv')
exoneration_df.head(3)

Unnamed: 0,Last Name,First Name,Age,Race,Sex,State,County,Tags,Worst Crime Display,Sentence,...,F/MFE,FC,ILD,P/FA,DNA,MWID,OM,Date of Exoneration,Date of 1st Conviction,Date of Release
0,Abbitt,Joseph,31.0,Black,Male,North Carolina,Forsyth,CV;#IO;#SA,Child Sex Abuse,Life,...,,,,,DNA,MWID,,9/2/09,6/22/95,9/2/09
1,Abbott,Cinque,19.0,Black,Male,Illinois,Cook,CIU;#IO;#NC;#P,Drug Possession or Sale,Probation,...,,,,P/FA,,,OM,2/1/22,3/25/08,3/25/08
2,Abdal,Warith Habib,43.0,Black,Male,New York,Erie,IO;#SA,Sexual Assault,20 to Life,...,F/MFE,,,,DNA,MWID,OM,9/1/99,6/6/83,9/1/99


## Mass Incarceration Racial Geography 
The population and incarceration data for Illinois counties were obtained by scraping [**Prison Policy Initiative's**](https://www.prisonpolicy.org/racialgeography/counties.html) website which provides information on total population and incarcerated populations broken down by race for counties across the United States[@prison_policy_initiative_home_2024].  

To extract the data, the `requests` library was used to retrieve the webpage's HTML content, and `BeautifulSoup` was employed to parse the HTML and locate the relevant table which was then converted into a `Pandas` DataFrame for cleaning and analysis. Here is the code used:  


In [26]:
import requests
from bs4 import BeautifulSoup

# Retrieve the HTML content of the target webpage
html_url = "https://www.prisonpolicy.org/racialgeography/counties.html"
result = requests.get(html_url)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(result.text)

# Locate the table element containing the data
table_elt = soup.find("table")

# Convert the HTML table into a Pandas DataFrame
table_sio = StringIO(str(table_elt))
county_df = pd.read_html(table_sio)[0]

# Clean the column names by removing unnecessary characters
county_df.columns = [c.replace("","").replace("","").strip() for c in county_df.columns]

# Filter for Illinois
il_df = county_df[county_df['State'] == "Illinois"].copy()

# Export to CSV
il_df.to_csv('../../data/raw-data/representation_by_county_raw.csv', index=False)
print("Data saved to 'representation_by_county_raw.csv'")
il_df.head()

Data saved to 'representation_by_county_raw.csv'


Unnamed: 0,County,State,Total Population,Total White Population,Total Black Population,Total Latino Population,Incarcerated Population,Incarcerated White Population,Incarcerated Black Population,Incarcerated Latino Population,Non-incarcerated Population,Non-incarcerated White Population,Non-Incarcerated Black Population,Non-Incarcerated Latino Population,Ratio of Overrepresentation of Whites Incarcerated Compared to Whites Non-Incarcerated,Ratio of Overrepresentation of Blacks Incarcerated Compared to Blacks Non-Incarcerated,Ratio of Overrepresentation of Latinos Incarcerated Compared to Latinos Non-Incarcerated
595,Adams County,Illinois,67103,62414,2331,776,110,73,36,0,66993,62341,2295,776,0.71,9.54,0.0
596,Alexander County,Illinois,8238,4983,2915,155,411,89,242,79,7827,4894,2673,76,0.35,1.72,19.82
597,Bond County,Illinois,17768,15797,1080,547,1542,500,657,304,16226,15297,423,243,0.34,16.32,13.14
598,Boone County,Illinois,54165,40757,1064,10967,71,38,12,21,54094,40719,1052,10946,0.71,8.71,1.46
599,Brown County,Illinois,6937,5191,1280,402,2059,419,1267,367,4878,4772,13,35,0.21,227.91,24.76


## Geographical Data

To enable advanced geographic analysis, I incorporated geocoded data, including latitude, longitude, and full address. This addition will hopefully allow for deeper exploratory data analysis (EDA) of geographic patterns within Illinois including location data which will allow me to examine trends and disparities across counties, evaluate geographic clustering of exoneration cases, and explore how systemic factors vary by region. This will ultimatley add a crucial spatial dimension to the overall analysis that provides the foundation for mapping, visualizations, and further geographic exploration. 

The geocoding process was conducted using [**GeoPy**](https://geopy.readthedocs.io/en/stable/){target=_blank}, a Python library that serves as an interface to geocoding APIs, specifically the **Nominatim API** from OpenStreetMap. GeoPy simplifies the retrieval of geographic details like latitude, longitude, and full address from place names by sending requests to the Nominatim API and processing the responses.  

In [None]:
# Import the Nominatim geocoder for converting location names into geographic coordinates:
from geopy.geocoders import Nominatim  

# Initialize the geolocator with a user-defined agent to avoid request limits:
geolocator = Nominatim(user_agent="illinois_exoneration_geocode") 

# Clean the 'County' column by removing the word "County" and any extra spaces:
il_df['County'] = il_df['County'].str.replace("County", "").str.strip()

# Rename the DataFrame to clarify it contains Illinois counties:
illinois_counties = il_df[['County', 'State']].copy()

# Define a function to geocode counties and return geographic details:
def geocode_county(row):
    """
    Takes a row containing 'County' and 'State' columns.
    Uses the geolocator to find the full address, latitude, and longitude.
    Returns a dictionary with the geocoded data or None if geocoding fails.
    """
    try:
        # Combine county and state into a query string for geocoding:
        location = geolocator.geocode(f"{row['County']}, {row['State']}, USA")
        
        # If a location is found, return the geocoded details:
        if location:
            return {
                'address': location.address,    # The full geocoded address
                'latitude': location.latitude,  # The latitude coordinate
                'longitude': location.longitude # The longitude coordinate
            }
        else:  # Print a message if no geocoding result is found:
            print(f"Failed: No result for {row['County']}, {row['State']}")
            return None
    except Exception as e:  # Handle errors during geocoding and print them:
        print(f"Error geocoding {row['County']}, {row['State']}: {e}")
        return None

# Apply the geocoding function to each Illinois county:
geocoded_results = illinois_counties.apply(geocode_county, axis=1)

# Extract the geocoding results into separate columns for address, latitude, and longitude:
illinois_counties['geocode_address'] = geocoded_results.apply(lambda x: x['address'] if isinstance(x, dict) and 'address' in x else None)
illinois_counties['latitude'] = geocoded_results.apply(lambda x: x['latitude'] if isinstance(x, dict) and 'latitude' in x else None)
illinois_counties['longitude'] = geocoded_results.apply(lambda x: x['longitude'] if isinstance(x, dict) and 'longitude' in x else None)

# Display the first few rows of the DataFrame with geocoded data:
illinois_counties.head()


Unnamed: 0,County,State,geocode_address,latitude,longitude
595,Adams,Illinois,"Adams County, Illinois, United States",39.978779,-91.211006
596,Alexander,Illinois,"Alexander County, Illinois, United States",37.180153,-89.350283
597,Bond,Illinois,"Bond County, Illinois, United States",38.863033,-89.439142
598,Boone,Illinois,"Boone County, Illinois, United States",42.321246,-88.823551
599,Brown,Illinois,"Brown County, Illinois, United States",39.949821,-90.748566


In [28]:
# Save the geocoded results to a CSV file 
illinois_counties.to_csv("../../data/raw-data/geocoded_population_counties.csv", index=False)
print("Geocoded county data saved to 'geocoded_population_counties.csv'")

Geocoded county data saved to 'geocoded_population_counties.csv'


## Illinois Shapefile  

To visualize geocoded data and perform geographic exploratory data analysis (EDA), I required a shapefile for Illinois county boundaries. While the Census Bureau provides [Illinois shapefiles](https://catalog.data.gov/dataset/tiger-line-shapefile-2021-state-illinois-census-tracts){target=_blank}, these files did not work as expected for my Exploratory Data Analysis (EDA) purposes due to compatibility issues.  

Instead, I sourced a shapefile directly from the [**Illinois State Geological Survey (ISGS)**](https://clearinghouse.isgs.illinois.edu/data/reference/illinois-county-boundaries-polygons-and-lines){target=_blank} which included county boundaries in a format compatible with the GIS tools I used. ISGS's shapefiles provided the geographic foundation for mapping and visualizing trends across Illinois counties for EDA which will allow me to adda crucial layer of context to the datasets and support meaningful EDA.[IllinoisCountyBoundaries]  

{{< include closing.qmd >}} 