---
title: "Data Cleaning"
format:
    html: 
        toc: true
        code-fold: false
---

<!-- After digesting the instructions, you can delete this cell, these are assignment instructions and do not need to be included in your final submission.  -->

{{< include instructions.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

### Cleaning the U.S. Exoneration Data 

*Data Cleaning Process*
- Removed columns with more than 50% missing values to focus on more complete data.
- Standardized column names and formats for easier manipulation.
- Addressed missing values in key columns:
  - Filled missing `county` values with "Unknown."
  - Dropped rows with missing `age` values as they are critical for analysis.
- Cleaned `tags` column by removing extraneous symbols (`#`, `;`) and ensuring uniform formatting.


In [92]:
# Import necessary Libraries
import pandas as pd 
import numpy as np 

# Load exoneration dataset
df = pd.read_csv('../../data/raw-data/US_exoneration_data.csv')
print("Initial Dataset: ")
df.head()


Initial Dataset: 


Unnamed: 0,Last Name,First Name,Age,Race,Sex,State,County,Tags,Worst Crime Display,Sentence,...,F/MFE,FC,ILD,P/FA,DNA,MWID,OM,Date of Exoneration,Date of 1st Conviction,Date of Release
0,Abbitt,Joseph,31.0,Black,Male,North Carolina,Forsyth,CV;#IO;#SA,Child Sex Abuse,Life,...,,,,,DNA,MWID,,9/2/09,6/22/95,9/2/09
1,Abbott,Cinque,19.0,Black,Male,Illinois,Cook,CIU;#IO;#NC;#P,Drug Possession or Sale,Probation,...,,,,P/FA,,,OM,2/1/22,3/25/08,3/25/08
2,Abdal,Warith Habib,43.0,Black,Male,New York,Erie,IO;#SA,Sexual Assault,20 to Life,...,F/MFE,,,,DNA,MWID,OM,9/1/99,6/6/83,9/1/99
3,Abernathy,Christopher,17.0,White,Male,Illinois,Cook,CIU;#CV;#H;#IO;#JV;#SA,Murder,Life without parole,...,,FC,,P/FA,DNA,,OM,2/11/15,1/15/87,2/11/15
4,Abney,Quentin,32.0,Black,Male,New York,New York,CV,Robbery,20 to Life,...,,,,,,MWID,,1/19/12,3/20/06,1/19/12


In [93]:
# Managing Missing Data - Identifying which columns have a lot of missing data
na_counts = df.isna().sum()
print(na_counts)

Last Name                    0
First Name                   0
Age                         27
Race                         0
Sex                          0
State                        0
County                      66
Tags                       171
Worst Crime Display          0
Sentence                     0
Posting Date                 0
OM Tags                   1430
F/MFE                     2557
FC                        3133
ILD                       2602
P/FA                      1291
DNA                       2984
MWID                      2610
OM                        1430
Date of Exoneration          0
Date of 1st Conviction       0
Date of Release              0
dtype: int64


In [94]:
# Drop columns with excessive missing values 
df.drop(columns = ['OM Tags', 'F/MFE', 'ILD', 'P/FA', 'DNA', 'MWID', 'FC','OM'], inplace = True)

# Standardize column names by converting to lowercase and replacing spaces with '_'
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Clean tags column by removing unecesary '#' from tags column
df['tags'] = df['tags'].str.replace('#', '', regex=False).str.replace(";", ",")

print(df.columns)


Index(['last_name', 'first_name', 'age', 'race', 'sex', 'state', 'county',
       'tags', 'worst_crime_display', 'sentence', 'posting_date',
       'date_of_exoneration', 'date_of_1st_conviction', 'date_of_release'],
      dtype='object')


*Illinois Subset:*
- Filtered the dataset to focus on cases from Illinois, resulting in a subset of 548 rows to be used for further analysis.


In [97]:
# Filter Data for Illinois 
IL_exonerations = df[df['state'] == 'Illinois']
print("Number of rows: " , IL_exonerations.shape[0]) 
IL_exonerations.head()

Number of rows:  548


Unnamed: 0,last_name,first_name,age,race,sex,state,county,tags,worst_crime_display,sentence,posting_date,date_of_exoneration,date_of_1st_conviction,date_of_release
1,Abbott,Cinque,19.0,Black,Male,Illinois,Cook,"CIU,IO,NC,P",Drug Possession or Sale,Probation,2/14/22,2/1/22,3/25/08,3/25/08
3,Abernathy,Christopher,17.0,White,Male,Illinois,Cook,"CIU,CV,H,IO,JV,SA",Murder,Life without parole,2/13/15,2/11/15,1/15/87,2/11/15
5,Abrego,Eruby,20.0,Hispanic,Male,Illinois,Cook,"CDC,H,IO",Murder,90 years,8/25/22,7/21/22,9/22/04,7/21/22
10,Adams,Demetris,22.0,Black,Male,Illinois,Cook,"CIU,IO,NC,P",Drug Possession or Sale,1 year,4/13/20,2/11/20,9/8/04,12/26/04
15,Adams,Kenneth,22.0,Black,Male,Illinois,Cook,"CDC,H,IO,JI,SA",Murder,75 years,8/29/11,7/2/96,10/20/78,6/14/96
