# Clean Clearinghouse Dataset

The dataset "clearinghouse.csv" is provided by clearinghouse through their internal SQL query. This dataset contains the case metadata for each case that has a complete coding. This notebook outlines the steps I undertook to clean this dataset for building the ML model to predict case issues/issue categories.

# Import Libaries

In [None]:
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer

# Join the issues dataset to the clearinghouse dataset that contains all the other metadatas

In [None]:
issue_data = pd.read_json("data/case_issues.json")

In [None]:
cases = pd.read_csv("data/clearinghouse.csv")
len(cases)

10796

In [None]:
result = cases.merge(issue_data, on="case_id", how="left")
len(result)

10796

In [None]:
result.head()

Unnamed: 0,case_id,case_name,case_status,case_state,court_name,case_type,case_ongoing,case_special_collections,case_causes_of_action,case_issues,issue_data
0,8429,"EEOC v. EADS AEROFRAME SERVICES, LLC",Coding Complete,Louisiana,Western District of Louisiana,Equal Employment,No,"EEOC Study — in sample, Multi-LexSum (in sample)","Title VII (including PDA), 42 U.S.C. § 2000e","Race discrimination, Black, Hispanic, White, R...",{'Affected National Origin/Ethnicity(s)': ['Hi...
1,9719,Regan v. Salt Lake County,Coding Complete,Utah,District of Utah,Jail Conditions,No,Strip Search Cases,42 U.S.C. § 1983,"Search policies, Strip search policy (faciliti...","{'Affected Sex/Gender(s)': ['Female'], 'Discri..."
2,18526,Kerrigan v. Philadelphia Board of Elections,Coding Complete,Pennsylvania,Eastern District of Pennsylvania,Disability Rights,No,,"Americans with Disabilities Act (ADA), 42 U.S....","Disability (inc. reasonable accommodations), M...",{'Disability and Disability Rights': ['Mobilit...
3,6263,EEOC v. TRANSIT MIX CONCRETE,Coding Complete,Texas,Northern District of Texas,Equal Employment,No,"EEOC Study — in sample, IWPR/Wage Project Cons...","Title VII (including PDA), 42 U.S.C. § 2000e","Race discrimination, Black, Disparate Treatmen...","{'Affected Race(s)': ['Black'], 'Discriminatio..."
4,10082,Shorter v. DC,Coding Complete,District of Columbia,District of District of Columbia,Policing,No,,42 U.S.C. § 1983,"Hearing impairment, Disability (inc. reasonabl...",{'Disability and Disability Rights': ['Hearing...


## Check for cases with black issue data

In [None]:
result[(result["issue_data"].isna()) & (~result["case_issues"].isna())]

Unnamed: 0,case_id,case_name,case_status,case_state,court_name,case_type,case_ongoing,case_special_collections,case_causes_of_action,case_issues,issue_data


In [None]:
result[(~result["issue_data"].isna()) & (result["case_issues"].isna())]

Unnamed: 0,case_id,case_name,case_status,case_state,court_name,case_type,case_ongoing,case_special_collections,case_causes_of_action,case_issues,issue_data
454,46103,Jones v. Trump,Coding Complete,District of Columbia,District of District of Columbia,Prison Conditions,Yes,Trump Administration 2.0: Challenges to the Go...,"Section 504 (Rehabilitation Act), 29 U.S.C. § ...",,{'Discrimination Area': ['Disparate Treatment'...
8349,46348,American Foreign Service Association v. Trump,Coding Complete,District of Columbia,District of District of Columbia,Labor Rights,Yes,Trump Administration 2.0: Challenges to the Go...,,,{'Presidential/Gubernatorial Authority': ['Civ...


## Drop all cases where the issue_data is blank, as we cannot use them to make predictions / evaluations

In [None]:
len(result)

10796

In [None]:
result = result[~result["issue_data"].isna()]
len(result)

10549

## Extract the issue categories and issues from issue data

In [None]:
result['issue_category'] = result["issue_data"].apply(lambda d: list(d.keys()) if isinstance(d, dict) else [])
result['issues'] = result["issue_data"].apply(
    lambda d: sorted([item for sublist in d.values() for item in sublist]) if isinstance(d, dict) else []
)

In [None]:
result.columns

Index(['case_id', 'case_name', 'case_status', 'case_state', 'court_name',
       'case_type', 'case_ongoing', 'case_special_collections',
       'case_causes_of_action', 'case_issues', 'issue_data', 'issue_category',
       'issues'],
      dtype='object')

## Drop columns we don't need

In [None]:
result = result[['case_id', 'case_name', 'case_status', 'case_state', 'court_name',
       'case_type', 'case_ongoing', 'case_special_collections',
       'case_causes_of_action', 'issue_category', 'issues']]
result.head()

Unnamed: 0,case_id,case_name,case_status,case_state,court_name,case_type,case_ongoing,case_special_collections,case_causes_of_action,issue_category,issues
0,8429,"EEOC v. EADS AEROFRAME SERVICES, LLC",Coding Complete,Louisiana,Western District of Louisiana,Equal Employment,No,"EEOC Study — in sample, Multi-LexSum (in sample)","Title VII (including PDA), 42 U.S.C. § 2000e","[Affected National Origin/Ethnicity(s), Affect...","[Black, Conditions of Employment (including as..."
1,9719,Regan v. Salt Lake County,Coding Complete,Utah,District of Utah,Jail Conditions,No,Strip Search Cases,42 U.S.C. § 1983,"[Affected Sex/Gender(s), Discrimination Basis,...","[Female, Search policies, Sex discrimination, ..."
2,18526,Kerrigan v. Philadelphia Board of Elections,Coding Complete,Pennsylvania,Eastern District of Pennsylvania,Disability Rights,No,,"Americans with Disabilities Act (ADA), 42 U.S....","[Disability and Disability Rights, Discriminat...","[Disability (inc. reasonable accommodations), ..."
3,6263,EEOC v. TRANSIT MIX CONCRETE,Coding Complete,Texas,Northern District of Texas,Equal Employment,No,"EEOC Study — in sample, IWPR/Wage Project Cons...","Title VII (including PDA), 42 U.S.C. § 2000e","[Affected Race(s), Discrimination Area, Discri...","[Black, Direct Suit on Merits, Disparate Treat..."
4,10082,Shorter v. DC,Coding Complete,District of Columbia,District of District of Columbia,Policing,No,,42 U.S.C. § 1983,"[Disability and Disability Rights, Discriminat...","[Disability (inc. reasonable accommodations), ..."


## Save the result df for future use

In [None]:
result.to_json("data/clean.json")

## Do some EDA on the result df

In [None]:
cols = ["case_state", "court_name", "case_type", "case_ongoing", "case_special_collections"]

In [None]:
for col in cols:
    print("---", col, "---")
    display(result[col].nunique())
    display(result[col].value_counts())

--- case_state ---


57

case_state
California                      1064
New York                         826
Texas                            675
District of Columbia             620
Illinois                         549
Pennsylvania                     399
Michigan                         393
Florida                          386
Alabama                          294
Georgia                          283
Louisiana                        276
Washington                       254
Tennessee                        247
Maryland                         247
Arizona                          244
Ohio                             231
North Carolina                   224
Missouri                         218
Massachusetts                    209
Virginia                         201
Indiana                          199
New Jersey                       186
Colorado                         185
Mississippi                      168
Minnesota                        133
Arkansas                         131
Oklahoma                   

--- court_name ---


193

court_name
District of District of Columbia                 504
Northern District of Illinois                    442
Southern District of New York                    438
Northern District of California                  432
No Court                                         329
                                                ... 
West Virginia state appellate court                1
U.S. Court of Appeals for the Seventh Circuit      1
Washington state supreme court                     1
Maryland state trial court                         1
Supreme Court of the United States                 1
Name: count, Length: 193, dtype: int64

--- case_type ---


27

case_type
Equal Employment                             3141
Prison Conditions                            1088
Immigration and/or the Border                 886
Jail Conditions                               820
Election/Voting Rights                        792
Disability Rights                             459
Public Benefits/Government Services           452
Policing                                      408
Criminal Justice (Other)                      323
Education                                     298
Speech and Religious Freedom                  284
Healthcare Access and Reproductive Issues     265
National Security                             245
Fair Housing/Lending/Insurance                224
Juvenile Institution                          205
Presidential/Gubernatorial Authority          127
Intellectual Disability (Facility)            107
School Desegregation                           89
Child Welfare                                  85
Mental Health (Facility)                

--- case_ongoing ---


5

case_ongoing
No                           7611
Yes                          2231
No reason to think so         550
Perhaps, but long-dormant     100
Unknown                        57
Name: count, dtype: int64

--- case_special_collections ---


270

case_special_collections
Multi-LexSum (in sample)                                                                                                                                                                                                                                      1659
EEOC Study — in sample                                                                                                                                                                                                                                        1057
EEOC Study — in sample, Multi-LexSum (in sample)                                                                                                                                                                                                               885
Law Firm Antiracism Alliance (LFAA) project                                                                                                                                                           

In [None]:
mlb = MultiLabelBinarizer()
binary_matrix = mlb.fit_transform(result['issue_category'].to_list())
binary_matrix.shape

(10549, 22)

In [None]:
category_df = pd.DataFrame(binary_matrix, columns=mlb.classes_)
category_df.sum().sort_values(ascending=False)

General/Misc.                                                6342
Discrimination Basis                                         5202
Discrimination Area                                          4175
EEOC-centric                                                 2542
Affected Sex/Gender(s)                                       2454
Jails, Prisons, Detention Centers, and Other Institutions    2325
Affected Race(s)                                             1362
Disability and Disability Rights                             1300
Medical/Mental Health Care                                   1256
Immigration/Border                                            929
Voting                                                        826
Affected National Origin/Ethnicity(s)                         598
Policing                                                      562
Reproductive rights                                           463
LGBTQ+                                                        417
Benefits (

In [None]:
mlb = MultiLabelBinarizer()
binary_matrix = mlb.fit_transform(result['issues'].to_list())
binary_matrix.shape

(10549, 403)

In [None]:
issues_df = pd.DataFrame(binary_matrix, columns=mlb.classes_)
issues_df.sum().sort_values(ascending=False)

Disparate Treatment                                          3377
Direct Suit on Merits                                        2468
Female                                                       1961
Sex discrimination                                           1657
Race discrimination                                          1563
                                                             ... 
Currency                                                        1
Catholicism                                                     1
Buddhism                                                        1
Underground Storage Tank (UST) leakage                          1
Local / state enforcement of immigration laws (duplicate)       1
Length: 403, dtype: int64

In [None]:
issues_counts = issues_df.sum()

threshold = issues_counts.quantile(0.90)
issues = issues_counts[issues_counts >= threshold]
issues.sort_values(ascending=False)

Disparate Treatment                                                                           3377
Direct Suit on Merits                                                                         2468
Female                                                                                        1961
Sex discrimination                                                                            1657
Race discrimination                                                                           1563
Discharge / Constructive Discharge / Layoff                                                   1256
Harassment / Hostile Work Environment                                                         1142
Disability (inc. reasonable accommodations)                                                   1090
Black                                                                                         1084
Retaliation                                                                                    879
Pattern or