## TODO:
1. Load court data
2. Load cooper center race data removing the header
3. Load cooper center Hispanic data removing the header
4. Merge cooper files, maybe filtering for only 1 conviction  
5. Collapse courts' data to counts of unique people within each race/code section
6. Reshape the resultant dataset so that race is in columns
7. Merge with the cooper datasets
8. Produce "disparity index" (% in data/ % in population) for each respective race
9. Sort the dataframe to see the results 

## 1. Load dependencies and read the Courts' data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read the smaller file with court cases
court = pd.read_csv("data1k.csv")
court.head(1).T

Unnamed: 0,0
person_id,199031000000439
HearingDate,2018-06-01
CodeSection,A.46.2-862
codesection,covered elsewhere
ChargeType,Misdemeanor
chargetype,Misdemeanor
Class,
DispositionCode,Guilty
disposition,Conviction
Plea,


In [3]:
# Check how Race is coded
court.Race.value_counts()

White Caucasian(Non-Hispanic)                  1093
Black(Non-Hispanic)                             864
White Caucasian (Non-Hispanic)                  342
Black (Non-Hispanic)                            311
Hispanic                                         88
Asian Or Pacific Islander                        27
Other(Includes Not Applicable.. Unknown)         27
White                                            17
Unknown (Includes Not Applicable.. Unknown)      11
MISSING                                          10
Black                                            10
Other (Includes Not Applicable.. Unknown)         8
American Indian                                   3
Name: Race, dtype: int64

In [3]:
# Get all the labels used in Race column
court.Race.unique()

array(['Black(Non-Hispanic)', 'White Caucasian (Non-Hispanic)',
       'Hispanic', 'White Caucasian(Non-Hispanic)',
       'Black (Non-Hispanic)', 'Asian Or Pacific Islander',
       'Other(Includes Not Applicable.. Unknown)',
       'Other (Includes Not Applicable.. Unknown)', 'MISSING',
       'Unknown (Includes Not Applicable.. Unknown)', 'Black', 'White',
       'American Indian'], dtype=object)

In [4]:
# Decide on what labels to use in the further research
replace_map = {'Black(Non-Hispanic)': 'Black (Non-Hispanic)', 
               'Hispanic': 'Hispanic', 
               'White Caucasian(Non-Hispanic)': 'White (Non-Hispanic)',
               'MISSING': 'Missing or Other', 
               'Asian Or Pacific Islander': 'Asian or Pacific Islander', 
               'Black (Non-Hispanic)': 'Black (Non-Hispanic)',
               'White Caucasian (Non-Hispanic)': 'White (Non-Hispanic)',
               'Other(Includes Not Applicable.. Unknown)': 'Missing or Other',
               'Other (Includes Not Applicable.. Unknown)': 'Missing or Other', 
               'Black': 'Black (Non-Hispanic)', 
               'White': 'White (Non-Hispanic)',
               'Unknown (Includes Not Applicable.. Unknown)': 'Missing or Other', 
               'American Indian': 'American Indian or Alaskan Native',
               'Unknown': 'Missing or Other', 
               'Asian or Pacific Islander': 'Asian or Pacific Islander',
               'American Indian Or Alaskan Native': 'American Indian or Alaskan Native'}

# Remap the Rade labels used replace_map
court['Race']= court.Race.replace(replace_map)

In [5]:
# Check the results
court.Race.value_counts()

White (Non-Hispanic)                 1452
Black (Non-Hispanic)                 1185
Hispanic                               88
Missing or Other                       56
Asian or Pacific Islander              27
American Indian or Alaskan Native       3
Name: Race, dtype: int64

In [7]:
# Select columns of interest for further research 
my_columns = ['HearingDate', 'CodeSection', 'Race', 'disposition', 'fips', 'expungable']
court_small = court[my_columns]
court_small.head()

Unnamed: 0,HearingDate,CodeSection,Race,disposition,fips,expungable
0,2018-06-01,A.46.2-862,Black (Non-Hispanic),Conviction,117,Automatic (pending)
1,2000-08-07,18.2-26,White (Non-Hispanic),Conviction,191,Petition
2,2000-08-07,18.2-95,White (Non-Hispanic),Conviction,191,Petition
3,2019-09-25,46.2-300,Hispanic,Conviction,23,Automatic (pending)
4,2010-05-03,46.2-613(2),White (Non-Hispanic),Conviction,840,Automatic


In [8]:
# Check data types
court_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811 entries, 0 to 2810
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   HearingDate  2811 non-null   object
 1   CodeSection  2811 non-null   object
 2   Race         2811 non-null   object
 3   disposition  2811 non-null   object
 4   fips         2811 non-null   int64 
 5   expungable   2811 non-null   object
dtypes: int64(1), object(5)
memory usage: 131.9+ KB


In [9]:
# Convert object type to datatime type
court_small['HearingDate'] = pd.to_datetime(court_small['HearingDate'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  court_small['HearingDate'] = pd.to_datetime(court_small['HearingDate'])


In [10]:
# Alt., select just the year of hearing
court_small['year'] = court_small['HearingDate'].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  court_small['year'] = court_small['HearingDate'].dt.year


In [11]:
# Now we can aggregate using the "year" column
court_small.head()

Unnamed: 0,HearingDate,CodeSection,Race,disposition,fips,expungable,year
0,2018-06-01,A.46.2-862,Black (Non-Hispanic),Conviction,117,Automatic (pending),2018
1,2000-08-07,18.2-26,White (Non-Hispanic),Conviction,191,Petition,2000
2,2000-08-07,18.2-95,White (Non-Hispanic),Conviction,191,Petition,2000
3,2019-09-25,46.2-300,Hispanic,Conviction,23,Automatic (pending),2019
4,2010-05-03,46.2-613(2),White (Non-Hispanic),Conviction,840,Automatic,2010


In [12]:
court_small.query("year >= 2021").sort_values('HearingDate', ascending=False)

Unnamed: 0,HearingDate,CodeSection,Race,disposition,fips,expungable,year
605,2024-04-16,18.2-95,White (Non-Hispanic),Conviction,550,Not eligible,2024
606,2024-04-16,18.2-23,White (Non-Hispanic),Conviction,550,Not eligible,2024
607,2024-04-16,18.2-108.01,White (Non-Hispanic),Conviction,550,Not eligible,2024
834,2021-01-30,18.2-248,White (Non-Hispanic),Conviction,161,Not eligible,2021


In [13]:
# Note years after 2020, they should not be in the dataset 
court_small['Automatic'] = court_small['expungable'] == "Automatic"
court_small.groupby('year').agg({'Automatic':'sum'}).head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  court_small['Automatic'] = court_small['expungable'] == "Automatic"


Unnamed: 0_level_0,Automatic
year,Unnamed: 1_level_1
2000,1
2001,5
2002,7
2003,1
2004,4


In [14]:
# Most "popular" code sections:
court_small['CodeSection'].value_counts()

B.46.2-301        268
A.46.2-862        253
46.2-300          177
C.46.2-862        123
18.2-250.1         96
                 ... 
A.46.2-704          1
4VAC20-1250-30      1
14-104              1
F.46.2-875          1
46.2-1052           1
Name: CodeSection, Length: 345, dtype: int64

In [16]:
# Loading original fips
fips = pd.read_csv('fips.csv')
fips.head()

Unnamed: 0,fips,name
0,1,Accomack Circuit Court
1,3,Albemarle Circuit Court
2,5,Alleghany Circuit Court
3,7,Amelia Circuit Court
4,9,Amherst Circuit Court


## 2. Load the Cooper basic demographics data

In [43]:
# Reading excel file from URL
url = 'https://demographics.coopercenter.org/sites/demographics/files/media/files/2020-07/Census_2019_RaceEstimates_forVA_0.xls'
cooper_race = pd.read_excel(url, header=4)
cooper_race.head(5)


Unnamed: 0,FIPS,Jurisdiction,Total Population,White Alone,Unnamed: 4,African American Alone,Unnamed: 6,Asian Alone,Unnamed: 8,Other Races Alone,Unnamed: 10,Two or more races,Unnamed: 12
0,,,,,,,,,,,,,
1,,Virginia,8535519.0,5922648.0,0.693883,1696911.0,0.198806,589710.0,0.069089,56694.0,0.006642,269556.0,0.031581
2,,,,,,,,,,,,,
3,1.0,Accomack County,32316.0,21899.0,0.677652,9304.0,0.287907,257.0,0.007953,293.0,0.009067,563.0,0.017422
4,3.0,Albemarle County,109330.0,89388.0,0.817598,10600.0,0.096954,6051.0,0.055346,483.0,0.004418,2808.0,0.025684


In [44]:
# Using pandas query with logical FIPS != null and python engine
cooper_race = cooper_race.query("FIPS.notnull()", engine='python')
cooper_race.head(5)

Unnamed: 0,FIPS,Jurisdiction,Total Population,White Alone,Unnamed: 4,African American Alone,Unnamed: 6,Asian Alone,Unnamed: 8,Other Races Alone,Unnamed: 10,Two or more races,Unnamed: 12
3,1.0,Accomack County,32316.0,21899.0,0.677652,9304.0,0.287907,257.0,0.007953,293.0,0.009067,563.0,0.017422
4,3.0,Albemarle County,109330.0,89388.0,0.817598,10600.0,0.096954,6051.0,0.055346,483.0,0.004418,2808.0,0.025684
5,5.0,Alleghany County,14860.0,13783.0,0.927524,698.0,0.046972,46.0,0.003096,56.0,0.003769,277.0,0.018641
6,7.0,Amelia County,13145.0,10050.0,0.764549,2688.0,0.204488,80.0,0.006086,85.0,0.006466,242.0,0.01841
7,9.0,Amherst County,31605.0,24299.0,0.768834,6041.0,0.191141,180.0,0.005695,305.0,0.00965,780.0,0.02468


In [46]:
# Selecting the required columns only
cooper_race = cooper_race[['FIPS', 
                           'Jurisdiction', 
                           'Total Population', 
                           'White Alone', 
                           'African American Alone',
                           'Asian Alone']]

cooper_race.head()

Unnamed: 0,FIPS,Jurisdiction,Total Population,White Alone,African American Alone,Asian Alone
3,1.0,Accomack County,32316.0,21899.0,9304.0,257.0
4,3.0,Albemarle County,109330.0,89388.0,10600.0,6051.0
5,5.0,Alleghany County,14860.0,13783.0,698.0,46.0
6,7.0,Amelia County,13145.0,10050.0,2688.0,80.0
7,9.0,Amherst County,31605.0,24299.0,6041.0,180.0


In [49]:
# Renaming columns
cooper_race = cooper_race.rename({'Total Population': 'totalpop',
                    'White Alone': 'whitepop',
                    'African American Alone': 'blackpop',
                    'Asian Alone': 'asianpop'}, axis=1)

cooper_race.head()

Unnamed: 0,FIPS,Jurisdiction,totalpop,whitepop,blackpop,asianpop
3,1.0,Accomack County,32316.0,21899.0,9304.0,257.0
4,3.0,Albemarle County,109330.0,89388.0,10600.0,6051.0
5,5.0,Alleghany County,14860.0,13783.0,698.0,46.0
6,7.0,Amelia County,13145.0,10050.0,2688.0,80.0
7,9.0,Amherst County,31605.0,24299.0,6041.0,180.0


## 3. Load the Cooper Hispanic demographics data

In [30]:
# Same as above: reading excel form the URL
url2 = 'https://demographics.coopercenter.org/sites/demographics/files/media/files/2020-07/Census_2019_HispanicEstimates_forVA_0.xls'
cooper_hisp = pd.read_excel(url2, header=4)
cooper_hisp.head()


Unnamed: 0,FIPS,Jurisdiction,"Decennial Census Count, April 1, 2010",Unnamed: 3,Unnamed: 4,"Population Estimate, July 1, 2019",Unnamed: 6,Unnamed: 7,"April 1, 2010 - July 1, 2019",Unnamed: 9
0,,,Total Population,Hispanic Population,,Total Population,Hispanic Population,,Hispanic Change,
1,,,,Total Hispanic,(%),,Total Hispanic,(%),Total Hispanic,(%)
2,,,,,,,,,,
3,,Virginia,8001024,631825,0.078968,8535519,834422,0.097759,202597,0.320654
4,,,,,,,,,,


In [35]:
# Using pandas query with logical FIPS != null and python engine
cooper_hisp = cooper_hisp.query("FIPS.notnull()", engine='python')
cooper_hisp = cooper_hisp[['FIPS', 'Unnamed: 6']]
cooper_hisp = cooper_hisp.rename({'Unnamed: 6':'hisppop'}, axis=1)
cooper_hisp.head(5)

Unnamed: 0,FIPS,hisppop
5,1.0,2955
6,3.0,6313
7,5.0,238
8,7.0,418
9,9.0,767


## 4. Merge the Cooper files

In [51]:
# Merging an dchecking if merge was successful
cooper = pd.merge(cooper_race, cooper_hisp, 
                  how='outer',
                  indicator='matched',
                  validate='one_to_one'
                 )


Unnamed: 0,FIPS,Jurisdiction,totalpop,whitepop,blackpop,asianpop,hisppop,matched
0,1.0,Accomack County,32316.0,21899.0,9304.0,257.0,2955,both
1,3.0,Albemarle County,109330.0,89388.0,10600.0,6051.0,6313,both
2,5.0,Alleghany County,14860.0,13783.0,698.0,46.0,238,both
3,7.0,Amelia County,13145.0,10050.0,2688.0,80.0,418,both
4,9.0,Amherst County,31605.0,24299.0,6041.0,180.0,767,both


In [53]:
# Check if all matched as "both"
cooper.matched.value_counts()

both          133
left_only       0
right_only      0
Name: matched, dtype: int64

In [54]:
cooper.head()

Unnamed: 0,FIPS,Jurisdiction,totalpop,whitepop,blackpop,asianpop,hisppop,matched
0,1.0,Accomack County,32316.0,21899.0,9304.0,257.0,2955,both
1,3.0,Albemarle County,109330.0,89388.0,10600.0,6051.0,6313,both
2,5.0,Alleghany County,14860.0,13783.0,698.0,46.0,238,both
3,7.0,Amelia County,13145.0,10050.0,2688.0,80.0,418,both
4,9.0,Amherst County,31605.0,24299.0,6041.0,180.0,767,both


## 5. Aggregate the Courts' data to count ids

In [58]:
court_conv = court.query("disposition == 'Conviction'")
court_conv.head()

Unnamed: 0,person_id,HearingDate,CodeSection,codesection,ChargeType,chargetype,Class,DispositionCode,disposition,Plea,...,within7,within10,class1_2,class3_4,expungable,old_expungable,expungable_no_lifetimelimit,reason,sameday,lifetime
0,199031000000439,2018-06-01,A.46.2-862,covered elsewhere,Misdemeanor,Misdemeanor,,Guilty,Conviction,,...,True,True,False,False,Automatic (pending),False,Automatic (pending),Conviction of misdemeanor charges listed in 19...,False,False
1,15100000000316,2000-08-07,18.2-26,covered elsewhere,Felony,Felony,,Guilty,Conviction,,...,False,False,False,False,Petition,False,Petition,Conviction or deferred dismissal of felony cha...,False,False
2,15100000000316,2000-08-07,18.2-95,covered elsewhere,Felony,Felony,,Guilty,Conviction,,...,False,False,False,False,Petition,False,Petition,Conviction or deferred dismissal of felony cha...,False,False
3,10210000000095,2019-09-25,46.2-300,covered elsewhere,Misdemeanor,Misdemeanor,,Guilty In Absentia,Conviction,,...,True,True,False,False,Automatic (pending),False,Automatic (pending),Conviction of misdemeanor charges listed in 19...,False,False
4,51220000000305,2010-05-03,46.2-613(2),covered elsewhere,Misdemeanor,Misdemeanor,1.0,Guilty,Conviction,Guilty,...,False,False,False,False,Automatic,False,Automatic,Conviction of misdemeanor charges listed in 19...,False,False


In [68]:
court_agg = court_conv.groupby(['Race', 
                    'fips', 
                    'CodeSection'
                   ])['person_id'].nunique().reset_index()


court_agg

Unnamed: 0,Race,fips,CodeSection,person_id
0,American Indian or Alaskan Native,25,A.46.2-862,1
1,Asian or Pacific Islander,13,18.2-56.1,1
2,Asian or Pacific Islander,25,A.46.2-862,1
3,Asian or Pacific Islander,41,C.46.2-862,1
4,Asian or Pacific Islander,53,A.46.2-862,1
...,...,...,...,...
1162,White (Non-Hispanic),820,A.46.2-862,1
1163,White (Non-Hispanic),830,46.2-300,1
1164,White (Non-Hispanic),840,46.2-300,1
1165,White (Non-Hispanic),840,46.2-613(2),1


## 5. Reshape so that we have different columns by Race

In [72]:
court_pivot = court_agg.pivot_table(index = ['fips', 'CodeSection'],
                      columns = ['Race'],
                      values = ['person_id'],
                      fill_value=0).reset_index()


court_pivot.head()

Unnamed: 0_level_0,fips,CodeSection,person_id,person_id,person_id,person_id,person_id,person_id
Race,Unnamed: 1_level_1,Unnamed: 2_level_1,American Indian or Alaskan Native,Asian or Pacific Islander,Black (Non-Hispanic),Hispanic,Missing or Other,White (Non-Hispanic)
0,1,C.46.2-894,0,0,0,0,0,1
1,3,18.2-192,0,0,0,0,0,1
2,3,18.2-195,0,0,0,0,0,1
3,3,18.2-250.1,0,0,1,0,0,0
4,3,18.2-258.1,0,0,0,0,0,1


In [73]:
# Multi index is not great
court_pivot.columns

MultiIndex([(       'fips',                                  ''),
            ('CodeSection',                                  ''),
            (  'person_id', 'American Indian or Alaskan Native'),
            (  'person_id',         'Asian or Pacific Islander'),
            (  'person_id',              'Black (Non-Hispanic)'),
            (  'person_id',                          'Hispanic'),
            (  'person_id',                  'Missing or Other'),
            (  'person_id',              'White (Non-Hispanic)')],
           names=[None, 'Race'])

In [74]:
# Overwriting column names manually 
court_pivot.columns = ['FIPS', 
                       'CodeSection', 
                       'amerindcount', 
                       'asiancount', 
                       'blackcount', 
                       'hispcount',
                       'missingcount',
                       'whitecount'
                      ]

In [75]:
court_pivot.head()

Unnamed: 0,FIPS,CodeSection,amerindcount,asiancount,blackcount,hispcount,missingcount,whitecount
0,1,C.46.2-894,0,0,0,0,0,1
1,3,18.2-192,0,0,0,0,0,1
2,3,18.2-195,0,0,0,0,0,1
3,3,18.2-250.1,0,0,1,0,0,0
4,3,18.2-258.1,0,0,0,0,0,1


## 6. Merge Cooper with the Courts' pivoted data

In [78]:
mergedata = pd.merge(court_pivot, cooper, 
                     how = 'outer',
                     on = 'FIPS',
                     indicator = 'matched2', 
                     validate = 'many_to_one'
                    )

In [80]:
# Check if everything was merged correctly:
mergedata.matched2.value_counts()

both          920
left_only      69
right_only     19
Name: matched2, dtype: int64

In [None]:
# Conclusin - some fips do not appear in cooper data...