# Judicial Voter Data Analysis V3

### Last updated: 3/6/22

Notes on data:
- Only retention races
- Exclude subcircuit races (shouldn't be any subcircuit races in retention races)

Metrics calcuated:

A. Percentage of people who voted in at least one judicial race: the percentage of votes out of total ballots cast in the judicial race with the highest total votes. This assumes that the total number of votes in the race with the highest votes is equivalent to the number of people who voted in judicial elections. It's possible that someone skipped the highest vote race and voted in another, but we're assuming that's a very small portion of voters, if at all.

(working note: need to see a ranking of top 5 races with the most votes by ward, and see if it's generally the same across wards...or a distribution)

B. Percentage of people who didn't vote in any judicial race: 100% minus A

C. Percentage of judicial voters out of registrered voters.

D. Percentage of regiestered voters who didn't vote in any judicial races.

Google sheets is here: https://docs.google.com/spreadsheets/d/1cR_HXbwe4G9WpkGQxl8u21jGGdNDY_1leo2BQ1P53FQ/edit?usp=sharing

In [2]:
#Check my working directory
!pwd

/Users/amy/Code/injustice_watch/analysis


In [115]:
#Import other packages for analysis
import pandas as pd
from openpyxl import Workbook #used to import .xlsx files
from openpyxl import load_workbook #used to import .xlsx files
import numpy as np

### Define Data Cleaning Functions

Finds and tags retention races 

In [116]:
#Returns the df with an added row called 'Retention'
def tag_retention(df):
    def retention(a):
        if 'No' in a:
            return 'Retention'
        elif 'Yes' in a:
            return 'Retention'
        else:
            return 'Not Retention'
    
    df['Retention'] = df['CANDIDATE'].apply(lambda x: retention(x))
    
    return df

In [117]:
#Pass a string pathname
#Returns a pandas df
def excel_to_df(pathname): 
    wb = load_workbook(pathname)
    ws = wb.active
    data = ws.values
    columns = next(data)[0:] #Gets the first line in the file as a header line
    df = pd.DataFrame(data, columns=columns)
    
    return df

## 2020 General 

In [119]:
#load data into pandas
wb = load_workbook('../Judicial General Data/judicial_general_2020.xlsx')
ws = wb.active
data = ws.values
columns = next(data)[0:] #Gets the first line in the file as a header line
df = pd.DataFrame(data, columns=columns)

In [120]:
#check the total number of rows = 7,505
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7505 entries, 0 to 7504
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   WARD               7505 non-null   float64
 1   REGISTERED VOTERS  7505 non-null   float64
 2   BALLOTS CAST       7505 non-null   float64
 3   RACE               7505 non-null   object 
 4   CANDIDATE          7505 non-null   object 
 5   VOTES              7505 non-null   float64
dtypes: float64(4), object(2)
memory usage: 351.9+ KB


In [121]:
df.head(20)

Unnamed: 0,WARD,REGISTERED VOTERS,BALLOTS CAST,RACE,CANDIDATE,VOTES
0,1.0,38017.0,30731.0,Abbey Fishman Romanek,No,2839.0
1,1.0,38017.0,30731.0,Abbey Fishman Romanek,Yes,17794.0
2,1.0,38017.0,30731.0,Andrea M. Buford,No,2683.0
3,1.0,38017.0,30731.0,Andrea M. Buford,Yes,17950.0
4,1.0,38017.0,30731.0,Anjana Hansen,No,3801.0
5,1.0,38017.0,30731.0,Anjana Hansen,Yes,16603.0
6,1.0,38017.0,30731.0,Ann Collins-Dole,No,2789.0
7,1.0,38017.0,30731.0,Ann Collins-Dole,Yes,17806.0
8,1.0,38017.0,30731.0,Anna Helen Demacopoulos,No,7004.0
9,1.0,38017.0,30731.0,Anna Helen Demacopoulos,Yes,14045.0


In [122]:
#clean data
#remove ballot measures 
#total number of rows = 7,187; 318 ballot measures removed
wb = load_workbook('../Judicial General Data/ballot_measures.xlsx')
ws = wb.active
data = ws.values
columns = next(data)[0:] #Gets the first line in the file as a header line
ballot_measures = pd.DataFrame(data, columns=columns)

clean_df = df[~df.RACE.isin(ballot_measures.RACE)]

clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7187 entries, 0 to 7504
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   WARD               7187 non-null   float64
 1   REGISTERED VOTERS  7187 non-null   float64
 2   BALLOTS CAST       7187 non-null   float64
 3   RACE               7187 non-null   object 
 4   CANDIDATE          7187 non-null   object 
 5   VOTES              7187 non-null   float64
dtypes: float64(4), object(2)
memory usage: 393.0+ KB


In [123]:
clean_df.tail(20)

Unnamed: 0,WARD,REGISTERED VOTERS,BALLOTS CAST,RACE,CANDIDATE,VOTES
7485,50.0,29522.0,21384.0,Raul Vega,Yes,9975.0
7486,50.0,29522.0,21384.0,Robert D. Kuzas,No,3189.0
7487,50.0,29522.0,21384.0,Robert D. Kuzas,Yes,9131.0
7488,50.0,29522.0,21384.0,Robert E. Gordon,No,2435.0
7489,50.0,29522.0,21384.0,Robert E. Gordon,Yes,10468.0
7490,50.0,29522.0,21384.0,Shelley Lynn Sutker-Dermer,No,2714.0
7491,50.0,29522.0,21384.0,Shelley Lynn Sutker-Dermer,Yes,10317.0
7492,50.0,29522.0,21384.0,Steven G. Watkins,No,2571.0
7493,50.0,29522.0,21384.0,Steven G. Watkins,Yes,9824.0
7494,50.0,29522.0,21384.0,Supreme Court Judge (Vacancy of Freeman),"P. Scott Neville, Jr.",14254.0


In [124]:
#remove non-retention races
#total number of rows = 6,500
clean_df = tag_retention(clean_df)
clean_df = clean_df[clean_df['Retention'] == 'Retention']
clean_df.info()
clean_df.head(20)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6200 entries, 0 to 7504
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   WARD               6200 non-null   float64
 1   REGISTERED VOTERS  6200 non-null   float64
 2   BALLOTS CAST       6200 non-null   float64
 3   RACE               6200 non-null   object 
 4   CANDIDATE          6200 non-null   object 
 5   VOTES              6200 non-null   float64
 6   Retention          6200 non-null   object 
dtypes: float64(4), object(3)
memory usage: 387.5+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Retention'] = df['CANDIDATE'].apply(lambda x: retention(x))


Unnamed: 0,WARD,REGISTERED VOTERS,BALLOTS CAST,RACE,CANDIDATE,VOTES,Retention
0,1.0,38017.0,30731.0,Abbey Fishman Romanek,No,2839.0,Retention
1,1.0,38017.0,30731.0,Abbey Fishman Romanek,Yes,17794.0,Retention
2,1.0,38017.0,30731.0,Andrea M. Buford,No,2683.0,Retention
3,1.0,38017.0,30731.0,Andrea M. Buford,Yes,17950.0,Retention
4,1.0,38017.0,30731.0,Anjana Hansen,No,3801.0,Retention
5,1.0,38017.0,30731.0,Anjana Hansen,Yes,16603.0,Retention
6,1.0,38017.0,30731.0,Ann Collins-Dole,No,2789.0,Retention
7,1.0,38017.0,30731.0,Ann Collins-Dole,Yes,17806.0,Retention
8,1.0,38017.0,30731.0,Anna Helen Demacopoulos,No,7004.0,Retention
9,1.0,38017.0,30731.0,Anna Helen Demacopoulos,Yes,14045.0,Retention


In [15]:
#export to csv to check in excel
#clean_df.to_csv('Test.csv')

In [81]:
#group by ward AND race
#make sure registered voters, ballots cast are the same and not summed up
grouped = clean_df.groupby(['WARD','RACE'], as_index=False).agg({'WARD': 'first','REGISTERED VOTERS': 'first',
                                                     'BALLOTS CAST': 'first','VOTES': 'sum'}).sort_values('VOTES',ascending=False)
grouped

Unnamed: 0,RACE,WARD,REGISTERED VOTERS,BALLOTS CAST,VOTES
2896,Michael P. Toomin,47.0,40767.0,36503.0,28918.0
2893,Mauricio Araujo,47.0,40767.0,36503.0,28241.0
2872,Jackie Marie Portman-Brown,47.0,40767.0,36503.0,27914.0
2858,Aurelia Marie Pucinski,47.0,40767.0,36503.0,27807.0
2882,Kenneth J. Wadas,47.0,40767.0,36503.0,27427.0
...,...,...,...,...,...
895,John J. Mahoney,15.0,18597.0,10086.0,7087.0
875,Bridget Anne Mitchell,15.0,18597.0,10086.0,7086.0
921,Robert D. Kuzas,15.0,18597.0,10086.0,7070.0
925,Terrence J. McGuire,15.0,18597.0,10086.0,7054.0


In [82]:
#export to csv to handcheck
grouped.to_csv('grouped.csv')

In [125]:
#group by race only and show the top 20 races with the most number of votes
#Aurella Marie Pucinski is the race with the most number of votes = 884,434
by_race = clean_df.groupby('RACE').sum().sort_values('VOTES',ascending=False)
by_race.head(20)

Unnamed: 0_level_0,WARD,REGISTERED VOTERS,BALLOTS CAST,VOTES
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aurelia Marie Pucinski,2550.0,3168586.0,2321986.0,884434.0
Michael P. Toomin,2550.0,3168586.0,2321986.0,871882.0
Mauricio Araujo,2550.0,3168586.0,2321986.0,828503.0
Jackie Marie Portman-Brown,2550.0,3168586.0,2321986.0,825928.0
"James Patrick Flannery, Jr.",2550.0,3168586.0,2321986.0,824548.0
Mary Ellen Coghlan,2550.0,3168586.0,2321986.0,819483.0
Kenneth J. Wadas,2550.0,3168586.0,2321986.0,817770.0
Patricia Manila Martin,2550.0,3168586.0,2321986.0,816891.0
Mary Katherine Rochford,2550.0,3168586.0,2321986.0,816618.0
Shelley Lynn Sutker-Dermer,2550.0,3168586.0,2321986.0,814021.0


## Citywide Participation - 2020 General

In [87]:
#find total number of ballots cast and registered voters
#first groupby
by_ward = clean_df.groupby('WARD').agg({'REGISTERED VOTERS':'first','BALLOTS CAST':'first'})
by_ward

Unnamed: 0_level_0,REGISTERED VOTERS,BALLOTS CAST
WARD,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,38017.0,30731.0
2.0,40366.0,34147.0
3.0,38027.0,28573.0
4.0,34870.0,28567.0
5.0,30166.0,24275.0
6.0,31173.0,20236.0
7.0,31134.0,20337.0
8.0,36234.0,24776.0
9.0,34876.0,22021.0
10.0,26971.0,16218.0


In [89]:
#total number of registered voters
by_ward['REGISTERED VOTERS'].sum()

1584293.0

In [90]:
#total number of ballots cast
by_ward['BALLOTS CAST'].sum()

1160993.0

In [91]:
#overall voter turnout
by_ward['BALLOTS CAST'].sum()/by_ward['REGISTERED VOTERS'].sum()

0.7328145740718415

In [126]:
by_race['VOTES'].max()

884434.0

In [100]:
#number of votes in the race with the highest votes = 884,434 is Aurelia Marie Pucinski
#calculate participation rate 
participation = 884434/1160993
participation

0.761790984097234

In [101]:
#calculate judicial turnout rate
judicial_turnout = 884434/1584293
judicial_turnout

0.5582515355429836

## Citywide Participation 2006-2020

In [135]:
judicial_list = ['/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2006.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2008.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2010.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2012.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2014.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2016.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2018.xlsx',
                '/Users/amy/Code/injustice_watch/Judicial General Data/judicial_general_2020.xlsx']

In [136]:
participation_list = []
jud_turnout_list = []
turnout_list = []
years_list = ['2006','2008','2010','2012','2014','2016','2018','2020']
ballot_measures_df = excel_to_df('/Users/amy/Code/injustice_watch/Judicial General Data/ballot_measures.xlsx')

index = 0

for pathname in judicial_list:
    print(index)
    df = excel_to_df(pathname)
    df = tag_retention(df)
    clean_df = df[df['Retention'] == 'Retention']
    clean_df = clean_df[~clean_df.RACE.isin(ballot_measures_df.RACE)]
    
    #find race w most votes
    by_race = clean_df.groupby('RACE').sum().sort_values('VOTES',ascending=False)
    print(by_race.head(3))
    max_votes = by_race['VOTES'].max()
    print('max votes')
    print(max_votes)
    
    #find denominators
    by_ward = clean_df.groupby('WARD').agg({'REGISTERED VOTERS':'first','BALLOTS CAST':'first'})
    registered = by_ward['REGISTERED VOTERS'].sum()
    ballots = by_ward['BALLOTS CAST'].sum()
    
    #participation
    participation = max_votes/ballots
    participation_list.append(participation)
    
    #judicial_turnout
    judicial_turnout = max_votes/registered
    jud_turnout_list.append(judicial_turnout)
    
    #voter turnout
    turnout = ballots/registered
    turnout_list.append(turnout)
    
    index+=1
    

0
                     WARD  REGISTERED VOTERS  BALLOTS CAST     VOTES
RACE                                                                
Patrick J. Quinn   2550.0          2721494.0     1340444.0  436438.0
Warren D. Wolfson  2550.0          2721494.0     1340444.0  392230.0
Kathy M. Flanagan  2550.0          2721494.0     1340444.0  389768.0
max votes
436438.0
1
                        WARD  REGISTERED VOTERS  BALLOTS CAST     VOTES
RACE                                                                   
Michael J. Gallagher  2550.0          2994584.0     2211996.0  643592.0
Thomas E. Flanagan    2550.0          2994584.0     2211996.0  633698.0
Richard J. Elrod      2550.0          2994584.0     2211996.0  613793.0
max votes
643592.0
2
                        WARD  REGISTERED VOTERS  BALLOTS CAST     VOTES
RACE                                                                   
Charles E. Freeman    2550.0          2669614.0     1411738.0  463055.0
Thomas R. Fitzgerald  2550.0       

In [137]:
#check 2020
participation_list

[0.6511842344775314,
 0.581910636366431,
 0.6560069927989471,
 0.6037546045661746,
 0.6530695339900873,
 0.6272166171894047,
 0.7400612459035086,
 0.761790984097234]

In [138]:
jud_turnout_list

[0.3207341261821632,
 0.4298373329985066,
 0.34690783012075904,
 0.45529038655908105,
 0.3187473149861037,
 0.445558789427002,
 0.44898370509121943,
 0.5582515355429836]

In [139]:
turnout_list

[0.49253975941155853,
 0.7386655375170641,
 0.5288172747071299,
 0.7540984087172771,
 0.4880756158362485,
 0.7103746572014907,
 0.6066845245261758,
 0.7328145740718415]

In [140]:
#save into a dictionary and convert to df
dict_to_df = {
    'Year':years_list,
    'Participation':participation_list,
    'Judicial Turnout':jud_turnout_list,
    'Voter Turnout':turnout_list,
}

summary_df = pd.DataFrame(dict_to_df)
summary_df

Unnamed: 0,Year,Participation,Judicial Turnout,Voter Turnout
0,2006,0.651184,0.320734,0.49254
1,2008,0.581911,0.429837,0.738666
2,2010,0.656007,0.346908,0.528817
3,2012,0.603755,0.45529,0.754098
4,2014,0.65307,0.318747,0.488076
5,2016,0.627217,0.445559,0.710375
6,2018,0.740061,0.448984,0.606685
7,2020,0.761791,0.558252,0.732815


## Ward by Ward Participation TK

Is the race with the highest votes based on the entire city or by ward? Having trouble figuring out how to calculate that in pandas

In [86]:
#calculate voter turnout by ward 
by_ward['TURNOUT'] = by_ward['BALLOTS CAST']/by_ward['REGISTERED VOTERS']
by_ward

Unnamed: 0_level_0,REGISTERED VOTERS,BALLOTS CAST,TURNOUT
WARD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,38017.0,30731.0,0.808349
2.0,40366.0,34147.0,0.845935
3.0,38027.0,28573.0,0.751387
4.0,34870.0,28567.0,0.819243
5.0,30166.0,24275.0,0.804714
6.0,31173.0,20236.0,0.649152
7.0,31134.0,20337.0,0.653209
8.0,36234.0,24776.0,0.683778
9.0,34876.0,22021.0,0.631408
10.0,26971.0,16218.0,0.601313
