# Candidate Data Check

Make sure the DCBOE hash IDs were pasted correctly, and generally check for other obvious problems with the `candidates` table. 

In [1]:
import os
os.chdir('../')

import pandas as pd
pd.set_option('display.max_rows', 500)

from scripts.data_transformations import list_candidates

In [2]:
cand = list_candidates(election_year=2022)
dcboe = pd.read_csv('data/dcboe/candidates_dcboe.csv')

In [3]:
cd = pd.merge(cand, dcboe, how='inner', on='dcboe_hash_id', suffixes=['_openanc', '_dcboe'])

## Candidate Name Comparison

These candidates have a different name in DCBOE than in OpenANC. Compare them and consider changing to match the ballot. 

The "rejects" are names as they are rendered by DCBOE (initially uppercase, then fed through the Python title case function) that we have decided are not displayable.

In [4]:
dcboe_name_rejects = [
    'Alyce Mcfarland'
    , 'Andrew Mccarthy-Clark'
    , 'Mike Mclaughlin'
    , 'Susana Bara√Ëano'
    , 'Brian J. Mccabe'
    , 'Carl Edgar Rohde Iii'
    , 'Quentin Col√Ìn Roosevelt'
    , 'Thomas P. Defranco'
    , 'Nicole Mcentee'
    , 'V√Çctor De Le√Ìn'
    , 'Laqueda Tate'
    , 'Bobby \"Slli\'M" Williams'
    , 'Matt Lafortune'
    , 'Vj Kapur'
    , 'Steven Mccarty'
    , 'H Norman Knickle'
    , 'Obbie English'
    , 'Robin Mckinney'
    , 'Roric Mccorristin'
]

name_mismatches = (
    cd.loc[
        (cd.candidate_name_openanc != cd.candidate_name_dcboe)
        & ~cd.candidate_name_dcboe.isin(dcboe_name_rejects)
        & ~cd.candidate_name_dcboe.str.contains('Withdrew')
    , ['smd_id_openanc', 'candidate_name_openanc', 'candidate_name_dcboe']]
    .sort_values(by='smd_id_openanc')
)

print(len(name_mismatches))
name_mismatches

56


Unnamed: 0,smd_id_openanc,candidate_name_openanc,candidate_name_dcboe
385,smd_2022_1C08,Barney R. Shapiro,Barney R Shapiro
215,smd_2022_1D04,Yasmin Romero-Latin,Yasmin Romero
55,smd_2022_1E07,Amanda Farnan,Amanda M Farnan
56,smd_2022_2A03,"Trupti ""Trip"" J. Patel","Trupti ""Trip"" Patel"
57,smd_2022_2B02,Jeff Rueckgauer,Jeffrey Rueckgauer
143,smd_2022_2D02,Carole Feld,Carole L. Feld
203,smd_2022_2E01,Kishan Putta,Kishan Kumar Putta
241,smd_2022_2G01,Tony Brown,"Anthony ""Tony"" Brown"
22,smd_2022_3C03,Janell Pagats,Janell Marie Pagats
275,smd_2022_3C05,Nicholas Ide,Nicholas (Nick) Ide


## SMD Comparison

This should be empty. If there's an disagreement on which district a candidate is running in, figure it out by the candidate's address. 

In [5]:
(
    cd.loc[cd.smd_id_openanc != cd.smd_id_dcboe
    , ['candidate_name_openanc', 'candidate_name_dcboe', 'smd_id_openanc', 'smd_id_dcboe']]
    .sort_values(by='smd_id_openanc')
)

Unnamed: 0,candidate_name_openanc,candidate_name_dcboe,smd_id_openanc,smd_id_dcboe


In [6]:
# This sum should be zero
# Once a candidate has a dcboe_hash_id, they should not have a manual status, generally
cd_left = pd.merge(cand, dcboe, how='left', on='dcboe_hash_id')
cd_left[cd_left.manual_status.notnull()].dcboe_hash_id.notnull().sum()
# cd_left[cd_left.manual_status.notnull()]

74

## Candidate Duplicates

Every person_id should only be in the candidate table once

In [7]:
cand.groupby('person_id').size()[cand.groupby('person_id').size() > 1]

Series([], dtype: int64)

In [8]:
cand.groupby('candidate_name').size()[cand.groupby('candidate_name').size() > 1]

Series([], dtype: int64)

There are two different candidates named Patricia Williams.

## person_id Integrity Check

Are there any person_ids in the candidate table that are NOT in the person table?

In [9]:
c = pd.read_csv('data/candidates.csv')
p = pd.read_csv('data/people.csv')

In [10]:
[x for x in list(c.person_id) if x not in list(p.person_id)]

[]