# Candidate Data Check

Make sure the DCBOE hash IDs were pasted correctly, and generally check for other obvious problems with the `candidates` table. 

In [1]:
import os
os.chdir('../')

import pandas as pd
pd.set_option('display.max_rows', 500)

from scripts.data_transformations import list_candidates

In [2]:
cand = list_candidates(election_year=2022)
dcboe = pd.read_csv('data/dcboe/candidates_dcboe.csv')

In [3]:
cd = pd.merge(cand, dcboe, how='inner', on='dcboe_hash_id', suffixes=['_openanc', '_dcboe'])

## Candidate Name Comparison

These candidates have a different name in DCBOE than in OpenANC. Compare them and consider changing to match the ballot. 

The "rejects" are names as they are rendered by DCBOE (initially uppercase, then fed through the Python title case function) that we have decided are not displayable.

In [4]:
dcboe_name_rejects = [
    'Alyce Mcfarland'
    , 'Andrew Mccarthy-Clark'
    , 'Mike Mclaughlin'
    , 'Susana Bara√Ëano'
    , 'Brian J. Mccabe'
    , 'Carl Edgar Rohde Iii'
    , 'Quentin Col√Ìn Roosevelt'
    , 'Thomas P. Defranco'
    , 'Nicole Mcentee'
    , 'V√Çctor De Le√Ìn'
    , 'Laqueda Tate'
    , 'Bobby \"Slli\'M" Williams'
    , 'Matt Lafortune'
    , 'Vj Kapur'
    , 'Steven Mccarty'
    , 'H Norman Knickle'
    , 'Obbie English'
    , 'Robin Mckinney'
    , 'Roric Mccorristin'
]

name_mismatches = (
    cd.loc[
        (cd.candidate_name_openanc != cd.candidate_name_dcboe)
        & ~cd.candidate_name_dcboe.isin(dcboe_name_rejects)
        & ~cd.candidate_name_dcboe.str.contains('Withdrew')
    , ['smd_id_openanc', 'candidate_name_openanc', 'candidate_name_dcboe']]
    .sort_values(by='smd_id_openanc')
)

print(len(name_mismatches))
name_mismatches

48


Unnamed: 0,smd_id_openanc,candidate_name_openanc,candidate_name_dcboe
377,smd_2022_1B05,Deborah Thomas,Deborah R. Thomas
291,smd_2022_1B07,J. Swiderski,J.I. Swiderski
354,smd_2022_1C07,Jake Faleschini,Jacob Faleschini
247,smd_2022_1D04,Yasmin Romero-Latin,Yasmin Romero
59,smd_2022_1E07,Amanda Farnan,Amanda M Farnan
60,smd_2022_2A03,"Trupti ""Trip"" J. Patel","Trupti ""Trip"" Patel"
69,smd_2022_2A04,Carson Colton Robb,Carson Robb
61,smd_2022_2B02,Jeff Rueckgauer,Jeffrey Rueckgauer
154,smd_2022_2D02,Carole Feld,Carole L. Feld
278,smd_2022_2G01,Tony Brown,"Anthony ""Tony"" Brown"


## SMD Comparison

This should be empty. If there's an disagreement on which district a candidate is running in, figure it out by the candidate's address. 

In [5]:
(
    cd.loc[cd.smd_id_openanc != cd.smd_id_dcboe
    , ['candidate_name_openanc', 'candidate_name_dcboe', 'smd_id_openanc', 'smd_id_dcboe']]
    .sort_values(by='smd_id_openanc')
)

Unnamed: 0,candidate_name_openanc,candidate_name_dcboe,smd_id_openanc,smd_id_dcboe


In [6]:
# This sum should be zero
# Once a candidate has a dcboe_hash_id, they should not have a manual status, generally
cd_left = pd.merge(cand, dcboe, how='left', on='dcboe_hash_id')
cd_left[cd_left.manual_status.notnull()].dcboe_hash_id.notnull().sum()
cd_left[cd_left.manual_status.notnull()]

Unnamed: 0,candidate_id,person_id,election_year,dcboe_hash_id,smd_id_x,candidate_name_x,manual_status,manual_source,write-in winner according to DCBOE,candidate_source_description,...,filed_date_x,dcboe_source,dcboe_source_link,dcboe_updated_at,updated_at,smd_id_y,candidate_name_y,pickup_date_y,filed_date_y,candidate_status_y
4,50519,11356,2022,,smd_2022_7D07,Brian Voorhees,Declared Intention to Run,,,,...,,,,,2022-07-14,,,,,
31,50546,11370,2022,,smd_2022_3E03,Thomas Marabello,Declared Intention to Run,,,,...,,,,,2022-07-29,,,,,
198,50714,11460,2022,,smd_2022_3E08,Rohin Ghosh,Declared Intention to Run,,,,...,,,,,2022-07-26,,,,,
249,50766,10546,2022,,smd_2022_8F02,Eric S. Blaylock,Declared Intention to Run,,,,...,,,,,2022-07-27,,,,,
254,50772,10178,2022,d6636977c7d7149b5c9f7ca10cf268612f96edba603301...,smd_2022_5E05,Robert Vinson Brannum,Withdrew,,,,...,,DCBOE,https://dcboe.org/Elections/2022-Elections,2022-08-08,2022-08-08,smd_2022_5E05,Robert Vinson Brannum,2022-07-27,,Pulled Papers for Ballot
327,50845,11530,2022,,smd_2022_1B05,Lindsay Webb,Declared Intention to Run,,,,...,,,,,2022-08-03,,,,,


## Candidate Duplicates

Every person_id should only be in the candidate table once

In [7]:
cand.groupby('person_id').size()[cand.groupby('person_id').size() > 1]

Series([], dtype: int64)

In [8]:
cand.groupby('candidate_name').size()[cand.groupby('candidate_name').size() > 1]

candidate_name
Patricia Williams    2
dtype: int64

There are two different candidates named Patricia Williams.