# Candidate Data Check

Make sure the DCBOE hash IDs were pasted correctly, and generally check for other obvious problems with the `candidates` table. 

In [1]:
import os
os.chdir('../')

import pandas as pd
pd.set_option('display.max_rows', 500)

from scripts.data_transformations import list_candidates

In [2]:
cand = list_candidates(election_year=2022)
dcboe = pd.read_csv('data/dcboe/candidates_dcboe.csv')

In [3]:
cd = pd.merge(cand, dcboe, how='inner', on='dcboe_hash_id', suffixes=['_openanc', '_dcboe'])

## Candidate Name Comparison

These candidates have a different name in DCBOE than in OpenANC. Compare them and consider changing to match the ballot. 

The "rejects" are names as they are rendered by DCBOE (initially uppercase, then fed through the Python title case function) that we have decided are not displayable.

In [4]:
dcboe_name_rejects = [
    'Alyce Mcfarland'
    , 'Andrew Mccarthy-Clark'
    , 'Mike Mclaughlin'
    , 'Susana Bara√Ëano'
    , 'Brian J. Mccabe'
    , 'Carl Edgar Rohde Iii'
    , 'Quentin Col√Ìn Roosevelt'
    , 'Thomas P. Defranco'
    , 'Nicole Mcentee'
    , 'V√Çctor De Le√Ìn'
    , 'Laqueda Tate'
    , 'Bobby \"Slli\'M" Williams'
    , 'Matt Lafortune'
    , 'Vj Kapur'
    , 'Steven Mccarty'
    , 'H Norman Knickle'
    , 'Obbie English'
    , 'Robin Mckinney'
    , 'Roric Mccorristin'
]

name_mismatches = (
    cd.loc[
        (cd.candidate_name_openanc != cd.candidate_name_dcboe)
        & ~cd.candidate_name_dcboe.isin(dcboe_name_rejects)
        & ~cd.candidate_name_dcboe.str.contains('Withdrew')
    , ['smd_id_openanc', 'candidate_name_openanc', 'candidate_name_dcboe']]
    .sort_values(by='smd_id_openanc')
)

print(len(name_mismatches))
name_mismatches

51


Unnamed: 0,smd_id_openanc,candidate_name_openanc,candidate_name_dcboe
382,smd_2022_1C07,Ivan Taylor Jr.,Ivan Taylor J.R.
221,smd_2022_1D04,Yasmin Romero-Latin,Yasmin Romero
57,smd_2022_1E07,Amanda Farnan,Amanda M Farnan
58,smd_2022_2A03,"Trupti ""Trip"" J. Patel","Trupti ""Trip"" Patel"
66,smd_2022_2A04,Carson Colton Robb,Carson Robb
59,smd_2022_2B02,Jeff Rueckgauer,Jeffrey Rueckgauer
144,smd_2022_2D02,Carole Feld,Carole L. Feld
248,smd_2022_2G01,Tony Brown,"Anthony ""Tony"" Brown"
23,smd_2022_3C03,Janell Pagats,Janell Marie Pagats
121,smd_2022_3C05,Sauleh A. Siddiqui,Sauleh Ahmad Siddiqui


## SMD Comparison

This should be empty. If there's an disagreement on which district a candidate is running in, figure it out by the candidate's address. 

In [5]:
(
    cd.loc[cd.smd_id_openanc != cd.smd_id_dcboe
    , ['candidate_name_openanc', 'candidate_name_dcboe', 'smd_id_openanc', 'smd_id_dcboe']]
    .sort_values(by='smd_id_openanc')
)

Unnamed: 0,candidate_name_openanc,candidate_name_dcboe,smd_id_openanc,smd_id_dcboe


In [6]:
# This sum should be zero
# Once a candidate has a dcboe_hash_id, they should not have a manual status, generally
cd_left = pd.merge(cand, dcboe, how='left', on='dcboe_hash_id')
cd_left[cd_left.manual_status.notnull()].dcboe_hash_id.notnull().sum()
cd_left[cd_left.manual_status.notnull()]

Unnamed: 0,candidate_id,person_id,election_year,dcboe_hash_id,smd_id_x,candidate_name_x,manual_status,manual_source,write-in winner according to DCBOE,candidate_source_description,...,filed_date_x,dcboe_source,dcboe_source_link,dcboe_updated_at,updated_at,smd_id_y,candidate_name_y,pickup_date_y,filed_date_y,candidate_status_y
4,50519,11356,2022,,smd_2022_7D07,Brian Voorhees,Write-In Candidate,,,,...,,,,,2022-08-22,,,,,
31,50546,11370,2022,,smd_2022_3E03,Thomas Marabello,Withdrew,,,,...,,,,,2022-08-17,,,,,
40,50555,10089,2022,fa1e9c655802e03affea2b345f0df38cfa609d596ba270...,smd_2022_3D04,Michael Sriqui,Pulled Papers for Ballot,,,,...,,,,,2022-08-12,,,,,
56,50571,10231,2022,7ad48a813bd7c8ab03ec75994250c26faf5714c0e91440...,smd_2022_7D05,Tamara Blair,Withdrew,,,,...,,,,,2022-08-08,,,,,
70,50585,11372,2022,2e893e83de78052fdc3324531028bc951d469342a23280...,smd_2022_1E06,Eric Jonathan Sheptok,Pulled Papers for Ballot,,,,...,,,,,2022-08-12,,,,,
86,50601,11388,2022,3d4c304f37a8be2017c3d60075836a5655d1934298c95c...,smd_2022_6C03,Rasheedah Hasan,Pulled Papers for Ballot,,,,...,,,,,2022-08-12,,,,,
92,50607,11394,2022,e728479719f877d7162ed7ed8df74fe05d1d36ffee07e7...,smd_2022_7E05,Tim Howard,Pulled Papers for Ballot,,,,...,,,,,2022-08-12,,,,,
120,50635,11415,2022,e5df28984ae702d84c4d2a44c86a87b3f2f9afb5995209...,smd_2022_7B05,Dexter Williams,Withdrew,,,,...,,,,,2022-08-02,,,,,
121,50636,10081,2022,5434005c4071000bb657a02f8ba798c05c25e97cd9fffa...,smd_2022_3C04,Beau Finley,Withdrew,,,,...,,,,,2022-08-10,,,,,
134,50649,10411,2022,26fe7d8b2f1d77541142d2fed528430c8c3c5bee8e0f6d...,smd_2022_7F03,Rebecca J. Morris,Pulled Papers for Ballot,,,,...,,,,,2022-08-12,,,,,


## Candidate Duplicates

Every person_id should only be in the candidate table once

In [7]:
cand.groupby('person_id').size()[cand.groupby('person_id').size() > 1]

Series([], dtype: int64)

In [8]:
cand.groupby('candidate_name').size()[cand.groupby('candidate_name').size() > 1]

Series([], dtype: int64)

There are two different candidates named Patricia Williams.