Take two DCBOE files and find differences based on the hashes. 

* old_dcboe = the CSV in the current head commit
* new_dcboe = the CSV in the current working directory, not yet committed

In [1]:
import io
import os
import pandas as pd
from git import Repo

os.chdir('../..') # root of repo is two directories above this notebook

In [2]:
csv_file_path = 'data/dcboe/candidates_dcboe.csv'

oa_repo = Repo('.') 
commit = oa_repo.head.commit
targetfile = commit.tree / csv_file_path

with io.BytesIO(targetfile.data_stream.read()) as f:
    old_dcboe = pd.read_table(f, sep=',', encoding='utf-8')

In [3]:
new_dcboe = pd.read_csv(csv_file_path)

In [4]:
len(old_dcboe)

398

In [5]:
len(new_dcboe)

398

In [6]:
# Change in number of active candidates
len(new_dcboe) - len(old_dcboe)

0

## Old hashes not in new file

Candidates who will no longer be on the ballot.  

In [7]:
old_hashes_not_in_new = ~( old_dcboe['dcboe_hash_id'].isin(new_dcboe['dcboe_hash_id']))
old_dcboe[old_hashes_not_in_new][['smd_id', 'candidate_name']]

Unnamed: 0,smd_id,candidate_name
229,smd_2022_5D05,"Salvador ""The Commissioner"" Sauceda-Guzm"


In [8]:
old_dcboe['smd_name'] = old_dcboe['smd_id'].str.replace('smd_', '') + ': ' + old_dcboe['candidate_name']
old_dcboe[old_hashes_not_in_new][['smd_name']].to_clipboard(index=False)

## New hashes not in old file

In [9]:
new_hashes_not_in_old = ~( new_dcboe['dcboe_hash_id'].isin(old_dcboe['dcboe_hash_id']))
new_dcboe[new_hashes_not_in_old]

Unnamed: 0,dcboe_hash_id,smd_id,candidate_name,pickup_date,filed_date,candidate_status
229,34685c5c49315bbe2d24cda257eb8da53fba75741e183c...,smd_2022_5D05,"Salvador ""The Commissioner"" Sauceda-Guzman",2022-07-21,2022-08-10,Filed Signatures


## Same hash, changed info

Have any fields changed for rows with the same dcboe_hash_id? 

In [10]:
df = pd.merge(old_dcboe, new_dcboe, how='inner', on='dcboe_hash_id', suffixes=['_old', '_new'])

In [11]:
columns_to_check = [c for c in new_dcboe.columns if c != 'dcboe_hash_id']

for c in columns_to_check:
    
    # Fill NULLs with 'x' to make the comparison work - can't compare NULL to NULL, pd.NA == pd.NA is also pd.NA
    diff = (df[c + '_old'].fillna('x') != df[c + '_new'].fillna('x'))
    num_differences = sum(diff)
    print(f'column "{c}" has {num_differences} differences')

    df[c + '_diff'] = diff

csv_columns = ['dcboe_hash_id']
for c in columns_to_check:
    csv_columns += [c + '_old', c + '_new', c + '_diff']

df[csv_columns].to_csv('data/dcboe/candidates_dcboe_diff.csv', index=False)

column "smd_id" has 0 differences
column "candidate_name" has 0 differences
column "pickup_date" has 0 differences
column "filed_date" has 0 differences
column "candidate_status" has 0 differences


## Compare counts by district

When a lot of candidates have small changes to thier names, it throws off the hash so to find differences in who is actually a candidate between files, compare the counts by SMD.

In [12]:
# from scripts.data_transformations import districts_candidates_commissioners
# new_smd = districts_candidates_commissioners()

In [13]:
# new_smd['number_of_candidates'].sum()

In [14]:
# smd_df.to_csv('smd_df_old.csv', index=False)

In [15]:
# old_smd = pd.read_csv('/Users/devin/Dropbox/OpenANC/smd_df_2020-09-03.csv')

In [16]:
# compare_smd = pd.merge(old_smd, new_smd, how='inner', on='smd_id', suffixes=['_old', '_new'])

In [17]:
# compare_smd['district'] = compare_smd['smd_id'].str.replace('smd_', '')

In [18]:
# diff = compare_smd['number_of_candidates_old'] != compare_smd['number_of_candidates_new']

In [19]:
# sum(diff)

In [20]:
# compare_smd.loc[diff, [
#     'district'
#     , 'list_of_candidates_old'
#     , 'list_of_candidates_new'
# ]].to_clipboard(index=False)