Take two DCBOE files and find differences based on the hashes. 

* old_dcboe = the CSV in the current head commit
* new_dcboe = the CSV in the current working directory, not yet committed

In [1]:
import io
import os
import pandas as pd
from git import Repo

os.chdir('../..') # root of repo is two directories above this notebook

In [2]:
csv_file_path = 'data/dcboe/candidates_dcboe.csv'

oa_repo = Repo('.') 
commit = oa_repo.head.commit
targetfile = commit.tree / csv_file_path

with io.BytesIO(targetfile.data_stream.read()) as f:
    old_dcboe = pd.read_table(f, sep=',', encoding='utf-8')

In [3]:
new_dcboe = pd.read_csv(csv_file_path)

In [4]:
len(old_dcboe)

420

In [5]:
len(new_dcboe)

420

## Old hashes not in new file

In [6]:
old_hashes_not_in_new = ~( old_dcboe['dcboe_hash_id'].isin(new_dcboe['dcboe_hash_id']))
old_dcboe[old_hashes_not_in_new]

Unnamed: 0,dcboe_hash_id,smd_id,candidate_name,pickup_date,filed_date
227,0412ea8a9cf9c03d0744c9e8f6c89064f8692ec6ca32fe...,smd_5D02,Cameron Brown,2020-08-03,2020-08-05
286,97dc1b7950240e70ca562b808b8017f4d334573452e36f...,smd_6C04,Pranav Nanda,2020-06-30,2020-08-04
326,7e6eff492c2dcca912c54ae1dc711dd7f1f0f4e238a005...,smd_7C03,Vincent Van,2020-06-26,2020-08-03


## New hashes not in old file

In [7]:
new_hashes_not_in_old = ~( new_dcboe['dcboe_hash_id'].isin(old_dcboe['dcboe_hash_id']))
new_dcboe[new_hashes_not_in_old]

Unnamed: 0,dcboe_hash_id,smd_id,candidate_name,pickup_date,filed_date
267,02647cfc9fba7e5e15efc35629f391c7cb0be97813d4e8...,smd_6A07,Rico Dancy,2020-07-22,2020-08-05
325,00969a9abd072c6fc2c620ae89aad6b3b60ee63648b21c...,smd_7C03,Vince Van,2020-06-26,2020-08-03
399,9bd61943b16542005aa92a0ab6066a18e51a92b76bc8e8...,smd_8D01,Patricia Carmon,2020-07-17,2020-08-05


## Same hash, changed info

Have any fields changed on the same hash? 

In [8]:
df = pd.merge(old_dcboe, new_dcboe, how='inner', on='dcboe_hash_id', suffixes=['_old', '_new'])

In [9]:
columns_to_check = [c for c in old_dcboe.columns if c != 'dcboe_hash_id']

for c in columns_to_check:
    num_differences = sum(df[c + '_old'] != df[c + '_new'])
    if num_differences > 0:
        print(df[df[c + '_old'] != df[c + '_new']])

                                         dcboe_hash_id smd_id_old  \
371  a18cc1ecaaf81f445eeb1e80ddc7d01168aa3d278ea46f...   smd_8A06   

    candidate_name_old pickup_date_old filed_date_old smd_id_new  \
371  Kristina Leszczak      2020-08-26            NaN   smd_8A06   

    candidate_name_new pickup_date_new filed_date_new  
371  Kristina Leszczak      2020-08-26            NaN  
