# Candidate Master &ndash; Data Quality #

## Overview ##

Candidates in the FEC data set have a unique ID assigned to them.  However, since we are combining Candidate records from multiple election cycle source files, we really should join to `cand` using both `cand_id` and `elect_cycle`.  The purpose of this notebook is to explore to "quality" of the Candidate data both within and across election cycles to see how consistent it is, and whether (and how) we can create a unified Candidate Master entity to improve referential integrity for the larger, complete data set.


## Notebook Setup ##

### Configure database connect info/options ###

Note: database connect string can be specified on the initial `%sql` command:

```python
database_url = "postgresql+psycopg2://user@localhost/fecdb"
%sql $database_url

```

Or, connect string is taken from DATABASE_URL environment variable (if not specified for `%sql`):

```python
%sql

```

In [1]:
%load_ext sql
%config SqlMagic.autopandas=True
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# connect string taken from DATABASE_URL environment variable
%sql

'Connected: crash@fecdb'

### Configure Python modules ###

In [2]:
import pandas as pd

pd.set_option("display.max_rows", 200)

### Set styling ###

In [3]:
%%html
<style>
  tr, th, td {
    text-align: left !important;
  }
</style>

## Examine Constitution of Candidate Data ##

### High-level summary ###

First count total records and distinct `cand_id`'s (and save out results for reference)

In [4]:
%%sql result <<
select count(*) as count_total,
       count(distinct cand_id) as count_distinct
  from cand

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [5]:
cand_count_total  = int(result.loc[0][0])
cand_distinct_ids = int(result.loc[0][1])
"cand_count_total = %d, cand_distinct_ids = %d" % (cand_count_total, cand_distinct_ids)

'cand_count_total = 56615, cand_distinct_ids = 26243'

### Quality of Candidate Names (`cand_name`) ###

Let's try and get a sense of the extent of formatting problems (inconsistencies or flaws).  First look for names that have lowercase letters (uppercase is now the standard)...

In [6]:
%%sql
select elect_cycle,
       count(*)
  from cand
 where cand_name ~ '[a-z]'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
0 rows affected.


In [7]:
%%sql
select cand_name,
       array_agg(distinct elect_cycle)
  from cand
 where cand_name ~ '[a-z]'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
0 rows affected.


Next, look for names with consecutive whitespace...

In [8]:
%%sql
select elect_cycle,
       count(*)
  from cand
 where cand_name ~ '\s{2,}'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
10 rows affected.


Unnamed: 0,elect_cycle,count
0,2000,1
1,2004,6
2,2006,8
3,2008,10
4,2010,8
5,2012,14
6,2014,14
7,2016,17
8,2018,18
9,2020,4


In [9]:
%%sql
select cand_name,
       array_agg(distinct elect_cycle)
  from cand
 where cand_name ~ '\s{2,}'
 group by 1
 order by 1
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
43 rows affected.


Unnamed: 0,cand_name,array_agg
0,"AYYADURAI, SHIVA DR",[2018]
1,"BARLOW, PAMELA LEE DVM","[2012, 2014, 2016]"
2,"BEATTY, JEFFREY K","[2006, 2008]"
3,"BISHOP, RONALD HUBERT JR MR",[2018]
4,"BOYCE, HENRY CHARLES ""CHUCK"" JR","[2018, 2020]"
5,"BOYD, JIM",[2004]
6,"BOYD JR, WILLIE ""WILL"" EUGENE",[2018]
7,"BURNS, NEIL PAUL","[2012, 2014, 2016, 2018]"
8,"BYRD, JOSEPH H ""JOE""",[2004]
9,"CHURCHILL, ROBERT W","[2006, 2008]"


### Integrity of Candidate ID (`cand_id`) ###

Count records across election cycles and see if we have any null `cand_id`'s (all zeros would be good)

In [10]:
%%sql
select elect_cycle,
       count(*)                  as records,
       count(*) - count(cand_id) as null_cand_ids
  from cand
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,records,null_cand_ids
0,2000,4529,0
1,2002,3944,0
2,2004,3814,0
3,2006,3704,0
4,2008,4072,0
5,2010,5126,0
6,2012,5628,0
7,2014,5536,0
8,2016,7641,0
9,2018,7590,0


Let's see if there are any duplicate `cand_id`'s in any election cycles

In [11]:
%%sql
with dup_cand_id as (
    select elect_cycle,
           cand_id,
           count(*) as id_count
      from cand
     group by 1, 2
    having count(*) > 1
)
select elect_cycle,
       count(*) as dupes,
       sum(id_count) as total_dupe_ids,
       max(id_count) as max_dupe_ids
  from dup_cand_id
 group by 1

 * postgresql+psycopg2://crash@caladan/fecdb
0 rows affected.


Now let's look at repeated `cand_id`'s across election cycles (note that specifying `distinct` within `array_agg` is a tricky way of sorting the values, for consistency, if we care to group by that field)

In [12]:
%%sql
with cand_id_sum as (
    select cand_id,
           count(*) as ec_count,
           array_agg(distinct elect_cycle) as elect_cycles
      from cand
     group by 1
)
select ec_count,
       count(*) as cand_ids,
       round(count(*)::numeric / :cand_distinct_ids * 100.0, 2) as pct_cand_ids
  from cand_id_sum
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,ec_count,cand_ids,pct_cand_ids
0,11,214,0.82
1,10,99,0.38
2,9,134,0.51
3,8,198,0.75
4,7,260,0.99
5,6,521,1.99
6,5,949,3.62
7,4,1259,4.8
8,3,2586,9.85
9,2,7973,30.38


### Consistency of names for `cand_id`'s across election cycles ###

In [13]:
%%sql
with cand_diff_names as (
    select cand_id,
           count(distinct cand_name)     as num_diff_names,
           array_agg(distinct cand_name) as diff_names
      from cand
     group by 1
    having count(distinct cand_name) > 1
)
select num_diff_names,
       count(*) as cand_ids,
       round(count(*)::numeric / :cand_distinct_ids * 100.0, 2) as pct_cand_ids
  from cand_diff_names
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
5 rows affected.


Unnamed: 0,num_diff_names,cand_ids,pct_cand_ids
0,2,1245,4.74
1,3,200,0.76
2,4,53,0.2
3,5,10,0.04
4,7,1,0.0


Get an idea of what the different names associated with the same `cand_id` look like&mdash;let's start with a sampling of Candidate IDs with `num_diff_names` = 2 (compare adjacent `cand_name`'s)...

In [14]:
%%sql
with cand_diff_names as (
    select cand_id,
           count(distinct cand_name)     as num_diff_names,
           array_agg(distinct cand_name) as diff_names
      from cand
     group by 1
    having count(distinct cand_name) = 2
)
select cand_id,
       unnest(diff_names) as cand_name
  from cand_diff_names
 order by cand_id
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cand_id,cand_name
0,H0AK00097,"COX, JOHN R."
1,H0AK00097,"COX, JOHN ROBERT"
2,H0AL07086,"SEWELL, TERRI A."
3,H0AL07086,"SEWELL, TERRYCINA ANDREA"
4,H0AL07177,"CHAMBERLAIN, DON"
5,H0AL07177,"CHAMBERLAIN, DONALD NORWOOD"
6,H0AR04038,"ROSS, MICHAEL A"
7,H0AR04038,"ROSS, MICHAEL AVERY"
8,H0AS00018,"FALEOMAVAEGA, ENI"
9,H0AS00018,"FALEOMAVAEGA, ENI F H"


Now, let's look at `num_diff_names` = 3...

In [15]:
%%sql
with cand_diff_names as (
    select cand_id,
           count(distinct cand_name)     as num_diff_names,
           array_agg(distinct cand_name) as diff_names
      from cand
     group by 1
    having count(distinct cand_name) = 3
)
select cand_id,
       unnest(diff_names) as cand_name
  from cand_diff_names
 order by cand_id
 limit 51

 * postgresql+psycopg2://crash@caladan/fecdb
51 rows affected.


Unnamed: 0,cand_id,cand_name
0,H0AL07060,"DAVIS, ARTUR G"
1,H0AL07060,"DAVIS, ARTUR G"
2,H0AL07060,"DAVIS, ARTUR GENESTRE"
3,H0AZ01184,"FLAKE, JEFF L"
4,H0AZ01184,"FLAKE, JEFF MR."
5,H0AZ01184,"FLAKE, JEFFRY LANE"
6,H0AZ01259,"GOSAR, PAUL ANTHONY"
7,H0AZ01259,"GOSAR, PAUL ANTHONY ANTHONY"
8,H0AZ01259,"GOSAR, PAUL DR."
9,H0AZ04501,"PENALOSA, JOE"


And `num_diff_names` = 4...

In [16]:
%%sql
with cand_diff_names as (
    select cand_id,
           count(distinct cand_name)     as num_diff_names,
           array_agg(distinct cand_name) as diff_names
      from cand
     group by 1
    having count(distinct cand_name) = 4
)
select cand_id,
       unnest(diff_names) as cand_name
  from cand_diff_names
 order by cand_id
 limit 52

 * postgresql+psycopg2://crash@caladan/fecdb
52 rows affected.


Unnamed: 0,cand_id,cand_name
0,H0AL05049,"CRAMER, ROBERT E ""BUD"""
1,H0AL05049,"CRAMER, ROBERT E ""BUD"" JR"
2,H0AL05049,"CRAMER, ROBERT EDWARD ""BUD"" JR"
3,H0AL05049,"CRAMER, ROBERT EDWARD BUD JR"
4,H0CA48024,"ISSA, DARRELL"
5,H0CA48024,"ISSA, DARRELL"
6,H0CA48024,"ISSA, DARRELL E"
7,H0CA48024,"ISSA, DARRELL EDWARD"
8,H0FL04066,"CRENSHAW, ANDER"
9,H0FL04066,"CRENSHAW, ANDER HON"


### Multiple `cand_id`'s for identical names &ndash; within election cycles ###

In [17]:
%%sql
with shared_cand_name as (
    select elect_cycle,
           cand_name,
           count(*) as num_shares
      from cand
     group by 1, 2
    having count(*) > 1
)
select num_shares,
       count(*) as shared_names
  from shared_cand_name
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
4 rows affected.


Unnamed: 0,num_shares,shared_names
0,5,2
1,4,5
2,3,41
3,2,1035


In [18]:
%%sql
with shared_cand_name as (
    select elect_cycle,
           cand_name,
           count(*) as num_shares,
           array_agg(distinct cand_id) as cand_ids
      from cand
     group by 1, 2
    having count(*) > 2
)
select cand_name,
       array_length(cand_ids, 1) num_cand_ids,
       cand_ids,
       array_agg(elect_cycle) as elect_cycles
  from shared_cand_name
 group by 1, 3
 order by array_length(cand_ids, 1) desc, count(*) desc, cand_name

 * postgresql+psycopg2://crash@caladan/fecdb
36 rows affected.


Unnamed: 0,cand_name,num_cand_ids,cand_ids,elect_cycles
0,"DE LA FUENTE, ROQUE ""ROCKY""",5,"[S8FL00299, S8MN00719, S8VT00166, S8WA00384, S...",[2018]
1,"GRAYSON, RICHARD",5,"[H6ID02191, H6WA08118, H6WY01033, H8AZ06012, S...",[2016]
2,"BRYK, WILLIAM",4,"[H2IN08136, S0ID00099, S2WY00091, S4OR00214]",[2014]
3,"FARRIS, JADEN THOMAS MR.",4,"[H0MD05175, P00009266, P00009928, S2MD00511]",[2020]
4,"HAMBURG, AL",4,"[H6WY00050, P00003277, S0NE00056, S4WY00030]",[2000]
5,"MAGEE, ERIN KENT",4,"[H4FL25014, H4IN07175, P20002648, S4TN00419]",[2014]
6,"MARTIN, ANDY",4,"[H6NH02196, P00003731, S4IL00362, S4NH00096]",[2016]
7,"KALEMKARIAN, TIMOTHY CHARLES",3,"[H0CA23043, P60003175, S6CA00477]","[2000, 2002, 2004, 2006, 2008, 2010, 2012, 201..."
8,"WELLS, TOM",3,"[H6FL01069, P00003160, S4FL00421]","[2004, 2006, 2008]"
9,"KOPSICK, JOSEPH WILLIAM",3,"[H2WI02116, H4OR03119, H6IL10150]","[2016, 2018]"


In [19]:
%%sql
with shared_cand_name as (
    select elect_cycle,
           cand_name,
           count(*) as num_shares,
           array_agg(distinct cand_id) as cand_ids
      from cand
     group by 1, 2
    having count(*) = 2
)
select cand_name,
       cand_ids,
       count(*) as num_elect_cycles,
       array_agg(elect_cycle) as elect_cycles
  from shared_cand_name
 group by 1, 2
 order by 3 desc, 1, 2
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cand_name,cand_ids,num_elect_cycles,elect_cycles
0,"CARROLL, JERRY LEON","[P00000679, S2CA00591]",9,"[2002, 2004, 2006, 2010, 2012, 2014, 2016, 201..."
1,"KEYES, ALAN L","[P60003076, S4IL00404]",8,"[2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018]"
2,"CHERRICKS, LIZA DAWN","[H8DE00046, P80003890]",6,"[2008, 2010, 2012, 2014, 2016, 2018]"
3,"KUCINICH, DENNIS J","[H6OH23033, P40002545]",6,"[2004, 2006, 2008, 2010, 2012, 2014]"
4,"PIPKIN, E J","[H8MD01128, S4MD00152]",6,"[2008, 2010, 2012, 2014, 2016, 2018]"
5,"SPECTER, ARLEN","[P60003233, S6PA00100]",6,"[2000, 2002, 2004, 2006, 2008, 2010]"
6,"VAUGHN, CORROGAN R","[P80004237, S0MD00200]",6,"[2008, 2010, 2012, 2014, 2016, 2018]"
7,"BACHMANN, MICHELE","[H6MN06074, P20002978]",5,"[2012, 2014, 2016, 2018, 2020]"
8,"BALDWIN, TAMMY","[H8WI00018, S2WI00219]",5,"[2012, 2014, 2016, 2018, 2020]"
9,"BATES, DON JR","[H2IN06171, S0IN00111]",5,"[2012, 2014, 2016, 2018, 2020]"


### Multiple `cand_id`'s for identical names &ndash; across election cycles ###

In [20]:
%%sql
with shared_cand_name as (
    select cand_name,
           count(distinct cand_id) as num_cand_ids
      from cand
     group by 1
    having count(distinct cand_id) > 1
)
select num_cand_ids,
       count(*) as num_shared_names
  from shared_cand_name
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
6 rows affected.


Unnamed: 0,num_cand_ids,num_shared_names
0,9,1
1,6,2
2,5,4
3,4,12
4,3,93
5,2,1030


In [21]:
%%sql
with shared_cand_name as (
    select cand_name,
           count(distinct cand_id) as num_cand_ids,
           array_agg(distinct cand_id) as cand_ids,
           count(*) as count_records
      from cand
     group by 1
    having count(distinct cand_id) > 2
)
select cand_name,
       num_cand_ids,
       cand_ids
  from shared_cand_name
 order by 2 desc, 1
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cand_name,num_cand_ids,cand_ids
0,"GRAYSON, RICHARD",9,"[H4FL09034, H4WY00170, H6FL14088, H6ID02191, H..."
1,"BRYK, WILLIAM",6,"[H0NY18016, H2IN08136, P20001822, S0ID00099, S..."
2,"CARTER, JERRY DEAN",6,"[S0NV00211, S4IA00160, S6CA00493, S8CO00198, S..."
3,"DE LA FUENTE, ROQUE ""ROCKY""",5,"[S8FL00299, S8MN00719, S8VT00166, S8WA00384, S..."
4,"LAROSE, JOSUE",5,"[H0FL19056, H2LA02099, P60004777, S0AZ00343, S..."
5,"MARTIN, ANDY",5,"[H6NH02196, P00003731, S0FL00197, S4IL00362, S..."
6,"SWING, GARY",5,"[H6AZ02213, H8CO01089, P00007674, S0CO00542, S..."
7,"BOSS, JEFF",4,"[H0NJ33010, H6NJ09249, P80005077, S8NJ00418]"
8,"CAMPBELL, TOM",4,"[H0SC05015, H8ND00104, S2CA00351, S8ND00104]"
9,"EVANS, MERVIN LEON",4,"[H0CA32085, H0CA33109, H4CA29026, S4CA00191]"


### To Do ###

Examine the following:

* Multiple `cand_id`'s for *similar* names &ndash; within election cycles
    * Look at names that are strict subsets/supersets of one another (indicating abbreviations, or dropping middle names, etc.)
    * Look at names that only differ by quotation marks (e.g. used for nicknames)
* Multiple `cand_id`'s for *similar* names &ndash; across election cycles
    * Same patterns as above should be investigated

## Summary ##

Based on the above exploration, the following enhancements to the schema are recommended:

* Normalize `cand_name` (and other text fields? \[see below]) during intial load (awk script)
    * Convert to uppercase (for future protection)
    * Collapse consecutive whitespace
* Create `cand_master` table based on unique `cand_id`'s (across election cycles)
    * Include `base_cand_master_id` for associating Candidates with different `cand_id`'s, but identical (*or similar?*) names

Other text fields to consider for normalization:

* `cand_st1`
* `cand_st2`
* `cand_city`
* `cand_st`