# Committee-Candidate Association &ndash; Data Quality #

## Overview ##

The `dq1` and `dq2` notebooks in this same directory examined the "quality" of the Committee and Candidate FEC data sets, respectively.  This current notebook takes a look at the nature and integrity of the association between the two.  There are actually three different relational structures that can represent connections between Committees and Candidates:

* Many-to-one relationship, through the `cmte.cand_id` foreign key
    * Note that this key only pertains to `cmte_tp` value of "H", "S", or "P"
* One-to-many relationship, through the `cand.cand_pcc` foreign key
    * Note: "pcc" stands for "principal campaign committee"
* Many-to-many relationship, through the `cand_cmte` intersect ("link") table
* Many-to-one relationship, through the `cmte_contrib` table (which itself is a many-to-many between `cmte` and itself)

For now, we will only consider the first three of these association mechanism.  We'll take a look at the fourth relationship if/when we start drilling down on Committee Contributions (i.e. the data in `cmte_contrib`).

Thus, here is a list of the examinations in this notebook:

* Integrity of `cmte.cand_id` foreign key
* Integrity of `cand.cand_pcc` foreign key
* Integrity of `cand_cmte` intersect table
* Consistency of Committees to Candidates, through `cmte.cand_id` foreign key
* Cardinality of Committees to Candidates, through `cand_cmte` intersect table
* Cardinality of Candidates to Committees, through `cmte.cand_id` foreign key
* Cardinality of Candidates to Committees, through `cand_cmte` intersect table

## Notebook Setup ##

### Configure database connect info/options ###

Note: database connect string can be specified on the initial `%sql` command:

```python
database_url = "postgresql+psycopg2://user@localhost/fecdb"
%sql $database_url

```

Or, connect string is taken from DATABASE_URL environment variable (if not specified for `%sql`):

```python
%sql

```

In [1]:
%load_ext sql
%config SqlMagic.autopandas=True
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# connect string taken from DATABASE_URL environment variable
%sql

'Connected: crash@fecdb'

### Configure Python modules ###

In [2]:
import pandas as pd

pd.set_option("display.max_rows", 200)

### Set styling ###

In [3]:
%%html
<style>
  tr, th, td {
    text-align: left !important;
  }
</style>

## Examination of Committee-Candidate Associations ##

### High-level summary ###

#### Replay `cmte` summary stats ####

In [4]:
%%sql result <<
select count(*) as count_total,
       count(distinct cmte_id) as count_distinct_ids,
       count(distinct cmte_nm) as count_distinct_names
  from cmte

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [5]:
cmte_count_total    = int(result.loc[0][0])
cmte_distinct_ids   = int(result.loc[0][1])
cmte_distinct_names = int(result.loc[0][2])
"cmte_count_total = %d, cmte_distinct_ids = %d, cmte_distinct_names = %d" % \
    (cmte_count_total, cmte_distinct_ids, cmte_distinct_names)

'cmte_count_total = 138384, cmte_distinct_ids = 45868, cmte_distinct_names = 50614'

#### Replay `cand` summary stats ####

In [6]:
%%sql result <<
select count(*) as count_total,
       count(distinct cand_id) as count_distinct_ids,
       count(distinct cand_name) as count_distinct_names
  from cand

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [7]:
cand_count_total    = int(result.loc[0][0])
cand_distinct_ids   = int(result.loc[0][1])
cand_distinct_names = int(result.loc[0][2])
"cand_count_total = %d, cand_distinct_ids = %d, cand_distinct_names = %d" % \
    (cand_count_total, cand_distinct_ids, cand_distinct_names)

'cand_count_total = 56615, cand_distinct_ids = 26243, cand_distinct_names = 26807'

#### Gather `cand_cmte` summary stats ####

Record total records and count of distinct foreign key values on the `cand_cmte` intersect table

In [8]:
%%sql result <<
select count(*) as count_total,
       count(distinct cmte_id) as count_cmte,
       count(distinct cand_id) as count_cand
  from cand_cmte

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [9]:
cand_cmte_count_total = int(result.loc[0][0])
cand_cmte_count_cmte  = int(result.loc[0][1])
cand_cmte_count_cand  = int(result.loc[0][2])
"cand_cmte_count_total = %d, cand_cmte_count_cmte = %d, cand_cmte_count_cand = %d" % \
    (cand_cmte_count_total, cand_cmte_count_cmte, cand_cmte_count_cand)

'cand_cmte_count_total = 55899, cand_cmte_count_cmte = 22305, cand_cmte_count_cand = 18291'

In [10]:
"cmte_ids represented: %.2f%%, cand_ids represented: %.2f%%" % \
    (cand_cmte_count_cmte / cmte_distinct_ids * 100.0,
     cand_cmte_count_cand / cand_distinct_ids * 100.0)

'cmte_ids represented: 48.63%, cand_ids represented: 69.70%'

### Integrity of `cmte.cand_id` foreign key ###

Let's check the number (and percentage) of null `cand_id`'s, which presumably represent Committees with no direct Candidate association (may have one or more associations through the `cand_cmte` intersect table)

In [11]:
%%sql
select count(*) as null_cand_ids,
       round(count(*)::numeric / :cmte_count_total * 100.0, 2) as pct_null_cand_ids
  from cmte
 where cand_id is null

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,null_cand_ids,pct_null_cand_ids
0,86964,62.84


Same thing, but only looking at distinct `cmte_id`'s (i.e. across election cycles)

In [12]:
%%sql
select count(distinct cmte_id) as distinct_null_cand_ids,
       round(count(distinct cmte_id)::numeric / :cmte_distinct_ids * 100.0, 2) as pct_distinct_null_cand_ids
  from cmte
 where cand_id is null

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,distinct_null_cand_ids,pct_distinct_null_cand_ids
0,23865,52.03


Now let's look for `cand_id`'s that don't join to Candidate records from the same election cycle (`cand` and `cmte` files should be published together by election cycle)

In [13]:
%%sql
select count(*)
  from cmte
 where (cand_id, elect_cycle) not in (select cand_id, elect_cycle from cand)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,18


Of those, let's see how many of the `cand_id`'s just point into outer space (i.e. don't exist in *any* election cycle's Candidate data)

In [14]:
%%sql
select count(*)
  from cmte
 where (cand_id) not in (select cand_id from cand)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,11


Let's take a look at those, and see if there are perhaps other Committee records with the same `cmte_id` that
*do* join to Candidate records

In [15]:
%%sql
select cmte.cmte_id,
       cmte.cmte_nm,
       count(*),
       array_agg(distinct cmte.elect_cycle) as elect_cycles,
       array_agg(distinct cmte.cand_id)     as bad_cand_ids,
       array_agg(distinct cand2.cand_id)    as other_cand_ids
  from cmte
  left join cmte cmte2 on  cmte2.cmte_id     = cmte.cmte_id
  left join cand cand2 on  cand2.cand_id     = cmte2.cand_id
                       and cand2.elect_cycle = cmte2.elect_cycle
 where (cmte.cand_id) not in (select cand_id from cand)
 group by 1, 2
 order by 1, 2

 * postgresql+psycopg2://crash@caladan/fecdb
8 rows affected.


Unnamed: 0,cmte_id,cmte_nm,count,elect_cycles,bad_cand_ids,other_cand_ids
0,C00356071,BILL HAAS FOR CONGRESS (1ST DISTRICT MISSOURI),4,"[2000, 2002]",[H0MO01082],[None]
1,C00363937,MIKE GALLAGHER FOR CONGRESS,2,[2000],[H0FL15047],[None]
2,C00625046,COMMITTEE TO ELECT BENJAMIN NOFS,4,"[2016, 2018]",[H6MD10024],[None]
3,C00626838,SHELLY SCHRATZ FOR CONGRESS,1,[2016],[H6NY26123],[None]
4,C00650424,DWIGHT BRADY FOR CONGRESS,1,[2018],[H8MI01156],[None]
5,C00677690,WASHINGTON FOR CONGRESS COMMITTEE,4,"[2018, 2020]",[H8PA13158],[None]
6,C00712851,AMERICAN RELIGIOUS PARTY,1,[2020],[C00713602],[None]
7,C00716191,LOGAN WILDE FOR CONGRESS,1,[2020],[H0UT01122],[None]


Now we'll take a look at the Committee-Candidate joins (through `cmte.cand_id`) that don't match within the election cycles they came with, but do join across disparate election cycles

In [16]:
%%sql
select count(*)
  from cmte
 where (cand_id, elect_cycle) not in (select cand_id, elect_cycle from cand)
   and (cand_id) in (select cand_id from cand)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,7


Let's take a look at those records

In [17]:
%%sql
select cmte.cmte_id,
       cmte.cmte_nm,
       cmte.elect_cycle as cmte_ec,
       cmte.cand_id,
       array_agg(distinct cand.cand_name) as cand_names,
       array_agg(distinct cand.elect_cycle) as cand_elect_cycles
  from cmte
  join cand on cand.cand_id = cmte.cand_id
 where (cmte.cand_id, cmte.elect_cycle) not in (select cand_id, elect_cycle from cand)
 group by 1, 2, 3, 4
 order by 1, 2, 3, 4

 * postgresql+psycopg2://crash@caladan/fecdb
7 rows affected.


Unnamed: 0,cmte_id,cmte_nm,cmte_ec,cand_id,cand_names,cand_elect_cycles
0,C00330043,ROSKAM FOR CONGRESS COMMITTEE,2000,H6IL06117,"[ROSKAM, PETER]","[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
1,C00342261,PEOPLE FOR AMERICAN LEADERSHIP,2000,P00003392,"[CLINTON, HILLARY RODHAM, CLINTON, HILLARY ROD...","[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
2,C00390468,DRAFT HILLARY FOR PRESIDENT 2004,2004,P00003392,"[CLINTON, HILLARY RODHAM, CLINTON, HILLARY ROD...","[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
3,C00400473,HILLARY CLINTON FOR PRESIDENT 2008,2004,P00003392,"[CLINTON, HILLARY RODHAM, CLINTON, HILLARY ROD...","[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
4,C00459768,JOSE RUIZ FOR CONGRESS COMMITTEE,2012,H0FL19049,"[RUIZ, JOSE M]","[2010, 2014, 2016, 2018]"
5,C00636308,GABRIEL MCARTHUR FOR CONGRESS,2018,H8C006211,"[MCARTHUR, GABRIEL SHAWN MR.]",[2016]
6,C00636308,GABRIEL MCARTHUR FOR CONGRESS,2020,H8C006211,"[MCARTHUR, GABRIEL SHAWN MR.]",[2016]


### Integrity of `cand.cand_pcc` foreign key ###

First, let's take a look at percentage of `cand` records with `cand_pcc` specified, and percentage of those that successfully join to `cmte` records, summarized by `cand_status`

In [18]:
%%sql
select cn.cand_status,
       count(*)                          as count,
       round(count(*)::numeric / :cand_count_total * 100.0, 2)
                                         as pct_of_total,
       count(cn.cand_pcc)                as pcc_cmte_specified,
       round(count(cn.cand_pcc)::numeric / count(*) * 100.0, 2)
                                         as pct_of_count,
       count(cn.cand_pcc) - count(cm.id) as pcc_cmte_not_found,
       round((count(cn.cand_pcc) - count(cm.id))::numeric / count(cn.cand_pcc) * 100.0, 2)
                                         as pct_of_specified
  from cand cn
  left join cmte cm on  cm.cmte_id     = cn.cand_pcc
                    and cm.elect_cycle = cn.elect_cycle
 group by 1
 order by 2 desc

 * postgresql+psycopg2://crash@caladan/fecdb
6 rows affected.


Unnamed: 0,cand_status,count,pct_of_total,pcc_cmte_specified,pct_of_count,pcc_cmte_not_found,pct_of_specified
0,N,24868,43.92,14717,59.18,595,4.04
1,C,17504,30.92,17452,99.7,20,0.11
2,P,13617,24.05,13580,99.73,279,2.05
3,F,617,1.09,615,99.68,1,0.16
4,,8,0.01,8,100.0,0,0.0
5,A,1,0.0,1,100.0,1,100.0


For the ones that join, let's see how many have a non-reciprocal `cand_id` in the PCC `cmte` table

In [19]:
%%sql
select cn.cand_status,
       count(*),
       sum((cm.cand_id != cn.cand_id)::integer) as mismatches,
       round(sum((cm.cand_id != cn.cand_id)::integer)::numeric / count(*) * 100.0, 2) as pct_mismatch
  from cand cn
  join cmte cm on  cm.cmte_id     = cn.cand_pcc
               and cm.elect_cycle = cn.elect_cycle
 group by 1
 order by 2 desc

 * postgresql+psycopg2://crash@caladan/fecdb
5 rows affected.


Unnamed: 0,cand_status,count,mismatches,pct_mismatch
0,C,17432,84,0.48
1,N,14122,91,0.64
2,P,13301,217,1.63
3,F,614,1,0.16
4,,8,0,0.0


And let's list some of those non-reciprocal Candidate records&mdash;first, those whose PCC `cand_id` joins back to a (different) Candidate record with a matching name (only looking at `cand_status` = "C" for now)

In [20]:
%%sql
select cn.elect_cycle,
       cn.cand_id,
       cn.cand_name,
       cn.cand_pcc,
       cm.cmte_nm,
       cm.cand_id as cmte_cand_id,
       cn2.cand_name as cmte_cand_name
  from cand cn
  join cmte cm       on  cm.cmte_id      = cn.cand_pcc
                     and cm.elect_cycle  = cn.elect_cycle
  left join cand cn2 on  cn2.cand_id     = cm.cand_id
                     and cn2.elect_cycle = cm.elect_cycle
 where cn.cand_status in ('C')
   and cm.cand_id != cn.cand_id
   and cn2.cand_name = cn.cand_name
 order by 3, 1
 limit 100

 * postgresql+psycopg2://crash@caladan/fecdb
52 rows affected.


Unnamed: 0,elect_cycle,cand_id,cand_name,cand_pcc,cmte_nm,cmte_cand_id,cmte_cand_name
0,2012,H0MO02148,"AKIN, W TODD",C00343475,TODD AKIN FOR SENATE,S2MO00429,"AKIN, W TODD"
1,2008,H6ME01157,"ALLEN, THOMAS H",C00328245,TOM ALLEN FOR SENATE,S8ME00080,"ALLEN, THOMAS H"
2,2010,H6ME01157,"ALLEN, THOMAS H",C00328245,TOM ALLEN FOR SENATE,S8ME00080,"ALLEN, THOMAS H"
3,2012,S0NV00138,"ANGLE, SHARRON E",C00460758,FRIENDS OF SHARRON ANGLE,H6NV02172,"ANGLE, SHARRON E"
4,2012,H8WI00018,"BALDWIN, TAMMY",C00326801,TAMMY BALDWIN FOR SENATE,S2WI00219,"BALDWIN, TAMMY"
5,2018,H2PA11098,"BARLETTA, LOU",C00445122,LOU BARLETTA FOR SENATE,S8PA00320,"BARLETTA, LOU"
6,2020,P00007492,"BENZEL, JULIANNE ELIZABETH MRS.",C00676320,UNITED FORWARD 2020 BENZEL FOR CONGRESS,H0CA04183,"BENZEL, JULIANNE ELIZABETH MRS."
7,2012,H8NV01071,"BERKLEY, SHELLEY",C00325738,BERKLEY FOR SENATE,S2NV00209,"BERKLEY, SHELLEY"
8,2018,H2TN06030,"BLACKBURN, MARSHA MRS.",C00376939,MARSHA FOR SENATE,S8TN00337,"BLACKBURN, MARSHA MRS."
9,2010,H6MO07128,"BLUNT, ROY",C00304758,FRIENDS OF ROY BLUNT,S0MO00183,"BLUNT, ROY"


And now those for which the Candidate specified by the PCC's `cand_id` has a *different* name than the original Candidate record (again, only looking at `cand_status` = "C" for now)

In [21]:
%%sql
select cn.elect_cycle,
       cn.cand_id,
       cn.cand_name,
       cn.cand_pcc,
       cm.cmte_nm,
       cm.cand_id as cmte_cand_id,
       cn2.cand_name as cmte_cand_name
  from cand cn
  join cmte cm       on  cm.cmte_id      = cn.cand_pcc
                     and cm.elect_cycle  = cn.elect_cycle
  left join cand cn2 on  cn2.cand_id     = cm.cand_id
                     and cn2.elect_cycle = cm.elect_cycle
 where cn.cand_status in ('C')
   and cm.cand_id != cn.cand_id
   and cn2.cand_name != cn.cand_name
 order by 3, 1
 limit 100

 * postgresql+psycopg2://crash@caladan/fecdb
31 rows affected.


Unnamed: 0,elect_cycle,cand_id,cand_name,cand_pcc,cmte_nm,cmte_cand_id,cmte_cand_name
0,2018,H8MD03116,ADAM DAVIDSON DEMARCO,C00660282,COMMITTEE TO ELECT ADAM DEMARCO,H8DC03014,"DEMARCO, ADAM 1985"
1,2010,S0CO00302,"BARTON, STEVEN KENT",C00468710,STEVE BARTON CAMPAIGN INC,H0CO01102,"BARTON, STEVEN DR"
2,2016,H4LA07029,"BOUSTANY, CHARLES W. DR. JR.",C00394866,BOUSTANY FOR SENATE INC,S6LA00300,"BOUSTANY, CHARLES W JR DR"
3,2014,H4TN04130,"CARR, JOE",C00541904,JOE CARR FOR SENATE,S4TN00302,"CARR, JOE S"
4,2018,H2MD06195,"DELANEY, JOHN K",C00508416,FRIENDS OF JOHN DELANEY,P00006213,"DELANEY, JOHN K."
5,2012,H4IN02101,"DONNELLY, JOSEPH SIMON MR.",C00393652,DONNELLY FOR INDIANA,S2IN00091,"DONNELLY, JOSEPH S"
6,2012,H0AZ01184,"FLAKE, JEFF MR.",C00347260,JEFF FLAKE FOR US SENATE INC,S2AZ00141,"FLAKE, JEFF"
7,2006,H6TN09043,"FORD, HAROLD JR",C00316141,HAROLD FORD JR FOR TENNESSEE,S6TN00240,"FORD, HAROLD E JR"
8,2010,H6NY20167,"GILLIBRAND, KIRSTEN ELIZABETH MRS.",C00413914,GILLIBRAND FOR SENATE,S0NY00410,"GILLIBRAND, KIRSTEN ELIZABETH"
9,2014,H2GA11149,"GINGREY, PHIL REP.",C00370783,GINGREY FOR SENATE INC,S2GA00100,"GINGREY, J PHILLIP"


### Integrity of `cand_cmte` intersect table ###

Check for bad Committee keys (i.e. if `cand_cmte` has any `cmte_id` values that don't exist in the `cmte` table)

In [22]:
%%sql
select count(*)
  from cand_cmte cc
 where not exists
       (select *
          from cmte cm
         where cm.cmte_id = cc.cmte_id)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,0


Do the same thing, except add `elect_cycle` to the (anti) join

In [23]:
%%sql
select count(*)
  from cand_cmte cc
 where not exists
       (select *
          from cmte cm
         where cm.cmte_id     = cc.cmte_id
           and cm.elect_cycle = cc.elect_cycle)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,0


Check for bad Candidate keys (i.e. if `cand_cmte` has any `cand_id` values that don't exist in the `cand` table)

In [24]:
%%sql
select count(*)
  from cand_cmte cc
 where not exists
       (select *
          from cand cn
         where cn.cand_id = cc.cand_id)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,11


Let's take a look at the Committees involved

In [25]:
%%sql
select cm.elect_cycle,
       cm.cmte_id,
       cm.cmte_nm,
       cm.cmte_tp,
       cm.org_tp,
       cm.cmte_pty_affiliation,
       cc.cand_election_yr,
       cc.fec_election_yr
  from cand_cmte cc
  join cmte cm on  cm.cmte_id     = cc.cmte_id
               and cm.elect_cycle = cc.elect_cycle
 where not exists
       (select *
          from cand cn
         where cn.cand_id = cc.cand_id)
 order by 2, 1

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,cmte_id,cmte_nm,cmte_tp,org_tp,cmte_pty_affiliation,cand_election_yr,fec_election_yr
0,2000,C00356071,BILL HAAS FOR CONGRESS (1ST DISTRICT MISSOURI),H,,DEM,2000,2000
1,2002,C00356071,BILL HAAS FOR CONGRESS (1ST DISTRICT MISSOURI),H,,DEM,2000,2002
2,2000,C00363937,MIKE GALLAGHER FOR CONGRESS,H,,DEM,2002,2000
3,2016,C00625046,COMMITTEE TO ELECT BENJAMIN NOFS,H,,GRN,2016,2016
4,2018,C00625046,COMMITTEE TO ELECT BENJAMIN NOFS,H,,GRN,2016,2018
5,2016,C00626838,SHELLY SCHRATZ FOR CONGRESS,H,,,2016,2016
6,2018,C00650424,DWIGHT BRADY FOR CONGRESS,H,,DEM,2018,2018
7,2018,C00677690,WASHINGTON FOR CONGRESS COMMITTEE,H,,,2018,2018
8,2020,C00677690,WASHINGTON FOR CONGRESS COMMITTEE,H,,,2018,2020
9,2020,C00712851,AMERICAN RELIGIOUS PARTY,X,,UNK,2019,2020


Let's check for bad Candidate keys again, except this time we'll add `elect_cycle` to the (anti) join

In [26]:
%%sql
select count(*)
  from cand_cmte cc
 where not exists
       (select *
          from cand cn
         where cn.cand_id     = cc.cand_id
           and cn.elect_cycle = cc.elect_cycle)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,18


And now take a look at the Committees involved

In [27]:
%%sql
select cm.elect_cycle,
       cm.cmte_id,
       cm.cmte_nm,
       cm.cmte_tp,
       cm.org_tp,
       cm.cmte_pty_affiliation,
       cc.cand_election_yr,
       cc.fec_election_yr
  from cand_cmte cc
  join cmte cm on  cm.cmte_id     = cc.cmte_id
               and cm.elect_cycle = cc.elect_cycle
 where not exists
       (select *
          from cand cn
         where cn.cand_id     = cc.cand_id
           and cn.elect_cycle = cc.elect_cycle)
 order by 2, 1

 * postgresql+psycopg2://crash@caladan/fecdb
18 rows affected.


Unnamed: 0,elect_cycle,cmte_id,cmte_nm,cmte_tp,org_tp,cmte_pty_affiliation,cand_election_yr,fec_election_yr
0,2000,C00330043,ROSKAM FOR CONGRESS COMMITTEE,H,,REP,1998,2000
1,2000,C00342261,PEOPLE FOR AMERICAN LEADERSHIP,U,,,2000,2000
2,2000,C00356071,BILL HAAS FOR CONGRESS (1ST DISTRICT MISSOURI),H,,DEM,2000,2000
3,2002,C00356071,BILL HAAS FOR CONGRESS (1ST DISTRICT MISSOURI),H,,DEM,2000,2002
4,2000,C00363937,MIKE GALLAGHER FOR CONGRESS,H,,DEM,2002,2000
5,2004,C00390468,DRAFT HILLARY FOR PRESIDENT 2004,U,,DEM,2004,2004
6,2004,C00400473,HILLARY CLINTON FOR PRESIDENT 2008,U,,DEM,2004,2004
7,2012,C00459768,JOSE RUIZ FOR CONGRESS COMMITTEE,H,,DEM,2012,2012
8,2016,C00625046,COMMITTEE TO ELECT BENJAMIN NOFS,H,,GRN,2016,2016
9,2018,C00625046,COMMITTEE TO ELECT BENJAMIN NOFS,H,,GRN,2016,2018


Let's see how many `cand_cmte` records don't join to both `cmte` and `cand`

In [28]:
%%sql
with bad_cand_cmte as (
    select cc.*
      from cand_cmte cc
      left join cmte cm on cm.cmte_id = cc.cmte_id
      left join cand cn on cn.cand_id = cc.cand_id
     where cm.id is null
        or cn.id is null
)
select count(*)
  from bad_cand_cmte

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,18


And let's do the same thing, adding `elect_cycle` to the joins (this is purely academic, since it looks like the previous result replicates bad Candidate keys by election cycle examination, above&mdash;this query should return the same as the previous one)

In [29]:
%%sql
with bad_cand_cmte as (
    select cc.*
      from cand_cmte cc
      left join cmte cm on  cm.cmte_id     = cc.cmte_id
                        and cm.elect_cycle = cc.elect_cycle
      left join cand cn on  cn.cand_id     = cc.cand_id
                        and cn.elect_cycle = cc.elect_cycle
     where cm.id is null
        or cn.id is null
)
select count(*)
  from bad_cand_cmte

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,18


Now let's check to see if there are duplicate entries in `cand_cmte`, meaning identical values for `cmte_id`, `cand_id`, and `elect_cycle`

In [30]:
%%sql
with cand_cmte_dup as (
    select cmte_id,
           cand_id,
           elect_cycle,
           count(*) dups,
           array_agg(id)
      from cand_cmte
     group by 1, 2, 3
    having count(*) > 1
)
select elect_cycle,
       dups,
       count(*)
  from cand_cmte_dup
 group by 1, 2
 order by 1, 2 desc

 * postgresql+psycopg2://crash@caladan/fecdb
8 rows affected.


Unnamed: 0,elect_cycle,dups,count
0,2000,2,1
1,2008,2,23
2,2010,2,40
3,2012,2,32
4,2014,2,53
5,2016,2,14
6,2018,2,60
7,2020,2,28


Note that there may be a legitimate reason for these duplications; let's see if/how these records differ from each other in columns other than `cmte_id`, `cand_id`, and `elect_cycle`.  We'll take the first 25 examples as a sampling to inspect...

In [31]:
%%sql
with cand_cmte_dup as (
    select cmte_id,
           cand_id,
           elect_cycle,
           count(*) dups,
           array_agg(id) as cand_cmte_pks
      from cand_cmte
     group by 1, 2, 3
    having count(*) > 1
     limit 25
)
select *
  from cand_cmte
 where id in
       (select unnest(cand_cmte_pks)
          from cand_cmte_dup)
 order by cand_id, cmte_id, elect_cycle

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,id,cand_id,cand_election_yr,fec_election_yr,cmte_id,cmte_tp,cmte_dsgn,linkage_id,elect_cycle
0,30928,H0CA10099,2010,2010,C00461103,H,P,154748,2010
1,30929,H0CA10099,2009,2010,C00461103,H,P,157564,2010
2,30937,H0CA10131,2009,2010,C00461947,H,P,155488,2010
3,30938,H0CA10131,2010,2010,C00461947,H,P,155492,2010
4,30943,H0CA10172,2010,2010,C00463257,H,P,156812,2010
5,30944,H0CA10172,2009,2010,C00463257,H,P,156761,2010
6,30969,H0CA15171,2009,2010,C00458117,H,P,153200,2010
7,30970,H0CA15171,2010,2010,C00458117,H,P,154590,2010
8,211,H0FL09107,2019,2020,C00709188,H,P,228108,2020
9,212,H0FL09107,2020,2020,C00709188,H,P,228337,2020


Looks like it may be due to `cand_election_yr`, so let's factor that into the equation and see to what extent that is true

In [32]:
%%sql
with cand_cmte_dup as (
    select cmte_id,
           cand_id,
           cand_election_yr,
           elect_cycle,
           count(*) dups,
           array_agg(id)
      from cand_cmte
     group by 1, 2, 3, 4
    having count(*) > 1
)
select elect_cycle,
       dups,
       count(*)
  from cand_cmte_dup
 group by 1, 2
 order by 1, 2 desc

 * postgresql+psycopg2://crash@caladan/fecdb
2 rows affected.


Unnamed: 0,elect_cycle,dups,count
0,2018,2,2
1,2020,2,2


And let's see what the remaining exceptions are

In [33]:
%%sql
with cand_cmte_dup as (
    select cmte_id,
           cand_id,
           cand_election_yr,
           elect_cycle,
           count(*) dups,
           array_agg(id) as cand_cmte_pks
      from cand_cmte
     group by 1, 2, 3, 4
    having count(*) > 1
     limit 25
)
select *
  from cand_cmte
 where id in
       (select unnest(cand_cmte_pks)
          from cand_cmte_dup)
 order by cand_id, cmte_id, elect_cycle

 * postgresql+psycopg2://crash@caladan/fecdb
8 rows affected.


Unnamed: 0,id,cand_id,cand_election_yr,fec_election_yr,cmte_id,cmte_tp,cmte_dsgn,linkage_id,elect_cycle
0,7927,H8GA04117,2018,2018,C00654699,H,P,217917,2018
1,7926,H8GA04117,2018,2018,C00654699,H,P,217918,2018
2,2213,H8GA04117,2018,2020,C00654699,H,P,223620,2020
3,2212,H8GA04117,2018,2020,C00654699,H,P,223619,2020
4,8672,H8NJ07256,2018,2018,C00690123,O,U,221986,2018
5,8673,H8NJ07256,2018,2018,C00690123,O,U,221985,2018
6,2698,H8NJ07256,2018,2020,C00690123,O,U,224070,2020
7,2697,H8NJ07256,2018,2020,C00690123,O,U,224071,2020


### Consistency of Committees to Candidates, through `cmte.cand_id` foreign key ###

We first check `cmte` for `cmte_id`'s with different `cand_id`'s within any election cycle

In [34]:
%%sql
with ec_cmte_cand_diffs as (
    select elect_cycle,
           cmte_id,
           count(distinct cand_id) as distinct_cand_ids,
           array_agg(distinct cand_id) as cand_ids
      from cmte
     group by 1, 2
    having count(distinct cand_id) > 1
)
select elect_cycle,
       distinct_cand_ids,
       count(*) as ec_cmte_ids
  from ec_cmte_cand_diffs
 group by 1, 2
 order by 1, 2 desc

 * postgresql+psycopg2://crash@caladan/fecdb
0 rows affected.


If not, now let's check for `cmte_id`'s with different `cand_id`'s *across* election cycles

In [35]:
%%sql
with cmte_cand_diffs as (
    select cmte_id,
           count(distinct cand_id) as distinct_cand_ids,
           array_agg(distinct cand_id) as cand_ids
      from cmte
     group by 1
    having count(distinct cand_id) > 1
)
select distinct_cand_ids,
       count(*) as cmte_ids
  from cmte_cand_diffs
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
2 rows affected.


Unnamed: 0,distinct_cand_ids,cmte_ids
0,3,3
1,2,228


Let's captures some of the instances, so we can visually inspect the disparate Candidate information for a specific Committee (as identified by `cmte_id`)

In [36]:
%%sql result <<
with cmte_cand_diffs as (
    select cmte_id,
           count(distinct cand_id) as distinct_cand_ids,
           array_agg(distinct cand_id) as cand_ids
      from cmte
     group by 1
    having count(distinct cand_id) > 1
)
select *
  from cmte_cand_diffs
 order by 2 desc
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.
Returning data to local variable result


We'll take a look at the first ten instances...

In [37]:
cmte_id0 = result.loc[0][0]
cmte_id1 = result.loc[1][0]
cmte_id2 = result.loc[2][0]
cmte_id3 = result.loc[3][0]
cmte_id4 = result.loc[4][0]
cmte_id5 = result.loc[5][0]
cmte_id6 = result.loc[6][0]
cmte_id7 = result.loc[7][0]
cmte_id8 = result.loc[8][0]
cmte_id9 = result.loc[9][0]
result[0:9]

Unnamed: 0,cmte_id,distinct_cand_ids,cand_ids
0,C00354597,3,"[H0NJ10117, H2FL11075, P40003345]"
1,C00501197,3,"[H2TX16185, P00010793, S8TX00285]"
2,C00567677,3,"[H0NY19139, H2AZ08102, H4CA45097]"
3,C00254938,2,"[H2DE00072, S0DE00068, None]"
4,C00266932,2,"[H2GA08038, S2GA00118]"
5,C00267708,2,"[H2FL16041, S4FL00207]"
6,C00264697,2,"[H2OH13033, S6OH00163]"
7,C00303552,2,"[H2NJ06053, S6NJ00164]"
8,C00304758,2,"[H6MO07128, S0MO00183]"


We sort by `cmte_id`, so the variation in Candidate ID and name will be in adjacent rows.  It is interesting to see whether `cand_id` just switches at a point in time (election cycle), or whether multiple values are used haphazardly between cycles

In [38]:
%%sql
select cmte.cmte_id,
       cand.cand_id,
       cand.cand_name,
       array_agg(distinct cand.elect_cycle) as elect_cycles
  from cmte
  join cand on  cand.cand_id     = cmte.cand_id
            and cand.elect_cycle = cmte.elect_cycle
 where cmte.cmte_id in (:cmte_id0, :cmte_id1, :cmte_id2, :cmte_id3, :cmte_id4,
                        :cmte_id5, :cmte_id6, :cmte_id7, :cmte_id8, :cmte_id9)
 group by 1, 2, 3
 order by 1, 2, 3

 * postgresql+psycopg2://crash@caladan/fecdb
28 rows affected.


Unnamed: 0,cmte_id,cand_id,cand_name,elect_cycles
0,C00254938,H2DE00072,"CASTLE, MICHAEL N","[2000, 2002, 2004, 2006, 2008]"
1,C00254938,S0DE00068,"CASTLE, MICHAEL N","[2010, 2012]"
2,C00264697,H2OH13033,"BROWN, SHERROD","[2000, 2002, 2004]"
3,C00264697,S6OH00163,"BROWN, SHERROD","[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
4,C00266932,H2GA08038,"CHAMBLISS, SAXBY",[2000]
5,C00266932,S2GA00118,"CHAMBLISS, C SAXBY","[2012, 2014, 2016, 2018, 2020]"
6,C00266932,S2GA00118,"CHAMBLISS, SAXBY","[2002, 2004, 2006, 2008, 2010]"
7,C00267708,H2FL16041,"DEUTSCH, PETER R","[2000, 2002]"
8,C00267708,S4FL00207,"DEUTSCH, PETER","[2004, 2006, 2008]"
9,C00303552,H2NJ06053,"SMITH, ROBERT G","[2002, 2004]"


### Cardinality of Committees to Candidates, through `cand_cmte` intersect table ###

See how many `cmte` records join (or attempt to join) to `cand` through `cand_cmte`

In [39]:
%%sql
select count(*) as link_recs,
       round(count(*)::numeric / :cmte_count_total * 100.0, 2) as pct_link_recs,
       count(cn.id) as cand_recs,
       round(count(cn.id)::numeric / :cmte_count_total * 100.0, 2) as pct_cand_recs
  from cmte cm
  join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                    and cc.elect_cycle = cm.elect_cycle
  left join cand cn on  cn.cand_id     = cc.cand_id
                    and cn.elect_cycle = cc.elect_cycle

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,link_recs,pct_link_recs,cand_recs,pct_cand_recs
0,55899,40.39,55881,40.38


In [40]:
%%sql
select count(*) as link_recs,
       round(count(*)::numeric / :cmte_count_total * 100.0, 2) as pct_link_recs,
       count(cn.id) as cand_recs,
       round(count(cn.id)::numeric / :cmte_count_total * 100.0, 2) as pct_cand_recs
  from cmte cm
  join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                    and cc.elect_cycle = cm.elect_cycle
  left join cand cn on  cn.cand_id     = cc.cand_id
                    and cn.elect_cycle = cc.elect_cycle

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,link_recs,pct_link_recs,cand_recs,pct_cand_recs
0,55899,40.39,55881,40.38


See how many `cmte` records join to multiple `cand_cmte` ("link") and `cand` records, and in what cardinality; show the `cmte.id` ("pk") values with the highest number of link/Candidate associations

In [41]:
%%sql
with multi_cand_cmte as (
    select cm.id as cmte_pk,
           count(*) as link_recs,
           count(cn.id) as cand_recs,
           array_agg(distinct cn.cand_id) as cand_ids
      from cmte cm
      join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                        and cc.elect_cycle = cm.elect_cycle
      left join cand cn on  cn.cand_id     = cc.cand_id
                        and cn.elect_cycle = cc.elect_cycle
     group by 1
    having count(*) > 1
)
select link_recs,
       cand_recs,
       count(*)           as num_cmtes,
       array_agg(cmte_pk) as cmte_pks
  from multi_cand_cmte
 group by 1, 2
 order by 1 desc, 2 desc, 3 desc

 * postgresql+psycopg2://crash@caladan/fecdb
24 rows affected.


Unnamed: 0,link_recs,cand_recs,num_cmtes,cmte_pks
0,43,43,2,"[57967, 74923]"
1,31,31,3,"[71683, 87658, 131737]"
2,22,22,8,"[95639, 106101, 107390, 107653, 116269, 117119..."
3,21,21,2,"[61867, 87328]"
4,20,20,11,"[73669, 83802, 86615, 90337, 95512, 99832, 998..."
5,19,19,5,"[43255, 58784, 63241, 76005, 98290]"
6,18,18,3,"[87582, 106839, 136132]"
7,17,17,1,[109356]
8,16,16,11,"[4073, 11334, 18671, 25086, 38725, 57412, 7413..."
9,15,15,4,"[42135, 59902, 63442, 77472]"


Now let's take a look at the actual Committee records at the top of the list, and validate that the `cand_id` for the record (if specified) is actually in the `cand_cmte` link assocation as well

In [42]:
%%sql
with multi_cand_cmte as (
    select cm.id as cmte_pk,
           count(*) as link_recs,
           count(cn.id) as cand_recs,
           array_agg(distinct cn.cand_id) as cand_ids
      from cmte cm
      join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                        and cc.elect_cycle = cm.elect_cycle
      left join cand cn on  cn.cand_id     = cc.cand_id
                        and cn.elect_cycle = cc.elect_cycle
     group by 1
    having count(*) > 1
)
select cm.id as cmte_pk,
       cm.elect_cycle,
       cm.cmte_id,
       cm.cmte_nm,
       cm.cand_id,
       mcc.link_recs,
       cm.cand_id = any(mcc.cand_ids) as cand_id_link_rec
  from multi_cand_cmte mcc
  join cmte cm on cm.id = mcc.cmte_pk
 order by 6 desc
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cmte_pk,elect_cycle,cmte_id,cmte_nm,cand_id,link_recs,cand_id_link_rec
0,74923,2012,C00501825,JARED POLIS VICTORY FUND 2012,S2IN00091,43,True
1,57967,2014,C00501825,JARED POLIS VICTORY FUND 2012,S2IN00091,43,True
2,71683,2012,C00461913,JARED POLIS VICTORY FUND,S0CO00211,31,True
3,131737,2000,C00242941,REPUBLICAN SENATORIAL INNER CIRCLE 1990,S0AK00063,31,True
4,87658,2010,C00461913,JARED POLIS VICTORY FUND,S0CO00211,31,True
5,127679,2002,C00376608,W/N 2002 C0MMITTEE,S0IA00077,22,True
6,117119,2004,C00386342,FRONTLINE DEMOCRATS,H0GA08032,22,True
7,119032,2004,C00405977,2004 JOINT CANDIDATE COMMITTEE II,S2SD00068,22,True
8,106101,2006,C00386342,FRONTLINE DEMOCRATS,H0GA08032,22,True
9,95639,2008,C00396226,SENATE MAJORITY COMMITTEE,S4AZ00030,22,True


See if there are any `cmte` records with multiple `cand` associations through the link table but no direct join through `cmte.cand_id`

In [43]:
%%sql
with multi_cand_cmte as (
    select cm.id as cmte_pk,
           count(*) as link_recs,
           count(cn.id) as cand_recs,
           array_agg(distinct cn.cand_id) as cand_ids
      from cmte cm
      join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                        and cc.elect_cycle = cm.elect_cycle
      left join cand cn on  cn.cand_id     = cc.cand_id
                        and cn.elect_cycle = cc.elect_cycle
     group by 1
    having count(*) > 1
)
select mcc.link_recs,
       mcc.cand_recs,
       count(*),
       array_agg(mcc.cmte_pk) as cmte_pks
  from multi_cand_cmte mcc
  join cmte cm on cm.id = mcc.cmte_pk
 where cm.cand_id is null
 group by 1, 2
 order by 1 desc, 2 desc, 3 desc

 * postgresql+psycopg2://crash@caladan/fecdb
0 rows affected.


See if there are any `cmte` records with a single `cand` association through the link table but a mismatching direct join through `cmte.cand_id`

In [44]:
%%sql
with multi_cand_cmte as (
    select cm.id as cmte_pk,
           count(*) as link_recs,
           sum((cm.cand_id != cc.cand_id)::integer) as mismatches
      from cmte cm
      join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                        and cc.elect_cycle = cm.elect_cycle
     group by 1
    having count(*) = 1
)
select count(*) as link_recs,
       sum(mcc.mismatches) as mismatches,
       round(sum(mcc.mismatches)::numeric / count(*) * 100.0, 2) as pct_mismatch
  from multi_cand_cmte mcc

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,link_recs,mismatches,pct_mismatch
0,49756,12,0.02


And look for any direct Committee-Candidate joins (using `cmte.cand_id`) that don't have a corresponding association through `cand_cmte`)

In [45]:
%%sql
with cmte_cand_join as (
    select cm.*
      from cmte cm
      join cand cn           on  cn.cand_id     = cm.cand_id
                             and cn.elect_cycle = cm.elect_cycle
      left join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                             and cc.cand_id     = cm.cand_id
                             and cc.elect_cycle = cm.elect_cycle
     where cc.id is null
)
select count(*)
  from cmte_cand_join

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,count
0,56


List them in descending chronological order

In [46]:
%%sql
with cmte_cand_join as (
    select cm.*,
           cn.cand_name,
           cn.cand_pty_affiliation,
           cn.cand_election_yr
      from cmte cm
      join cand cn           on  cn.cand_id     = cm.cand_id
                             and cn.elect_cycle = cm.elect_cycle
      left join cand_cmte cc on  cc.cmte_id     = cm.cmte_id
                             and cc.cand_id     = cm.cand_id
                             and cc.elect_cycle = cm.elect_cycle
     where cc.id is null
)
select elect_cycle,
       cmte_id,
       cmte_nm,
       cand_name,
       cand_pty_affiliation,
       cand_election_yr
  from cmte_cand_join
 order by 1 desc, 3
 limit 100

 * postgresql+psycopg2://crash@caladan/fecdb
56 rows affected.


Unnamed: 0,elect_cycle,cmte_id,cmte_nm,cand_name,cand_pty_affiliation,cand_election_yr
0,2020,C00709600,ANAMO (AMERICA NEEDS A MAKEOVER),MICHEL ANISSA POWELL DR,DEM,2020
1,2020,C00693267,BLACK LABEL EMPIRE (HOUSE OF LORDS) CYBER UNITS,"MCGEE, ANTONIO HON.",W,2019
2,2020,C00583567,COMMITTEE TO ELECT ROBERT MARSHALL,"MARSHALL, ROBERT",REP,2020
3,2020,C00573105,COMMITTEE TO INSURE CIVIL & CONSTITUTIONAL RIGHTS,"SMITH, JOSEPH DR",REP,1988
4,2020,C00610022,FINANCE COMMITTEE FOR DOUGLAS HOWARD PIERCE,"PIERCE, DOUGLAS HOWARD",DEM,2018
5,2020,C00664243,JAZMINA SAAVEDRA FOR US CONGRESS,"SAAVEDRA, JAZMINA",REP,2018
6,2020,C00550863,MARIANNE WILLIAMSON FOR CONGRESS,"WILLIAMSON, MARIANNE",NNE,2014
7,2020,C00684381,M HUDSON HALE COMMITTEE,"HUDSON HALE, MICHELLE R",DEM,2018
8,2020,C00612051,NEW FUTURE FOR MATTHEW KALUS THALER,"THALER, MATTHEW KALUS",REP,2018
9,2020,C00716381,PATRONS OF ROBERT E SMITH,"SMITH, ROBERT EUGENE",REP,2020


### Cardinality of Candidates to Committees, through `cmte.cand_id` foreign key ###

First see if there are any Candidate records with no directly associated Committee (through the `cmte.cand_id` foreign key)

In [47]:
%%sql result <<
select count(*) as orphan_count,
       round(count(*)::numeric / :cand_count_total * 100.0, 2) as pct_total_cands,
       count(distinct cn.cand_id) orphans_distinct,
       round(count(distinct cn.cand_id)::numeric / :cand_distinct_ids * 100.0, 2) as pct_distinct_cands
  from cand cn
 where not exists
       (select *
          from cmte cm
         where cm.cand_id = cn.cand_id)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [48]:
orphan_count     = int(result.loc[0][0])
orphans_distinct = int(result.loc[0][2])
result

Unnamed: 0,orphan_count,pct_total_cands,orphans_distinct,pct_distinct_cands
0,9904,17.49,7973,30.38


Let's see if the unassociated nature of these Candidate records has something to do with their "status"

In [49]:
%%sql
select cand_status,
       count(*) as orphan_count,
       round(count(*)::numeric / :orphan_count * 100.0, 2) as pct_total_orphans,
       count(distinct cn.cand_id) orphans_distinct,
       round(count(distinct cn.cand_id)::numeric / :orphans_distinct * 100.0, 2) as pct_distinct_orphans
  from cand cn
 where not exists
       (select *
          from cmte cm
         where cm.cand_id = cn.cand_id)
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
5 rows affected.


Unnamed: 0,cand_status,orphan_count,pct_total_orphans,orphans_distinct,pct_distinct_orphans
0,A,1,0.01,1,0.01
1,C,35,0.35,34,0.43
2,F,2,0.02,2,0.03
3,N,9805,99.0,7916,99.29
4,P,61,0.62,37,0.46


For the ones with `cand_status` = "C" ("statutory" candidate), let's see what their PCCs (principal campaign committees) look like&mdash;in partcular, whether they have `cand_id`'s defined, and if so, how those `cand` records relate to the current orphaned record

In [50]:
%%sql
select cn.elect_cycle,
       cn.cand_id,
       cn.cand_name,
       cn.cand_pcc,
       cm.cmte_nm,
       cm.cand_id as cmte_cand_id,
       cn2.cand_name as cmte_cand_name
  from cand cn
  left join cmte cm  on  cm.cmte_id      = cn.cand_pcc
                     and cm.elect_cycle  = cn.elect_cycle
  left join cand cn2 on  cn2.cand_id     = cm.cand_id
                     and cn2.elect_cycle = cm.elect_cycle
 where not exists
       (select *
          from cmte cm
         where cm.cand_id = cn.cand_id)
   and cn.cand_status = 'C'
 order by cn2.id > 0, cm.id > 0, cn.cand_pcc is null, 1 desc, 3

 * postgresql+psycopg2://crash@caladan/fecdb
35 rows affected.


Unnamed: 0,elect_cycle,cand_id,cand_name,cand_pcc,cmte_nm,cmte_cand_id,cmte_cand_name
0,2018,H8MD03116,ADAM DAVIDSON DEMARCO,C00660282,COMMITTEE TO ELECT ADAM DEMARCO,H8DC03014,"DEMARCO, ADAM 1985"
1,2018,S8MI00356,"EPSTEIN, LENA ROSE",C00641498,LENA FOR CONGRESS,H8MI11320,"EPSTEIN, LENA ROSE"
2,2018,H8MD02118,"MATORY, LIZ",C00583781,LIZ MATORY FOR CONGRESS,H6MD08507,"MATORY, LIZ"
3,2018,S8PA00338,"WRIGHT, THERESA MICHELLE",C00659953,CAMPAIGN FOR THERESA WRIGHT,H8PA07259,"WRIGHT, THERESA"
4,2016,S6FL00376,"GRAYSON, ALAN MARK",C00424713,COMMITTEE TO ELECT ALAN GRAYSON,H6FL08213,"GRAYSON, ALAN MARK"
5,2016,S6FL00350,"JOLLY, DAVID W",C00551572,FRIENDS OF DAVID JOLLY,H4FL13101,"JOLLY, DAVID W."
6,2016,P60006954,"LYNCH, DENNIS MICHAEL",C00576074,DML FOR AMERICA,P60006814,"LYNCH, DENNIS M"
7,2016,H6IL18153,"MELLON, ROBERT",C00582460,ROB MELLON FOR CONGRESS,H4IL18109,"MELLON, ROB"
8,2016,S6NY00383,"SPOTORNO, FRANK",C00579599,SPOTORNO FOR AMERICA,H6NY14244,"SPOTORNO, FRANK"
9,2014,S4MT00084,"STAPLETON, COREY C",C00542068,WWW.COREYSTAPLETON.COM,H4MT01033,"STAPLETON, COREY"


Now we'll do the same thing (looking at Candidate records with no `cmte.cand_id` association), with `elect_cycle` added to the (anti) join

In [51]:
%%sql
select count(*) as orphan_count,
       count(distinct cn.cand_id) orphans_distinct
  from cand cn
 where not exists
       (select *
          from cmte cm
         where cm.cand_id     = cn.cand_id
           and cm.elect_cycle = cn.elect_cycle)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,orphan_count,orphans_distinct
0,11114,8853


Again, we'll summarize by `cand_status`

In [52]:
%%sql
select cand_status,
       count(*) as orphan_count,
       count(distinct cn.cand_id) orphans_distinct
  from cand cn
 where not exists
       (select *
          from cmte cm
         where cm.cand_id     = cn.cand_id
           and cm.elect_cycle = cn.elect_cycle)
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
6 rows affected.


Unnamed: 0,cand_status,orphan_count,orphans_distinct
0,A,1,1
1,C,112,109
2,F,4,4
3,N,10595,8537
4,P,400,266
5,,2,2


And list the ones with `cand_status` = "C" ("statutory" candidate), along with their PCC information (including associated Candidate for the Committee)

In [53]:
%%sql
select cn.elect_cycle,
       cn.cand_id,
       cn.cand_name,
       cn.cand_pcc,
       cm.cmte_nm,
       cm.cand_id as cmte_cand_id,
       cn2.cand_name as cmte_cand_name
  from cand cn
  left join cmte cm  on  cm.cmte_id      = cn.cand_pcc
                     and cm.elect_cycle  = cn.elect_cycle
  left join cand cn2 on  cn2.cand_id     = cm.cand_id
                     and cn2.elect_cycle = cm.elect_cycle
 where not exists
       (select *
          from cmte cm
         where cm.cand_id     = cn.cand_id
           and cm.elect_cycle = cn.elect_cycle)
   and cn.cand_status = 'C'
 order by cn2.id > 0, cm.id > 0, cn.cand_pcc is null, 1 desc, 3

 * postgresql+psycopg2://crash@caladan/fecdb
112 rows affected.


Unnamed: 0,elect_cycle,cand_id,cand_name,cand_pcc,cmte_nm,cmte_cand_id,cmte_cand_name
0,2020,P00007492,"BENZEL, JULIANNE ELIZABETH MRS.",C00676320,UNITED FORWARD 2020 BENZEL FOR CONGRESS,H0CA04183,"BENZEL, JULIANNE ELIZABETH MRS."
1,2018,H8MD03116,ADAM DAVIDSON DEMARCO,C00660282,COMMITTEE TO ELECT ADAM DEMARCO,H8DC03014,"DEMARCO, ADAM 1985"
2,2018,H2TN06030,"BLACKBURN, MARSHA MRS.",C00376939,MARSHA FOR SENATE,S8TN00337,"BLACKBURN, MARSHA MRS."
3,2018,H0ND01026,"CRAMER, KEVIN MR.",C00504704,CRAMER FOR SENATE,S8ND00120,"CRAMER, KEVIN MR."
4,2018,H2MD06195,"DELANEY, JOHN K",C00508416,FRIENDS OF JOHN DELANEY,P00006213,"DELANEY, JOHN K."
5,2018,S8MI00356,"EPSTEIN, LENA ROSE",C00641498,LENA FOR CONGRESS,H8MI11320,"EPSTEIN, LENA ROSE"
6,2018,H8NC08109,"HUFFMAN, SCOTT",C00663229,HUFFMAN 2018 CAMPAIGN COMMITTEE,H8NC08091,"HUFFMAN, JEFFREY SCOTT"
7,2018,H4WV03070,"JENKINS, EVAN H",C00548271,JENKINS FOR SENATE,S8WV00127,"JENKINS, EVAN H"
8,2018,H8MD02118,"MATORY, LIZ",C00583781,LIZ MATORY FOR CONGRESS,H6MD08507,"MATORY, LIZ"
9,2018,H2TX16185,"O'ROURKE, ROBERT BETO",C00501197,BETO FOR TEXAS,S8TX00285,"O'ROURKE, ROBERT (BETO)"


Now let's look at all of the cardinalities of Candidate to Committee (minus orphans); we count the number of Candidates with various numbers of direct associations (through `cmte.cand_id`)

In [54]:
%%sql
with cand_cmte_card as (
    select cn.cand_id,
           count(distinct cm.cmte_id) as num_cmtes
      from cand cn
      left join cmte cm on cm.cand_id = cn.cand_id
     group by 1
    having count(distinct cm.cmte_id) > 0
)
select num_cmtes as assoc_cmtes,
       count(*)  as cand_count,
       round(count(*)::numeric / (:cand_distinct_ids - :orphans_distinct) * 100.0, 2) as pct_non_orphans
    from cand_cmte_card
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
28 rows affected.


Unnamed: 0,assoc_cmtes,cand_count,pct_non_orphans
0,43,1,0.01
1,39,1,0.01
2,34,1,0.01
3,31,1,0.01
4,30,1,0.01
5,29,2,0.01
6,25,1,0.01
7,24,1,0.01
8,23,1,0.01
9,22,3,0.02


Let's see who the candidates are that have 10 or more directly associated Committees, including the office(s) they are running for

In [55]:
%%sql
with cand_cmte_card as (
    select cn.cand_id,
           count(distinct cm.cmte_id) as num_cmtes
      from cand cn
      join cmte cm on cm.cand_id = cn.cand_id
     group by 1
    having count(distinct cm.cmte_id) >= 10
)
select cn.cand_id,
       cn.cand_name,
       array_agg(distinct cn.cand_pty_affiliation) as party,
       ccc.num_cmtes,
       array_agg(distinct concat('''', cn.cand_election_yr, ' ', cn.cand_office_st, ' ', cn.cand_office, ''''))
           as "race (year state office)"
  from cand_cmte_card ccc
  join cand cn on cn.cand_id = ccc.cand_id
 group by 1, 2, 4
 order by 4 desc

 * postgresql+psycopg2://crash@caladan/fecdb
86 rows affected.


Unnamed: 0,cand_id,cand_name,party,num_cmtes,race (year state office)
0,P00003335,"BUSH, GEORGE W",[REP],43,"['2000 US P', '2004 US P']"
1,S0NH00219,"SHAHEEN, JEANNE",[DEM],39,"['2002 NH S', '2008 NH S', '2014 NH S', '2020 ..."
2,P80003338,"OBAMA, BARACK",[DEM],34,"['2008 US P', '2012 US P']"
3,P80003338,"OBAMA, BARACK / JOSEPH R. BIDEN",[DEM],34,['2012 US P']
4,P20000527,"NADER, RALPH","[IND, UNK]",31,"['2000 US P', '2004 US P', '2008 US P']"
5,P60002458,"JEWELL, ROGER H",[REP],30,['2008 US P']
6,P60002458,"JEWELL, ROGER HENRY","[NNE, REP]",30,['2016 US P']
7,P80000235,"KERRY, JOHN F",[DEM],29,['2004 US P']
8,P80002801,"MCCAIN, JOHN S",[REP],29,"['2000 US P', '2008 US P']"
9,P80002801,"MCCAIN, JOHN S.",[REP],29,['2008 US P']


### Cardinality of Candidates to Committees, through `cand_cmte` intersect table ###

Now see how many Candidates do not have a Committee association through the `cand_cmte` intersect table

In [56]:
%%sql result <<
select count(distinct cand_id) orphans_distinct,
       round(count(distinct cand_id)::numeric / :cand_distinct_ids * 100.0, 2) as pct_distinct_cands
  from cand
 where cand_id not in
       (select cand_id
          from cand_cmte)

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [57]:
orphans_distinct = int(result.loc[0][0])
result

Unnamed: 0,orphans_distinct,pct_distinct_cands
0,7960,30.33


Let's see if the unassociated nature of these Candidate records has something to do with their "status"

In [58]:
%%sql
select cand_status,
       count(distinct cand_id) orphans_distinct,
       round(count(distinct cand_id)::numeric / :orphans_distinct * 100.0, 2) as pct_distinct_orphans
  from cand
 where cand_id not in
       (select cand_id
          from cand_cmte)
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
5 rows affected.


Unnamed: 0,cand_status,orphans_distinct,pct_distinct_orphans
0,A,1,0.01
1,C,30,0.38
2,F,2,0.03
3,N,7910,99.37
4,P,33,0.41


Now let's look at all of the cardinalities of Candidate to Committee (minus orphans); we count the number of Candidates with various numbers of associations through `cand_cmte`

In [59]:
%%sql
with cand_cmte_card as (
    select cn.cand_id,
           count(distinct cm.cmte_id) as num_cmtes
      from cand cn
      left join cand_cmte cc on cc.cand_id = cn.cand_id
      left join cmte cm      on cm.cmte_id = cc.cmte_id
     group by 1
    having count(distinct cm.cmte_id) > 0
)
select num_cmtes as assoc_cmtes,
       count(*)  as cand_count,
       round(count(*)::numeric / (:cand_distinct_ids - :orphans_distinct) * 100.0, 2) as pct_non_orphans
    from cand_cmte_card
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
39 rows affected.


Unnamed: 0,assoc_cmtes,cand_count,pct_non_orphans
0,51,1,0.01
1,47,1,0.01
2,44,1,0.01
3,43,2,0.01
4,41,1,0.01
5,38,1,0.01
6,35,1,0.01
7,34,1,0.01
8,32,2,0.01
9,31,1,0.01


Let's see who the candidates are that have 20 or more associated Committees (though `cand_cmte`), including the office(s) they are running for

In [60]:
%%sql
with cand_cmte_card as (
    select cn.cand_id,
           count(distinct cm.cmte_id) as num_cmtes
      from cand cn
      join cand_cmte cc on cc.cand_id = cn.cand_id
      join cmte cm      on cm.cmte_id = cc.cmte_id
     group by 1
    having count(distinct cm.cmte_id) >= 20
)
select cn.cand_id,
       cn.cand_name,
       array_agg(distinct cn.cand_pty_affiliation) as party,
       ccc.num_cmtes,
       array_agg(distinct concat('''', cn.cand_election_yr, ' ', cn.cand_office_st, ' ', cn.cand_office, ''''))
           as "race (year state office)"
  from cand_cmte_card ccc
  join cand cn on cn.cand_id = ccc.cand_id
 group by 1, 2, 4
 order by 4 desc

 * postgresql+psycopg2://crash@caladan/fecdb
55 rows affected.


Unnamed: 0,cand_id,cand_name,party,num_cmtes,race (year state office)
0,S6MO00305,"MCCASKILL, CLAIRE",[DEM],51,"['2006 MO S', '2012 MO S', '2018 MO S']"
1,P00003335,"BUSH, GEORGE W",[REP],47,"['2000 US P', '2004 US P']"
2,S8NC00239,"HAGAN, KAY R",[DEM],44,"['2008 NC S', '2014 NC S']"
3,S0NH00219,"SHAHEEN, JEANNE",[DEM],43,"['2002 NH S', '2008 NH S', '2014 NH S', '2020 ..."
4,S6LA00227,"LANDRIEU, MARY L",[DEM],43,"['2002 LA S', '2008 LA S', '2014 LA S']"
5,S6MN00267,"KLOBUCHAR, AMY",[DEM],41,['2018 MN S']
6,S6MN00267,"KLOBUCHAR, AMY J","[DEM, DFL]",41,"['2006 MN S', '2012 MN S', '2024 MN S']"
7,S2WI00219,"BALDWIN, TAMMY",[DEM],38,"['2012 WI S', '2018 WI S', '2024 WI S']"
8,P80003338,"OBAMA, BARACK",[DEM],35,"['2008 US P', '2012 US P']"
9,P80003338,"OBAMA, BARACK / JOSEPH R. BIDEN",[DEM],35,['2012 US P']


## Summary ##

### Findings ###

* `cmte.cand_id` foreign key
    * High integrity, more than 99.9% of keys successfully join to `cand`
    * In a small number of instances (less than 0.5%), the `cmte.cand_id` key changes for a Committee record (identified by `cmte.cmte_id`) across election cycles; visual examination shows that in almost all cases, it is between Candidate records with identical or very similar names (and thus, the same underlying person)
    * For Candidate records not referenced by a Committee through a direct join ("orphans"), the vast majority of them (over 99%) have a `cand_status` value of "N"
        * Of the "orphans" with `cand_status` of "C", the reverse join through `cand.cand_pcc` successfully identifies a valid Committee about 50-70% of the time (depending on how the orphan is identified), including a closely matching Candidate record (joined to the Committee through `cmte.cand_id`)
* `cand.cand_pcc` foreign key
    * High integrity (99.7% of keys successfully join to `cmte`) when `cand.cand_status` has values of "C" (statutory candidate), "P" (prior statutory candidate), or "F" (future statutory candidate)
    * Low integrity (59% of keys successfully join to `cmte`) when `cand.cand_status` has a value of "N" (not yet a statutory candidate)
    * For successful joins across all `cand.cand_status` values, there is a matching reciprocal join from `cmte` to `cand` through `cmte.cand_id` more than 99% of the time
        * Looking at `cand.cand_status` = "C", the non-reciprocal joins (only 0.5% of cases) went to other Candidate records with the **same name** 63% of the time, and Candidate records with **different names** (though almost always similar) 37% of the time
* `cand_cmte` intersect table
    * Integrity of the foreign keys is high, 100% valid for `cmte_id` and 99.97% valid for `cand_id`
    * 0.5% of the records have duplicate `cmte_id` and `cand_id` associations within an election cycle; further analysis shows that it is almost entirely due to differing values for `cand_cmte.cand_election_yr` (not sure how this field is used&mdash;note that many of the diffs contain a value of an odd-numbered year)
    * 3.2% of Committee records that join to Candidate through `cand_cmte` join to multiple Candidate records; of those, 81% join to 2, 3, or 4 Candidates (though we don't currently know how those Candidate records relate to one another&mdash;e.g. whether they represent the same, or disparate, real-world people)
    * In almost all cases of joins between Committee and Candidate through `cand_cmte`, the direct join through `cmte.cand_id` matches one of the intersect table records
        * In the case of one-to-one associations through `cand_cmte`, the match rate is 99.98%
    * For Candidate records not referenced by a Committee through the intersect table, the vast majority of them (over 99%) have a `cand_status` value of "N" (as with the direct association using `cmte.cand_id`)

### Recommendations ###

Based on the above exploration and findings, the following enhancements to the schema are recommended:

* Create `cand_cmte_mstr` table based on unique `cmte_id`/`cand_id` associations (across election cycles)
    * Still need to explore a little more whether specificity based on any other `cand_cmte` attributes is warranted (currently leaning against this idea)
* Create "Base Candidate" associations (through `cand_mstr.base_cand_mstr_id`, see `dc2` notebook) based on Candidate records that are closely-related through various relationships with common Committee records
