# Committee Reference &ndash; Data Quality #

## Overview ##

Committees in the FEC data set have a unique ID assigned to them.  However, since we are combining Committee records from multiple election cycle source files, we really should join to `cmte` using both `cmte_id` and `elect_cycle`.  The purpose of this notebook is to explore the "quality" of the Committee data, both within and across election cycles, to see how consistent it is, and how we can create a unified Committee Master entity to improve referential integrity for the larger, complete data set.

Here is a list of the examinations in this notebook:

* Quality of Committee Names (`cmte_nm`)
* Integrity of Committee ID (`cmte_id`)
* Consistency of names for `cmte_id`'s across election cycles
* Multiple `cmte_id`'s for identical names &ndash; within election cycles
* Multiple `cmte_id`'s for identical names &ndash; across election cycles

Here are additional examinations to do:

* Multiple `cmte_id`'s for *similar* names &ndash; within election cycles
* Multiple `cmte_id`'s for *similar* names &ndash; across election cycles

## Notebook Setup ##

### Configure database connect info/options ###

Note: database connect string can be specified on the initial `%sql` command:

```python
database_url = "postgresql+psycopg2://user@localhost/fecdb"
%sql $database_url

```

Or, connect string is taken from DATABASE_URL environment variable (if not specified for `%sql`):

```python
%sql

```

In [1]:
%load_ext sql
%config SqlMagic.autopandas=True
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# connect string taken from DATABASE_URL environment variable
%sql

'Connected: crash@fecdb'

### Configure Python modules ###

In [2]:
import pandas as pd

pd.set_option("display.max_rows", 200)

### Set styling ###

In [3]:
%%html
<style>
  tr, th, td {
    text-align: left !important;
  }
</style>

## Examination of Committee Data Set ##

### High-level summary ###

First count total records and distinct `cmte_id`'s (and save out results for reference)

In [4]:
%%sql result <<
select count(*) as count_total,
       count(distinct cmte_id) as count_distinct_ids,
       count(distinct cmte_nm) as count_distinct_names
  from cmte

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.
Returning data to local variable result


In [5]:
cmte_count_total    = int(result.loc[0][0])
cmte_distinct_ids   = int(result.loc[0][1])
cmte_distinct_names = int(result.loc[0][2])
"cmte_count_total = %d, cmte_distinct_ids = %d, cmte_distinct_names = %d" % \
    (cmte_count_total, cmte_distinct_ids, cmte_distinct_names)

'cmte_count_total = 138384, cmte_distinct_ids = 45868, cmte_distinct_names = 50614'

### Quality of Committee Names (`cmte_nm`) ###

Let's try and get a sense of the extent of formatting problems (inconsistencies or flaws).  First look for names that have lowercase letters (uppercase is now the standard)...

In [6]:
%%sql
select elect_cycle,
       count(*)
  from cmte
 where cmte_nm ~ '[a-z]'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
7 rows affected.


Unnamed: 0,elect_cycle,count
0,2004,4
1,2006,3
2,2008,11
3,2010,3
4,2012,2
5,2014,2
6,2016,1


In [7]:
%%sql
select cmte_nm,
       array_agg(distinct elect_cycle)
  from cmte
 where cmte_nm ~ '[a-z]'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
15 rows affected.


Unnamed: 0,cmte_nm,array_agg
0,Catholics United,"[2010, 2012, 2014, 2016]"
1,Defenders of Wildlife Action Fund 527,[2008]
2,Defenders of Willdife Action Fund,[2008]
3,Environment Colorado,[2008]
4,Environment New Jersey,[2008]
5,Free Enterprise Fund Committee,[2006]
6,League of Conservation Voters Inc,[2008]
7,Let Freedom Ring,[2008]
8,NARAL Pro-Choice America,"[2004, 2008, 2010, 2012, 2014]"
9,Planned Parenthood Action Fund Inc.,"[2004, 2006, 2008, 2010]"


Next, look for names with consecutive whitespace...

In [8]:
%%sql
select elect_cycle,
       count(*)
  from cmte
 where cmte_nm ~ '\s{2,}'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,count
0,2000,6
1,2002,33
2,2004,61
3,2006,84
4,2008,86
5,2010,80
6,2012,78
7,2014,68
8,2016,85
9,2018,90


In [9]:
%%sql
select cmte_nm,
       array_agg(distinct elect_cycle)
  from cmte
 where cmte_nm ~ '\s{2,}'
 group by 1
 order by 1
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cmte_nm,array_agg
0,ACADIAN AMBULANCE SERVICE EMPLOYEE FEDERAL PO...,"[2018, 2020]"
1,ADVANCING CHIROPRACTIC PAC (FEDERATION OF STR...,"[2010, 2012, 2014, 2016]"
2,ALASKA FEDERATION OF REPUBLICAN WOMEN,"[2014, 2016, 2018]"
3,ALEXANDER AND BALDWIN INC FEDPAC (A,"[2004, 2006, 2008]"
4,AMEGY BANK NATIONAL ASSOC. POLITICAL ACTION ...,"[2006, 2008, 2010, 2012]"
5,AMERICAN BEVERAGE ASSOCIATION POLITICAL ACTION...,"[2006, 2008]"
6,AMERICAN CLARITY AND EXCEPTIONALISM (ACE PAC),"[2018, 2020]"
7,AMERICAN FEDERATION OF STATE COUNTY & MUNICIPA...,"[2008, 2010, 2012, 2014, 2016, 2018, 2020]"
8,AMERICAN PODIATRIC MEDICAL ASSN. INC. PODIATR...,[2006]
9,AMERICAN RELIGIOUS PARTY,[2020]


And now, names with errant spaces within parentheses...

In [10]:
%%sql
select elect_cycle,
       count(*)
  from cmte
 where cmte_nm ~ '\( | \)'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,count
0,2000,1
1,2002,2
2,2004,1
3,2006,1
4,2008,3
5,2010,3
6,2012,6
7,2014,5
8,2016,7
9,2018,7


In [11]:
%%sql
select cmte_nm,
       array_agg(distinct elect_cycle)
  from cmte
 where cmte_nm ~ '\( | \)'
 group by 1
 order by 1
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,cmte_nm,array_agg
0,BIPARTISAN VOLUNTARY PUBLIC AFFAIRS COMMITTEE ...,"[2012, 2014, 2016]"
1,CALIFORNIA FARM BUREAU FEDERATION FUND TO PROT...,"[2010, 2012, 2016, 2018, 2020]"
2,CALIFORNIA FARM BUREAU FUND TO PROTECT THE FAM...,[2008]
3,JOHN MANEELY COMPANY FAIR TRADE POLITICAL ACTI...,"[2008, 2010, 2012, 2014, 2016, 2018]"
4,MANTECH INTERNATIONAL CORPORATION POLITICAL AC...,"[2016, 2018, 2020]"
5,MANTECH INTERNATIONAL CORPORATION POLITICAL AC...,"[2012, 2014]"
6,SALT RIVER VALLEY WATER USERS' ASSOCIATION POL...,"[2002, 2004, 2006, 2008, 2010, 2012, 2014, 201..."
7,SEIU CALIFORNIA STATE COUNCIL (NONPROFIT 501 (...,"[2018, 2020]"
8,STATE OF HAWAII ORGANIZATION OF POLICE OFFICER...,"[2012, 2014, 2016, 2018, 2020]"
9,SYNTEX ( U S A ) LLC EMPLOYEES PAC,"[2000, 2002]"


And inconsistent spacing around commas (either space before, or no space after)

In [12]:
%%sql
select elect_cycle,
       count(*)
  from cmte
 where cmte_nm ~ ' ,|,[^ ]'
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,count
0,2000,3
1,2002,3
2,2004,6
3,2006,2
4,2008,2
5,2010,5
6,2012,4
7,2014,6
8,2016,6
9,2018,8


In [13]:
%%sql
select cmte_nm,
       array_agg(distinct elect_cycle)
  from cmte
 where cmte_nm ~ ' ,|,[^ ]'
 group by 1
 order by 1
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
22 rows affected.


Unnamed: 0,cmte_nm,array_agg
0,"10,000 LAKES VICTORY 2014",[2014]
1,"BAKER, MANOCK & JENSEN , A PROFESSIONAL CORPOR...",[2010]
2,"BSA , THE SOFTWARE ALLIANCE PAC","[2014, 2016, 2018, 2020]"
3,"CHAD LEE FOR CONGRESS,INC.","[2012, 2014]"
4,"CITIZENS FOR MIKE ASSAD,INC","[2014, 2016]"
5,CLOROX COMPANY EMPLOYEES' POLITICAL ACTION COM...,"[2000, 2002, 2004, 2006, 2008]"
6,"CONTI FOR CONGRESS,INC.",[2004]
7,"DAVID WOOD , MAKE AMERICA RIGHTEOUS AGAIN",[2020]
8,"DEMOCRACYONE,INC.",[2020]
9,"DISTRICT 1199J,NUHHCE,AFSCME,AFL-CIO","[2010, 2012, 2014, 2016, 2018, 2020]"


### Integrity of Committee ID (`cmte_id`) ###

Count records across election cycles and see if we have any null `cmte_id`'s (all zeros would be good)

In [14]:
%%sql
select elect_cycle,
       count(*)                  as records,
       count(*) - count(cmte_id) as null_cmte_ids
  from cmte
 group by 1
 order by 1

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,records,null_cmte_ids
0,2000,9577,0
1,2002,9103,0
2,2004,9322,0
3,2006,9282,0
4,2008,10017,0
5,2010,11138,0
6,2012,14455,0
7,2014,14905,0
8,2016,17827,0
9,2018,19270,0


Let's see if there are any duplicate `cmte_id`'s in any election cycles (should also be zero)

In [15]:
%%sql
with dup_cmte_id as (
    select elect_cycle,
           cmte_id,
           count(*) as id_count
      from cmte
     group by 1, 2
    having count(*) > 1
)
select elect_cycle,
       count(*) as dupes,
       sum(id_count) as total_dupe_ids,
       max(id_count) as max_dupe_ids
  from dup_cmte_id
 group by 1

 * postgresql+psycopg2://crash@caladan/fecdb
0 rows affected.


Now let's look at repeated `cmte_id`'s across election cycles (note that specifying `distinct` within `array_agg` is a tricky way of sorting the values, for consistency, if we care to group by that field)

In [16]:
%%sql
with cmte_id_sum as (
    select cmte_id,
           count(*) as ec_count,
           array_agg(distinct elect_cycle) as elect_cycles
      from cmte
     group by 1
)
select ec_count,
       count(*) as cmte_ids,
       round(count(*)::numeric / :cmte_distinct_ids * 100.0, 2) as pct_cmte_ids
  from cmte_id_sum
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
11 rows affected.


Unnamed: 0,ec_count,cmte_ids,pct_cmte_ids
0,11,2193,4.78
1,10,490,1.07
2,9,554,1.21
3,8,651,1.42
4,7,801,1.75
5,6,1338,2.92
6,5,2505,5.46
7,4,2971,6.48
8,3,5380,11.73
9,2,15998,34.88


### Consistency of names for `cmte_id`'s across election cycles ###

Count the number of different names for Committee records (identified by `cmte_id`) appearing in multiple election cycles

In [17]:
%%sql
with cmte_diff_names as (
    select cmte_id,
           count(distinct cmte_nm)     as num_diff_names,
           array_agg(distinct cmte_nm) as diff_names
      from cmte
     group by 1
    having count(distinct cmte_nm) > 1
)
select count(*) as cmte_ids,
       round(count(*)::numeric / :cmte_distinct_ids * 100.0, 2) as pct_cmte_ids
  from cmte_diff_names

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,cmte_ids,pct_cmte_ids
0,4572,9.97


Let's report by the different levels of variation on name (i.e. number of different representations)

In [18]:
%%sql
with cmte_diff_names as (
    select cmte_id,
           count(distinct cmte_nm)     as num_diff_names,
           array_agg(distinct cmte_nm) as diff_names
      from cmte
     group by 1
    having count(distinct cmte_nm) > 1
)
select num_diff_names,
       count(*) as cmte_ids,
       round(count(*)::numeric / :cmte_distinct_ids * 100.0, 2) as pct_cmte_ids
  from cmte_diff_names
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
8 rows affected.


Unnamed: 0,num_diff_names,cmte_ids,pct_cmte_ids
0,9,2,0.0
1,8,1,0.0
2,7,6,0.01
3,6,17,0.04
4,5,75,0.16
5,4,291,0.63
6,3,979,2.13
7,2,3201,6.98


Get an idea of what the different names associated with the same `cmte_id` look like&mdash;let's start with a sampling of Committee IDs with `num_diff_names` = 2 (compare adjacent `cmte_nm`'s)...

In [19]:
%%sql
with cmte_diff_names as (
    select cmte_id,
           count(distinct cmte_nm)     as num_diff_names,
           array_agg(distinct cmte_nm) as diff_names
      from cmte
     group by 1
    having count(distinct cmte_nm) = 2
)
select cmte_id,
       unnest(diff_names) as cmte_nm
  from cmte_diff_names
 order by cmte_id
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cmte_id,cmte_nm
0,C00000042,ILLINOIS TOOL WORKS FOR BETTER GOVERNMENT COMM...
1,C00000042,ILLINOIS TOOL WORKS INC. FOR BETTER GOVERNMENT...
2,C00000059,HALLMARK CARDS PAC
3,C00000059,HALLMARK POLITICAL ACTION COMMITTEE-FEDERAL HA...
4,C00000372,MAINTENANCE OF WAY POLITICAL LEAGUE
5,C00000372,MAINTENANCE OF WAY POLITICAL LEAGUE A PAC OF T...
6,C00000794,LENT & SCRIVNER PAC
7,C00000794,LENT & SCRIVNER PAC
8,C00000901,BUILD POLITICAL ACTION COMMITTEE OF THE NATION...
9,C00000901,BUILD POLITICAL ACTION COMMITTEE OF THE NATION...


Now, let's look at `num_diff_names` = 3...

In [20]:
%%sql
with cmte_diff_names as (
    select cmte_id,
           count(distinct cmte_nm)     as num_diff_names,
           array_agg(distinct cmte_nm) as diff_names
      from cmte
     group by 1
    having count(distinct cmte_nm) = 3
)
select cmte_id,
       unnest(diff_names) as cmte_nm
  from cmte_diff_names
 order by cmte_id
 limit 51

 * postgresql+psycopg2://crash@caladan/fecdb
51 rows affected.


Unnamed: 0,cmte_id,cmte_nm
0,C00000489,D R I V E POLITICAL FUND CHAPTER 886
1,C00000489,"D R I V E POLITICAL FUND, CHAPTER 886"
2,C00000489,"D R I V E POLITICAL FUND, TEAMSTERS LOCAL UNIO..."
3,C00000638,INDIANA MEDICAL POLITICAL ACTION COMMITTEE
4,C00000638,INDIANA STATE MEDICAL ASSOCIATION POLITICAL AC...
5,C00000638,INDIANA STATE MEDICAL ASSOCIATION POLITICAL AC...
6,C00000729,AMERICAN DENTAL ASSOCIATION POLITICAL ACTION C...
7,C00000729,AMERICAN DENTAL POLITICAL ACTION CMTE.
8,C00000729,AMERICAN DENTAL POLITICAL ACTION COMMITTEE
9,C00000935,DCCC


And `num_diff_names` = 4...

In [21]:
%%sql
with cmte_diff_names as (
    select cmte_id,
           count(distinct cmte_nm)     as num_diff_names,
           array_agg(distinct cmte_nm) as diff_names
      from cmte
     group by 1
    having count(distinct cmte_nm) = 4
)
select cmte_id,
       unnest(diff_names) as cmte_nm
  from cmte_diff_names
 order by cmte_id
 limit 52

 * postgresql+psycopg2://crash@caladan/fecdb
52 rows affected.


Unnamed: 0,cmte_id,cmte_nm
0,C00000885,INTERNATIONAL UNION OF PAINTERS & ALLIED TRADE...
1,C00000885,INTERNATIONAL UNION OF PAINTERS AND ALLIED TRA...
2,C00000885,INTERNATIONAL UNION OF PAINTERS AND ALLIED TRA...
3,C00000885,INTERNATIONAL UNION OF PAINTERS AND ALLIED TRA...
4,C00001198,AMERICAN HOTEL AND LODGING ASSOCIATION PAC
5,C00001198,AMERICAN HOTEL AND LODGING ASSOCIATION POLITIC...
6,C00001198,AMERICAN HOTEL & LODGING ASSOCIATION PAC FKA A...
7,C00001198,AMERICAN HOTEL MOTEL POLITICAL ACTION COMMITTEE
8,C00002469,MACHINISTS NON PARTISAN POLITICAL LEAGUE
9,C00002469,MACHINISTS NON-PARTISAN POLITICAL LEAGUE


And for the fun of it, let's look at the most extreme examples (`num_diff_names` > 7)

In [22]:
%%sql
with cmte_diff_names as (
    select cmte_id,
           count(distinct cmte_nm)     as num_diff_names,
           array_agg(distinct cmte_nm) as diff_names
      from cmte
     group by 1
    having count(distinct cmte_nm) > 7
)
select cmte_id,
       unnest(diff_names) as cmte_nm
  from cmte_diff_names
 order by cmte_id
 limit 100

 * postgresql+psycopg2://crash@caladan/fecdb
26 rows affected.


Unnamed: 0,cmte_id,cmte_nm
0,C00077701,KRAFT FOOD INC POLITICAL ACTION COMMITTEE (KF ...
1,C00077701,KRAFT FOODS GLOBAL INC. POLITICAL ACTION COMMI...
2,C00077701,KRAFT FOODS GLOBAL INC. POLITICAL ACTION COMMI...
3,C00077701,KRAFT FOODS GLOBAL INC. POLITICAL ACTION COMMI...
4,C00077701,KRAFT FOODS GROUP INC. POLITICAL ACTION COMMIT...
5,C00077701,"KRAFT FOODS GROUP, INC. POLITICAL ACTION COMMI..."
6,C00077701,KRAFT FOODS NORTH AMERICA INC. POLTICAL ACTION...
7,C00077701,THE KRAFT HEINZ COMPANY POLITICAL ACTION COMMI...
8,C00113753,JOHNSON CONTROLS INC. PAC
9,C00113753,MALLINCKRODT INC. POLITICAL ACTION COMMITTEE


### Multiple `cmte_id`'s for identical names &ndash; within election cycles ###

Let's first see how many names are involved, and what percentage of total distinct names that represents

In [23]:
%%sql
with shared_cmte_name as (
    select elect_cycle,
           cmte_nm,
           count(*) as num_shares
      from cmte
     where cmte_nm is not null
     group by 1, 2
    having count(*) > 1
)
select count(*) as shared_names,
       round(count(*)::numeric / :cmte_distinct_names * 100.0, 2) as pct_distinct_names
  from shared_cmte_name

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,shared_names,pct_distinct_names
0,1412,2.79


Now we'll report by the level of replication (name sharing by different Committees, as identified by `cmte_id`) we have in various cycles; note that the same name may "offend" (map to multiple `cmte_id`'s) within different election cycles (either in the same, or differing, "num_shares")

In [24]:
%%sql
with shared_cmte_name as (
    select elect_cycle,
           cmte_nm,
           count(*) as num_shares
      from cmte
     where cmte_nm is not null
     group by 1, 2
    having count(*) > 1
)
select num_shares,
       count(*) as shared_names,
       array_agg(distinct elect_cycle) as elect_cycles
  from shared_cmte_name
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
4 rows affected.


Unnamed: 0,num_shares,shared_names,elect_cycles
0,5,5,"[2000, 2012, 2016, 2018, 2020]"
1,4,12,"[2000, 2008, 2010, 2012, 2014, 2016, 2018]"
2,3,70,"[2000, 2002, 2004, 2006, 2008, 2010, 2012, 201..."
3,2,1325,"[2000, 2002, 2004, 2006, 2008, 2010, 2012, 201..."


Let's take a look at some of the top offenders (see "elect_cycles" for multiple offenses of the same "num_shares" by a name)

In [25]:
%%sql
with shared_cmte_name as (
    select elect_cycle,
           cmte_nm,
           count(*) as num_shares,
           array_agg(distinct cmte_id) as cmte_ids
      from cmte
     where cmte_nm is not null
     group by 1, 2
    having count(*) > 2
)
select cmte_nm,
       array_length(cmte_ids, 1) num_cmte_ids,
       cmte_ids,
       array_agg(elect_cycle) as elect_cycles
  from shared_cmte_name
 group by 1, 3
 order by array_length(cmte_ids, 1) desc, count(*) desc, cmte_nm

 * postgresql+psycopg2://crash@caladan/fecdb
61 rows affected.


Unnamed: 0,cmte_nm,num_cmte_ids,cmte_ids,elect_cycles
0,COLLINS FOR CONGRESS,5,"[C00502039, C00520379, C00521641, C00544684, C...","[2016, 2018, 2020]"
1,COLLINS FOR CONGRESS,5,"[C00245712, C00250605, C00270983, C00331124, C...",[2000]
2,MOORE FOR CONGRESS,5,"[C00331066, C00397505, C00464578, C00508853, C...",[2012]
3,CLEAN WATER ACTION PROJECT,4,"[C70001839, C70001862, C70001870, C70002324]","[2016, 2018]"
4,COLLINS FOR CONGRESS,4,"[C00502039, C00520379, C00521641, C00544684]",[2014]
5,COLLINS FOR CONGRESS,4,"[C00335695, C00502039, C00520379, C00521641]",[2012]
6,KELLY FOR CONGRESS,4,"[C00352732, C00417998, C00444216, C00460808]",[2010]
7,KELLY FOR CONGRESS,4,"[C00295493, C00317834, C00323493, C00352732]",[2000]
8,KENNEDY FOR CONGRESS,4,"[C00287078, C00347203, C00357046, C00360032]",[2000]
9,MOORE FOR CONGRESS,4,"[C00397505, C00464578, C00508853, C00555185]",[2014]


And we'll look at the top recurrences for cases where the number of shares is exactly two in an election cycle

In [26]:
%%sql
with shared_cmte_name as (
    select elect_cycle,
           cmte_nm,
           count(*) as num_shares,
           array_agg(distinct cmte_id) as cmte_ids
      from cmte
     where cmte_nm is not null
     group by 1, 2
    having count(*) = 2
)
select cmte_nm,
       cmte_ids,
       count(*) as num_elect_cycles,
       array_agg(elect_cycle) as elect_cycles
  from shared_cmte_name
 group by 1, 2
 order by 3 desc, 1, 2
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cmte_nm,cmte_ids,num_elect_cycles,elect_cycles
0,FRELINGHUYSEN FOR CONGRESS,"[C00148684, C00299404]",11,"[2000, 2002, 2004, 2006, 2008, 2010, 2012, 201..."
1,FOCUS ON THE FAMILY ACTION,"[C30000673, C90008186]",7,"[2006, 2008, 2010, 2012, 2014, 2016, 2018]"
2,LEAGUE OF CONSERVATION VOTERS INC,"[C70004262, C90005786]",7,"[2004, 2006, 2012, 2014, 2016, 2018, 2020]"
3,COMMON SENSE ISSUES INC,"[C30001457, C90009739]",6,"[2008, 2010, 2012, 2014, 2016, 2018]"
4,GRAVES FOR CONGRESS,"[C00359034, C00462556]",6,"[2010, 2012, 2014, 2016, 2018, 2020]"
5,HUMAN RIGHTS CAMPAIGN,"[C70004569, C90012626]",6,"[2010, 2012, 2014, 2016, 2018, 2020]"
6,PRICE FOR CONGRESS,"[C00195628, C00386755]",6,"[2010, 2012, 2014, 2016, 2018, 2020]"
7,REFORM PARTY OF THE UNITED STATES OF AMERICA,"[C00331314, C00364307]",6,"[2000, 2002, 2004, 2006, 2008, 2010]"
8,VOTEVETS.ORG ACTION FUND,"[C30001275, C90010620]",6,"[2008, 2010, 2012, 2014, 2016, 2018]"
9,AFL-CIO COPE POLITICAL CONTRIBUTIONS COMMITTEE,"[C00003806, C70000112]",5,"[2002, 2004, 2006, 2008, 2010]"


### Multiple `cmte_id`'s for identical names &ndash; across election cycles ###

As above, we'll see how many names are mapped to different `cmte_id`'s, except now *across* election cycles

In [27]:
%%sql
with shared_cmte_name as (
    select cm.cmte_nm,
           count(distinct cm2.cmte_id) as num_cmte_ids,
           array_agg(distinct cm2.elect_cycle) as elect_cycles
      from cmte cm
      join cmte cm2 on  cm2.cmte_nm      = cm.cmte_nm
                    and cm2.cmte_id     != cm.cmte_id
                    and cm2.elect_cycle != cm.elect_cycle
     group by 1
)
select count(*) as shared_names,
       round(count(*)::numeric / :cmte_distinct_names * 100.0, 2) as pct_distinct_names
  from shared_cmte_name

 * postgresql+psycopg2://crash@caladan/fecdb
1 rows affected.


Unnamed: 0,shared_names,pct_distinct_names
0,1370,2.71


And now report by the level of replication (name sharing by different Committees) we have in across cycles

In [28]:
%%sql
with shared_cmte_name as (
    select cm.cmte_nm,
           count(distinct cm2.cmte_id) as num_cmte_ids,
           array_agg(distinct cm2.elect_cycle) as elect_cycles
      from cmte cm
      join cmte cm2 on  cm2.cmte_nm      = cm.cmte_nm
                    and cm2.cmte_id     != cm.cmte_id
                    and cm2.elect_cycle != cm.elect_cycle
     group by 1
)
select num_cmte_ids,
       count(*) as num_shared_names
  from shared_cmte_name
 group by 1
 order by 1 desc

 * postgresql+psycopg2://crash@caladan/fecdb
9 rows affected.


Unnamed: 0,num_cmte_ids,num_shared_names
0,13,1
1,9,1
2,8,1
3,7,4
4,6,4
5,5,21
6,4,26
7,3,147
8,2,1165


And we'll take a look at the top offenders for this replication 

In [29]:
%%sql
with shared_cmte_name as (
    select cm.cmte_nm,
           count(distinct cm2.cmte_id) as num_cmte_ids,
           array_agg(distinct cm2.cmte_id) as cmte_ids,
           count(*) as count_records
      from cmte cm
      join cmte cm2 on  cm2.cmte_nm      = cm.cmte_nm
                    and cm2.cmte_id     != cm.cmte_id
                    and cm2.elect_cycle != cm.elect_cycle
     group by 1
)
select cmte_nm,
       num_cmte_ids,
       cmte_ids
  from shared_cmte_name
 order by 2 desc, 1
 limit 50

 * postgresql+psycopg2://crash@caladan/fecdb
50 rows affected.


Unnamed: 0,cmte_nm,num_cmte_ids,cmte_ids
0,COLLINS FOR CONGRESS,13,"[C00245712, C00250605, C00270983, C00331124, C..."
1,KELLY FOR CONGRESS,9,"[C00295493, C00317834, C00323493, C00352732, C..."
2,MOORE FOR CONGRESS,8,"[C00320721, C00331066, C00397505, C00397927, C..."
3,ANDERSON FOR CONGRESS,7,"[C00310235, C00338426, C00422972, C00462549, C..."
4,CAMPBELL FOR CONGRESS,7,"[C00352468, C00358416, C00398313, C00412312, C..."
5,ROBINSON FOR CONGRESS,7,"[C00338806, C00386359, C00417212, C00428185, C..."
6,ROGERS FOR CONGRESS,7,"[C00343863, C00344507, C00344812, C00370155, C..."
7,KENNEDY FOR CONGRESS,6,"[C00287078, C00347203, C00357046, C00360032, C..."
8,LUCAS FOR CONGRESS,6,"[C00287912, C00328922, C00366146, C00419705, C..."
9,THOMAS FOR CONGRESS,6,"[C00321653, C00374561, C00384222, C00649632, C..."


## Summary ##

### Findings ###

* Less than 1% of records have obvious flaws in the name, but there are opportunites to normalize (to make name matching more accurate, when needed)
* There are no null `cmte_id`'s and no duplicated `cmte_id`'s within an election cycle
* High-level summary of `cmte_id` longevity across two-year election cycles:
    * 28% of distinct `cmte_id`'s only appear in a single election cycle's data (i.e. 2 years)
    * 47% of distinct `cmte_id`'s appear in two or three election cycles' data (i.e. 3-6 years)
    * 25% of distinct `cmte_id`'s appear in four or more election cycles' data (i.e. 7+ years)
* 10% of Committees (identified by `cmte_id`) are listed under different names (`cmte_nm`) across election cycles
    * This represents 14% of Committees appearing in multiple election cycles
    * In some cases, it is a matter of spacing and punctuation (fixable by standard normalization), but in many other cases, the names have different wording (e.g. more or less specificity) or spelling (e.g. abbreviations)
* 2.8% of distinct Committee names in the overall data set map to 2 or more `cmte_id`'s within an election cycle
    * A similar number (actually 2.7%) of distinct Committee names are seen with different `cmte_id`'s *across* election cycles
    * Many of the same names appear in both lists of replicated/reused names

### Recommendations ###

Based on the above exploration and findings, the following enhancements to the schema are recommended:

* Normalize `cmte_nm` (and other text fields? \[see below]) during intial load (awk script)
    * Convert to uppercase
    * Collapse consecutive whitespace
    * Eliminate whitespace immediately inside of parentheses
* Create `cmte_mstr` table based on unique `cmte_id`'s (across election cycles)
    * Include `base_cmte_mstr_id` for associating Committees with different `cmte_id`'s, but identical (*or similar*) names, or otherwise deemed to be the same underlying organization

Other text fields to consider for normalization:

* `tres_nm`
* `cmte_st1`
* `cmte_st2`
* `cmte_city`
* `cmte_st`
* `connected_org_nm`