# Define Donor Segment Context &ndash; Top Contributors to the 314 PAC #

## Overview ##

Explore the FEC data by specifying SQL predicates that identify **Donor Segments**, which are static lists of Donor (`donor_indiv` view) records.  Note that a Donor Segment context may including one *or more* segments (e.g. by name or ID).  As with Donor contexts, Donor identities ***are*** discernible within queries using this context type.

For this notebook, we will create a Donor Segment based on top contributors to the 314 PAC.  This represents a common use case of manually creating a single Donor Segment and setting a query context in which to explore the collective giving patterns for the included members.  As a basis for the Donors comprising the Segment, we will also group the underlying Individual records into Donors using a simple name and address matching scheme.  As described below, this grouping scheme is by no means rigorous, but demonstrates an approach to bringing better integrity to some of the variability in the base FEC data sets.

This approach will create the following query contexts:

**Principal Context View**

* `ctx_dseg`

**Dependent Context Views**

* `ctx_dseg_memb`
* `ctx_donor`
* `ctx_indiv`
* `ctx_indiv_contrib`
* `ctx_donor_contrib`

## Notebook Setup ##

### Configure database connect info/options ###

Note: database connect string can be specified on the initial `%sql` command:

```python
database_url = "postgresql+psycopg2://user@localhost/fecdb"
%sql $database_url

```

Or, connect string is taken from DATABASE_URL environment variable (if not specified for `%sql`):

```python
%sql

```

In [1]:
%load_ext sql
%config SqlMagic.autopandas=True
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# connect string taken from DATABASE_URL environment variable
%sql

'Connected: crash@fecdb'

### Clear context ###

Note that we drop *all* context views so we won't have any inconsistencies after this notebook is run.  After defining `ctx_indiv` below, we will define all dependent views (see Overview, above), and leave any higher-order or orthogonal views undefined

In [2]:
%sql drop view if exists ctx_dseg_memb     cascade
%sql drop view if exists ctx_dseg          cascade
%sql drop view if exists ctx_donor_contrib cascade
%sql drop view if exists ctx_donor         cascade
%sql drop view if exists ctx_household     cascade
%sql drop view if exists ctx_iseg_memb     cascade
%sql drop view if exists ctx_iseg          cascade
%sql drop view if exists ctx_indiv_contrib cascade
%sql drop view if exists ctx_indiv         cascade

 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.


### Set styling ###

In [3]:
%%html
<style>
  tr, th, td {
    text-align: left !important;
  }
</style>

## Create Donor Segment for Top 314 Donors ##

We clear out any previous versions of these temporary tables/views that are created and used in this notebook (to ensure that all of the SQL for them shown below is executed, since this use case is written for demonstration purposes).

In [4]:
%sql delete from donor_seg where name = 'Top 314 Donors'
%sql drop materialized view if exists donor_sum_314 cascade
%sql drop materialized view if exists indiv_group cascade

 * postgresql+psycopg2://crash@localhost/fecdb
0 rows affected.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.


This view is a rough cut approach to grouping `indiv` records that are likely to represent the same real-wold person.  Individuals here are grouped together if matching on the combination of: last name, first three characters of first name, and first three characters of zip code.  Note that this view only considers the most standard pattern of name representation in the FEC data (i.e "&lt;last&gt;, &lt;first&gt; [&lt;middle&gt;|&lt;titles&gt;|&lt;degrees&gt;|...]"); other non-well-formed representations will be skipped (or not properly parsed and associated).

This quick and dirty logic is used for creating Donors from Individuals in this notebook (for demonstration purposes), but should be replaced later by higher-definition, context-sensitive algorithms when trying to get more accurate analysis and reporting out of the data.

In [5]:
%%sql
create materialized view if not exists indiv_group as
select ip.part1                  as last_name,
       substr(ip.part2, 1, 3)    as first_name_pfx,
       substr(ip.zip_code, 1, 3) as zip_pfx,
       count(distinct ip.id)     as indivs,
       array_agg(distinct ip.id) as indiv_ids
  from indiv_parsed ip
 where ip.name ~ '^[A-Z][^,]'
   and ip.zip_code is not null
   and ip.num_parts > 1
   and ip.part1 !~ ' '
 group by 1, 2, 3

 * postgresql+psycopg2://crash@localhost/fecdb
6738578 rows affected.


Create a view to represent `indiv_contrib` records associated with any committee whose name is prefixed by "314" (this qualification can be amended if there are other patterns representing the same PAC; currently there are no others with "314" elsewhere in the name)

Note that this serves as a template for creating other segments of contributions, and hence the Donors (or Inidividuals, if wanting to create an Individual Segment instead) behind them, for doing a similar type of investigation

In [6]:
%%sql
create or replace view contrib_to_314 as
select cm.cmte_nm,
       ic.*
  from cmte cm
  join indiv_contrib ic
       on ic.cmte_id = cm.cmte_id
 where cm.cmte_nm like '314%'

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


We now create a view summarizing the contributions to the "314" committees from the Individual groupings (i.e. approximation of Donors) created above.  The aggregation for each "Donor" includes the list of consolidated `indiv_id` keys, the total number of contributions, and the total and average amounts.

SQL design note: not obvious whether it is better to re-aggregate the unnested ids (even though we are not able to omit the `distinct` qualifier), or select `ig.indiv_ids` and add to GROUP BY clause&mdash;voting for the former option right now.

In [7]:
%%sql
create materialized view donor_sum_314 as
with indiv_group_memb as (
    select ig.last_name,
           ig.first_name_pfx,
           ig.zip_pfx,
           --ig.indiv_ids,
           unnest(ig.indiv_ids) as indiv_id
      from indiv_group ig
)
select igm.last_name,
       igm.first_name_pfx,
       igm.zip_pfx,
       array_agg(distinct igm.indiv_id)
                                 as indiv_ids,
       count(ct.transaction_amt) as contribs,
       sum(ct.transaction_amt)   as total_amt,
       round(sum(ct.transaction_amt) / count(ct.transaction_amt), 2)
                                 as avg_amt,
       array_agg(distinct ct.elect_cycle)
                                 as elect_cycles
  from indiv_group_memb igm
  join contrib_to_314 ct on ct.indiv_id = igm.indiv_id
 group by 1, 2, 3

 * postgresql+psycopg2://crash@localhost/fecdb
2438 rows affected.


Create a couple of indexes for performance.

In [8]:
%sql create index donor_sum_314_total_amt on donor_sum_314 (total_amt)
%sql create index donor_sum_314_avg_amt on donor_sum_314 (avg_amt)

 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.


Let's inspect the top 50 "Donors" (the actual Donor records not yet created) by total contribution amount.  This is the list that we will create our segment from.

In [9]:
%%sql
select *
  from donor_sum_314
 order by total_amt desc, contribs desc
 limit 50

 * postgresql+psycopg2://crash@localhost/fecdb
50 rows affected.


Unnamed: 0,last_name,first_name_pfx,zip_pfx,indiv_ids,contribs,total_amt,avg_amt,elect_cycles
0,STOREY,BAY,191,[11659801],21,240000.0,11428.57,[2016]
1,ROSZAK,MAT,600,[10325529],2,204562.0,102281.0,[2018]
2,PROCKOP,DAR,191,[9683572],2,200000.0,100000.0,[2018]
3,PARK,TOD,940,[9188602],2,200000.0,100000.0,[2018]
4,SHENKER,SCO,947,[10979072],7,105500.0,15071.43,"[2014, 2018]"
5,GIRARDI,THO,900,[4336953],4,100000.0,25000.0,[2018]
6,NASH,RIC,598,[8656333],2,100000.0,50000.0,[2018]
7,ABRAMSON,RON,200,[28429],4,50000.0,12500.0,[2018]
8,TAYLOR,DAL,606,[11902785],2,50000.0,25000.0,[2018]
9,LARSEN,CHR,941,[6822374],2,50000.0,25000.0,[2018]


Now we use `create_donor_seg` function to create the segment based on the query we just executed.  This step also creates the actual underlying Donor records from the groupings determined by `indiv_group`.

In [10]:
%%sql
with donor_set as (
    select row(indiv_ids)::id_array as ids
      from donor_sum_314
     order by total_amt desc, contribs desc
     limit 50
)
select create_donor_seg(array_agg(ids), 'Top 314 Donors') as seg_id
  from donor_set

 * postgresql+psycopg2://crash@localhost/fecdb
1 rows affected.


Unnamed: 0,seg_id
0,239


## Create Principal View (`ctx_dseg`) ##

In [11]:
%%sql
create or replace view ctx_dseg as
select id,
       name,
       description
  from donor_seg ds
 where ds.name = 'Top 314 Donors'

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


In [12]:
%%sql
select *
  from ctx_dseg

 * postgresql+psycopg2://crash@localhost/fecdb
1 rows affected.


Unnamed: 0,id,name,description
0,239,Top 314 Donors,


## Create Dependent Views ##

### Create `ctx_dseg_memb` ###

In [13]:
%%sql
create or replace view ctx_dseg_memb as
select dsm.*
  from ctx_dseg dsx
  join donor_seg_memb dsm on dsm.donor_seg_id = dsx.id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


In [14]:
%%sql
select ds.name as dseg_name,
       d.name  as indiv_name,
       d.city,
       d.state,
       d.zip_code,
       d.elect_cycles
  from ctx_dseg_memb dsmx
  join donor_seg ds on ds.id = dsmx.donor_seg_id
  join donor_indiv d on d.id = dsmx.donor_indiv_id

 * postgresql+psycopg2://crash@localhost/fecdb
50 rows affected.


Unnamed: 0,dseg_name,indiv_name,city,state,zip_code,elect_cycles
0,Top 314 Donors,"ABRAMSON, RONALD",WASHINGTON,DC,200063807,"[2012, 2014, 2016, 2018, 2020]"
1,Top 314 Donors,"BASSI, STEVE",CARLSBAD,CA,920081900,[2018]
2,Top 314 Donors,"BEEUWKES, REINIER",CONCORD,MA,17425322,"[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
3,Top 314 Donors,"BERG, FRED",CUSHING,ME,45633307,"[2016, 2018]"
4,Top 314 Donors,"BLUE, ALLEN",VENICE,CA,902913830,"[2016, 2018]"
5,Top 314 Donors,"BYERS, BROOK",MENLO PARK,CA,940257020,"[2014, 2016, 2018, 2020]"
6,Top 314 Donors,"CUELLAR, CLIFFORD",TACOMA,WA,984053360,"[2016, 2018, 2020]"
7,Top 314 Donors,"FERSTER, DAVID",WILMETTE,IL,600911553,"[2014, 2016, 2018, 2020]"
8,Top 314 Donors,"FORDE, JAMES",TUSTIN,CA,927806320,[2018]
9,Top 314 Donors,"FRIEDMAN, DONNA",MOUNT PLEASANT,SC,294644305,"[2014, 2016, 2018, 2020]"


### Create `ctx_donor` ###

Since there is no chance of multiple inclusion of Donors in `ctx_dseg_memb` (i.e. `ctx_dseg` does not include more than one Donor Segments), then we can use straight-forward SQL for creating `ctx_donor` (see similar section in `dc7` for discussion of a slightly more complicated case).

In [15]:
%%sql
create or replace view ctx_donor as
select d.*
  from ctx_dseg_memb dsmx
  join donor_indiv d on d.id = dsmx.donor_indiv_id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


Quick validation&mdash;the rows here should match those queried just before creation of the Donor Segment (above), except full Donor names are now shown.

In [16]:
%%sql
select id,
       name,
       city,
       state,
       zip_code,
       elect_cycles
  from ctx_donor

 * postgresql+psycopg2://crash@localhost/fecdb
50 rows affected.


Unnamed: 0,id,name,city,state,zip_code,elect_cycles
0,28429,"ABRAMSON, RONALD",WASHINGTON,DC,200063807,"[2012, 2014, 2016, 2018, 2020]"
1,678448,"BASSI, STEVE",CARLSBAD,CA,920081900,[2018]
2,779629,"BEEUWKES, REINIER",CONCORD,MA,17425322,"[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
3,877907,"BERG, FRED",CUSHING,ME,45633307,"[2016, 2018]"
4,1083690,"BLUE, ALLEN",VENICE,CA,902913830,"[2016, 2018]"
5,1661575,"BYERS, BROOK",MENLO PARK,CA,940257020,"[2014, 2016, 2018, 2020]"
6,2576925,"CUELLAR, CLIFFORD",TACOMA,WA,984053360,"[2016, 2018, 2020]"
7,3678807,"FERSTER, DAVID",WILMETTE,IL,600911553,"[2014, 2016, 2018, 2020]"
8,3850042,"FORDE, JAMES",TUSTIN,CA,927806320,[2018]
9,3993057,"FRIEDMAN, DONNA",MOUNT PLEASANT,SC,294644305,"[2014, 2016, 2018, 2020]"


### Create `ctx_indiv` ###

In [17]:
%%sql
create or replace view ctx_indiv as
select i.*
  from ctx_donor dx
  join indiv i on i.donor_indiv_id = dx.id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


Note, there will likely be more Individuals in the context (compared to Donors above) due to the coalesce logic in `indiv_group`.  Individual records that are combined to form a Donor should be adjacent to each other here.

In [18]:
%%sql
select id,
       name,
       city,
       state,
       zip_code,
       elect_cycles,
       donor_indiv_id
  from ctx_indiv
 order by donor_indiv_id, name

 * postgresql+psycopg2://crash@localhost/fecdb
54 rows affected.


Unnamed: 0,id,name,city,state,zip_code,elect_cycles,donor_indiv_id
0,28429,"ABRAMSON, RONALD",WASHINGTON,DC,200063807,"[2012, 2014, 2016, 2018, 2020]",28429
1,678448,"BASSI, STEVE",CARLSBAD,CA,920081900,[2018],678448
2,779629,"BEEUWKES, REINIER",CONCORD,MA,17425322,"[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]",779629
3,877907,"BERG, FRED",CUSHING,ME,45633307,"[2016, 2018]",877907
4,1083690,"BLUE, ALLEN",VENICE,CA,902913830,"[2016, 2018]",1083690
5,1661575,"BYERS, BROOK",MENLO PARK,CA,940257020,"[2014, 2016, 2018, 2020]",1661575
6,2576925,"CUELLAR, CLIFFORD",TACOMA,WA,984053360,"[2016, 2018, 2020]",2576925
7,3678807,"FERSTER, DAVID",WILMETTE,IL,600911553,"[2014, 2016, 2018, 2020]",3678807
8,3850042,"FORDE, JAMES",TUSTIN,CA,927806320,[2018],3850042
9,3993057,"FRIEDMAN, DONNA",MOUNT PLEASANT,SC,294644305,"[2014, 2016, 2018, 2020]",3993057


### Create `ctx_indiv_contrib` ###

In [19]:
%%sql
create or replace view ctx_indiv_contrib as
select ic.*
  from ctx_indiv ix
  join indiv_contrib ic on ic.indiv_id = ix.id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


In [20]:
%%sql
select count(*)             as contribs,
       sum(transaction_amt) as total_amt,
       array_agg(distinct elect_cycle) as elect_cycles
  from ctx_indiv_contrib

 * postgresql+psycopg2://crash@localhost/fecdb
1 rows affected.


Unnamed: 0,contribs,total_amt,elect_cycles
0,4369,6663529.0,"[2002, 2004, 2006, 2008, 2010, 2012, 2014, 201..."


### Create `ctx_donor_contrib` ###

This is really the same as `ctx_indiv_contrib`, except that we are adding `donor_indiv_id` on top of the `indiv_contrib` columns so that queries using this context view are able to join to (and/or group by) the underlying Donor record (and not just the Individual associated with the contribution record).

In [21]:
%%sql
create or replace view ctx_donor_contrib as
select ic.*,
       ix.donor_indiv_id
  from ctx_indiv ix
  join indiv_contrib ic on ic.indiv_id = ix.id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


In [22]:
%%sql
select d.id                 as donor_id,
       d.name               as donor_name,
       count(*)             as contribs,
       sum(transaction_amt) as total_amt,
       array_agg(distinct elect_cycle) as elect_cycles
  from ctx_donor_contrib cx
  join donor_indiv d on d.id = cx.donor_indiv_id
 group by 1, 2
 order by 4 desc, 3 desc

 * postgresql+psycopg2://crash@localhost/fecdb
50 rows affected.


Unnamed: 0,donor_id,donor_name,contribs,total_amt,elect_cycles
0,779629,"BEEUWKES, REINIER",433,888258.0,"[2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]"
1,6169689,"KARPLUS, BARBARA",340,878285.0,"[2016, 2018, 2020]"
2,10979072,"SHENKER, SCOTT",241,849800.0,"[2014, 2016, 2018, 2020]"
3,28429,"ABRAMSON, RONALD",295,471583.0,"[2012, 2014, 2016, 2018, 2020]"
4,1083690,"BLUE, ALLEN",81,426400.0,"[2016, 2018]"
5,7863787,"MCEVOY, NION",558,383899.0,"[2002, 2004, 2006, 2008, 2010, 2012, 2014, 201..."
6,11902785,"TAYLOR, DALE",48,258050.0,"[2016, 2018, 2020]"
7,4336953,"GIRARDI, THOMAS V.",24,235100.0,"[2014, 2016, 2018, 2020]"
8,6433238,"KIRK, CLAY",98,211850.0,"[2014, 2016, 2018, 2020]"
9,1661575,"BYERS, BROOK",60,194900.0,"[2014, 2016, 2018, 2020]"
