# Query by Individual #

## Overview ##

Explore the FEC data by specifying SQL predicates that identify **Individuals**, which are people identities extracted&mdash;and somewhat cleansed&mdash;from the [Individual Contributions](https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/) file.  Inidividual records (stored in the `indiv` table), are basically distinct combinations of name and address information (city, state, zipcode) that have not been aggressively deduplicated.  Thus, there will be multiple records for a real-world person if there are variants (or typos or deception) in the identifying information for contribution records.

Querying by Individual can be used to target all of the `indiv` records (and associated contribution data in `indiv_contrib`) for a single person, or for a set of people to be explored collectively.  Examples of both usages will be presented here.

Note that this approach will create the following query contexts (each of which may be used in formulating specific queries for investigation or reporting):

* `ctx_indiv`
* `ctx_contrib`

One of the limitation of Querying by Individual is that it is difficult to distinguish between the contribution of distinct people identities within a result set.

## Notebook Setup ##

* Configure database connect information and options
* Clear potentially interfering context (PostgreSQL doesn't let you replace a view definition with conflicting column names)
* Set styling for notebook

In [1]:
sqlconnect = "postgresql+psycopg2://crash@localhost/fecdb"

%load_ext sql
%config SqlMagic.autopandas=True
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
%sql $sqlconnect

'Connected: crash@fecdb'

In [2]:
%sql drop view if exists ctx_contrib cascade
%sql drop view if exists ctx_indiv cascade

 * postgresql+psycopg2://crash@localhost/fecdb
Done.
 * postgresql+psycopg2://crash@localhost/fecdb
Done.


In [3]:
%%html
<style>
  tr, th, td {
    text-align: left !important;
  }
</style>

## Create Context Views &ndash; Single-Person Use Case ##

### Create `ctx_indiv` ###

For this use case, we'll identify the `indiv` records associated with an identity that we previously queried (in `el_queries1.sql` and `el_queries3.sql`)

In [4]:
%%sql
create or replace view ctx_indiv as
select *
  from indiv
 where name like 'SANDELL, SCOTT%'
   and zip_code ~ '9402[58]'
   and name !~ 'MRS\.'

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


Let's take a quick look at the context we just set (for validation) before proceeding

In [5]:
%%sql
select id,
       name,
       city,
       state,
       zip_code,
       elect_cycles
  from ctx_indiv

 * postgresql+psycopg2://crash@localhost/fecdb
18 rows affected.


Unnamed: 0,id,name,city,state,zip_code,elect_cycles
0,10527433,"SANDELL, SCOTT D",MENLO PARK,CA,94025,"[2004, 2006, 2008, 2010]"
1,10527430,"SANDELL, SCOTT",MENLO PARK,CA,940257022,"[2016, 2018, 2020]"
2,10527429,"SANDELL, SCOTT",MENLO PARK,CA,94025,"[2000, 2008, 2010, 2012, 2016]"
3,10527435,"SANDELL, SCOTT D",MENLO PARK,CA,940257022,[2016]
4,10527437,"SANDELL, SCOTT D",PORTOLA VALLEY,CA,940287608,[2016]
5,10527445,"SANDELL, SCOTT MR.",PORTOLA VALLEY,CA,94028,[2018]
6,10527438,"SANDELL, SCOTT D.",MENLO PARK,CA,94025,[2010]
7,10527431,"SANDELL, SCOTT",PORTOLA VALLEY,CA,94028,"[2010, 2016]"
8,10527434,"SANDELL, SCOTT D",MENLO PARK,CA,940256112,[2014]
9,10527441,"SANDELL, SCOTT D. MR.",MENLO PARK,CA,94025,"[2000, 2008, 2010, 2012]"


### Create `ctx_contrib` ###

Now we'll create the context view for the contributions from the targeted "Individual" records

In [6]:
%%sql
create or replace view ctx_contrib as
select ic.*
  from ctx_indiv ix
  join indiv_contrib ic on ic.indiv_id = ix.id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


And some quick validation on the view

In [7]:
%%sql
select count(*)             as contribs,
       sum(transaction_amt) as total_amt,
       array_agg(distinct elect_cycle) as elect_cycles
  from ctx_contrib

 * postgresql+psycopg2://crash@localhost/fecdb
1 rows affected.


Unnamed: 0,contribs,total_amt,elect_cycles
0,73,227250.0,"[2000, 2002, 2004, 2006, 2008, 2010, 2012, 201..."


## Query Based on Context &ndash; Single-Person Use Case ##

### Query using `ctx_indiv` ###

Now we can use this context to do a little investigation.  Drawing on `el_queries1.sql`, let's take a look at the contributions by election cycle from the identity that we have just isolated.

In [8]:
%%sql
select ic.elect_cycle,
       count(*) cycle_contribs,
       sum(ic.transaction_amt) cycle_amount,
       round(avg(ic.transaction_amt), 2) avg_amount,
       min(ic.transaction_amt) min_amount,
       max(ic.transaction_amt) max_amount
  from ctx_indiv ix
  join indiv_contrib ic on ic.indiv_id = ix.id
 group by 1
 order by 1

 * postgresql+psycopg2://crash@localhost/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,cycle_contribs,cycle_amount,avg_amount,min_amount,max_amount
0,2000,4,2000.0,500.0,250.0,1000.0
1,2002,3,5800.0,1933.33,1400.0,2500.0
2,2004,5,7150.0,1430.0,500.0,2000.0
3,2006,3,4850.0,1616.67,1100.0,2500.0
4,2008,10,10650.0,1065.0,-2300.0,2300.0
5,2010,9,15950.0,1772.22,1000.0,5000.0
6,2012,4,3650.0,912.5,500.0,1175.0
7,2014,1,2500.0,2500.0,2500.0,2500.0
8,2016,21,83500.0,3976.19,-2500.0,20000.0
9,2018,10,80600.0,8060.0,2500.0,20000.0


### Query using `ctx_contrib` ###

Now let's do the same query using the context view on the contribution data, rather than having to join to `indiv_contrib` explicitly (should get the same results as above)

In [9]:
%%sql
select cx.elect_cycle,
       count(*) cycle_contribs,
       sum(cx.transaction_amt) cycle_amount,
       round(avg(cx.transaction_amt), 2) avg_amount,
       min(cx.transaction_amt) min_amount,
       max(cx.transaction_amt) max_amount
  from ctx_contrib cx
 group by 1
 order by 1

 * postgresql+psycopg2://crash@localhost/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,cycle_contribs,cycle_amount,avg_amount,min_amount,max_amount
0,2000,4,2000.0,500.0,250.0,1000.0
1,2002,3,5800.0,1933.33,1400.0,2500.0
2,2004,5,7150.0,1430.0,500.0,2000.0
3,2006,3,4850.0,1616.67,1100.0,2500.0
4,2008,10,10650.0,1065.0,-2300.0,2300.0
5,2010,9,15950.0,1772.22,1000.0,5000.0
6,2012,4,3650.0,912.5,500.0,1175.0
7,2014,1,2500.0,2500.0,2500.0,2500.0
8,2016,21,83500.0,3976.19,-2500.0,20000.0
9,2018,10,80600.0,8060.0,2500.0,20000.0


## Create Context Views &ndash; Multi-Person Use Case ##

### Create `ctx_indiv` ###

For this use case, we'll identify the `indiv` records associated with the household (multiple people) previously queried (in `el_queries1.sql` and `el_queries3.sql`)

In [10]:
%%sql
create or replace view ctx_indiv as
select *
  from indiv
 where name like 'SANDELL, %'
   and zip_code ~ '9402[58]'

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


Let's take a quick look at the context we just set (for validation) before proceeding

In [11]:
%%sql
select id,
       name,
       city,
       state,
       zip_code,
       elect_cycles
  from ctx_indiv

 * postgresql+psycopg2://crash@localhost/fecdb
27 rows affected.


Unnamed: 0,id,name,city,state,zip_code,elect_cycles
0,10527433,"SANDELL, SCOTT D",MENLO PARK,CA,94025,"[2004, 2006, 2008, 2010]"
1,10527447,"SANDELL, SCOTT MRS.",MENLO PARK,CA,94025,[2004]
2,10527430,"SANDELL, SCOTT",MENLO PARK,CA,940257022,"[2016, 2018, 2020]"
3,10527429,"SANDELL, SCOTT",MENLO PARK,CA,94025,"[2000, 2008, 2010, 2012, 2016]"
4,10527435,"SANDELL, SCOTT D",MENLO PARK,CA,940257022,[2016]
5,10527437,"SANDELL, SCOTT D",PORTOLA VALLEY,CA,940287608,[2016]
6,10527445,"SANDELL, SCOTT MR.",PORTOLA VALLEY,CA,94028,[2018]
7,10527438,"SANDELL, SCOTT D.",MENLO PARK,CA,94025,[2010]
8,10527431,"SANDELL, SCOTT",PORTOLA VALLEY,CA,94028,"[2010, 2016]"
9,10527434,"SANDELL, SCOTT D",MENLO PARK,CA,940256112,[2014]


### Create `ctx_contrib` ###

Note that we don't actually have to recreate this (see Summary, below), but we are doing it just so this use case can be extracted to work stand-alone.

In [12]:
%%sql
create or replace view ctx_contrib as
select ic.*
  from ctx_indiv ix
  join indiv_contrib ic on ic.indiv_id = ix.id

 * postgresql+psycopg2://crash@localhost/fecdb
Done.


And some quick validation on the view

In [13]:
%%sql
select count(*)             as contribs,
       sum(transaction_amt) as total_amt,
       array_agg(distinct elect_cycle) as elect_cycles
  from ctx_contrib

 * postgresql+psycopg2://crash@localhost/fecdb
1 rows affected.


Unnamed: 0,contribs,total_amt,elect_cycles
0,101,264450.0,"[2000, 2002, 2004, 2006, 2008, 2010, 2012, 201..."


## Query Based on Context &ndash; Multi-Person Use Case ##

### Query using `ctx_indiv` ###

Now we can use this context to do a little investigation.  Drawing on `el_queries1.sql`, let's take a look at the contributions by election cycle from the identity that we have just isolated.

In [14]:
%%sql
select ic.elect_cycle,
       count(*) cycle_contribs,
       sum(ic.transaction_amt) cycle_amount,
       round(avg(ic.transaction_amt), 2) avg_amount,
       min(ic.transaction_amt) min_amount,
       max(ic.transaction_amt) max_amount
  from ctx_indiv ix
  join indiv_contrib ic on ic.indiv_id = ix.id
 group by 1
 order by 1

 * postgresql+psycopg2://crash@localhost/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,cycle_contribs,cycle_amount,avg_amount,min_amount,max_amount
0,2000,4,2000.0,500.0,250.0,1000.0
1,2002,3,5800.0,1933.33,1400.0,2500.0
2,2004,15,17400.0,1160.0,250.0,2500.0
3,2006,6,9350.0,1558.33,1000.0,2500.0
4,2008,17,17200.0,1011.76,-2300.0,2300.0
5,2010,11,20750.0,1886.36,1000.0,5000.0
6,2012,4,3650.0,912.5,500.0,1175.0
7,2014,1,2500.0,2500.0,2500.0,2500.0
8,2016,24,88200.0,3675.0,-2500.0,20000.0
9,2018,12,86000.0,7166.67,2500.0,20000.0


### Query using `ctx_contrib` ###

Now let's do the same query using the context view on the contribution data, rather than having to join to `indiv_contrib` explicitly (should get the same results as above)

In [15]:
%%sql
select cx.elect_cycle,
       count(*) cycle_contribs,
       sum(cx.transaction_amt) cycle_amount,
       round(avg(cx.transaction_amt), 2) avg_amount,
       min(cx.transaction_amt) min_amount,
       max(cx.transaction_amt) max_amount
  from ctx_contrib cx
 group by 1
 order by 1

 * postgresql+psycopg2://crash@localhost/fecdb
11 rows affected.


Unnamed: 0,elect_cycle,cycle_contribs,cycle_amount,avg_amount,min_amount,max_amount
0,2000,4,2000.0,500.0,250.0,1000.0
1,2002,3,5800.0,1933.33,1400.0,2500.0
2,2004,15,17400.0,1160.0,250.0,2500.0
3,2006,6,9350.0,1558.33,1000.0,2500.0
4,2008,17,17200.0,1011.76,-2300.0,2300.0
5,2010,11,20750.0,1886.36,1000.0,5000.0
6,2012,4,3650.0,912.5,500.0,1175.0
7,2014,1,2500.0,2500.0,2500.0,2500.0
8,2016,24,88200.0,3675.0,-2500.0,20000.0
9,2018,12,86000.0,7166.67,2500.0,20000.0


## Summary ##

It is significant that the only difference in the "Single person use case" and the "Multi-person use case" above was the definition of the `indiv_ctx` view.  The contribution context (and hence the SQL definition of `contrib_ctx`) is exactly contingent upon the "Individual" context, and any queries used for investigation or reporting based on either of these context views can be reused without change when targeting other sets of "Individual" records.

Note that the context views (here and in subsequent notebooks) can also be created as [materialized views](https://www.postgresql.org/docs/10/rules-materializedviews.html), especially if they are somewhat expensive to query (large or complex query logic or result sets) and will be used multiple times for investigation or reporting.  Purely contingent context views (e.g. `indiv_contrib` in this notebook) do not have to be recreated when the underlying context is changed, though they will need to be "[refreshed](https://www.postgresql.org/docs/10/sql-refreshmaterializedview.html)" if defined as a materialized view.