## Exploratory Data Analysis


Brad Howlett (bth2g)  
Eric Larson (rel4yx)  
Hanim Song (hs4cf) 
---

In [1]:
#import findspark
#findspark.init()

In [54]:
import os
from pyspark.sql import SparkSession
import pyspark.sql.types as typ
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
from pyspark.sql.functions import col, asc
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, DoubleType, DateType
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np

In [3]:
 import pyspark.sql.column
 from pyspark.sql.functions import sum, count, avg, expr, lit

In [4]:
spark = SparkSession \
    .builder \
    .getOrCreate()

sc = spark.sparkContext

## Utility functions

In [5]:
def csv_header(filename: str):
    with open(filename, 'r') as csv:
        line = csv.readline().rstrip()
    headers = line.split(',')
    return headers

In [6]:
csv_header('fec_data/cn/cn_header_file.csv')

['CAND_ID',
 'CAND_NAME',
 'CAND_PTY_AFFILIATION',
 'CAND_ELECTION_YR',
 'CAND_OFFICE_ST',
 'CAND_OFFICE',
 'CAND_OFFICE_DISTRICT',
 'CAND_ICI',
 'CAND_STATUS',
 'CAND_PCC',
 'CAND_ST1',
 'CAND_ST2',
 'CAND_CITY',
 'CAND_ST',
 'CAND_ZIP']

### Based on data from : https://www.fec.gov/data/browse-data/?tab=bulk-data   
This is House/Senate campaign finance data:

In [7]:
data_dir = 'fec_data'
year20 = '20'
year18 = '18'
year16 = '16'
years = [year20, year18, year16]


In [8]:
def read_data_in(dataname: str, fields_to_double):
    dfs = {}
    header_row = csv_header(f'{data_dir}/{dataname}/{dataname}_header_file.csv')
    #default to stringtype for ease of loading, then adjust below:
    fields = [*[typ.StructField(h[:], typ.StringType(), True) for h in header_row]]
    schema = typ.StructType(fields)

    for year in years:
        txt_filename = f'{data_dir}/{dataname}/{dataname}{year}.txt'
        df_temp = sc.textFile(txt_filename).map(lambda row: [elem for elem in row.split('|')])
        df = spark.createDataFrame(df_temp, schema)
        for field in fields_to_double:
            df = df.withColumn(field, df[field].cast(DoubleType()))
        dfs[year] = df
    return dfs

**Create dataframes and combine the three files together for analysis:**

### This is the original data that you had read in
The all candidates file contains summary financial information for each candidate who raised or spent money during the period, regardless of when they are up for election.

The file has one record per candidate and shows information about the candidate, total receipts, transfers received from authorized committees, total disbursements, transfers given to authorized committees, cash-on-hand totals, loans and debts, and other financial summary information.

In [9]:
# weball: create campaigns (weball data dir) dataframes
weball_double_fields = [
    'TTL_RECEIPTS', 
    'TTL_INDIV_CONTRIB', 
    'CAND_CONTRIB', 
    'OTHER_POL_CMTE_CONTRIB',
    'POL_PTY_CONTRIB'
]
dfs = read_data_in('weball', weball_double_fields)
df = reduce(DataFrame.unionAll, dfs.values())

The candidate master file contains basic information for each candidate, including:

    Candidates who have filed a Statement of Candidacy (Form 2) for the upcoming election

    Candidates who have active campaign committees without regard to election year

    Candidates who are referenced as a part of a draft committee or a nonconnected committee that registers as supporting or opposing a particular candidate

The file shows the candidate's identification number, candidate’s name, party affiliation, election year, office state, office sought, district, incumbent/challenger status, status as a candidate, name of the candidate’s principal campaign committee, and address.

In [10]:
# cn: The candidate master file contains basic information for each candidate
cn_df = read_data_in('cn', [])

In [11]:
cn_df[year20].show(5)

+---------+--------------------+--------------------+----------------+--------------+-----------+--------------------+--------+-----------+---------+--------------------+--------+------------+-------+--------+
|  CAND_ID|           CAND_NAME|CAND_PTY_AFFILIATION|CAND_ELECTION_YR|CAND_OFFICE_ST|CAND_OFFICE|CAND_OFFICE_DISTRICT|CAND_ICI|CAND_STATUS| CAND_PCC|            CAND_ST1|CAND_ST2|   CAND_CITY|CAND_ST|CAND_ZIP|
+---------+--------------------+--------------------+----------------+--------------+-----------+--------------------+--------+-----------+---------+--------------------+--------+------------+-------+--------+
|H0AK00105|        LAMB, THOMAS|                 NNE|            2020|            AK|          H|                  00|       C|          N|C00607515|1861 W LAKE LUCIL...|        |     WASILLA|     AK|   99654|
|H0AK00113|   TUGATUK, RAY SEAN|                 DEM|            2020|            AK|          H|                  00|       C|          N|         |          P

The committee master file contains one record for each committee registered with the Federal Election Commission. This includes federal political action committees and party committees, campaign committees for presidential, house and senate candidates, as well as groups or organizations who are spending money for or against candidates for federal office.

In [12]:
# cm: committee master file
cm_df = read_data_in('cm', [])
cm_df[year20].show(5)

+---------+--------------------+--------------------+--------------------+---------+-------------+-------+---------+---------+-------+--------------------+----------------+------+--------------------+-------+
|  CMTE_ID|             CMTE_NM|             TRES_NM|            CMTE_ST1| CMTE_ST2|    CMTE_CITY|CMTE_ST| CMTE_ZIP|CMTE_DSGN|CMTE_TP|CMTE_PTY_AFFILIATION|CMTE_FILING_FREQ|ORG_TP|    CONNECTED_ORG_NM|CAND_ID|
+---------+--------------------+--------------------+--------------------+---------+-------------+-------+---------+---------+-------+--------------------+----------------+------+--------------------+-------+
|C00000059|  HALLMARK CARDS PAC|           SARAH MOE|          2501 MCGEE|  MD #500|  KANSAS CITY|     MO|    64108|        U|      Q|                 UNK|               M|     C|                    |       |
|C00000422|AMERICAN MEDICAL ...|   WALKER, KEVIN MR.|25 MASSACHUSETTS ...|SUITE 600|   WASHINGTON|     DC|200017400|        B|      Q|                    |         

In [13]:
# ccl: This file contains one record for each candidate to committee linkage.
ccl_df = read_data_in('ccl', [])
ccl_df[year20].show(5)

+---------+----------------+---------------+---------+-------+---------+----------+
|  CAND_ID|CAND_ELECTION_YR|FEC_ELECTION_YR|  CMTE_ID|CMTE_TP|CMTE_DSGN|LINKAGE_ID|
+---------+----------------+---------------+---------+-------+---------+----------+
|C00713602|            2019|           2020|C00712851|      O|        U|    228963|
|H0AK00105|            2020|           2020|C00607515|      H|        P|    229250|
|H0AL01055|            2020|           2020|C00697789|      H|        P|    226125|
|H0AL01063|            2020|           2020|C00701557|      H|        P|    227053|
|H0AL01071|            2020|           2020|C00701409|      H|        P|    227054|
+---------+----------------+---------------+---------+-------+---------+----------+
only showing top 5 rows



The file has one record per House and Senate campaign committee and shows information about the candidate, total receipts, transfers received from authorized committees, total disbursements, transfers given to authorized committees, cash-on-hand totals, loans and debts, and other financial summary information.

In [14]:
# webl: House Senate campaigns
webl_df = read_data_in('webl', [])
webl_df[year20].show(5)

+---------+--------------------+--------+------+--------------------+------------+---------------+---------+-------------+---------+----------+------------+----------+-----------+---------------+----------------+-------------+-----------------+--------------+--------------------+-------------+-------------+------------+------------+--------------------+----------------------+---------------+----------+-------------+------------+
|  CAND_ID|           CAND_NAME|CAND_ICI|PTY_CD|CAND_PTY_AFFILIATION|TTL_RECEIPTS|TRANS_FROM_AUTH| TTL_DISB|TRANS_TO_AUTH|  COH_BOP|   COH_COP|CAND_CONTRIB|CAND_LOANS|OTHER_LOANS|CAND_LOAN_REPAY|OTHER_LOAN_REPAY|DEBTS_OWED_BY|TTL_INDIV_CONTRIB|CAND_OFFICE_ST|CAND_OFFICE_DISTRICT|SPEC_ELECTION|PRIM_ELECTION|RUN_ELECTION|GEN_ELECTION|GEN_ELECTION_PRECENT|OTHER_POL_CMTE_CONTRIB|POL_PTY_CONTRIB|CVG_END_DT|INDIV_REFUNDS|CMTE_REFUNDS|
+---------+--------------------+--------+------+--------------------+------------+---------------+---------+-------------+---------+--


Contributions from committees to candidates file description

The contributions from committees to candidates file is a subset of the itemized records (OTH) file and contains each contribution or independent expenditure made by a:

    PAC
    Party committee
    Candidate committee
    Other federal committee

and given to a candidate during the two-year election cycle.

In [15]:
# pas2: Contributions from committees to candidates
pas2_df = read_data_in('pas2', [])
pas2_df[year20].show(5)

+---------+---------+------+---------------+------------------+--------------+---------+--------------------+-----------+-----+--------+--------+----------+--------------+---------------+---------+---------+----------+--------+-------+---------+-------------------+
|  CMTE_ID|AMNDT_IND|RPT_TP|TRANSACTION_PGI|         IMAGE_NUM|TRANSACTION_TP|ENTITY_TP|                NAME|       CITY|STATE|ZIP_CODE|EMPLOYER|OCCUPATION|TRANSACTION_DT|TRANSACTION_AMT| OTHER_ID|  CAND_ID|   TRAN_ID|FILE_NUM|MEMO_CD|MEMO_TEXT|             SUB_ID|
+---------+---------+------+---------------+------------------+--------------+---------+--------------------+-----------+-----+--------+--------+----------+--------------+---------------+---------+---------+----------+--------+-------+---------+-------------------+
|C00567180|        T|   TER|          P2020|201901099143774199|           24K|      PAC|TED YOHO FOR CONG...|GAINESVILLE|   FL|   32608|        |          |      01082019|           1880|C00494583|H2FL0


Contributions by individuals file description

The contributions by individuals file contains information for contributions given by individuals. The method used to include contributions in this file has changed over time.
2015 - present: greater than $200

A contribution will be included if:

    The contribution’s election cycle-to-date amount is over $200 for contributions to candidate committees
    The contribution’s calendar year-to-date amount is over $200 for contributions to political action committees (PACs) and party committees.

# The individual files are huge!
## Need to download these files from the Dropbox link and put in the fec_data/indiv folder
- We should make a subset of these to work with in our code
- Then when stuff in getting close to final run on the whole file?

In [16]:
indiv_df = read_data_in('indiv', [])
indiv_df[year16].show(5)

+---------+---------+------+---------------+-----------+--------------+---------+-----------------+------------+-----+---------+----------------+--------------------+--------------+---------------+--------+--------------------+--------+-------+---------+-------------------+
|  CMTE_ID|AMNDT_IND|RPT_TP|TRANSACTION_PGI|  IMAGE_NUM|TRANSACTION_TP|ENTITY_TP|             NAME|        CITY|STATE| ZIP_CODE|        EMPLOYER|          OCCUPATION|TRANSACTION_DT|TRANSACTION_AMT|OTHER_ID|             TRAN_ID|FILE_NUM|MEMO_CD|MEMO_TEXT|             SUB_ID|
+---------+---------+------+---------------+-----------+--------------+---------+-----------------+------------+-----+---------+----------------+--------------------+--------------+---------------+--------+--------------------+--------+-------+---------+-------------------+
|C00088591|        N|    M3|              P|15970306895|            15|      IND|   BURCH, MARY K.|FALLS CHURCH|   VA|220424511|NORTHROP GRUMMAN|VP PROGRAM MANAGE...|      021

# New code that joins tables and counts the number of individual donations each candidate gets
Now that I have this it should be easy to make other stats and aggregate by state, dem/rep, etc

In [31]:
# keep only donations for 'G' the general election
# so discard P primary, etc
ind16 = indiv_df[year16].filter(col('TRANSACTION_PGI').startswith('G'))

In [32]:
#ind16.count()

In [33]:
# join on 'CMTE_ID', which gives us 'CAND_ID'
ind16a = ind16.join(ccl_df[year16], on='CMTE_ID', how='inner')

In [35]:
#ind16a.count()

In [36]:
# join on 'CAND_ID', which gives us all the info about the candidate
# turns out this step isn't needed, because after aggregating below you do this again.
ind16b = ind16a.join(cn_df[year16], on='CAND_ID', how='inner')

In [37]:
# TODO
# for our purposes we need to combine multiple donations from one person to the same candidate into one row
# using some groupby code or such

In [38]:
# count number of individual donations each candidate recieve
ind16agg = ind16b.groupby(col('CAND_ID')).agg(
    count(lit(1)).alias('numdonat')
)

In [41]:
#ind16agg.show(10)

In [42]:
# join candidates table to above table to get details about candidate
numdonations = ind16agg.join(cn_df[year16], on='CAND_ID', how='inner')

In [43]:
numdonations.cache()

DataFrame[CAND_ID: string, numdonat: bigint, CAND_NAME: string, CAND_PTY_AFFILIATION: string, CAND_ELECTION_YR: string, CAND_OFFICE_ST: string, CAND_OFFICE: string, CAND_OFFICE_DISTRICT: string, CAND_ICI: string, CAND_STATUS: string, CAND_PCC: string, CAND_ST1: string, CAND_ST2: string, CAND_CITY: string, CAND_ST: string, CAND_ZIP: string]

In [45]:
#numdonations.show(10)

## Continutation of original notebook code

In [40]:
df.count()

10476

---
**Number of columns:**

In [None]:
len(df.columns)

---
**Statistical summary of response variable:**

Our statistical summary will be based on whether a candidate won or lost the relevant political race.  
  
We are still gathering and joining that data to this set.

---
**Statistical summary of potential predictor variables:**

Total receipts -

In [46]:
df.select('TTL_RECEIPTS').describe().show()

+-------+-------------------+
|summary|       TTL_RECEIPTS|
+-------+-------------------+
|  count|              10476|
|   mean| 1594327.4810843852|
| stddev|4.960967162034303E7|
|    min|          -674132.5|
|    max|      4.824617973E9|
+-------+-------------------+



In [20]:
df.select('TTL_RECEIPTS').describe().show()

+-------+-------------------+
|summary|       TTL_RECEIPTS|
+-------+-------------------+
|  count|              10476|
|   mean| 1594327.4810843852|
| stddev|4.960967162034303E7|
|    min|          -674132.5|
|    max|      4.824617973E9|
+-------+-------------------+



Contributions by individuals -

In [21]:
df.select('TTL_INDIV_CONTRIB').describe().show()

+-------+--------------------+
|summary|   TTL_INDIV_CONTRIB|
+-------+--------------------+
|  count|               10476|
|   mean|   2462247.248615882|
| stddev|1.8436468552952343E8|
|    min|             -2695.0|
|    max|     1.8853982587E10|
+-------+--------------------+



Contributions by candidates -

In [22]:
df.select('CAND_CONTRIB').describe().show()

+-------+-------------------+
|summary|       CAND_CONTRIB|
+-------+-------------------+
|  count|              10476|
|   mean|  427017.3470924015|
| stddev|2.980191026125246E7|
|    min|                0.0|
|    max|      2.831281203E9|
+-------+-------------------+



Contributions from party committees -

In [23]:
df.select('POL_PTY_CONTRIB').describe().show()

+-------+------------------+
|summary|   POL_PTY_CONTRIB|
+-------+------------------+
|  count|             10476|
|   mean| 1114.714221076747|
| stddev|31051.635875742028|
|    min|               0.0|
|    max|         3100000.0|
+-------+------------------+



Contributions from other political committees -

In [24]:
df.select('OTHER_POL_CMTE_CONTRIB').describe().show()

+-------+----------------------+
|summary|OTHER_POL_CMTE_CONTRIB|
+-------+----------------------+
|  count|                 10476|
|   mean|    315356.70375429565|
| stddev|  1.8795125581625413E7|
|    min|                   0.0|
|    max|           1.9235003E9|
+-------+----------------------+



Candidate status (C = Challenger, O = Open, I = Incumbent) -

In [25]:
#some data cleaning to do for the blanks
df.groupby('CAND_ICI').count().orderBy('count', ascending = False).show()

+--------+-----+
|CAND_ICI|count|
+--------+-----+
|       C| 5887|
|       O| 2630|
|       I| 1747|
|        |  212|
+--------+-----+



Candidate party affiliation (count) -

In [52]:
can_df = df.groupby('CAND_PTY_AFFILIATION').count().orderBy('count', ascending = False).show()

+--------------------+-----+
|CAND_PTY_AFFILIATION|count|
+--------------------+-----+
|                 REP| 4896|
|                 DEM| 4633|
|                 IND|  349|
|                 LIB|  174|
|                 GRE|   77|
|                 OTH|   48|
|                 DFL|   45|
|                 NPA|   45|
|                 NNE|   43|
|                 UNK|   39|
|                  UN|   25|
|                 CON|   19|
|                   W|   12|
|                 NON|    7|
|                 IDP|    7|
|                 NOP|    5|
|                 UNI|    4|
|                 SEP|    4|
|                 AMP|    3|
|                 WFP|    3|
+--------------------+-----+
only showing top 20 rows



Candidate state (count) - 

In [27]:
df.groupby('CAND_OFFICE_ST').count().orderBy('count', ascending = False).show()

+--------------+-----+
|CAND_OFFICE_ST|count|
+--------------+-----+
|            CA|  972|
|            TX|  781|
|            FL|  668|
|            NY|  552|
|            00|  513|
|            PA|  387|
|            NC|  342|
|            IL|  329|
|            GA|  319|
|            NJ|  281|
|            VA|  280|
|            OH|  280|
|            MI|  275|
|            AZ|  268|
|            TN|  226|
|            MD|  218|
|            IN|  217|
|            CO|  198|
|            WA|  191|
|            MN|  190|
+--------------+-----+
only showing top 20 rows



In [28]:
df.select('CAND_NAME', 
          'CAND_OFFICE_ST',
          'CAND_PTY_AFFILIATION', 
          'CAND_ICI', 
          'TTL_RECEIPTS',
          'CAND_CONTRIB',    
          'TTL_INDIV_CONTRIB',
          'POL_PTY_CONTRIB',
          'OTHER_POL_CMTE_CONTRIB').show(5)

+-------------------+--------------+--------------------+--------+------------+------------+-----------------+---------------+----------------------+
|          CAND_NAME|CAND_OFFICE_ST|CAND_PTY_AFFILIATION|CAND_ICI|TTL_RECEIPTS|CAND_CONTRIB|TTL_INDIV_CONTRIB|POL_PTY_CONTRIB|OTHER_POL_CMTE_CONTRIB|
+-------------------+--------------+--------------------+--------+------------+------------+-----------------+---------------+----------------------+
|     SHEIN, DIMITRI|            AK|                 DEM|       C|         0.0|         0.0|              0.0|            0.0|                   0.0|
|    YOUNG, DONALD E|            AK|                 REP|       I|  1362383.63|         0.0|        637025.31|            0.0|             584444.63|
|NELSON, THOMAS JOHN|            AK|                 REP|       C|         0.0|         0.0|              0.0|            0.0|                   0.0|
|      GALVIN, ALYSE|            AK|                 IND|       C|  2266364.63|     3394.63|        