# Week 4 Assignment
Aaron Palumbo || IS620

## Deliverables:

Centrality measures can be used to predict (positive or negative) outcomes for a node.

Your task in this week’s assignment is to identify an interesting set of network data that is available on the web (either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.  As an additional constraint, there should be at least one categorical variable available for each node (such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high level plan that describes how you would load the data for analysis, and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups. 

For this week’s assignment, you are not required to actually load or analyze the data.  Please see also Project 1 below.

You may work in a small group on the assignment.   You should post your document to GitHub by end of day on Sunday September 20th.


## Proposed data source

Because we're heading into a presidential election, I thought I would be interesting to take a look at campaign finances. The federal election commission has data on campaign finances here: http://www.fec.gov/data/DataCatalog.do?format=html.

Let's take a look at some of the information available:

From the website:

**Committees (Committee Master)**

The committee master file contains one record for each committee registered with the Federal Election Commission. This includes federal political action committees and party committees, campaign committees for presidential, house and senate candidates, as well as groups or organizations who are spending money for or against candidates for federal office.

The file contains basic information about the committees. The ID number the Commission assigned to the committee is first, along with the name of the committee, the sponsor, where appropriate, the treasurer's name and the committee's address. The file also includes information about what type of committee is being described, along with the candidate's ID number if it is a campaign committee. A comma delimited header file is available on this file's data dictionary page.

**Candidates (Candidate Master)**

The candidate master file contains one record for each candidate who has either registered with the Federal Election Commission or appeared on a ballot list prepared by a state elections office.

The file contains basic information about the candidate, including name, party, whether the candidate is an incumbent, challenger, or involved in an open seat, address, state and district in which the candidate is running and the year of the election for which the candidate is registered. (Note that incumbent/challenger status is dynamic in the current election cycle and there may be delays in identifying districts that will involve open seats. The file also includes the ID number assigned to the candidate by the FEC which is used in tracking campaign finance information about the campaign, as well as the ID number of the candidate's principal campaign committee. A comma delimited header file is available on this file's data dictionary page.

**Linkages (Candidate-Committee Linkage)**

This file contains one record for each candidate to committee linakge.

**Itemized Records (Any Transaction from One Committee to Another)**

The itemized records (miscellaneous transactions) file contains all transactions (contributions, transfers, etc. among federal committees). It contains all data in the itemized committee contributions file plus PAC contributions to party committees, party transfers from state committee to state committee, and party transfers from national committee to state committee. Note that this file only includes federal transfers not soft money transactions. A comma delimited header file is available on this file's data dictionary page.

**Contributions to Candidates (Contributions to Candidates (and other expenditures) from Committees)**

The itemized committee contributions file contains each contribution or independent expenditure made by a PAC, party committee, candidate committee, or other federal committee to a candidate during the two-year election cycle. It includes the ID number of the contributing committee and the ID number of the recipient. You will need to use the committee master and candidate master files in conjunction with this file to set up a relational database to analyze these data. A comma delimited header file is available on this file's data dictionary page.

**Individual Contributions (Contributions by Individuals)**

The individual contributions file contains each contribution from an individual to a federal committee if the contribution was at least $200. It includes the ID number of the committee receiving the contribution, the name, city, state, zip code, and place of business of the contributor along with the date and amount of the contribution. NOTE: this file can be very large file. A comma delimited header file is available on this file's data dictionary page.

**Operating Expenditures**

The Operating Expenditures file contains disbursements reported on FEC Form 3 Line 17, FEC Form 3P Line 23and FEC Form 3X Lines 21(a)(i), 21(a)(ii) and 21(b). Operating expenditures disclosed by electronic filers are available for the current election cycle and for election cycles through 2004. Operating expenditures disclosed by paper filers, excluding Form 3P, are available for the current election cycle and for election cycles through 2006. Please note, operating expenditures disclosed by paper filers during the 2006 cycle are only available from reports filed in October 2005 and later. NOTE: this file can be very large file. A comma delimited header file is available on this file's data dictionary page. 

## Cursury look at the data

I know we aren't required to load or analyze this data, but let's just take a quick look to make sure we *could* work with it ...

In [242]:
from IPython import display
import ftplib
import os
import time
import pandas as pd

In [243]:
## Download from fec.gov
HOST = 'ftp.fec.gov'
DIR = '/FEC/2016/'
ZIP_FILE_LIST = ['cm16.zip', 'cn16.zip', 'ccl16.zip', 'oth16.zip', 
                 'pas216.zip', 'indiv16.zip', 'oppexp16.zip']
TXT_FILE_LIST = ['cm.txt', 'cn.txt', 'ccl.txt', 'itoth.txt',
                 'itpas2.txt', 'itcont.txt', 'oppexp.txt']
CURRENT_FILES = os.listdir('.')

DOWNLOAD_LIST = [z for (z, t) in zip(ZIP_FILE_LIST, TXT_FILE_LIST)
                 if z not in CURRENT_FILES and t not in CURRENT_FILES]

if len(DOWNLOAD_LIST) > 0:
    ftp = ftplib.FTP(HOST)
    ftp.login()
    ftp.cwd(DIR)
    for FILE in DOWNLOAD_LIST:
        ftp.retrbinary('RETR ' + FILE, open(FILE, 'wb').write)
        with open("README.txt", "a") as myfile:
            myfile.write("\n" + FILE + " downloaded on " + time.strftime("%d/%m/%Y"))
    ftp.quit()

In [244]:
## Read into data frames
dfList = []
for f in TXT_FILE_LIST:
    dfList.append(pd.read_csv(f, sep='|', header=None))

Let's do a quick search to see what we have:

In [245]:
candidateName = "CARLY"
cand = dfList[1].loc[dfList[1][1].str.contains(candidateName, na=False), :]
display.display(cand)
comm = dfList[0].loc[dfList[0][1].str.contains(candidateName, na=False), :]
display.display(comm)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
3305,P60007242,"FIORINA, CARLY",REP,2016,US,P,0,O,C,C00577312,1020 N FAIRFAX ST STE 200,,ALEXANDRIA,VA,22314
4257,S0CA00330,"FIORINA, CARLY",REP,2010,CA,S,0,C,P,C00469924,455 CAPITOL MALL SUITE 801,,SACRAMENTO,CA,95814


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
3502,C00417386,WOMBLE CARLYLE SANDRIDGE & RICE LLP POLITICAL ...,"FOSTER, HAILS",ONE WEST FOURTH STREET,,WINSTON SALEM,NC,27101,B,Q,,Q,,,
5019,C00469924,CARLY FOR CALIFORNIA INC,FRANK SADLER,C/O COVE STRATEGIES,1020 BERNARD STREET,ALEXANDRIA,VA,22314,P,S,REP,T,,,S0CA00330
10535,C00576934,CALI FOR CARLY FIORINA,DAVE POPE,5042 WILSHIRE BLVD #14961,,LOS ANGELES,CA,90036,U,N,,T,,,
10571,C00577312,CARLY FOR PRESIDENT,"SCHMUCKLER, JOSEPH R",1020 N FAIRFAX ST,STE 200,ALEXANDRIA,VA,22314,P,P,REP,Q,,NONE,P60007242


In [246]:
## Let's search for contributions to that last one, "CARLY FOR PRESIDENT"
candidate_id = candidate.iloc[3, 0]
dfNum = 6
f = dfList[dfNum].loc[:, 0].values == candidate_id
df = dfList[dfNum].loc[f, :]
df.iloc[0, :]

0                                         C00577312
1                                                 N
2                                              2015
3                                                Q2
4                                201507159000202958
5                                                23
6                                               F3P
7                                                SB
8     CHRIS WALTERS PROFESSIONAL VIDEO SERVICES INC
9                                         WHITEHALL
10                                               MD
11                                            21161
12                                       05/29/2015
13                                           1529.4
14                                              NaN
15          EQUIPMENT RENTAL/TRAVEL/PHOTOGRAPHY SVC
16                                              NaN
17                                              NaN
18                                              NaN
19          

We see that column 13 contains the contributions. What's the total?

In [247]:
sum(df.iloc[:, 13])

726989.85999999999

There will be a bit more work to join the data in an appropriate manner, but it looks like we have what we need.

Okay, so we have data that we can work with, now what?

## Centrality Measures

First, how do we load our data into a graph database?

Nodes: 
>We represent each person or committee as a node. This node would have an attribute associated with it identifying it as a {candidate, contributor, committee}. We might also add other attributes like state, and zip code.

Edges:
>We then create and edge for each donation made. The edge would also be weighted by the amount of the donation. We also might add other attributes like date.

Now what might we expect to see?

>If we measure centrality by degree, we might get a sense of which candidates are getting the most **number** of donations. If we include the weight of the donation, we can also get a sense of which candidate is raising the most money.

Potential Issues:
>This network involves heterogeneous nodes, which may cause issues. For one, I think it won't be possible to compare the centrality of candidates with that of donors or committees. It's possible there will be other unforeseen (at least by me) issues.
