# Summary
This notebook, in its third iteration, was used to take all of the individual JSON files we had downloaded from GovTrack onto an external hard drive and convert them for each body of Congress in each session into an individual .csv file.

The columns in this .csv file consisted of each member of that house, the date of the vote, whether or note the vote was on an amendment to a bill, as opposed to a bill itself, the amount of that body that is required for the vote to succeed, the result of the vote, and the title of the bill which is used to find the bill JSON file later. The rows of this .csv file are each a record of a vote that occurred in that legislative body for that session. Since we break up by body, we produce from, for example the 103rd Congress, [103house.csv](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/data/congress_sessions_legislation/103house.csv) and [103senate.csv](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/data/congress_sessions_legislation/103senate.csv).

We record whether or not a vote is for an amendment to a bill or to a bill itself because later, when we try and find what legislation a given vote actually concerns, we cannot find the subjects a particular amendment refers to, only the subjects that the bill the amendment amends refers to. So what we do is essentially a double counting in that we say that every amendment to a bill concerns the same subjects that that bill concerns. This means that are essentially saying bills with more amendments count as more votes on those bill's subjects.

Import the packages we need.

In [2]:
import json
import numpy as np
import pandas as pd
import os
import subprocess

We kept all of our individual JSON files on an external hard drive so we copy all of the files we need for each session of Congress from the external hard drive to our computer, go through them to create the .csv file, and then delete them from the computer.

In order to run this, you will need our external hard drive and will have to update the pathtodata variable.

In [None]:
for i in range(89,88,-1):
    print i
    print "copying locally..."
    subprocess.check_call("./copydata.sh "+str(i), shell=True)

    print "making csv..."
    pathtodata = '/media/anne/LACIE SHARE/DataScienceFinalProject/'
    pathtodata = 'data/'
    make_csv(i, pathtodata)

    print "deleting "+str(i)+" locally..."
    subprocess.check_call("./rmdata.sh "+str(i), shell=True)

These are helper functions we will use in order to make the .csv file. convert_str_to_float is used in order to determine the fraction of the body needed for a vote to succeed. pad_dict is used as one of the final steps before making the .csv and accomodates the fact that all legislators we will find do not necessarily remain for the entire session. Some legislators are nominated for cabinet positions, some retire, and thus we have to pad the dictionary with np.nan values. We know that isAmendment will always be populated so we use that as our model column to determine how many np.nans we need to add.

In [4]:
def convert_str_to_float(string):
    num, denom = string.split('/')
    return float(num)/float(denom)

def pad_dict(vote_dict):
    for key in vote_dict:
        if len(vote_dict[key]) != len(vote_dict['isAmendment']):
            vote_dict[key] = vote_dict[key] + [np.nan]*(len(vote_dict['isAmendment']) - len(vote_dict[key]))
    return vote_dict

Below is our function that takes in the session number and the path to the data for that session and outputs the house and senate .csv files for that session.

In [3]:
def make_csv(congress_no, pathtodata):
    #content_list now contains the years for that session of Congress (each session is two years)
    #in earlier sessions, Congress data is broken up arbitrarily into sections, content_list lets us treat both organizational systems the same
    content_list = []
    for content in os.listdir(pathtodata+str(congress_no)+ "/votes"):
        if content != '.DS_Store':
            content_list.append(content)
    
    #counts the number of votes in the house versus in the senate, this lets us separate one session into the two .csv files
    #again, because we don't always start at years and because votes in different years maybe numbered differently, we get the 
    #number where the votes start and the number where the votes end, letting us know what vote files to look at
    def get_vote_nums(path):
        housevotes = []
        senvotes = []
        for folder in os.listdir(path):
            if folder[0] == 'h':
                housevotes.append(int(folder[1:]))
            elif folder[0] == 's':
                senvotes.append(int(folder[1:]))   
        housevotes.sort()
        senvotes.sort()
        if len(housevotes) == 0:
            housevotes = [0]
        if len(senvotes) == 0:
            senvotes = [0]
        return housevotes[0], housevotes[len(housevotes)-1], senvotes[0], senvotes[len(senvotes)-1]

    #we loop through the years within the session and call get_vote_data which populates a dictionary we can then convert into a dataframe
    house_dict = {}
    senate_dict = {}
    for year in content_list:
        house_count = 0
        senate_count = 0
        path = pathtodata+str(congress_no) + "/votes/" + year
        hstart, hend, sstart, send = get_vote_nums(path)
        print year, hstart, hend
        house = get_vote_data(path+"/h", hstart, hend, house_dict)
        senate = get_vote_data(path+"/s", sstart, send, senate_dict)

    house_dict = pad_dict(house_dict)
    senate_dict = pad_dict(senate_dict)
    
    #we now take each dictionary and convert it to our final .csv file
    df = pd.DataFrame(data=house_dict)
    df.to_csv("csv/" + str(congress_no)+"house.csv", index=False)

    df = pd.DataFrame(data=senate_dict)
    df.to_csv("csv/" + str(congress_no)+"senate.csv", index=False)

Below is our function that takes the path to the data, the vote number the vote data starts at and ends at, and an empty dictionary and returns a dictionary that we can then convert to a dataframe and then a .csv file as specified by the summary at the top of this notebook.

In [11]:
votes = ['Yea', 'Nay', 'Present', 'Not Voting']

def get_vote_data(path, start, end, vote_dict):
    for i in range(start, end+1):  #loop through all of the votes
        votepath = path + str(i) + "/data.json"
        if os.path.exists(votepath):  #we have the vote data and we can use it in the dictonary
            f = open(votepath, 'r')
            vote = json.loads(f.read(), 'utf-8') #load the vote data
            if 'Aye' in vote['votes']:  #vote['votes'] is grouped by the response given by the legislator, replace 'Aye' with 'Yea'
                vote['votes']['Yea'] = vote['votes']['Aye']
                del vote['votes']['Aye']
            if 'No' in vote['votes']: #replace 'Nay' with 'No', just standardizing vote data
                vote['votes']['Nay'] = vote['votes']['No']
                del vote['votes']['No']
            if not any(vote_dict): #if the vote_dict is empty, then we have to create all the keys and lists for their values
                if vote['category'] == 'passage' or vote['category'] == 'amendment' or vote['category'] == 'unknown': #we only care about votes on bills and amendments, not nominations or confirmations or resolutions
                    vote_dict['date'] = [vote['date']] #grab the date of the vote
                    if vote['category'] == 'passage' or vote['category'] == 'unknown': #handles bills and unknowns
                        if 'bill' not in vote or vote['bill'] == None:
                            vote_dict['title'] = [np.nan] #we do not know the title so we store is a nan
                        else:
                            vote_dict['title'] = [vote['bill']['type'] + str(vote['bill']['number'])]
                        vote_dict['isAmendment'] = [0] #the vote is not on an amendment
                    elif vote['category'] == 'amendment' and 'amendment' in vote:  #handles amendments
                        if 'bill' not in vote or vote['bill'] == None:
                            vote_dict['title'] = [np.nan] #if we don't know the bill the amendent amends, we have no title
                        else:
                            #we record the title as the amendment number to a bill, which we can use to find the amendment data later
                            vote_dict['title'] = [vote['amendment']['type'][0] + "amdt" + str(vote['amendment']['number']) + " to " + vote['bill']['type'] + str(vote['bill']['number'])]
                        vote_dict['isAmendment'] = [1]
                    else:
                        #handles the case where we have a vote['category'] == 'amendment' but there is no amendment field provided in the amendment's .json file
                        if 'bill' in vote:
                            vote_dict['title'] = [vote['bill']['type'] + str(vote['bill']['number'])] #we store the bill title anyways and use isAmendment to know its an amendment
                            vote_dict['isAmendment'] = [1]
                        else:
                            vote_dict['title'] = [np.nan] #we don't know the title
                            vote_dict['isAmendment'] = [1]
                    if 'result_text' not in vote: #handles the fact that the result of the vote is stored as different keys in vote's .json file
                        vote_dict['result'] = [vote['result']]
                    else:
                        vote_dict['result'] = [vote['result_text']]
                    if vote['requires'] == 'unknown': #we don't know the fraction needed for the vote to succeed
                        vote_dict['requires'] = [np.nan]
                    elif vote['requires'] == 'QUORUM':
                        vote_dict['requires'] = [convert_str_to_float("1/2")] # assuming quorum is simple 1/2 majority
                    else:
                        vote_dict['requires'] = [convert_str_to_float(vote['requires'])]
                    #stores how each person voted, using their id
                    for vote_type in votes:
                        if vote_type in vote['votes']:
                            for person in vote['votes'][vote_type]:
                                vote_dict[person['id']] = [vote_type]
            #now the dictionary exists so we'll just be appending, this next section is the same as above
            else:
                if vote['category'] == 'passage' or vote['category'] == 'amendment' or vote['category'] == 'unknown': 
                    vote_dict['date'].append(vote['date'])
                    if vote['category'] == 'passage' or vote['category'] == 'unknown':
                        if 'bill' not in vote or vote['bill'] == None:
                            vote_dict['title'].append(np.nan)
                        else:
                            vote_dict['title'].append(vote['bill']['type'] + str(vote['bill']['number']))
                        vote_dict['isAmendment'].append(0)
                    elif vote['category'] == 'amendment' and 'amendment' in vote:
                        if 'bill' not in vote or vote['bill'] == None:
                            vote_dict['title'].append(np.nan)
                        else:
                            vote_dict['title'].append(vote['amendment']['type'][0] + "amdt" + str(vote['amendment']['number']) + " to " + vote['bill']['type'] + str(vote['bill']['number']))
                        vote_dict['isAmendment'].append(1)
                    else:
                        if 'bill' in vote:
                            vote_dict['title'].append(vote['bill']['type'] + str(vote['bill']['number']))
                            vote_dict['isAmendment'].append(1)
                        else:
                            vote_dict['title'].append(np.nan)
                            vote_dict['isAmendment'].append(1)
                    if 'result_text' not in vote:
                        vote_dict['result'].append(vote['result'])
                    else:
                        vote_dict['result'].append(vote['result_text'])

                    if vote['requires'] == 'unknown':
                        vote_dict['requires'].append(np.nan)
                    elif vote['requires'] == 'QUORUM':
                        vote_dict['requires'].append(convert_str_to_float("1/2")) # assuming quorum is simple 1/2 majority
                    else:
                        vote_dict['requires'].append(convert_str_to_float(vote['requires']))
                    for vote_type in votes:
                        if vote_type in vote['votes']:
                            for person in vote['votes'][vote_type]:
                                if person == 'VP':
                                    print "The VP voted to break a tie in this vote"
                                elif person['id'] in vote_dict:
                                    vote_dict[person['id']].append(vote_type)
                                else:
                                    #if we see a new person, we need to pad their list with the np.nans to show the time they weren't a legislator
                                    vote_dict[person['id']] = [np.nan]*(len(vote_dict['isAmendment'])-1) + [vote_type] 
    return vote_dict