# Summary
This notebook is basically an extension of [data_scraping_iter3.ipynb](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/data_scraping_iter3.ipynb) but requires the .csv files that were created from that notebook. In this notebook, we find all of the sessions where GovTrack has provided us with bill data so we know what specifically legislators from those sessions are voting on. We then add several fields to our .csv files using that bill data.

Firstly, we create a matrix where the columns is a master list of all subjects addressed in that session and the rows for each column is a sequence of 1s and 0s, 1 if that bill relates to that subject and 0 if not. We get the main committee that that bill was in, the short title, the official title, and the name of the sponsor of the bill.

Because we can only get the name of the sponsor, we will need a final additional step to replace the name with the actual accurate id for that legislator. That is contained in [reference_id.ipynb](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/reference_id.ipynb).

Imports the packages we need.

In [9]:
import json
import numpy as np
import pandas as pd
import os

Finds all of the sessions of Congress where we have bill data to begin with.

In [14]:
contains_bills = []

for i in range(1, 114):
    path = '/media/anne/LACIE SHARE/DataScienceFinalProject/' + str(i) + '/bills'  #this path needs to be changed
    if os.path.exists(path):
        contains_bills.append(i)
        
print contains_bills

[6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113]


These are our helper functions in adding the necessary fields outlined in the summary. find_subject_in_bills is used for finding a given subject in all of the bills and populating that bill's rows with 1s and 0s, appropriately. get_subjects gets all of the data we need for those fields using each bill's .json file.

In [15]:
def find_subject_in_bills(subject):
    res = []
    for i in range(0, len(tmp)): #tmp holds a list of objects for our bill data where each element corresponds to a vote
        if type(tmp[i]) != dict: #that subject cannot be in that bill since we didn't get any bill data
            res.append(0)
        else:
            if subject in tmp[i]['subjects']:  #if that bill relates to that subject, then so does that vote
                res.append(1)
            else:
                res.append(0)
    return res #each subject now has a list of 1s and 0s that show us the subjects each vote concerns

def get_subjects(isAmendment, title):
    if type(title) != str: #this means we weren't able to find the title so we won't be able to 
        tmp.append(np.nan)
    else:
        if isAmendment: #if we are dealing with an amendment, the first part is the amendment title, the second part is the bill title
            if len(title.split()) > 1:
                bill = title.split()[2]
            else:
                bill = title
        else:
            bill = title
        #finds the path to the bill data
        path = "/media/anne/LACIE SHARE/DataScienceFinalProject/" + str(congress_no) + "/bills/" + ''.join([i for i in bill if not i.isdigit()]) + "/" + bill + "/data.json"
        if os.path.exists(path): #checks if bill .json exists
            f = open(path, 'r')
            vote = json.loads(f.read(), 'utf-8')
            obj = {}
            if 'subjects' in vote:  #gets all the subjects and stores them as a list
                obj['subjects'] = vote['subjects']
            else:
                obj['subjects'] = np.nan
            if len(vote['committees']): #the first committee should be the one the bill originated in
                obj['committee'] = vote['committees'][0]['committee']
            else:
                obj['committee'] = np.nan
            if 'short_title' in vote: #gets the short title
                obj['billTitle'] = vote['short_title']
            else:
                obj['billTitle'] = np.nan
            if 'official_title' in vote: #gets the official title
                obj['officialTitle'] = vote['official_title']
            else:
                obj['officialTitle'] = np.nan
            if 'sponsor' in vote and vote['sponsor']: #we get the sponsor name and format it to match our legislators data for reference_id.ipynb
                name = vote['sponsor']['name'].split(', ')
                filtered_name = name[1] + "#" + name[0] + "#" + vote['sponsor']['state']
                obj['sponsor'] = filtered_name
            else:
                obj['sponsor'] = np.nan
            tmp.append(obj) #tmp holds a list of objects for our bill dataa and each element corresponds to a vote
        else:
            tmp.append(np.nan)

Using the helper functions above, we loop through all the sessions of Congress and we create new .csv files in a cleanedcsv directory. These new .csv files contain all the data the previous .csv files contained in addition to the new fields outlined in the summary at the top of this notebook.

In [22]:
for congress_no in range(1,114): #we loop through all sessions
    print congress_no
    for body in ['house', 'senate']:
        tmp = [] #holds a list of objects for each vote and the objects hold our bill data
        path = 'csv/' + str(congress_no) + body + '.csv'   #this is where we stored our first iteration of .csvs
        path2 = 'cleanedcsv/' + str(congress_no) + body + '.csv' #this is where we will store our second iteration of .csvs
        df = pd.read_csv(path)
        subjects_dict = {'committee': [], 'billTitle': [], 'sponsor': [], 'officialTitle': []} #initialize dict for each body
        for i in range(0, len(df.title)): #loop through all of the titles and get all of the data from the .json files the titles lead to
            get_subjects(df.isAmendment[i], df.title[i])
            
        count_subjects_dict = {}

        for i in range(0, len(tmp)): #loop through all of the votes
            if type(tmp[i]) == dict: #this means we found a bill .json and have some of its data
                subjects_dict['committee'].append(str(tmp[i]['committee'])) #append the data if it exists or if its a nan
                subjects_dict['billTitle'].append(str(tmp[i]['billTitle']))
                subjects_dict['officialTitle'].append(tmp[i]['officialTitle'].encode('utf-8'))
                subjects_dict['sponsor'].append(str(tmp[i]['sponsor']))
                if type(tmp[i]['subjects']) == list:
                    #we loop through all of the subjects and create a master dictionary for each body in each session of all subjects addressed
                    for j in range(0, len(tmp[i]['subjects'])):
                        if tmp[i]['subjects'][j] in count_subjects_dict:
                            count_subjects_dict[tmp[i]['subjects'][j]] += 1
                        else:
                            count_subjects_dict[tmp[i]['subjects'][j]] = 1
            else: #we weren't able to find bill .json so we only have nans
                subjects_dict['committee'].append(np.nan)
                subjects_dict['billTitle'].append(np.nan)
                subjects_dict['officialTitle'].append(np.nan)
                subjects_dict['sponsor'].append(np.nan)

        for key in count_subjects_dict: #loop through all the subjects now and set up the array of 1s and 0s
            subjects_dict[key] = find_subject_in_bills(key)
        
        subjects_df = pd.DataFrame(data=subjects_dict) #basically appends these new fields to our existing .csv file in new path

        final = pd.concat([df, subjects_df], axis=1)
        final.to_csv(path2, index=False)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
