# Representation Learning at the US Congress




Notebook dedicated to run the awesome notebook of Senator Representations by Nathaniel Tucker:

https://github.com/knathanieltucker/tf-keras-tutorial/blob/master/SenatorRepresentations.ipynb

The following notebook is his work with a few more comments made by me to understand the material: 

## A bit more interesting

Nathan wants to breifly talk about two other ways to use representational learning in addtion to the first:

- Representations as a byproduct of predictio
- Hand crafted representations
- Prediction as a byproduct of representations

Nathan covered the first in WordRepresentations notebook. The second we won't cover here, but Nathan took inspiration from this homework from a class Nathan took a long time ago: https://github.com/cs109/content/blob/master/HW5_solutions.ipynb

In this homework we used graph algorithms to hand craft a representation of senators, and lo and behold they turned out to be quite partison. So if you want a great example of tactic 2, then check that out.

The third tactic is what we will do below. We will create a prediciton problem that we don't really care about. But this problem if solved by representations will create quite useful ones. The classic example of this is word2vec.
We will walk through much more of the details this time because we are fresh ground.

In [1]:
import requests

In [2]:
from pattern import web

In [3]:
import json

In [4]:
def get_senate_vote(congress, year, vote):
    url = 'http://www.govtrack.us/data/congress/{}/votes/{}/s{}/data.json'.format(congress, year, vote)
    page = requests.get(url).text
    return json.loads(page)

In [5]:
def get_all_votes(congress, year):
    page = requests.get('https://www.govtrack.us/data/congress/{}/votes/{}/'.format(congress, year)).text
    dom = web.Element(page)
    votes = [a.attr['href'] for a in dom.by_tag('a') 
             if a.attr.get('href', '').startswith('s')]
    n_votes = len(votes)
    votes_on_bills = []
    for i in range(1, n_votes + 1):
        vote = get_senate_vote(congress, year, i)
        if 'bill' in vote:
            votes_on_bills.append(vote)
    return votes_on_bills

The above two functions will scrape a website that keeps track of how US government votes go. Nathan already scraped it, but for people curious as to how he got the data, you can check out above.

In [6]:
# vote_data_113_2013 = get_all_votes(113, 2013)

In [7]:
# vote_data_113_2014 = get_all_votes(113, 2014)

In [8]:
# vote_data_114_2015 = get_all_votes(114, 2015)

In [9]:
# vote_data_114_2016 = get_all_votes(114, 2016)

In [10]:
# all_vote_data = vote_data_113_2013 + \
#                 vote_data_113_2014 + \
#                 vote_data_114_2015 + \
#                 vote_data_114_2016

In [11]:
# with open('data/congress/USA/all_vote_data.json', 'w') as outfile:
#     json.dump(all_vote_data, outfile)

In [12]:
all_vote_data = json.load(open('data/congress/USA/all_vote_data.json'))

In [13]:
all_vote_data[0]

{u'bill': {u'congress': 113,
  u'number': 15,
  u'title': u'A resolution to improve procedures for the consideration of legislation and nominations in the Senate.',
  u'type': u'sres'},
 u'category': u'passage',
 u'chamber': u's',
 u'congress': 113,
 u'date': u'2013-01-24T19:54:00-05:00',
 u'number': 1,
 u'question': u'On the Resolution S.Res. 15',
 u'record_modified': u'2013-01-24T20:38:00-05:00',
 u'requires': u'3/5',
 u'result': u'Resolution Agreed to',
 u'result_text': u'Resolution Agreed to (78-16, 3/5 majority required)',
 u'session': u'2013',
 u'source_url': u'http://www.senate.gov/legislative/LIS/roll_call_votes/vote1131/vote_113_1_00001.xml',
 u'subject': u'S. Res. 15',
 u'type': u'On the Resolution',
 u'updated_at': u'2016-12-25T10:01:28-05:00',
 u'vote_id': u's1-113.2013',
 u'votes': {u'Nay': [{u'display_name': u'Crapo (R-ID)',
    u'first_name': u'Mike',
    u'id': u'S266',
    u'last_name': u'Crapo',
    u'party': u'R',
    u'state': u'ID'},
   {u'display_name': u'Cruz (R-

In [14]:
len(all_vote_data)

744

You can see that we have the bill and all the votes that it got from various senators. In addition to this information we will want to find out one more bit of info, who sponsored the bill?

In [15]:
def get_senate_bill(congress, bill_type, bill_number):
    url = 'http://www.govtrack.us/data/congress/{}/bills/{}/{}{}/data.json'.format(congress, bill_type, bill_type, bill_number)
    page = requests.get(url).text
    return json.loads(page)

In [16]:
def get_all_bills(vote_data):
    bill_data = []
    for vote in vote_data:
        if 'bill' in vote:
            bill_type = vote['bill']['type']
            bill_number = vote['bill']['number']
            congress = vote['bill']['congress']
            bill = get_senate_bill(congress, bill_type, bill_number)
            bill['id'] = '{}{}'.format(bill_type, bill_number)
            bill_data.append(bill)
    return bill_data

In [17]:
# bill_data = get_all_bills(all_vote_data)

In [18]:
# with open('data/congress/USA/bill_data.json', 'w') as outfile:
#     json.dump(bill_data, outfile)

In [19]:
bill_data = json.load(open('data/congress/USA/bill_data.json'))

In [20]:
bill_data[0]

{u'actions': [{u'acted_at': u'2013-01-24',
   u'references': [{u'reference': u'CR S293',
     u'type': u'text of measure as introduced'}],
   u'text': u'Submitted in the Senate.',
   u'type': u'action'},
  {u'acted_at': u'2013-01-24',
   u'references': [{u'reference': u'CR S270-274', u'type': u'consideration'},
    {u'reference': u'CR S293', u'type': u'text of measure as introduced'}],
   u'text': u'Measure laid before Senate by unanimous consent.',
   u'type': u'action'},
  {u'acted_at': u'2013-01-24',
   u'how': u'roll',
   u'references': [{u'reference': u'CR S272', u'type': u'text'}],
   u'result': u'pass',
   u'roll': u'1',
   u'status': u'PASSED:SIMPLERES',
   u'text': u'Resolution agreed to in Senate, under the order of 1/24/2012, having achieved 60 votes in the affirmative, without amendment by Yea-Nay Vote. 78 - 16. Record Vote Number: 1.',
   u'type': u'vote',
   u'vote_type': u'vote',
   u'where': u's'}],
 u'amendments': [{u'amendment_id': u'samdt3-113',
   u'amendment_type':

In [21]:
len(bill_data)

744

Again we get a ton of information. But we are just interested in who sponsored it.

We will then map each senator to an ID, just like we did with words:

In [22]:
def get_senators(vote_data):
    senators = []
    for vote in vote_data:
        for sen in vote['votes']['Nay']:
            senators.append(sen['last_name'] + ', ' + sen['state'])
        for sen in vote['votes']['Yea']:
            senators.append(sen['last_name'] + ', ' + sen['state'])
    return senators

In [23]:
senators = get_senators(all_vote_data)

In [24]:
senators[2]

u'Flake, AZ'

In [25]:
for v, k in enumerate(set(senators)):
    print v, k

0 Kirk, IL
1 Blumenthal, CT
2 Booker, NJ
3 Walsh, MT
4 Vitter, LA
5 Boxer, CA
6 Barrasso, WY
7 Johnson, WI
8 Udall, CO
9 Cardin, MD
10 Sanders, VT
11 Cornyn, TX
12 Hatch, UT
13 Bennet, CO
14 Klobuchar, MN
15 Peters, MI
16 Toomey, PA
17 Cantwell, WA
18 Nelson, FL
19 Hirono, HI
20 Tester, MT
21 Cochran, MS
22 Reid, NV
23 Gillibrand, NY
24 Landrieu, LA
25 Coons, DE
26 Franken, MN
27 Hagan, NC
28 Capito, WV
29 Wicker, MS
30 Carper, DE
31 Merkley, OR
32 Murray, WA
33 Whitehouse, RI
34 Cruz, TX
35 Ayotte, NH
36 Feinstein, CA
37 Inhofe, OK
38 Risch, ID
39 Graham, SC
40 Chiesa, NJ
41 Johnson, SD
42 Burr, NC
43 Lautenberg, NJ
44 Moran, KS
45 McCain, AZ
46 Donnelly, IN
47 Warren, MA
48 Boozman, AR
49 Cotton, AR
50 Coburn, OK
51 Daines, MT
52 Schumer, NY
53 Lee, UT
54 Levin, MI
55 Gardner, CO
56 Heller, NV
57 Markey, MA
58 Murphy, CT
59 Durbin, IL
60 McCaskill, MO
61 McConnell, KY
62 Reed, RI
63 Mikulski, MD
64 King, ME
65 Thune, SD
66 Paul, KY
67 Flake, AZ
68 Alexander, TN
69 Coats, IN
70 Fische

In [26]:
# leave the first two blank for padding and not senators
senator_to_id = { k: v + 2 for v, k in enumerate(set(senators)) }

In [27]:
senator_to_id

{u'Alexander, TN': 70,
 u'Ayotte, NH': 37,
 u'Baldwin, WI': 117,
 u'Barrasso, WY': 8,
 u'Baucus, MT': 103,
 u'Begich, AK': 118,
 u'Bennet, CO': 15,
 u'Blumenthal, CT': 3,
 u'Blunt, MO': 84,
 u'Booker, NJ': 4,
 u'Boozman, AR': 50,
 u'Boxer, CA': 7,
 u'Brown, OH': 105,
 u'Burr, NC': 44,
 u'Cantwell, WA': 19,
 u'Capito, WV': 30,
 u'Cardin, MD': 11,
 u'Carper, DE': 32,
 u'Casey, PA': 78,
 u'Cassidy, LA': 85,
 u'Chambliss, GA': 119,
 u'Chiesa, NJ': 42,
 u'Coats, IN': 71,
 u'Coburn, OK': 52,
 u'Cochran, MS': 23,
 u'Collins, ME': 73,
 u'Coons, DE': 27,
 u'Corker, TN': 115,
 u'Cornyn, TX': 13,
 u'Cotton, AR': 51,
 u'Cowan, MA': 83,
 u'Crapo, ID': 106,
 u'Cruz, TX': 36,
 u'Daines, MT': 53,
 u'Donnelly, IN': 48,
 u'Durbin, IL': 61,
 u'Enzi, WY': 109,
 u'Ernst, IA': 93,
 u'Feinstein, CA': 38,
 u'Fischer, NE': 72,
 u'Flake, AZ': 69,
 u'Franken, MN': 28,
 u'Gardner, CO': 57,
 u'Gillibrand, NY': 25,
 u'Graham, SC': 41,
 u'Grassley, IA': 108,
 u'Hagan, NC': 29,
 u'Harkin, IA': 90,
 u'Hatch, UT': 14,


We will convert all the sponsors and cosponsors into IDs:

In [28]:
def get_senator_unique_name(senator):
    last_name = senator['name'].split(',')[0]
    return '{}, {}'.format(last_name, senator['state'])

def get_senator_id(senator):
    if senator not in senator_to_id:
        return 1
    return senator_to_id[senator]

For each bill, pull out it's sponsor and co-sponsors:

In [29]:
def get_bill_sponsors(bill_data):
    d = {}
    for bill in bill_data:
        d[bill['id']] = {
            'sponsor': get_senator_id(get_senator_unique_name(bill['sponsor'])),
            'cosponsors': [get_senator_id(get_senator_unique_name(cosponsor)) for cosponsor in bill['cosponsors']]
        }
    return d

In [30]:
bill_dict = get_bill_sponsors(bill_data)

In [31]:
# bill_dict

And finally we will make our data. So we are really interested in representing our senators, but for an ML algorithm to learn that, it needs a goal to acheive with the representations aka a procedure to determine if the representation is good.

Our prediciton problem will be: can we predict a senator's vote based on who sponsored it?

Notice the prediciton problem is composed of an interaction of representations (if representations don't interact the problem becomes too simple). Now if we were truely interested in the prediciton problem we would include a ton more features: the age of the sentator, whether they are rep or a dem. But we are interested in the representation. So let's get our data:

In [33]:
senator_vote_data = []
id_to_displayname = {}

for vote in all_vote_data:
    
    bill_type = vote['bill']['type']
    bill_number = vote['bill']['number']
    bill_id = '{}{}'.format(bill_type, bill_number)
    
    if bill_id in bill_dict:
        bill_sponsors = bill_dict[bill_id]
        sponsor = bill_sponsors['sponsor']
        cosponsors = bill_sponsors['cosponsors']
    else:
        continue
    
    for sen in vote['votes']['Nay']:
        senator_id = get_senator_id(sen['last_name'] + ', ' + sen['state'])
        id_to_displayname[senator_id] = sen[u'display_name']
        senator_vote_data.append((0, senator_id, sponsor, cosponsors)) 
        
    for sen in vote['votes']['Yea']:
        senator_id = get_senator_id(sen['last_name'] + ', ' + sen['state'])
        id_to_displayname[senator_id] = sen[u'display_name']
        senator_vote_data.append((1, senator_id, sponsor, cosponsors))

In [35]:
senator_vote_data[0]

(0, 106, 24, [56, 47])

In [36]:
id_to_displayname

{2: u'Kirk (R-IL)',
 3: u'Blumenthal (D-CT)',
 4: u'Booker (D-NJ)',
 5: u'Walsh (D-MT)',
 6: u'Vitter (R-LA)',
 7: u'Boxer (D-CA)',
 8: u'Barrasso (R-WY)',
 9: u'Johnson (R-WI)',
 10: u'Udall (D-CO)',
 11: u'Cardin (D-MD)',
 12: u'Sanders (I-VT)',
 13: u'Cornyn (R-TX)',
 14: u'Hatch (R-UT)',
 15: u'Bennet (D-CO)',
 16: u'Klobuchar (D-MN)',
 17: u'Peters (D-MI)',
 18: u'Toomey (R-PA)',
 19: u'Cantwell (D-WA)',
 20: u'Nelson (D-FL)',
 21: u'Hirono (D-HI)',
 22: u'Tester (D-MT)',
 23: u'Cochran (R-MS)',
 24: u'Reid (D-NV)',
 25: u'Gillibrand (D-NY)',
 26: u'Landrieu (D-LA)',
 27: u'Coons (D-DE)',
 28: u'Franken (D-MN)',
 29: u'Hagan (D-NC)',
 30: u'Capito (R-WV)',
 31: u'Wicker (R-MS)',
 32: u'Carper (D-DE)',
 33: u'Merkley (D-OR)',
 34: u'Murray (D-WA)',
 35: u'Whitehouse (D-RI)',
 36: u'Cruz (R-TX)',
 37: u'Ayotte (R-NH)',
 38: u'Feinstein (D-CA)',
 39: u'Inhofe (R-OK)',
 40: u'Risch (R-ID)',
 41: u'Graham (R-SC)',
 42: u'Chiesa (R-NJ)',
 43: u'Johnson (D-SD)',
 44: u'Burr (R-NC)',
 45:

In [37]:
len(senator_vote_data)

72382

~70k examples of (vote, senator voting, sponsor, cosponsor) tuples is pretty good (we could of course scrape more).

In [38]:
y = [d[0] for d in senator_vote_data]

In [39]:
# again we pad
def pad_or_crop(lst, l=10):
    return (lst + [0] * l)[:10]

In [40]:
pad_or_crop([99])

[99, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [41]:
pad_or_crop([99], 2)

[99, 0, 0]

In [42]:
import numpy as np

x_1 = np.array(map(lambda x: x[1], senator_vote_data))
x_2 = np.array(map(lambda x: x[2], senator_vote_data))
x_3 = np.array(map(lambda x: pad_or_crop(x[3]), senator_vote_data))
x = [x_1, x_2, x_3]

In [43]:
x_1

array([106,  36,  69, ...,   6,  74,  31])

In [44]:
x_2

array([24, 24, 24, ..., 13, 13, 13])

In [45]:
x_3

array([[56, 47,  0, ...,  0,  0,  0],
       [56, 47,  0, ...,  0,  0,  0],
       [56, 47,  0, ...,  0,  0,  0],
       ..., 
       [36,  0,  0, ...,  0,  0,  0],
       [36,  0,  0, ...,  0,  0,  0],
       [36,  0,  0, ...,  0,  0,  0]])

In [46]:
# we add in padding and unknown senators
id_to_displayname[0] = '<PAD>'
id_to_displayname[1] = '<NOT A SENATOR>'

In [47]:
id_to_displayname

{0: '<PAD>',
 1: '<NOT A SENATOR>',
 2: u'Kirk (R-IL)',
 3: u'Blumenthal (D-CT)',
 4: u'Booker (D-NJ)',
 5: u'Walsh (D-MT)',
 6: u'Vitter (R-LA)',
 7: u'Boxer (D-CA)',
 8: u'Barrasso (R-WY)',
 9: u'Johnson (R-WI)',
 10: u'Udall (D-CO)',
 11: u'Cardin (D-MD)',
 12: u'Sanders (I-VT)',
 13: u'Cornyn (R-TX)',
 14: u'Hatch (R-UT)',
 15: u'Bennet (D-CO)',
 16: u'Klobuchar (D-MN)',
 17: u'Peters (D-MI)',
 18: u'Toomey (R-PA)',
 19: u'Cantwell (D-WA)',
 20: u'Nelson (D-FL)',
 21: u'Hirono (D-HI)',
 22: u'Tester (D-MT)',
 23: u'Cochran (R-MS)',
 24: u'Reid (D-NV)',
 25: u'Gillibrand (D-NY)',
 26: u'Landrieu (D-LA)',
 27: u'Coons (D-DE)',
 28: u'Franken (D-MN)',
 29: u'Hagan (D-NC)',
 30: u'Capito (R-WV)',
 31: u'Wicker (R-MS)',
 32: u'Carper (D-DE)',
 33: u'Merkley (D-OR)',
 34: u'Murray (D-WA)',
 35: u'Whitehouse (D-RI)',
 36: u'Cruz (R-TX)',
 37: u'Ayotte (R-NH)',
 38: u'Feinstein (D-CA)',
 39: u'Inhofe (R-OK)',
 40: u'Risch (R-ID)',
 41: u'Graham (R-SC)',
 42: u'Chiesa (R-NJ)',
 43: u'Johnso

In [48]:
# this gives us how many representations:
len(id_to_displayname)

120

In [49]:
# we again need to write down the metadata
import csv

with open('data/congress/USA/senator_metadata.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
    for key, value in sorted(id_to_displayname.items()):
        writer.writerow([value.encode('utf8')])

In [50]:
# finally we build our model
from keras.layers import concatenate
from keras.layers import Dense, Input, Flatten
from keras.layers import MaxPooling1D, Embedding

embedding_layer = Embedding(len(id_to_displayname), 100)

# train a 1D convnet with global maxpooling
voting = voting_input = Input(shape=(1,), dtype='int32')
voting = embedding_layer(voting)
voting = Dense(32, activation='relu')(voting)
voting = Dense(32, activation='relu')(voting)

sponsor = sponsor_input = Input(shape=(1,), dtype='int32')
sponsor = embedding_layer(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)

cosponsor = cosponsor_input = Input(shape=(10,), dtype='int32')
cosponsor = embedding_layer(cosponsor)
cosponsor = MaxPooling1D(10)(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)

combined = concatenate([voting, sponsor, cosponsor])
combined = Dense(32, activation='relu')(combined)
combined = Dense(1, activation='sigmoid')(combined)

Using TensorFlow backend.


In data/congress/USA, launch TensorBoard:
    
> davids-air:USA dazconap$ tensorboard --logdir=senator_reps/

In [51]:
from keras.models import Model
from keras.callbacks import TensorBoard

model = Model([voting_input, sponsor_input, cosponsor_input], combined)

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

embedding_metadata = {
    embedding_layer.name: '../senator_metadata.csv'
}

model.fit([x_1, x_2, x_3], np.array(y).reshape(-1, 1, 1),
          batch_size=128,
          epochs=10,
          validation_split=0.2,
          callbacks=[TensorBoard(log_dir='data/congress/USA/senator_reps', 
                                 embeddings_freq=1,
                                 embeddings_metadata=embedding_metadata)])

Train on 57905 samples, validate on 14477 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1834e373d0>

Go to TensorBoard:
> http://localhost:6006/#projector

In TensorBoard, we can look at the representations created in our model using t-SNE or PCA. A t-SNE analysis using more than a 1.1K iterations divides the senators in two different groups, roughly Republicans and Democrats.

![title](figures/t-SNE-USA.png)

It is interesting to note some are "on the other side" such as Kerry (Democrat from MA) which is in the Republican cohort.