## WIM Python API-Webscraping workshop: 2020-09-18
### Helge Marahrens (hmarahre@iu.edu) & Anne Kavalerchik (akavaler@iu.edu)
#### Part 1: ProPublica API
https://www.propublica.org/datastore/api/propublica-congress-api

#### What you need:
* Python
    * required: requests, json
    * optional: pandas, pickle

In [22]:
# packages you need to install
import requests
import json

import pandas as pd

# packages that come with Python
from collections import defaultdict
from time import sleep
import pickle

* a Developer account/API key
    * I saved my API key in a .txt document
    * do not share you API key with anyone (i.e., treat it like a password)

In [23]:
local_file = 'congress_auth.txt'
with open(local_file, "r") as txtfile:
    content = txtfile.readline().strip('\n')
credentials = {'X-API-Key':content}

In [24]:
#credentials = {'X-API-Key':'asdj423948239wfwdsld3445'} <- not my real API key (I just slammed my keyboard)

#### pseudoscript
1. read API documentation
<br> check the API limit
2. import packages
3. authentication
4. build get request
5. send get request – check server response
<br><font color=green>200 – OK</font>
<br><font color=orange>404 – data not found</font>
<br><font color=red>401 – unauthorized</font>
<br><font color=red>429 – too many requests</font>
6. explore data structures
<br> lists, dictionaries
7. save data
<br> e.g. csv

#### 1. read API documentation
https://www.propublica.org/datastore/api/propublica-congress-api  <br>
"Usage is limited to 5000 requests per day (rate limits are subject to change)."

see above for: <br>
2. import packages <br>
3. authentication key

#### 4. build get request

In [25]:
host = "https://api.propublica.org/congress/v1/116"
chamber = "/house"
data_section = "/members.json"
print(host + chamber + data_section)

https://api.propublica.org/congress/v1/116/house/members.json


#### 5. send get request – check server response

In [26]:
response = requests.get(host + chamber + data_section, headers=credentials)
assert(response.status_code==200)
members = response.json()

In [27]:
response.status_code

200

In [28]:
assert(response.status_code==200)

#### 6. explore data structures

In [35]:
len(members)

3

In [36]:
type(members)

dict

In [37]:
members.keys()

dict_keys(['status', 'copyright', 'results'])

In [38]:
len(members['results'])

1

In [39]:
members['results'][0].keys()

dict_keys(['congress', 'chamber', 'num_results', 'offset', 'members'])

In [40]:
members['results'][0]['congress']

'116'

In [41]:
type(members['results'][0]['members'])

list

In [42]:
len(members['results'][0]['members'])

449

In [44]:
print(json.dumps(members['results'][0]['members'][0],\
                 indent=4, sort_keys=True))

{
    "api_uri": "https://api.propublica.org/congress/v1/members/A000374.json",
    "at_large": false,
    "contact_form": null,
    "cook_pvi": "R+15",
    "crp_id": "N00036633",
    "cspan_id": "76236",
    "date_of_birth": "1954-09-16",
    "district": "5",
    "dw_nominate": 0.541,
    "facebook_account": "CongressmanRalphAbraham",
    "fax": null,
    "fec_candidate_id": "H4LA05221",
    "first_name": "Ralph",
    "gender": "M",
    "geoid": "2205",
    "google_entity_id": "/m/012dwd7_",
    "govtrack_id": "412630",
    "icpsr_id": "21522",
    "id": "A000374",
    "ideal_point": null,
    "in_office": true,
    "last_name": "Abraham",
    "last_updated": "2020-09-16 10:30:22 -0400",
    "leadership_role": "",
    "middle_name": null,
    "missed_votes": 319,
    "missed_votes_pct": 35.84,
    "next_election": "2020",
    "ocd_id": "ocd-division/country:us/state:la/cd:5",
    "office": "417 Cannon House Office Building",
    "party": "R",
    "phone": "202-225-8490",
    "rss_url"

#### 7. save data

In [46]:
df = pd.DataFrame(members['results'][0]['members'])
df.shape

(449, 46)

In [48]:
df.head()

Unnamed: 0,id,title,short_title,api_uri,first_name,middle_name,last_name,suffix,date_of_birth,gender,...,office,phone,fax,state,district,at_large,geoid,missed_votes_pct,votes_with_party_pct,votes_against_party_pct
0,A000374,Representative,Rep.,https://api.propublica.org/congress/v1/members...,Ralph,,Abraham,,1954-09-16,M,...,417 Cannon House Office Building,202-225-8490,,LA,5,False,2205,35.84,94.88,4.95
1,A000370,Representative,Rep.,https://api.propublica.org/congress/v1/members...,Alma,,Adams,,1946-05-27,F,...,2436 Rayburn House Office Building,202-225-1510,,NC,12,False,3712,2.92,99.19,0.7
2,A000055,Representative,Rep.,https://api.propublica.org/congress/v1/members...,Robert,B.,Aderholt,,1965-07-22,M,...,1203 Longworth House Office Building,202-225-4876,,AL,4,False,104,5.06,93.56,6.32
3,A000371,Representative,Rep.,https://api.propublica.org/congress/v1/members...,Pete,,Aguilar,,1979-06-19,M,...,109 Cannon House Office Building,202-225-3201,,CA,31,False,631,0.9,97.27,2.61
4,A000372,Representative,Rep.,https://api.propublica.org/congress/v1/members...,Rick,,Allen,,1951-11-07,M,...,2400 Rayburn House Office Building,202-225-2823,,GA,12,False,1312,0.22,92.27,7.61


In [49]:
df.to_excel("congress_house_116.xlsx", encoding='utf-8-sic')

#### health bills

In [53]:
bills_100 = defaultdict()
host =\
 "https://api.propublica.org/congress/v1/bills/subjects/health.json?offset="

In [54]:
for i, offset in enumerate(range(0,100,20)):
    sleep(1)
    response = requests.get(host + str(offset), headers=credentials)
    assert(response.status_code==200)
    bills = response.json()
    bills_100[i] = bills

In [58]:
bills_100[0]['results'][0]

{
    "active": true,
    "bill_id": "hr945-116",
    "bill_slug": "hr945",
    "bill_type": "hr",
    "bill_uri": "https://api.propublica.org/congress/v1/116/bills/hr945.json",
    "committee_codes": [],
    "committees": "House Energy and Commerce Committee",
    "congressdotgov_url": "https://www.congress.gov/bill/116th-congress/house-bill/945",
    "cosponsors": 120,
    "cosponsors_by_party": {
        "D": 96,
        "R": 24
    },
    "enacted": null,
    "govtrack_url": "https://www.govtrack.us/congress/bills/116/hr945",
    "gpo_pdf_uri": null,
    "house_passage": null,
    "introduced_date": "2019-01-31",
    "last_vote": null,
    "latest_major_action": "Ordered to be Reported (Amended) by Voice Vote.",
    "latest_major_action_date": "2020-09-09",
    "number": "H.R.945",
    "primary_subject": "Health",
    "senate_passage": null,
    "short_title": "Mental Health Access Improvement Act of 2019",
    "sponsor_id": "T000460",
    "sponsor_name": "Mike Thompson",
    "spon

In [None]:
with open("ProPublica_Members-Bills.pkl", "wb") as file:
    pickle.dump(bills_100, file)