## The regulations.gov API

<p>In an effort to take citizen's opinions and perspectives into account, the federal government has launched [regulations.gov](https://www.regulations.gov/).  This site allows visitors to review documents related to open propositions and comment online.  The site acts as a repository for those comments, and it provides an admirable API for accessing this data.  Below are some notes on how to interact with the API effectively in Python.</p>

## Getting set up

* Sign up for an API account at [http://regulationsgov.github.io/developers/key/](http://regulationsgov.github.io/developers/key/)

* This blog post goes over the minimum set of parameters that I found useful when consuming this API  For a full list of options, visit [http://regulationsgov.github.io/developers/console/](http://regulationsgov.github.io/developers/console/) and click "Expanded operations".


### Searching for documents

The first endpoint of interest in [https://api.data.gov/regulations/v3/documents
](https://api.data.gov/regulations/v3/documents
).  This endpoint can be used to search for documents in the system using the parameters in the table below.

Add any number of querystring parameters to refind the query to your needs:

| Attribute     | Description
| ------------- |:-----------------------------
| api_key       | Get this from the first link 
| s             | Keywords 
| rpp           | Results per page, max=1000
| po            | Page offset, starting at 0
| crd           | Creation date, accepts date as MM/DD/YY or range as MM/DD/YY-MM/DD/YY
| cat           | Document category, for our interests, we wanted "AEP" for Agriculture, Environment, and Public Lands
| dct           | Document type, for our interest, we wanted "PS" for Public Submission.  We won't use this in the first call we make, but later on in this post we will.

In [173]:
import requests
import pandas as pd
import json
import sys
from cStringIO import StringIO

In [148]:
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
from docx import Document

In [64]:
keywords = 'water'
results_per_page = 1000
page_offset = 0
creation_range = '01/01/16-11/01/16'
category = 'AEP'

In [67]:
uri_template = 'https://api.data.gov/regulations/v3/documents?api_key={}&s={}&rpp={}&po={}&crd={}&cat={}'
uri = uri_template.format(api_key, keywords, results_per_page, page_offset, creation_range, category)
r = requests.get(uri)

In [68]:
search_result = json.loads(r.content)
print('Found ' + str(search_result['totalNumRecords']) + ' results.')

Found 685 results.


Here's an example of the first result:

In [61]:
doc = search_result['documents'][0]
print(json.dumps(doc, indent=4, sort_keys=True))

{
    "agencyAcronym": "EPA", 
    "allowLateComment": false, 
    "attachmentCount": 0, 
    "commentDueDate": "2016-12-28T23:59:59-05:00", 
    "commentStartDate": "2016-09-29T00:00:00-04:00", 
    "docketId": "EPA-HQ-OW-2016-0405", 
    "docketTitle": "Federal Baseline Water Quality Standards for Indian Reservations", 
    "docketType": "Rulemaking", 
    "documentId": "EPA-HQ-OW-2016-0405-0001", 
    "documentStatus": "Posted", 
    "documentType": "Proposed Rule", 
    "frNumber": "2016-23432", 
    "numberOfCommentsReceived": 0, 
    "openForComment": true, 
    "postedDate": "2016-09-29T00:00:00-04:00", 
    "rin": "2060-AF62", 
    "summary": "Document Contents : ...OW-2016-0405; FRL-9953-19-OW] RIN 2040-AF62 Federal Baseline <endeca_term>Water</endeca_term> Quality Standards for Indian Reservations AGENCY: Environmental Protection Agency (EPA). ACTION: Advance...", 
    "title": "Federal Baseline Water Quality Standards: Indian Reservations"
}


### Document lookup

Once you've gotten a longer list, you are likely to want further details on one or more of the results.  The document lookup ID can hlep with that, requiring your api_key and a documentId which can be found in the results identified previously.

https://api.data.gov/regulations/v3/document

In [72]:
uri_template = 'https://api.data.gov/regulations/v3/document?api_key={}&documentId={}'
documentId = doc['documentId']
print(documentId)
uri = uri_template.format(api_key, documentId)

EPA-HQ-OA-2013-0031-0009


In [73]:
r = requests.get(uri)

In [74]:
first_doc = json.loads(r.content)

The document response has a lot more details, so rather than printing it out here, let's just see some of what's available.

In [77]:
print(first_doc.keys())

[u'allowLateComment', u'docketTitle', u'pageCount', u'receivedDate', u'abstract', u'rin', u'cfrPart', u'documentType', u'postedDate', u'numItemsRecieved', u'title', u'frCitation', u'docketType', u'openForComment', u'startEndPage', u'commentDueDate', u'status', u'federalRegisterNumber', u'agencyAcronym', u'docSubType', u'docketId', u'documentId', u'agencyName', u'attachmentCount', u'fileFormats']


### Dockets

Dockets are "organizational folders" on the regulations.gov site.  I believe it has wider uses than my needs.  I believe Dockets can be helpful to find collections of documents that are related in some way.  For our interest, we want to retrieve all the comments for a given documentId of interest.  The Dockets endpoint helps us get some high level metadata about comments.

In [98]:
doc_type = 'PS' # Public Submission
parentDocumentId = 'EPA–R08–OAR–2015–0463'
uri_template = 'https://api.data.gov/regulations/v3/docket?api_key={}&rpp={}&po={}&dct={}&docketId={}'
uri = uri_template.format(api_key, results_per_page, page_offset, doc_type, parentDocumentId)

In [99]:
r = requests.get(uri)

In [100]:
docket_data = json.loads(r.content)
print(json.dumps(docket_data, indent=4, sort_keys=True))

{
    "agency": "Environmental Protection Agency", 
    "agencyAcronym": "EPA", 
    "docketId": "EPA-R08-OAR-2015-0463", 
    "generic": {
        "label": "Location", 
        "value": "R08-OAR"
    }, 
    "numberOfComments": 151, 
    "rin": "Not Assigned", 
    "title": "Approval and Promulgation of Air Quality Implementation Plans; Utah; Revisions to Regional Haze State Implementation Plan", 
    "type": {
        "label": "Type", 
        "value": "Rulemaking"
    }
}


Great, so for this particular proposal, we find 151 comments.  Let's retrieve them with the search API.

To retrieve them, simply use the documentId you are interested as your search keyword, and ask for just public submissions (comments).

In [101]:
keywords = parentDocumentId
results_per_page = 1000
page_offset = 0
doc_type = 'PS' # Public Submission

In [102]:
uri_template = 'https://api.data.gov/regulations/v3/documents?api_key={}&s={}&rpp={}&po={}&dct={}'
uri = uri_template.format(api_key, keywords, results_per_page, page_offset, doc_type)
r = requests.get(uri)

In [103]:
search_result = json.loads(r.content)
print('Found ' + str(search_result['totalNumRecords']) + ' results.')

Found 151 results.


There's one last catch in us getting all the data we're interested in.  Some comments are totally in the response while some have attachments that may contain useful information for us.  Let's examine a few fields of the document below for which this is the case.

In [159]:
doc = search_result['documents'][0]
print('Attachment count: ' + str(doc['attachmentCount']) + ', Comment: ' + doc['commentText'])

Attachment count: 2, Comment: See Attached


Above we can see that there are two attachments to this document which are not a part of the API's response.  We can retrieve them in the manner shown below, which I believe is an undocumented feature of the API.  These attachments are (always?) PDF files or Microsoft Word files.  Let's first retrieve them, and then convert them to text.

We'll need some helper functions to make that happen.  The first function below unpacks the "Content-Disposition" heading which should be sent back from the server describing the content.  We'll extract the filename and look at it's type so that we know if its a PDF or a Word doc.

In [258]:
def get_file_type(content_disposition):
    arr = content_disposition.split(';')
    for elem in arr:
        elem = elem.strip()
        if elem.find('filename=')==0:
            arr2 = elem.split('=')
            fname = arr2[1].replace('"', '').replace("'", '')
            i = fname.rfind('.')
            return fname[i:].lower()
    return 'unknown'

The next two helper functions convert PDFs and Word docs into plain text for use in our analysis.

In [241]:
def pdf_to_text(pdf_string):
    f = StringIO(pdf_string)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(f):
        interpreter.process_page(page)
        data =  retstr.getvalue()
    return data

In [242]:
def docx_to_text(docx_string):
    f = StringIO(docx_string)
    d = Document(f)
    data = ''
    for p in d.paragraphs:
        data += p.text + ' '
    return data

Now that we have our helper functions in place, let's write one last function which gets all the attachments for any particular documentId of a comment and aggregates them all into one simple string.

In [260]:
def get_attachment_comments(documentId, attachmentCount):
    comment = ''
    uri_template = 'https://www.regulations.gov/contentStreamer?documentId={}&attachmentNumber={}&disposition=attachment'
    for attachment_num in range(1, attachmentCount + 1):
        uri = uri_template.format(documentId, attachment_num)
        r = requests.get(uri)
        sc = r.status_code
        if sc == 200:
            content_disposition = r.headers['Content-Disposition']
            file_type = get_file_type(content_disposition)
            if file_type == '.pdf':
                comment += pdf_to_text(r.content) + ' '
            elif file_type == '.docx':
                comment += docx_to_text(r.content) + ' '
            else:
                print("Can't handle this attachment: " + content_disposition)
        else:
            print("Could not download " + uri)
    return unicode(comment, 'utf-8')

## Ready to pull data
With a few helper funcitons in place, we're now ready to iterate over our search results and grab our corpus.

In [261]:
print('Going to retrieve values for ' + parentDocumentId)

Going to retrieve values for EPA–R08–OAR–2015–0463


In [262]:
corpus = {}

In [263]:
corpus[parentDocumentId] = {}
keywords = parentDocumentId
results_per_page = 1000
page_offset = 0
doc_type = 'PS' # Public Submission

In [None]:
uri_template = 'https://api.data.gov/regulations/v3/documents?api_key={}&s={}&rpp={}&po={}&dct={}'
nresults = 1000
while results_per_page == nresults:
    uri = uri_template.format(api_key, keywords, results_per_page, page_offset, doc_type)
    print uri
    r = requests.get(uri)
    search_result = json.loads(r.content)
    for doc in search_result['documents']:
        thisDocumentId = doc['documentId']
        if not(corpus[parentDocumentId].has_key(thisDocumentId)):
            ct = doc['commentText']
            if doc['attachmentCount'] > 0:
                acom = get_attachment_comments(doc['documentId'], doc['attachmentCount'])
                ct += ' ' + acom
            corpus[parentDocumentId][thisDocumentId] = ct
    page_offset += 1
    nresults = len(search_result['documents'])