Build scraper for Statements of Administrative Policy, connect to bill #34

aih · 2020-10-20T18:59:18Z

https://obamawhitehouse.archives.gov/omb/legislative_sap_default

https://www.presidency.ucsb.edu/documents/app-categories/written-statements/presidential/statements-administration-policy?items_per_page=10&page=127

https://www.presidency.ucsb.edu/documents/app-categories/written-statements/presidential/statements-administration-policy?items_per_page=10&page=127 George w bush saps https://georgewbush-whitehouse.archives.gov/omb/legislative/sap/index.html 

aih · 2020-11-17T19:52:35Z

The SAP may be connected to one or more bills (this will be described in the title)

aih · 2021-01-12T19:47:43Z

Related to #82

aih · 2021-01-21T06:39:08Z

This is why we need to link to the pdfs from our server:

ayeshamk · 2021-01-25T16:32:11Z

SAP data archive for Obama administration.

ayeshamk · 2021-01-25T16:33:39Z

Obama administration SAP data archive link: https://obamawhitehouse.archives.gov/omb/legislative_sap_default.

JoshData · 2021-01-26T20:30:25Z

As promised, here's some feedback on the JSON metadata currently at https://github.com/aih/FlatGov/blob/FT_branch_1/server_py/flatgov/dump_statement.json.zip.

I think it might be important to re-think the philosophy here in terms of building a permanent archive of this data (Statements of Administration Policy) first and then thinking about how it integrates into the flatgov application second. This creates a little buffer between the usefulness and value of your data-gathering efforts and the longevity of the Flatgov application --- a permanent archive will live forever and will create value for researchers so long as humans continue to exist, but Flatgov might not. As a permanent archive, you want to make sure the archive is complete and accurate and that the metadata is clear in meaning, well organized, has some documentation about what it is/where it's from/how it's organized, and doesn't have extraneous (e.g. flatgov-internal) information in it. I would move it out of a repository that has UI/front-end/server things. Then a second step is to integrate it into the UI --- being a consumer of your own data to prove that the data is consumable.

The JSON currently looks like: (my specific comments are below)

[
...
{
  "model": "bills.statement",
  "pk": 24,
  "fields": {
    "bill_number": "HR1140",
    "bill": "H.R. 1140 Rights for Transportation Security Officers Act of 2020",
    "congres": "116",
    "date_issued": " March 2, 2020",
    "pdf_link": null,
    "link": "https://www.whitehouse.gov/wp-content/uploads/2020/03/SAP_HR-1140.pdf",
    "created_at": "2021-01-16T17:11:53.026Z"
  }
},
...
]

Don't ZIP it. It's not huge. It's easier to work with if it's not compressed.
The Django model information (model, pk) don't have a use beyond Flatgov internal uses, so it might be handy for export but it'll be clearer for building a long-term archive of this information to omit it.
bill_number and congress could be combined into a congress-project-style bill id field just named bill or bill_id which would look like hr1140-116. This format has the advantage of it being clear that it's an ID with known values for the bill type characters (https://github.com/unitedstates/congress/wiki/bills#basic-information). (Note congres is misspelled.)
bill should be renamed bill_title.
date_issued should be an ISO-8601 YYYY-MM-DD date.
pdf_link is empty so I'm not sure if it's meant to be used.
link is great. It will go stale eventually though, and this JSON may last forever. So a better name might be original_pdf_link. (It's the URL you actually downloaded the PDF from.)
Since the link will eventually break, it might be cool to pre-emptively fetch an Internet Archive link to the PDF so that there's a permanent and quasi-authoritative URL. This could be named link or permanent_pdf_link.
There should be a date_fetched field that indicates when the scraper acquired this document. There's some overlap with created_at but created_at probably refers to the ORM database record, which might have a different date.
You should add a field pdf_filename that has the name of the local PDF file. Then the naming convention of the PDF file doesn't matter as long as it's unique. The pdf_filename should be the primary way that the metadata record is connected to the local PDF file, not by assuming a particular directory structure and file naming convention for the PDFs.

aih · 2021-02-12T18:41:48Z

This is merged and deployed, as of d80b0:

aih added high priority scraping labels Nov 17, 2020

aih self-assigned this Jan 12, 2021

ayeshamk self-assigned this Jan 20, 2021

ayeshamk pinned this issue Jan 25, 2021

smplater unpinned this issue Feb 2, 2021

nkinaba added this to the Understand the Context Section__Bill Page milestone Feb 4, 2021

aih closed this as completed Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build scraper for Statements of Administrative Policy, connect to bill #34

Build scraper for Statements of Administrative Policy, connect to bill #34

aih commented Oct 20, 2020 •

edited by ayeshamk

Loading

aih commented Nov 17, 2020

aih commented Jan 12, 2021

aih commented Jan 21, 2021

ayeshamk commented Jan 25, 2021

ayeshamk commented Jan 25, 2021

JoshData commented Jan 26, 2021

aih commented Feb 12, 2021

Build scraper for Statements of Administrative Policy, connect to bill #34

Build scraper for Statements of Administrative Policy, connect to bill #34

Comments

aih commented Oct 20, 2020 • edited by ayeshamk Loading

aih commented Nov 17, 2020

aih commented Jan 12, 2021

aih commented Jan 21, 2021

ayeshamk commented Jan 25, 2021

ayeshamk commented Jan 25, 2021

JoshData commented Jan 26, 2021

aih commented Feb 12, 2021

aih commented Oct 20, 2020 •

edited by ayeshamk

Loading