Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build scraper for Statements of Administrative Policy, connect to bill #34

Closed
aih opened this issue Oct 20, 2020 · 7 comments
Closed

Build scraper for Statements of Administrative Policy, connect to bill #34

aih opened this issue Oct 20, 2020 · 7 comments

Comments

@aih
Copy link
Collaborator Author

aih commented Nov 17, 2020

The SAP may be connected to one or more bills (this will be described in the title)

@aih
Copy link
Collaborator Author

aih commented Jan 12, 2021

Related to #82

@ayeshamk ayeshamk self-assigned this Jan 20, 2021
@aih
Copy link
Collaborator Author

aih commented Jan 21, 2021

This is why we need to link to the pdfs from our server:
image

@ayeshamk ayeshamk pinned this issue Jan 25, 2021
@ayeshamk
Copy link
Collaborator

SAP data archive for Obama administration.

image

@ayeshamk
Copy link
Collaborator

Obama administration SAP data archive link: https://obamawhitehouse.archives.gov/omb/legislative_sap_default.

@JoshData
Copy link
Member

As promised, here's some feedback on the JSON metadata currently at https://github.com/aih/FlatGov/blob/FT_branch_1/server_py/flatgov/dump_statement.json.zip.

I think it might be important to re-think the philosophy here in terms of building a permanent archive of this data (Statements of Administration Policy) first and then thinking about how it integrates into the flatgov application second. This creates a little buffer between the usefulness and value of your data-gathering efforts and the longevity of the Flatgov application --- a permanent archive will live forever and will create value for researchers so long as humans continue to exist, but Flatgov might not. As a permanent archive, you want to make sure the archive is complete and accurate and that the metadata is clear in meaning, well organized, has some documentation about what it is/where it's from/how it's organized, and doesn't have extraneous (e.g. flatgov-internal) information in it. I would move it out of a repository that has UI/front-end/server things. Then a second step is to integrate it into the UI --- being a consumer of your own data to prove that the data is consumable.

The JSON currently looks like: (my specific comments are below)

[
...
{
  "model": "bills.statement",
  "pk": 24,
  "fields": {
    "bill_number": "HR1140",
    "bill": "H.R. 1140 Rights for Transportation Security Officers Act of 2020",
    "congres": "116",
    "date_issued": " March 2, 2020",
    "pdf_link": null,
    "link": "https://www.whitehouse.gov/wp-content/uploads/2020/03/SAP_HR-1140.pdf",
    "created_at": "2021-01-16T17:11:53.026Z"
  }
},
...
]
  • Don't ZIP it. It's not huge. It's easier to work with if it's not compressed.
  • The Django model information (model, pk) don't have a use beyond Flatgov internal uses, so it might be handy for export but it'll be clearer for building a long-term archive of this information to omit it.
  • bill_number and congress could be combined into a congress-project-style bill id field just named bill or bill_id which would look like hr1140-116. This format has the advantage of it being clear that it's an ID with known values for the bill type characters (https://github.com/unitedstates/congress/wiki/bills#basic-information). (Note congres is misspelled.)
  • bill should be renamed bill_title.
  • date_issued should be an ISO-8601 YYYY-MM-DD date.
  • pdf_link is empty so I'm not sure if it's meant to be used.
  • link is great. It will go stale eventually though, and this JSON may last forever. So a better name might be original_pdf_link. (It's the URL you actually downloaded the PDF from.)
  • Since the link will eventually break, it might be cool to pre-emptively fetch an Internet Archive link to the PDF so that there's a permanent and quasi-authoritative URL. This could be named link or permanent_pdf_link.
  • There should be a date_fetched field that indicates when the scraper acquired this document. There's some overlap with created_at but created_at probably refers to the ORM database record, which might have a different date.
  • You should add a field pdf_filename that has the name of the local PDF file. Then the naming convention of the PDF file doesn't matter as long as it's unique. The pdf_filename should be the primary way that the metadata record is connected to the local PDF file, not by assuming a particular directory structure and file naming convention for the PDFs.

@smplater smplater unpinned this issue Feb 2, 2021
@nkinaba nkinaba added this to the Understand the Context Section__Bill Page milestone Feb 4, 2021
@aih
Copy link
Collaborator Author

aih commented Feb 12, 2021

This is merged and deployed, as of d80b0:
image

@aih aih closed this as completed Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants