Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scraper for CRS reports #51

Closed
aih opened this issue Nov 17, 2020 · 7 comments
Closed

Add scraper for CRS reports #51

aih opened this issue Nov 17, 2020 · 7 comments

Comments

@aih
Copy link
Collaborator

aih commented Nov 17, 2020

Start with everycrsreport.com (built by Josh). Can download data using instructions on About page.

May need to limit bill citations to summary.

  1. CRS reports may be updated over time, so date of CRS report does not determine related bills.
  2. Not many reports reference bills, so key will be filtering. CRS report will typically cite the bill name.
  3. May be able to use categorization to narrow search criteria or confirm relationship with bills.

NOTE: only 1 in 10 or even 1 in 100 of the reports will be relevant to us. These are only reports that include a mention of the bill title or number or other related information.

Can start by indexing CRS reports and searching by bill name & bill number.
Better data is before March 2020. Recent data is pdf => OCR, so may not be as usable.

@aih
Copy link
Collaborator Author

aih commented Jan 27, 2021

We need to:

  1. download all of the report pdfs, following the information in the csv file: https://www.everycrsreport.com/reports.csv

  2. index the the pdfs (we're already using Elasticsearch, so I'd recommend that).

  3. look for a bill number or bill title in the index. If we find it, we associate that pdf with that bill. The list of bill numbers and bill names to search is in the Bill table (flatgov/bills/models.py), as the number (string) and titles (array) columns. Below is a zipped json file with the bill numbers as key, and titles as a field. This can be used to test the ES index search.
    billsMeta.json.gz

image

Associating a bill with reports

When we are in production, we will use the Bills table to get a list of the bill numbers and bill titles. But for now, I have provided a .json file which is of the form { 113hr2000: {titles: ['My long title 1', 'My title 2']...}, 113hr2001: {...}

So, once you have the pdfs downloaded, and indexed into Elasticsearch, you will go through this .json object, one key at a time and search.

For 113hr2000 you will look for H. R. 2000 in reports with a date in the years 2013 or 2014 (corresponding to Congress 113). So the Elasticsearch index should include a date field. You will also search reports for the titles in 113hr2000.titles-- only return reports that have a high score (almost exact match).

The goal of this issue is to create a new database table, 'crs`, which includes the name of the report, the date of the report, and a link to the report (which we will store locally). Then in the Bills table we will add an external key that references any reports that the bill is associated with.

Once we have found which CRS reports mention a bill number or title (it will not be many of them), we then will add a table crs_reports to our Django database, that links has the name of the report, a column for associated bills, and a link to the report in our static directory.

@aih
Copy link
Collaborator Author

aih commented Jan 27, 2021

We will want to add a Celery task to download new reports periodically (once a week?), index them and search for bill numbers.

@aih
Copy link
Collaborator Author

aih commented Jan 27, 2021

The goal is to create a table in this tab that has columns for the Bill Number, Bill (title, truncated), Congress (congress number), Report (report title), and Date (report date), Link (internal link to the pdf)
image

It should be similar (except for the report title) to the SAP table:
image

@aih
Copy link
Collaborator Author

aih commented Jan 27, 2021

One example of a bill that is referenced in a CRS report is here:
https://crsreports.congress.gov/product/pdf/R/R42765
image

This mentions H.R. 2000. However, it is not immediately clear, without reading the context (May 2013 introduction date for the bill) which Congress this refers to. We could make the assumption that the references are to a bill in the same year +/- 1 as the CRS report was written. In this case, the report is from June 2013 = 113th Congress, so we'd assume (correctly) this is 113hr2000: https://www.govtrack.us/congress/bills/113/hr2000

We have the bill number as 113hr2000, and the report shows H. R. 200. The 113 is the congress number. It is related to the year as follows: congressNumber = round_up((year - 1788)/2). So, 2020 is Congress 116, and 2021 is Congress 117. As a first pass, we will assume the report is referring to the Congress of the same year as the report.

@aih
Copy link
Collaborator Author

aih commented Jan 27, 2021

Another example, this time a pdf that references many bills, all in the 114th Congress, also from the year of the report (December 2015):

https://crsreports.congress.gov/product/pdf/R/R43518
image

@aih
Copy link
Collaborator Author

aih commented Jan 27, 2021

The search at Congress.gov is pretty effective at finding pdfs related by bill number:
https://crsreports.congress.gov/search/#/?termsToSearch=hr1800&orderBy=Relevance

@nkinaba nkinaba added this to the Understand the Context Section__Bill Page milestone Feb 4, 2021
@aih
Copy link
Collaborator Author

aih commented Feb 14, 2021

This is now working and documented in server_py/flatgov/crs/CRS_REPORTS.adoc

image

@aih aih closed this as completed Feb 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants