-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scraper for CRS reports #51
Comments
We need to:
Associating a bill with reportsWhen we are in production, we will use the Bills table to get a list of the bill numbers and bill titles. But for now, I have provided a .json file which is of the form { 113hr2000: {titles: ['My long title 1', 'My title 2']...}, 113hr2001: {...} So, once you have the pdfs downloaded, and indexed into Elasticsearch, you will go through this .json object, one key at a time and search. For 113hr2000 you will look for H. R. 2000 in reports with a date in the years 2013 or 2014 (corresponding to Congress 113). So the Elasticsearch index should include a date field. You will also search reports for the titles in 113hr2000.titles-- only return reports that have a high score (almost exact match). The goal of this issue is to create a new database table, 'crs`, which includes the name of the report, the date of the report, and a link to the report (which we will store locally). Then in the Bills table we will add an external key that references any reports that the bill is associated with. Once we have found which CRS reports mention a bill number or title (it will not be many of them), we then will add a table crs_reports to our Django database, that links has the name of the report, a column for associated bills, and a link to the report in our static directory. |
We will want to add a Celery task to download new reports periodically (once a week?), index them and search for bill numbers. |
One example of a bill that is referenced in a CRS report is here: This mentions We have the bill number as 113hr2000, and the report shows |
Another example, this time a pdf that references many bills, all in the 114th Congress, also from the year of the report (December 2015): |
The search at Congress.gov is pretty effective at finding pdfs related by bill number: |
Start with
everycrsreport.com
(built by Josh). Can download data using instructions on About page.May need to limit bill citations to summary.
NOTE: only 1 in 10 or even 1 in 100 of the reports will be relevant to us. These are only reports that include a mention of the bill title or number or other related information.
Can start by indexing CRS reports and searching by bill name & bill number.
Better data is before March 2020. Recent data is pdf => OCR, so may not be as usable.
The text was updated successfully, but these errors were encountered: