Organize Celery tasks #248

aih · 2021-03-23T16:29:02Z

We have a number of scrapers and processing tasks. We need to make sure they run efficiently and get only the latest changes, if possible.
This is first an issue of analyzing the current scrapers and how they work. We need to know:

How is the initial data loaded?
How are updates made?

@ayeshamk, I'm asking Wei to organize this. He may have questions about individual scrapers.

We now have the following scrapers and processors:

1. US Congress bill scraper
2. Scripts to process bills and create json with bill metadata (I am working to update this for performance)
3. Loading the metadata into the database
4. CBO reports (Ayesha)
5. CRS reports (@idmitryv)
6. Committee Documents (crec_loader, Ayesha)
7. Statemement of Administration Policy (Ayesha)
- This consists of a loader with fixtures for previous administrations (Trump and Obama), and a scraper for the current administration (Biden)

The text was updated successfully, but these errors were encountered:

weinicookpad · 2021-03-26T14:36:02Z

CREC scraper

There is a json file that includes urls to crawl named crec_detail_urls.json

url example - https://www.govinfo.gov/wssearch/getContentDetail?packageId=CRPT-117hrpt1&granuleId=CRPT-117hrpt1

We send request to the url and parse the response with the fields below.

title, pdf_link, Category, Report Type, Report Number, Date, Committee, Associated Legislation

Category, Report Type, Report Number, Date, Committee, Associated Legislation are in the response metadata.

The scraper store those data as a json in the crec_data.json file.
Run Django command ./manage.py load_crec to store them into database.

When we run the Django command above, it calls crec_loader function in common/crec_data.py

It stores the data into CommitteeDocument table in database.

Statements of Administration Policy Scraper

The scraper goes to https://www.whitehouse.gov/omb/statements-of-administration-policy/
Get the urls on the page above and store them into ../server_py/flatgov/biden_data.json file.
Run Django command ./manage.py biden_statements to store them into database.

When we run the Django command above, it calls load_statements function in common/biden_statements.py

It stores the data into Statement table in database.

CBO Scraper

Run Django command ./manage.py load_cbo

Before running it, the django command automatically delete all the cbo instances in the database.

It stores data into CboReport table in database

CRS Scraper

See CRS_REPORT.adoc

How Daily Updates.

We run CRS Scraper and CBO Scraper daily in a easy way using celery scheduler.
CREC and SAP scrapers were built with Scrapy.

We will need to integrate Scrapy with Django.

Here is the schema.

Client sends a request with a URL to crawl it. (1)

Django triggers Scrapy to run a spider to crawl that URL. (2)

Django returns a response to tell Client that crawling just started. (3)

Scrapy completes crawling and saves extracted data into a database. (4)

Django fetches that data from the database and return it to Client. (5)

In this way, we don't need to store data into json files in Scrapy anymore.

aih · 2021-03-26T15:13:02Z

Notes:

For the Cbo scraper, it seems inefficient to delete the data and recreate it. Can we change the scraper to scrape by date, and only update the most recent? Or to check before adding to the database, whether the item already exists?
Similarly for the crs scraper, let's see if we can avoid re-scraping

For the scrapy scrapers, we also want to:

only scrape the most recent
if we need to store to a temporary json file, that's ok. Better, of course, if we store straight to the db

aih · 2021-03-26T15:20:28Z

Also, for the crec scraper, I believe there is code that makes the crec_detail_urls.json.

@ayeshamk ?

kapphire · 2021-03-26T15:30:47Z

For Cbo scraper, we don't need to delete the data and recreate it.
We can check the item before adding to the database by bill_number.

CRS scraper:
We need to store the latest url from the csv file while running the celery task to avoid duplicates.

In that way, we can avoid duplicates in Scrapy.

aih assigned kapphire Mar 23, 2021

aih mentioned this issue Mar 26, 2021

Create a Celery task to update the (Biden) SAP scraper daily #264

Closed

aih closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Organize Celery tasks #248

Organize Celery tasks #248

aih commented Mar 23, 2021 •

edited

Loading

weinicookpad commented Mar 26, 2021

aih commented Mar 26, 2021

aih commented Mar 26, 2021

kapphire commented Mar 26, 2021

Organize Celery tasks #248

Organize Celery tasks #248

Comments

aih commented Mar 23, 2021 • edited Loading

weinicookpad commented Mar 26, 2021

CREC scraper

Statements of Administration Policy Scraper

CBO Scraper

CRS Scraper

How Daily Updates.

aih commented Mar 26, 2021

aih commented Mar 26, 2021

kapphire commented Mar 26, 2021

aih commented Mar 23, 2021 •

edited

Loading