Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize Celery tasks #248

Closed
7 tasks done
aih opened this issue Mar 23, 2021 · 4 comments
Closed
7 tasks done

Organize Celery tasks #248

aih opened this issue Mar 23, 2021 · 4 comments
Assignees

Comments

@aih
Copy link
Collaborator

aih commented Mar 23, 2021

We have a number of scrapers and processing tasks. We need to make sure they run efficiently and get only the latest changes, if possible.
This is first an issue of analyzing the current scrapers and how they work. We need to know:

  1. How is the initial data loaded?
  2. How are updates made?

@ayeshamk, I'm asking Wei to organize this. He may have questions about individual scrapers.

We now have the following scrapers and processors:

  • 1. US Congress bill scraper
  • 2. Scripts to process bills and create json with bill metadata (I am working to update this for performance)
  • 3. Loading the metadata into the database
  • 4. CBO reports (Ayesha)
  • 5. CRS reports (@idmitryv)
  • 6. Committee Documents (crec_loader, Ayesha)
  • 7. Statemement of Administration Policy (Ayesha)
    • This consists of a loader with fixtures for previous administrations (Trump and Obama), and a scraper for the current administration (Biden)
@weinicookpad
Copy link

CREC scraper

  • There is a json file that includes urls to crawl named crec_detail_urls.json

url example - https://www.govinfo.gov/wssearch/getContentDetail?packageId=CRPT-117hrpt1&granuleId=CRPT-117hrpt1

  • We send request to the url and parse the response with the fields below.

title, pdf_link, Category, Report Type, Report Number, Date, Committee, Associated Legislation

Category, Report Type, Report Number, Date, Committee, Associated Legislation are in the response metadata.

  • The scraper store those data as a json in the crec_data.json file.

  • Run Django command ./manage.py load_crec to store them into database.

When we run the Django command above, it calls crec_loader function in common/crec_data.py

It stores the data into CommitteeDocument table in database.

Statements of Administration Policy Scraper

  • The scraper goes to https://www.whitehouse.gov/omb/statements-of-administration-policy/

  • Get the urls on the page above and store them into ../server_py/flatgov/biden_data.json file.

  • Run Django command ./manage.py biden_statements to store them into database.

When we run the Django command above, it calls load_statements function in common/biden_statements.py

It stores the data into Statement table in database.

CBO Scraper

  • Run Django command ./manage.py load_cbo

Before running it, the django command automatically delete all the cbo instances in the database.

  • It stores data into CboReport table in database

CRS Scraper

  • See CRS_REPORT.adoc

How Daily Updates.

  1. We run CRS Scraper and CBO Scraper daily in a easy way using celery scheduler.

  2. CREC and SAP scrapers were built with Scrapy.

We will need to integrate Scrapy with Django.

Here is the schema.

Client sends a request with a URL to crawl it. (1)

Django triggers Scrapy to run a spider to crawl that URL. (2)

Django returns a response to tell Client that crawling just started. (3)

Scrapy completes crawling and saves extracted data into a database. (4)

Django fetches that data from the database and return it to Client. (5)

In this way, we don't need to store data into json files in Scrapy anymore.

@aih
Copy link
Collaborator Author

aih commented Mar 26, 2021

Notes:

  • For the Cbo scraper, it seems inefficient to delete the data and recreate it. Can we change the scraper to scrape by date, and only update the most recent? Or to check before adding to the database, whether the item already exists?
  • Similarly for the crs scraper, let's see if we can avoid re-scraping

For the scrapy scrapers, we also want to:

  • only scrape the most recent
  • if we need to store to a temporary json file, that's ok. Better, of course, if we store straight to the db

@aih
Copy link
Collaborator Author

aih commented Mar 26, 2021

Also, for the crec scraper, I believe there is code that makes the crec_detail_urls.json.

@ayeshamk ?

@kapphire
Copy link
Collaborator

For Cbo scraper, we don't need to delete the data and recreate it.
We can check the item before adding to the database by bill_number.

CRS scraper:
We need to store the latest url from the csv file while running the celery task to avoid duplicates.

In that way, we can avoid duplicates in Scrapy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants