Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL #522

Closed
fgregg opened this issue Dec 17, 2019 · 7 comments
Closed

ETL #522

fgregg opened this issue Dec 17, 2019 · 7 comments
Labels

Comments

@fgregg
Copy link
Collaborator

fgregg commented Dec 17, 2019

The major work of this task is to set up scheduled jobs for the scrapers and scheduled jobs for management commands that are still left for la metro councilmatic to be responsible for

these are

manage.py refresh_pic
manage.py compile_pdfs
manage.py convert_attachment_text 
manage.py update_index --batch-size=100 --age=1 
manage.py data_integrity 

I recommend doing these as two separate containers.

@fgregg fgregg added the Epic label Dec 17, 2019
@hancush
Copy link
Collaborator

hancush commented Dec 26, 2019

Part 1: The Scrapers

Scraper code is not bundled with the application code, so it must be deployed separately from the application (i.e., the problem can't be solved by manually running pupa commands on the app container).

As far as I can tell, Heroku apps need to have a web process that connects to a particular port within 30 seconds, else deployments are marked as failed. This is not the case for the scrapers, which are long running Python processes. That means they can't be deployed as their own application to Heroku. This will probably be solved when we implement the Airflow dashboard – hurrah!

In the meantime, I used Heroku's instructions for connecting to their Postgres instances remotely to initialize the database and populate it with data via my local machine:

docker-compose run --rm -e DATABASE_URL=$(heroku config:get DATABASE_URL -a la-metro-councilmatic-staging)?sslmode=require -e DJANGO_SETTINGS_MODULE=pupa.settings scrapers

Notes

pupa dbinit us runs nearly 200k inserts. This takes a loooong time (possibly a limitation of the Postgres tier we're using?), so long that the command appeared to hang indefinitely.

To debug whether the command was actually running, I tried deconstructing dbinit into its composite parts: a database migration and a call to the loaddivisions management command in python-opencivicdata. I noticed that loaddivisions had a --bulk option, which leverages Django's bulk_create method to create divisions in batches of 100k. So, I updated the scrapers service command to run the migration and load the divisions in bulk, to see whether it would make any difference:

sh -c 'django-admin migrate && \
    django-admin loaddivisions us --bulk && \
    pupa update --rpm=600 lametro people && \
    pupa update --rpm=600 lametro bills window=30 && \
    pupa update --rpm=600 lametro events'

It did: Inserting divisions took < 1 minute. It will not be necessary to preserve this behavior, at least not for this database, but I thought it would be good to document. The ability to specify bulk inserts in dbinit might make a good PR against pupa.

Moving forward

Barring dashboard implementation, if it's important to run scrapes on a recurring basis for the test deployment, I think I'd recommend authenticating with Heroku on an EC2 instance (the Metro server, perhaps?) and running the scrapers service in a cronjob.

Heroku does seem to provide some facilities for scheduling work and executing background processes. With that said, as we haven't done much (any?) R&D with this stack, I'd like to use research time, rather than client hours, to assess them. Additionally, we'd need to package the scraper code with the application or do additional R&D on deploying images with Heroku's container registry, namely: Is it possible to use both a prebuilt image (scraper code) and leverage Heroku's build pipeline (app code) to define services within a single application?

Moreover, if we are going to implement an Airflow, which provides its own facilities for interacting with containers and scheduling work, all of this might be redundant, at least where Metro's concerned.

@hancush
Copy link
Collaborator

hancush commented Dec 26, 2019

Part 2: The Management Commands

Since the management commands are part of the application code, I defined a metro_etl task that runs all five pieces of the ETL pipeline. This allows us to hook into Heroku's free scheduler to run the pipeline, however we probably want to run it in concert with the scrapers, which the scheduler can't help us achieve. And again, scheduling will be achieved by the ETL dashboard, so we probably don't want to invest too much time on another approach.

With all of this in mind, let's have a conversation about how often we need the Metro data pipeline to run for the test deployment (i.e., whether it needs to be scheduled and automated) and identify options from there.

@hancush
Copy link
Collaborator

hancush commented Dec 26, 2019

Some options:

  • If we decide to use EC2 to schedule scrapes hooking into the remote Postgres instance, we could extend that cronjob to run the ETL pipeline, as well.
  • We could install docker-compose in the Metro container, then either extend the metro_etl task to start with scraping (using the containerized command) or, if it's easy to link scheduled tasks, define a separate metro_scrape task. There are some limitations to the scheduler (namely that it sometimes – but rarely – misses task execution and tasks that are not completed before the next one begins are terminated), but it would give us some basic scheduling functionality without a whole lot of overhead (I think).
  • We could run the scrape and ETL pipeline by hand, when needed.

@fgregg
Copy link
Collaborator Author

fgregg commented Jan 2, 2020 via email

@hancush
Copy link
Collaborator

hancush commented Jan 2, 2020

@fgregg Thanks for your input!

As we're on a limited time budget, I'd not prefer not to re-deploy the test site. I'm also enjoying this opportunity to pilot Heroku for a more complicated application.

As an alternative, I propose deploying the scrapers to an EC2 instance, as you suggest, and scraping into the remote Heroku DB. Since it's temporary, I will authenticate with Heroku with my credentials on the server so we can access the database URL environment variable for the app.

Meanwhile, since it's not important for the scrape and ETL to be coordinated, we can use the Heroku scheduler for the application ETL.

Thoughts?

@hancush
Copy link
Collaborator

hancush commented Jan 6, 2020

For posterity, I've scheduled the Councilmatic ETL pipeline to run every 20 minutes with the free Heroku Scheduler integration.

Meanwhile, I've manually deployed the metro-upgrade-deploy branch of our scrapers fork to a dedicated EC2 instance called Metro Upgrade (Public DNS: ec2-3-93-23-235.compute-1.amazonaws.com). The configs for this server are under version control. @fgregg, you should have shell access. I have not yet enabled CI for the scrapers because I'm having trouble with the updated crontab. It doesn't appear to be running (no logs in the /tmp/lametro.log file).

@fgregg
Copy link
Collaborator Author

fgregg commented Jan 7, 2020

I've got it all working. datamade/scrapers-us-municipal@cdc0d81

Didn't set up CI. Don't think we really want it right now, tbh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants