ETL #522

fgregg · 2019-12-17T15:39:25Z

The major work of this task is to set up scheduled jobs for the scrapers and scheduled jobs for management commands that are still left for la metro councilmatic to be responsible for

these are

manage.py refresh_pic
manage.py compile_pdfs
manage.py convert_attachment_text 
manage.py update_index --batch-size=100 --age=1 
manage.py data_integrity

I recommend doing these as two separate containers.

The text was updated successfully, but these errors were encountered:

hancush · 2019-12-26T17:13:12Z

Part 1: The Scrapers

Scraper code is not bundled with the application code, so it must be deployed separately from the application (i.e., the problem can't be solved by manually running pupa commands on the app container).

As far as I can tell, Heroku apps need to have a web process that connects to a particular port within 30 seconds, else deployments are marked as failed. This is not the case for the scrapers, which are long running Python processes. That means they can't be deployed as their own application to Heroku. This will probably be solved when we implement the Airflow dashboard – hurrah!

In the meantime, I used Heroku's instructions for connecting to their Postgres instances remotely to initialize the database and populate it with data via my local machine:

docker-compose run --rm -e DATABASE_URL=$(heroku config:get DATABASE_URL -a la-metro-councilmatic-staging)?sslmode=require -e DJANGO_SETTINGS_MODULE=pupa.settings scrapers

Notes

pupa dbinit us runs nearly 200k inserts. This takes a loooong time (possibly a limitation of the Postgres tier we're using?), so long that the command appeared to hang indefinitely.

To debug whether the command was actually running, I tried deconstructing dbinit into its composite parts: a database migration and a call to the loaddivisions management command in python-opencivicdata. I noticed that loaddivisions had a --bulk option, which leverages Django's bulk_create method to create divisions in batches of 100k. So, I updated the scrapers service command to run the migration and load the divisions in bulk, to see whether it would make any difference:

sh -c 'django-admin migrate && \
    django-admin loaddivisions us --bulk && \
    pupa update --rpm=600 lametro people && \
    pupa update --rpm=600 lametro bills window=30 && \
    pupa update --rpm=600 lametro events'

It did: Inserting divisions took < 1 minute. It will not be necessary to preserve this behavior, at least not for this database, but I thought it would be good to document. The ability to specify bulk inserts in dbinit might make a good PR against pupa.

Moving forward

Barring dashboard implementation, if it's important to run scrapes on a recurring basis for the test deployment, I think I'd recommend authenticating with Heroku on an EC2 instance (the Metro server, perhaps?) and running the scrapers service in a cronjob.

Heroku does seem to provide some facilities for scheduling work and executing background processes. With that said, as we haven't done much (any?) R&D with this stack, I'd like to use research time, rather than client hours, to assess them. Additionally, we'd need to package the scraper code with the application or do additional R&D on deploying images with Heroku's container registry, namely: Is it possible to use both a prebuilt image (scraper code) and leverage Heroku's build pipeline (app code) to define services within a single application?

Moreover, if we are going to implement an Airflow, which provides its own facilities for interacting with containers and scheduling work, all of this might be redundant, at least where Metro's concerned.

hancush · 2019-12-26T19:02:24Z

Part 2: The Management Commands

Since the management commands are part of the application code, I defined a metro_etl task that runs all five pieces of the ETL pipeline. This allows us to hook into Heroku's free scheduler to run the pipeline, however we probably want to run it in concert with the scrapers, which the scheduler can't help us achieve. And again, scheduling will be achieved by the ETL dashboard, so we probably don't want to invest too much time on another approach.

With all of this in mind, let's have a conversation about how often we need the Metro data pipeline to run for the test deployment (i.e., whether it needs to be scheduled and automated) and identify options from there.

hancush · 2019-12-26T19:09:31Z

Some options:

If we decide to use EC2 to schedule scrapes hooking into the remote Postgres instance, we could extend that cronjob to run the ETL pipeline, as well.
We could install docker-compose in the Metro container, then either extend the metro_etl task to start with scraping (using the containerized command) or, if it's easy to link scheduled tasks, define a separate metro_scrape task. There are some limitations to the scheduler (namely that it sometimes – but rarely – misses task execution and tasks that are not completed before the next one begins are terminated), but it would give us some basic scheduling functionality without a whole lot of overhead (I think).
We could run the scrape and ETL pipeline by hand, when needed.

fgregg · 2020-01-02T16:49:55Z

Few thoughts: At this stage, we are hoping to shake out any bugs and so we want the ETL patterns to be similar to what they will be in production. It is our expectation that when we go to the Airflow pattern, the etl jobs will run sequentially (or in a DAG), however it is also our intention to get councilmatic 2.5 in production before we go the the Airflow pattern. If we keep the LA Metro scrapers into our current scraper setup, it means that the scrapers and LA councilmatic will be on separate servers and we should not be able to expect any orderin between scrapers and LA Metro Councilmatic. I think this is the pattern we should test for. Alternatively, we could run everything on the same server for LA and guarantee sequentiality, but I think it's best to not add a different pattern, in production, for how we do ETL. I recommend that we schedule a running of the scraper and separately, schedule a running of all the other LA metro ETL bits and bobs. I think a pattern that would be familiar to us would be to do this on EC2. We could just deploy the LA Metro 2.5 instance and use the docker-compose.yml file along the with crontabs that ran `docker-compose run --rm scrapers` and docker-compose run -rm other_metro_etl`

…

On Thu, Dec 26, 2019 at 2:09 PM Hannah Cushman ***@***.***> wrote: Some options: - If we decide to use EC2 to schedule scrapes hooking into the remote Postgres instance, we could extend that cronjob to run the ETL pipeline, as well. - We could install docker-compose in the Metro container, then either extend the metro_etl task to start with scraping (using the containerized command) or, if it's easy to link scheduled tasks, define a separate metro_scrape task. There are some limitations to the scheduler (namely that it sometimes – but rarely – misses task execution and tasks that are not completed before the next one begins are terminated), but it would give us some basic scheduling functionality without a whole lot of overhead (I think). - We could run the scrape and ETL pipeline by hand, when needed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#522>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEDC3KO5NUVD4TVXNCX2S3Q2T6OXANCNFSM4J35MYBQ> .

hancush · 2020-01-02T17:44:20Z

@fgregg Thanks for your input!

As we're on a limited time budget, I'd not prefer not to re-deploy the test site. I'm also enjoying this opportunity to pilot Heroku for a more complicated application.

As an alternative, I propose deploying the scrapers to an EC2 instance, as you suggest, and scraping into the remote Heroku DB. Since it's temporary, I will authenticate with Heroku with my credentials on the server so we can access the database URL environment variable for the app.

Meanwhile, since it's not important for the scrape and ETL to be coordinated, we can use the Heroku scheduler for the application ETL.

Thoughts?

hancush · 2020-01-06T16:12:10Z

For posterity, I've scheduled the Councilmatic ETL pipeline to run every 20 minutes with the free Heroku Scheduler integration.

Meanwhile, I've manually deployed the metro-upgrade-deploy branch of our scrapers fork to a dedicated EC2 instance called Metro Upgrade (Public DNS: ec2-3-93-23-235.compute-1.amazonaws.com). The configs for this server are under version control. @fgregg, you should have shell access. I have not yet enabled CI for the scrapers because I'm having trouble with the updated crontab. It doesn't appear to be running (no logs in the /tmp/lametro.log file).

fgregg · 2020-01-07T22:59:25Z

I've got it all working. datamade/scrapers-us-municipal@cdc0d81

Didn't set up CI. Don't think we really want it right now, tbh.

fgregg added the Epic label Dec 17, 2019

fgregg closed this as completed Jan 7, 2020

hancush mentioned this issue May 4, 2020

Scheduling tasks on Heroku datamade/how-to#76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL #522

ETL #522

fgregg commented Dec 17, 2019

hancush commented Dec 26, 2019 •

edited

Loading

hancush commented Dec 26, 2019

hancush commented Dec 26, 2019

fgregg commented Jan 2, 2020 via email

hancush commented Jan 2, 2020

hancush commented Jan 6, 2020

fgregg commented Jan 7, 2020

ETL #522

ETL #522

Comments

fgregg commented Dec 17, 2019

hancush commented Dec 26, 2019 • edited Loading

Part 1: The Scrapers

Notes

Moving forward

hancush commented Dec 26, 2019

Part 2: The Management Commands

hancush commented Dec 26, 2019

fgregg commented Jan 2, 2020 via email

hancush commented Jan 2, 2020

hancush commented Jan 6, 2020

fgregg commented Jan 7, 2020

hancush commented Dec 26, 2019 •

edited

Loading