-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETL #522
Comments
Part 1: The ScrapersScraper code is not bundled with the application code, so it must be deployed separately from the application (i.e., the problem can't be solved by manually running pupa commands on the app container). As far as I can tell, Heroku apps need to have a web process that connects to a particular port within 30 seconds, else deployments are marked as failed. This is not the case for the scrapers, which are long running Python processes. That means they can't be deployed as their own application to Heroku. This will probably be solved when we implement the Airflow dashboard – hurrah! In the meantime, I used Heroku's instructions for connecting to their Postgres instances remotely to initialize the database and populate it with data via my local machine: docker-compose run --rm -e DATABASE_URL=$(heroku config:get DATABASE_URL -a la-metro-councilmatic-staging)?sslmode=require -e DJANGO_SETTINGS_MODULE=pupa.settings scrapers Notes
To debug whether the command was actually running, I tried deconstructing sh -c 'django-admin migrate && \
django-admin loaddivisions us --bulk && \
pupa update --rpm=600 lametro people && \
pupa update --rpm=600 lametro bills window=30 && \
pupa update --rpm=600 lametro events' It did: Inserting divisions took < 1 minute. It will not be necessary to preserve this behavior, at least not for this database, but I thought it would be good to document. The ability to specify bulk inserts in Moving forwardBarring dashboard implementation, if it's important to run scrapes on a recurring basis for the test deployment, I think I'd recommend authenticating with Heroku on an EC2 instance (the Metro server, perhaps?) and running the Heroku does seem to provide some facilities for scheduling work and executing background processes. With that said, as we haven't done much (any?) R&D with this stack, I'd like to use research time, rather than client hours, to assess them. Additionally, we'd need to package the scraper code with the application or do additional R&D on deploying images with Heroku's container registry, namely: Is it possible to use both a prebuilt image (scraper code) and leverage Heroku's build pipeline (app code) to define services within a single application? Moreover, if we are going to implement an Airflow, which provides its own facilities for interacting with containers and scheduling work, all of this might be redundant, at least where Metro's concerned. |
Part 2: The Management CommandsSince the management commands are part of the application code, I defined a With all of this in mind, let's have a conversation about how often we need the Metro data pipeline to run for the test deployment (i.e., whether it needs to be scheduled and automated) and identify options from there. |
Some options:
|
Few thoughts:
At this stage, we are hoping to shake out any bugs and so we want the ETL
patterns to be similar to what they will be in production.
It is our expectation that when we go to the Airflow pattern, the etl jobs
will run sequentially (or in a DAG), however it is also our intention to
get councilmatic 2.5 in production before we go the the Airflow pattern.
If we keep the LA Metro scrapers into our current scraper setup, it means
that the scrapers and LA councilmatic will be on separate servers and we
should not be able to expect any orderin between scrapers and LA Metro
Councilmatic.
I think this is the pattern we should test for. Alternatively, we could run
everything on the same server for LA and guarantee sequentiality, but I
think it's best to not add a different pattern, in production, for how we
do ETL.
I recommend that we schedule a running of the scraper and separately,
schedule a running of all the other LA metro ETL bits and bobs.
I think a pattern that would be familiar to us would be to do this on EC2.
We could just deploy the LA Metro 2.5 instance and use the
docker-compose.yml file along the with crontabs that ran `docker-compose
run --rm scrapers` and docker-compose run -rm other_metro_etl`
…On Thu, Dec 26, 2019 at 2:09 PM Hannah Cushman ***@***.***> wrote:
Some options:
- If we decide to use EC2 to schedule scrapes hooking into the remote
Postgres instance, we could extend that cronjob to run the ETL pipeline, as
well.
- We could install docker-compose in the Metro container, then either
extend the metro_etl task to start with scraping (using the
containerized command) or, if it's easy to link scheduled tasks, define a
separate metro_scrape task. There are some limitations to the
scheduler (namely that it sometimes – but rarely – misses task execution
and tasks that are not completed before the next one begins are
terminated), but it would give us some basic scheduling functionality
without a whole lot of overhead (I think).
- We could run the scrape and ETL pipeline by hand, when needed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#522>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEDC3KO5NUVD4TVXNCX2S3Q2T6OXANCNFSM4J35MYBQ>
.
|
@fgregg Thanks for your input! As we're on a limited time budget, I'd not prefer not to re-deploy the test site. I'm also enjoying this opportunity to pilot Heroku for a more complicated application. As an alternative, I propose deploying the scrapers to an EC2 instance, as you suggest, and scraping into the remote Heroku DB. Since it's temporary, I will authenticate with Heroku with my credentials on the server so we can access the database URL environment variable for the app. Meanwhile, since it's not important for the scrape and ETL to be coordinated, we can use the Heroku scheduler for the application ETL. Thoughts? |
For posterity, I've scheduled the Councilmatic ETL pipeline to run every 20 minutes with the free Heroku Scheduler integration. Meanwhile, I've manually deployed the |
I've got it all working. datamade/scrapers-us-municipal@cdc0d81 Didn't set up CI. Don't think we really want it right now, tbh. |
The major work of this task is to set up scheduled jobs for the scrapers and scheduled jobs for management commands that are still left for la metro councilmatic to be responsible for
these are
I recommend doing these as two separate containers.
The text was updated successfully, but these errors were encountered: