Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Investigate Glue Crawlers and Workflows #15

Open
dacort opened this issue Sep 4, 2019 · 2 comments
Open

Investigate Glue Crawlers and Workflows #15

dacort opened this issue Sep 4, 2019 · 2 comments

Comments

@dacort
Copy link
Contributor

dacort commented Sep 4, 2019

Crawlers can now use existing tables as a crawler source, which may give us the ability to deprecate our custom partitioning code that searches S3 for new partitions.

In combination with Workflows, we could easily trigger a Crawler to run after our job is finished.

@davehowell
Copy link

I've done this previously and it works well. In CF or Terraform specify the glue database, table and also the crawler that depends on that table, then at the end of the glue script after the job.commit() something like this. Super easy!

import boto3
glue_client = boto3.client('glue', region_name='${region}')
glue_client.start_crawler(Name='${glue_crawler_name}')

@dacort
Copy link
Contributor Author

dacort commented May 5, 2021

I'm looking back into this again as noted in #23

Probably the part of this project that I was least happy with (but also kind of proud of 😆 ) was the partition management portion. We couldn't originally use Glue Crawlers because we wanted to control the table names and already knew the schemas, but now we can pre-create the tables and use the Crawlers to update the partitions.

This, to me, seems like a better approach than managing custom partitioning logic inside the job itself, but it does have the downside of a more complex workflow. Instead of having a single job that manages raw and converted tables and partitions, we would need to have the following:

  • Source crawler for adding new partitions
  • Job for handling the conversion
  • Destination crawler for converted data
  • Workflow for orchestrating the above

And with the addition of Blueprints, we could essentially package this all up. Blueprints can take a set of parameters (see screenshot below) and then you can create a Workflow from the Blueprint.

Combining Blueprints with Workflows and pre-configured Crawlers would probably cut 80% of the code in this project, which would be a fantastic result. The more components of Glue I can successfully leverage the better.

A couple notes on running Crawlers on existing tables:

  • If you run the crawler without the appropriate classifier, it removes the schema.
  • If you run the crawler with schema updates enabled, it will change the partition names.
  • It seems like it only adds the partitions (at least for ALB) when we completely ignore schema updates. I tried "add new columns only" and it still didn't add the partitions but I may need to try recreating the table and crawler from scratch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants