Common Crawl infrastructure updates meta ticket #445

kgodey · 2020-06-26T15:33:33Z

The main goal of these updates will be to reduce the number of technologies we're using to process Common Crawl data.

The first step should be to move the AWS Data Pipeline jobs into Apache Airflow.

Next we should experiment with whether or not it's worth trying to get rid of the AWS Glue job.

If these steps are done, we'll have the advantage of bringing all of our data pipelines (at least the ones getting data from the internet into our catalog) into one unified location, with a unified interface so that we can see their status at a glance.

Further, we will have a bit more flexibility and control over dependencies between the jobs. Currently, we have a flow like:

AWS Data Pipeline 1 -> AWS Glue -> AWS Data Pipeline 2

However, if the first of these steps fails, the second two still try to run, costing time and resources. Further, tracking down the source of the failure requires comparing logs across two different AWS services. This would be improved by bringing these steps into the same Apache Airflow DAG, or into a number of DAGs with dependencies defined between them.

Lastly, bringing these pipelines into Airflow will reduce our dependency on AWS-only services, and make our work more approachable and visible to people not on CC Staff.

The text was updated successfully, but these errors were encountered:

mathemancer · 2020-07-10T14:26:11Z

As an intermediate step, it might make sense to create an Airflow DAG with the following steps:

Run the PySpark job defined by the first data pipeline
Call AWS Glue to run the intermediate transformation job that it handles
Run the PySpark job defined by the second data pipeline.

When this is set up, we'll already have many of the visibility advantages of bringing things into Airflow, and we won't have to deal with replacing the AWS Glue transformations with our own code (which might be a pain)

kgodey created this issue from a note in Active Sprint (Ticket Work Required) Jun 26, 2020

kgodey assigned mathemancer Jun 26, 2020

kgodey added this to Pending Review in Backlog Jun 26, 2020

kgodey added CC staff only labels Jun 26, 2020

mathemancer mentioned this issue Jul 10, 2020

[Infrastructure] Replace "Common Crawl Provider Images ETL" (AWS Data Pipeline) with Apache Airflow DAG #458

Closed

annatuma removed this from Pending Review in Backlog Jul 10, 2020

kgodey removed this from Ticket Work Required in Active Sprint Aug 24, 2020

kgodey added this to Pending Review in Backlog via automation Aug 24, 2020

kgodey moved this from Pending Review to Q4 2020 in Backlog Aug 24, 2020

kgodey moved this from Q4 2020 to Q3 2020 in Backlog Aug 24, 2020

kgodey moved this from Q3 2020 to Next Sprint in Backlog Sep 18, 2020

annatuma removed this from Next Sprint in Backlog Sep 21, 2020

annatuma added this to Ready for Development in Active Sprint via automation Sep 21, 2020

kgodey added 🔒 staff only Restricted to CC staff members 🧹 status: ticket work required Needs more details before it can be worked on and removed CC staff only labels Sep 22, 2020

mathemancer mentioned this issue Nov 2, 2020

[Infrastructure] Replace "Common Crawl ETL" (AWS Data Pipeline) with Apache Airflow DAG #526

Closed

mathemancer removed the 🧹 status: ticket work required Needs more details before it can be worked on label Nov 2, 2020

cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020

kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Dec 16, 2020

kgodey closed this as completed Dec 16, 2020

Active Sprint automation moved this from Ready for Development to Done Dec 16, 2020

obulat mentioned this issue Apr 17, 2023

[Infrastructure] Replace "Common Crawl Provider Images ETL" (AWS Data Pipeline) with Apache Airflow DAG (original #458) WordPress/openverse#1767

Open

TimidRobot removed this from Done in Active Sprint Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common Crawl infrastructure updates meta ticket #445

Common Crawl infrastructure updates meta ticket #445

kgodey commented Jun 26, 2020 •

edited by mathemancer

mathemancer commented Jul 10, 2020

Common Crawl infrastructure updates meta ticket #445

Common Crawl infrastructure updates meta ticket #445

Comments

kgodey commented Jun 26, 2020 • edited by mathemancer

mathemancer commented Jul 10, 2020

kgodey commented Jun 26, 2020 •

edited by mathemancer