Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Common Crawl infrastructure updates meta ticket #445

Closed
kgodey opened this issue Jun 26, 2020 · 1 comment
Closed

Common Crawl infrastructure updates meta ticket #445

kgodey opened this issue Jun 26, 2020 · 1 comment
Assignees
Labels
🔒 staff only Restricted to CC staff members 🙅 status: discontinued Not suitable for work as repo is in maintenance

Comments

@kgodey
Copy link
Contributor

kgodey commented Jun 26, 2020

The main goal of these updates will be to reduce the number of technologies we're using to process Common Crawl data.

The first step should be to move the AWS Data Pipeline jobs into Apache Airflow.

Next we should experiment with whether or not it's worth trying to get rid of the AWS Glue job.

If these steps are done, we'll have the advantage of bringing all of our data pipelines (at least the ones getting data from the internet into our catalog) into one unified location, with a unified interface so that we can see their status at a glance.

Further, we will have a bit more flexibility and control over dependencies between the jobs. Currently, we have a flow like:

AWS Data Pipeline 1 -> AWS Glue -> AWS Data Pipeline 2

However, if the first of these steps fails, the second two still try to run, costing time and resources. Further, tracking down the source of the failure requires comparing logs across two different AWS services. This would be improved by bringing these steps into the same Apache Airflow DAG, or into a number of DAGs with dependencies defined between them.

Lastly, bringing these pipelines into Airflow will reduce our dependency on AWS-only services, and make our work more approachable and visible to people not on CC Staff.

@kgodey kgodey created this issue from a note in Active Sprint (Ticket Work Required) Jun 26, 2020
@kgodey kgodey added this to Pending Review in Backlog Jun 26, 2020
@mathemancer
Copy link
Contributor

As an intermediate step, it might make sense to create an Airflow DAG with the following steps:

  1. Run the PySpark job defined by the first data pipeline
  2. Call AWS Glue to run the intermediate transformation job that it handles
  3. Run the PySpark job defined by the second data pipeline.

When this is set up, we'll already have many of the visibility advantages of bringing things into Airflow, and we won't have to deal with replacing the AWS Glue transformations with our own code (which might be a pain)

@annatuma annatuma removed this from Pending Review in Backlog Jul 10, 2020
@kgodey kgodey removed this from Ticket Work Required in Active Sprint Aug 24, 2020
@kgodey kgodey added this to Pending Review in Backlog via automation Aug 24, 2020
@kgodey kgodey moved this from Pending Review to Q4 2020 in Backlog Aug 24, 2020
@kgodey kgodey moved this from Q4 2020 to Q3 2020 in Backlog Aug 24, 2020
@kgodey kgodey moved this from Q3 2020 to Next Sprint in Backlog Sep 18, 2020
@annatuma annatuma removed this from Next Sprint in Backlog Sep 21, 2020
@annatuma annatuma added this to Ready for Development in Active Sprint via automation Sep 21, 2020
@kgodey kgodey added 🔒 staff only Restricted to CC staff members 🧹 status: ticket work required Needs more details before it can be worked on and removed CC staff only labels Sep 22, 2020
@mathemancer mathemancer removed the 🧹 status: ticket work required Needs more details before it can be worked on label Nov 2, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@kgodey kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
Active Sprint automation moved this from Ready for Development to Done Dec 16, 2020
@TimidRobot TimidRobot removed this from Done in Active Sprint Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🔒 staff only Restricted to CC staff members 🙅 status: discontinued Not suitable for work as repo is in maintenance
Development

No branches or pull requests

3 participants