This repository has been archived by the owner on Jan 13, 2022. It is now read-only.
Common Crawl infrastructure updates meta ticket #445
Labels
🔒 staff only
Restricted to CC staff members
🙅 status: discontinued
Not suitable for work as repo is in maintenance
The main goal of these updates will be to reduce the number of technologies we're using to process Common Crawl data.
The first step should be to move the AWS Data Pipeline jobs into Apache Airflow.
Next we should experiment with whether or not it's worth trying to get rid of the AWS Glue job.
If these steps are done, we'll have the advantage of bringing all of our data pipelines (at least the ones getting data from the internet into our catalog) into one unified location, with a unified interface so that we can see their status at a glance.
Further, we will have a bit more flexibility and control over dependencies between the jobs. Currently, we have a flow like:
However, if the first of these steps fails, the second two still try to run, costing time and resources. Further, tracking down the source of the failure requires comparing logs across two different AWS services. This would be improved by bringing these steps into the same Apache Airflow DAG, or into a number of DAGs with dependencies defined between them.
Lastly, bringing these pipelines into Airflow will reduce our dependency on AWS-only services, and make our work more approachable and visible to people not on CC Staff.
The text was updated successfully, but these errors were encountered: