This repository contains an Apache Airflow Directed Acyclic Graph (DAG) designed to automate the extraction, transformation, and loading (ETL) of flight departure data from the Airlabs API into an AWS S3 bucket. The workflow is hosted on an Amazon EC2 instance.
Here is a link to my Medium post describing my processes and the insights that can be drawn from it:
https://medium.com/inst414-data-science-tech/exploring-the-busiest-times-at-jfk-airport-f25454671a07
- DAG File: departures_dag.py
- Python File: departures_analysis.py This DAG, named departures_dag, is scheduled to run hourly and is designed to perform the following tasks:
Check Airlabs API Availability: Utilizes an HttpSensor to ensure the Airlabs departure API is reachable before proceeding with data extraction.
Extract Flight Departure Data: A SimpleHttpOperator is used to fetch data from the Airlabs API for a specific airport (JFK in this example). The extraction focuses on fields such as departure and arrival times, flight numbers, delays, and other relevant flight details.
Transform and Load Data: A custom Python function (transform_load_data) transforms the extracted JSON data into a structured format suitable for analytics. It processes each flight's departure time, terminal, airline, and other attributes, then appends the new data to an existing CSV file in an AWS S3 bucket. The process includes deduplication and sorting of the records.
- Apache Airflow 2.0+
- AWS S3 bucket access
- Airlabs API key
- Airflow Environment: Ensure Airflow is installed and properly configured on your EC2 instance. Refer to the Apache Airflow documentation for installation and setup instructions.
- AWS S3 Bucket: Ensure you have access to an AWS S3 bucket and that Airflow has the necessary permissions to read from and write to this bucket.
- Airlabs API Key: Replace the placeholder API key in the DAG file with your actual Airlabs API key.
- Airflow Connections: Set up the http_conn_id='departure_api' connection in Airflow to point to the Airlabs API endpoint. Ensure the connection includes the base URL and any necessary authentication headers.
After configuring the prerequisites and setting up the necessary connections in Airflow, enable the departures_dag DAG in the Airflow UI. The workflow will execute according to its schedule, processing new flight departure data hourly.
Credentials: Ensure that your Airlabs API key and AWS credentials are securely stored and managed within Airflow. Use Airflow's built-in secrets management to handle sensitive information.