Skip to content

Datapipeline to extract SubReddit using Reddit-API - AWS - Airflow

Notifications You must be signed in to change notification settings

devallasaitej/Reddit_API_AWS_pipeline

Repository files navigation

Reddit API - AWS - Python - Airflow Data Pipeline

Overview

The objective of this project is to orchestarate a data pipeline using Airflow which runs in docker to acquire data from a subreddit - r/dataengineering using reddit's API, cleanse acquired data and finally load reporting level data to Amazon RDS MySQL table.

Platforms Used

  1. Airflow: Workflow orchestration management platform
  2. AWS S3: Object storage service to store raw, cleansed and aggregated formats of data
  3. AWS RDS: Relational data service to store final aggregated - reporting layer data in a table
  4. AWS IAM: Identity and Access management service to create roles to access AWS S3

Data Pipeline Architecture

Reddit_datapipeline

Airflow DAG

Screenshot 2024-03-02 at 12 23 57 PM

Final RDS Table snapshot

image

About

Datapipeline to extract SubReddit using Reddit-API - AWS - Airflow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages