airflow_aws_justwatch_pipeline

Data pipeline using Airflow, GraphQL, AWS S3, AWS Glue Jobs and AWS Redshift. The objective is to create a database in AWS Redshift with data from JustWatch.com, a website from where is possible to watch movies and shows provided by many streaming services (Netflix, Youtube, Amazon Prime and many others). But for this project I just considered the 10 most watched services:

Amazon Prime Video
Apple TV Plus
Crunchyroll
Disney Plus
Hulu
Netflix
Paramount Plus
Peacock
Tubi TV
YouTube

The pipeline will follow the steps below:

Extract data from endpoint API 'https://apis.justwatch.com/graphql'
Save the raw data into JSON files.
Upload JSON files to AWS S3 bucket.
Trigger AWS Glue Jobs that process the titles data to normalized data into parquet files. Example: 1 title has many production countries in the raw data. The Glue Job separates all these productions countries from raw data and transform them into parquets that will be read from a Redshift database.
Transform these parquet files into a Redshift database.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
aws_scripts		aws_scripts
dags		dags
dockerfile/airflow		dockerfile/airflow
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

airflow_aws_justwatch_pipeline

About

Releases

Packages

Languages

danrbueno/airflow_aws_justwatch_pipeline

Folders and files

Latest commit

History

Repository files navigation

airflow_aws_justwatch_pipeline

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages