# Scheduled data pipeline to load VAT rates

**WORKFLOW MANAGEMENT TOOL SELECTION**

<img src="resources/images/etl-comparison.png" width="850"/>

**ARCHITECTURE DEFINITION**

- **Harbor**: Harbor is an open source cloud native registry that stores, signs, and scans container images for vulnerabilities.
- **AWS Batch**: Dynamically allocates the resources to run the container, so there's no need to run a machine all the time.
- **Cloudwatch**: Monitoring and observability service. Run pipeline at 00:00 am (UTC) every day
- **Snowflake**: Snowflake comes with a Python connector, allowing us to use code to fire SQL queries to Snowflake to transform and aggregate data.

![title](resources/images/pipeline-architecture.png)

In [1]:
import json
import requests
import pandas as pd
from datetime import datetime
from prefect import task, Flow, Parameter

In [2]:
@task
def extract(url: str) -> dict:
    res = requests.get(url)
    if not res:
        raise Exception('No data fetched')
    return json.loads(res.content)

In [3]:
@task
def transform(data: dict) -> pd.DataFrame:
    """
        Transform data:
        Create dataframe with rates information and last_updated date value
        Replace missing values with 0's
        Transform numeric column types to float
        Split dataframe into countries and VAT rates
    """
    return {'countries': pd.DataFrame([]), 'vat_rates': pd.DataFrame([])}

In [4]:
@task
def load(countries: pd.DataFrame, rates: pd.DataFrame, path: str) -> None:
    """
        Create script to insert data into VATRates.
        In case a new country is added or removed, Country table should be updated
    """

In [5]:
def prefect_flow():
    with Flow(name='vat_rates_etl_pipeline') as flow:
        param_url = Parameter(name='rates_url', required=True)
        vat_info = extract(url=param_url)
        data = transform(vat_info)
        load(countries=data['countries'], rates=data['vat_rates'], path='')
    return flow

In [6]:
if __name__ == '__main__':
    flow = prefect_flow()
    flow.run(parameters={
        'rates_url': 'https://raw.githubusercontent.com/benbucksch/eu-vat-rates/ce7a777b7bbdcc94e352d816282647271f6baebf/rates.json'
    })

[2022-03-02 18:52:20-0500] INFO - prefect.FlowRunner | Beginning Flow run for 'vat_rates_etl_pipeline'
[2022-03-02 18:52:20-0500] INFO - prefect.TaskRunner | Task 'rates_url': Starting task run...
[2022-03-02 18:52:20-0500] INFO - prefect.TaskRunner | Task 'rates_url': Finished task run for task with final state: 'Success'
[2022-03-02 18:52:20-0500] INFO - prefect.TaskRunner | Task 'extract': Starting task run...
[2022-03-02 18:52:21-0500] INFO - prefect.TaskRunner | Task 'extract': Finished task run for task with final state: 'Success'
[2022-03-02 18:52:21-0500] INFO - prefect.TaskRunner | Task 'transform': Starting task run...
[2022-03-02 18:52:21-0500] INFO - prefect.TaskRunner | Task 'transform': Finished task run for task with final state: 'Success'
[2022-03-02 18:52:21-0500] INFO - prefect.TaskRunner | Task 'transform['countries']': Starting task run...
[2022-03-02 18:52:21-0500] INFO - prefect.TaskRunner | Task 'transform['countries']': Finished task run for task with final stat