# Desafio

## Enunciado

üß© Desafio: ETL de Dados P√∫blicos

üéØ Objetivo Construir um pipeline ETL no Apache Airflow que:

Extrai dados de algum dataset p√∫blico

Transforma os dados limpando, filtrando e unificando informa√ß√µes.

Carrega o resultado em um banco de dados PostgreSQL (ou salva como Parquet).

Agenda a execu√ß√£o di√°ria do pipeline (simulando ingest√£o incremental).

## Instala depend√™ncias

In [None]:
%pip install kagglehub==0.3.13
%pip install mlflow==3.5.1
%pip install pandas==2.3.3

### Dependencias e imports

In [None]:
from pathlib import Path
import sys
sys.path.insert(0, str(Path("desafio").resolve()))

import mlflow
import pandas as pd
import os

from src.public_dataset.extract import extract
from src.public_dataset.transform import transform
from src.public_dataset.load import load

### Constants

In [None]:
KAGGLE_HUB_DATASET="ahmadrazakashif/bmw-worldwide-sales-records-20102024"
MLFLOW_EXPERIMENT_NAME="etl_public_dataset"
MLFLOW_RUN_NAME="jupyter_run"
ARTIFACTS_PATH="artifacts"
DB_URL="sqlite:///data/debug.db"
DB_TABLE="desafio"

Path("data").mkdir(parents=True, exist_ok=True)

## MLFLOW - Configura√ß√£o e utilidades

In [None]:
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
run_id = mlflow.start_run(run_name=MLFLOW_RUN_NAME).info.run_id
mlflow.end_run()

os.makedirs(ARTIFACTS_PATH, exist_ok=True)  # cria se n√£o existir

def log_artifact(df: pd.DataFrame, name: str):
    path = os.path.join(ARTIFACTS_PATH, f"{name}.parquet")
    df.to_parquet(path, index=False)
    mlflow.log_artifact(path, artifact_path="data")

### Baixar data set

In [None]:
df_raw = extract(run_id=run_id, dataset_kaggle_hub=KAGGLE_HUB_DATASET)
mlflow.end_run()

### Explora o dataset

In [None]:
df_raw.info()
display('Dimens√µes:', df_raw.shape)
display(df_raw.describe())
display("Itens nulos:", df_raw.isnull().sum().sort_values(ascending=False))
df_raw.head()

### Trata o dataset

In [None]:
df_clear = transform(run_id=run_id)
mlflow.end_run()

## Armazene esses valores como um artefato dentro do MLFlow

In [None]:
load(run_id=run_id, target_table=DB_TABLE, engine_url=DB_URL, schema=None)
mlflow.end_run()