# Pipeline de processamento de dados: demonstração

<br>

> Este pipeline realiza o processamento completo dos dados hidrológicos, desde a extração do banco de dados até a geração de um arquivo final pronto para análise e modelagem. Cada etapa foi desenhada para garantir a qualidade, integridade e enriquecimento dos dados, facilitando o uso em tarefas de ciência de dados e aprendizado de máquina.

## 1. Importando bibliotecas

<br>

**Nesta etapa, são importadas todas as bibliotecas e funções necessárias para o processamento dos dados.**

- *Bibliotecas como pandas e numpy são essenciais para manipulação e análise dos dados.*
- *Funções utilitárias do pipeline são importadas do módulo de processamento (`data_process.py`).*

> Garantir que todas as dependências estejam carregadas é fundamental para o funcionamento correto das etapas seguintes.

In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
db_path = os.path.abspath(os.path.join(os.getcwd(), "..", "db"))
if db_path not in sys.path:
    sys.path.append(db_path)

# Import all processing functions from data_process.py
from ml.data_process import (
    load_all_data_from_db,
    format_db_data_columns,
    print_missing_values,
    fill,
    encode_date,
    resample_data,
    feature_engineering,
)

In [2]:
PROCESSED_PATH = "../data/ANA HIDROWEB/RIO MEIA PONTE/processed.csv"

## 2. Importando e formatando dados

<br>


**Os dados brutos são extraídos diretamente do banco de dados relacional, que armazena informações de sensores, estações de monitoramento e segmentos de rio.**

- *A extração percorre todas as medições disponíveis, consolidando dados de diferentes sensores (chuva, nível, vazão) e estações ao longo dos segmentos do rio.*

- *Cada coluna do DataFrame representa uma variável medida por um sensor específico em uma estação e segmento, refletindo a estrutura das tabelas do banco de dados (`SensorMeasurement`, `Sensor`, `MonitoringStation`, `RiverSegment`).*

- *Após a extração, os nomes das colunas são padronizados para facilitar o acesso e a manipulação: por exemplo, `rain_upstream`, `level_downstream`, etc. Essa padronização permite identificar rapidamente a variável, a posição no rio e o tipo de dado, abstraindo detalhes técnicos dos identificadores do banco.*

- *A formatação também garante que apenas as colunas relevantes para a análise sejam mantidas, descartando dados redundantes ou não utilizados.*


> O objetivo é garantir que os dados estejam em um formato limpo, padronizado e consistente, pronto para o tratamento de valores ausentes e demais transformações do pipeline.


In [3]:
# --- Load and format data from the database ---

data = load_all_data_from_db()
data = format_db_data_columns(data)

data.head()

Unnamed: 0_level_0,rain_upstream,flow_upstream,level_upstream,rain_downstream,flow_downstream,level_downstream,rain_after,flow_after,level_after
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2010-02-04 21:00:00+00:00,,,,,,,,64.8,289.0
2010-02-04 22:00:00+00:00,,,,,,,,64.8,289.0
2010-02-04 23:00:00+00:00,,,,,,,,64.2,288.0
2010-02-05 00:00:00+00:00,,,,,,,,64.8,289.0
2010-02-05 01:00:00+00:00,,,,,,,,64.8,289.0


## 3. Lidando com valores ausentes

<br>

**Nesta etapa, são identificados e tratados os valores ausentes no conjunto de dados.**

- *Primeiro, é feita uma contagem dos valores faltantes em cada coluna para diagnóstico.*
- *Valores ausentes em colunas de chuva são preenchidos com zero, pois a ausência geralmente indica ausência de precipitação.*
- *Para colunas de nível e vazão, utiliza-se preenchimento para trás (bfill), aproveitando o valor mais próximo disponível.*

> O tratamento adequado de valores ausentes é essencial para evitar distorções nas análises e garantir a robustez do pipeline.

In [4]:
# Count missing values in each column
print_missing_values(data)

Missing values in each column:
rain_upstream        43969
flow_upstream        72821
level_upstream       70096
rain_downstream      42796
flow_downstream      44537
level_downstream     44542
rain_after          290930
flow_after          277952
level_after         266295
dtype: int64


In [5]:
data = fill(data)
print_missing_values(data.loc["2024"])

Missing values in each column:
Series([], dtype: int64)


In [6]:
# Check if the index is ordered
is_ordered = data.index.is_monotonic_increasing
print(f"Data is ordered by index: {is_ordered}")

Data is ordered by index: True


## 4. Reamostragem dos dados com base diária

<br>

**Os dados são reamostrados para frequência diária, agregando múltiplas medições do mesmo dia.**

- *São calculadas estatísticas como média, máximo, mínimo, e quantis (25% e 75%) para cada variável em cada dia.*
- *Após a reamostragem, aplica-se novamente o tratamento de valores ausentes para garantir a completude dos dados diários.*

> A reamostragem padroniza a granularidade dos dados, tornando-os adequados para análises temporais e modelagem.

In [None]:
# Resample the data to daily frequency and aggregate using the new resample_data function
data_resampled = resample_data(data, fill_func=fill)

In [8]:
print_missing_values(data_resampled)

Missing values in each column:
Series([], dtype: int64)


In [9]:
data_resampled.loc["2024"].head()

Unnamed: 0_level_0,rain_upstream_mean,rain_upstream_max,rain_upstream_min,rain_upstream_q25,rain_upstream_q75,flow_upstream_mean,flow_upstream_max,flow_upstream_min,flow_upstream_q25,flow_upstream_q75,...,flow_after_mean,flow_after_max,flow_after_min,flow_after_q25,flow_after_q75,level_after_mean,level_after_max,level_after_min,level_after_q25,level_after_q75
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024-01-01 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,34.632292,45.3,26.8,31.125,37.6,...,195.152083,247.4,77.8,134.4,243.7,309.333333,360.0,184.0,254.0,357.0
2024-01-02 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,32.297872,46.5,27.2,28.4,34.075,...,117.896809,154.2,72.3,83.5,146.275,233.031915,275.0,176.0,192.0,266.75
2024-01-03 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,35.730208,42.8,30.4,32.9,38.475,...,113.429167,132.6,95.6,106.7,122.35,229.770833,252.0,208.0,222.0,240.5
2024-01-04 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,33.115625,38.4,27.2,30.275,36.15,...,107.172917,131.7,85.7,97.1,118.025,222.020833,251.0,195.0,210.0,235.5
2024-01-05 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,23.752083,28.2,20.8,21.4,26.0,...,90.875,102.7,82.8,85.0,97.9,201.71875,217.0,191.0,194.0,211.0


<!-- ## Feature engineering

#### Novas features:

- chuva_acumulada_2_dias: soma da precipitação dos últimos 2 dias
- chuva_acumulada_3_dias: soma da precipitação dos últimos 3 dias
- dias_sem_chuva: número de dias sem chuva
- variacao_chuva: taxa de variação da precipitação em relação ao dia anterior
- variacao_nivel: taxa de variação do nível do rio em relação ao dia anterior
- variacao_vazao: taxa de variação da vazão em relação ao dia anterior
- Encodings de dia do ano: seno e cosseno para capturar a sazonalidade -->


## 5. Feature engineering

<br>

**Nesta etapa, são criadas novas variáveis (features) que enriquecem o conjunto de dados e capturam padrões temporais relevantes.**

- *Cálculo de chuvas acumuladas em 2 e 3 dias para diferentes estações, permitindo identificar eventos de precipitação prolongada.*
- *Codificação de datas usando funções seno e cosseno, facilitando a captura de sazonalidade e padrões anuais.*
- *Inclusão do ano como variável explícita para análises temporais.*

> O enriquecimento dos dados com novas features potencializa a capacidade preditiva de modelos de machine learning e análises estatísticas.

In [10]:
# Feature engineering
data_resampled = feature_engineering(data_resampled, encode_date_func=encode_date)
data_resampled.tail()

Unnamed: 0_level_0,rain_upstream_mean,rain_upstream_max,rain_upstream_min,rain_upstream_q25,rain_upstream_q75,flow_upstream_mean,flow_upstream_max,flow_upstream_min,flow_upstream_q25,flow_upstream_q75,...,level_after_q75,rain_upstream_acc_2_days,rain_downstream_acc_2_days,rain_after_acc_2_days,rain_upstream_acc_3_days,rain_downstream_acc_3_days,rain_after_acc_3_days,date_sin,date_cos,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-05-28 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,10.95125,12.01,9.05,10.72,11.62,...,111.25,0.0,0.002083,0.002083,0.0,0.004167,0.00625,0.573772,-0.819015,2025
2025-05-29 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,9.93875,11.17,9.05,9.25,10.72,...,114.25,0.0,0.002083,0.002083,0.0,0.002083,0.002083,0.559589,-0.82877,2025
2025-05-30 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,9.681354,10.72,9.25,9.25,10.08,...,115.0,0.0,0.0,0.0,0.0,0.002083,0.002083,0.54524,-0.83828,2025
2025-05-31 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,9.683646,10.72,9.05,9.46,9.87,...,110.25,0.0,0.0,0.0,0.0,0.0,0.0,0.53073,-0.847541,2025
2025-06-01 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,9.530204,10.51,8.86,9.05,9.87,...,111.0,0.0,0.003922,0.003922,0.0,0.003922,0.003922,0.516062,-0.856551,2025


In [11]:
# Filter out data beyond 2024
data_filtered = data_resampled[data_resampled.index.year <= 2024]
data_filtered.tail()

Unnamed: 0_level_0,rain_upstream_mean,rain_upstream_max,rain_upstream_min,rain_upstream_q25,rain_upstream_q75,flow_upstream_mean,flow_upstream_max,flow_upstream_min,flow_upstream_q25,flow_upstream_q75,...,level_after_q75,rain_upstream_acc_2_days,rain_downstream_acc_2_days,rain_after_acc_2_days,rain_upstream_acc_3_days,rain_downstream_acc_3_days,rain_after_acc_3_days,date_sin,date_cos,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024-12-27 00:00:00+00:00,0.01875,0.4,0.0,0.0,0.0,34.536667,37.63,32.17,33.45,35.26,...,225.0,0.121908,0.203728,0.105154,0.123991,0.224561,0.10932,-0.085731,0.996318,2024
2024-12-28 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,29.894421,32.17,27.23,28.69,30.91,...,195.0,0.01875,0.045833,0.010417,0.121908,0.203728,0.105154,-0.068615,0.997643,2024
2024-12-29 00:00:00+00:00,0.122581,8.4,0.0,0.0,0.0,27.145699,35.0,24.63,26.51,26.99,...,176.0,0.122581,0.335484,0.049462,0.141331,0.381317,0.059879,-0.051479,0.998674,2024
2024-12-30 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,26.008696,27.23,24.16,25.09,26.99,...,247.0,0.122581,0.355049,0.049462,0.122581,0.355049,0.049462,-0.034328,0.999411,2024
2024-12-31 00:00:00+00:00,0.004167,0.4,0.0,0.0,0.0,22.969271,24.39,21.88,22.11,23.47,...,184.0,0.004167,0.440399,0.0625,0.126747,0.775882,0.111962,-0.017166,0.999853,2024


In [12]:
# Filter out data before 2014
data_filtered = data_filtered[data_filtered.index.year >= 2014]
data_filtered.head()

Unnamed: 0_level_0,rain_upstream_mean,rain_upstream_max,rain_upstream_min,rain_upstream_q25,rain_upstream_q75,flow_upstream_mean,flow_upstream_max,flow_upstream_min,flow_upstream_q25,flow_upstream_q75,...,level_after_q75,rain_upstream_acc_2_days,rain_downstream_acc_2_days,rain_after_acc_2_days,rain_upstream_acc_3_days,rain_downstream_acc_3_days,rain_after_acc_3_days,date_sin,date_cos,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-01-01 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,19.369072,20.1,18.3,18.8,19.8,...,291.0,0.0,0.010309,0.0,0.0,0.173196,0.0,0.0,1.0,2014
2014-01-02 00:00:00+00:00,0.004124,0.2,0.0,0.0,0.0,20.271134,21.0,19.3,19.8,20.6,...,308.0,0.004124,0.084536,0.0,0.004124,0.092784,0.0,0.017213,0.999852,2014
2014-01-03 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,20.906383,22.3,19.3,19.8,22.1,...,302.0,0.004124,0.082474,0.0,0.004124,0.084536,0.0,0.034422,0.999407,2014
2014-01-04 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,17.985567,19.8,16.7,17.4,18.3,...,280.0,0.0,0.0,0.0,0.004124,0.082474,0.0,0.05162,0.998667,2014
2014-01-05 00:00:00+00:00,0.0,0.0,0.0,0.0,0.0,15.802128,16.7,15.0,15.5,16.0,...,271.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068802,0.99763,2014


## 6. Salvar dados formatados em CSV

<br>

**Por fim, os dados processados são salvos em um arquivo CSV, pronto para uso em análises e modelagem.**
- *O arquivo contém todas as variáveis tratadas, reamostradas e enriquecidas, com colunas nomeadas de forma intuitiva.*
- *O CSV é salvo no diretório especificado, permitindo fácil acesso e compartilhamento.*
> O arquivo final serve como base para análises futuras, garantindo que os dados estejam prontos para serem utilizados em modelos de machine learning ou outras análises estatísticas.

In [13]:
# Save the processed data to a CSV file
data_filtered.to_csv(PROCESSED_PATH, sep=";", index=True)
print(f"Processed data saved to {PROCESSED_PATH}")

Processed data saved to ../data/ANA HIDROWEB/RIO MEIA PONTE/processed.csv
