# 🛠️ ETL and Feature Engineering

In this notebook, we load, clean, and transform the Rossmann sales dataset using a structured ETL (Extract, Transform, Load) pipeline and apply advanced feature engineering techniques to prepare the data for modeling.


In [4]:
import sys
import os
#Add src to path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname('/home/amanda/rossmann-sales-forecast/src/'), '..')))
from src.etl import carregar_dados, limpar_dados
from src.features import criar_variaveis_temporais, criar_lags, criar_medias_moveis


## 📥 Step 1: Load and Clean Data

We start by loading the raw sales and store data, and apply basic data cleaning operations such as handling missing values and formatting.


In [5]:
# Load and clean data
df = carregar_dados("~/rossmann-sales-forecast/data/raw/train.csv", "~/rossmann-sales-forecast/data/raw/store.csv")
df = limpar_dados(df)


  df_train = pd.read_csv(caminho_treino)


## 🧠 Step 2: Feature Engineering

We generate new features to enrich the dataset:

- **Temporal features** like day of the week, month, etc.
- **Lag features** to capture past sales trends.
- **Rolling means** to smooth fluctuations and detect trends.


In [7]:
# Feature engineering
df = criar_variaveis_temporais(df)
df = criar_lags(df, lags=[1, 7, 14])
df = criar_medias_moveis(df, janelas=[7, 14])


## 🧹 Step 3: Handle Missing Values

We drop any remaining rows with `NaN` values created during lag/rolling operations.


In [8]:
# Drop NA and save
df = df.dropna()


## 💾 Step 4: Save Processed Data

We export the final processed dataset to a CSV file to be used in the training pipeline.


In [9]:
df.to_csv("~/rossmann-sales-forecast/data/processed/train_processed.csv", index=False)
