# 03 — FEAST Inicial (Training Set con Point-in-Time)

**Objetivo:** Construir y visualizar el *training set* con **Feast** usando el `FeatureView` `premium_features` (join key: `id`, timestamp: `event_timestamp`).  
**Requisitos previos:**  
- Haber corrido el *preprocess* que genera `data/processed/train_proc.parquet` con columnas `id`, `event_timestamp` y el target.  
- Haber aplicado el *feature repo*: dentro de `feature_repo/` ejecutar `feast apply`.  
- El archivo `feature_repo/features.py` debe declarar **premium_features** con el `schema` mostrado en el repo.

> Nota: Este notebook **no** reentrena modelos. Solo muestra el flujo de construcción del training set point-in-time.

In [1]:
%load_ext autoreload
%autoreload 2

import os
from pathlib import Path
import pandas as pd
from feast import FeatureStore

REPO_PATH = "../feature_repo"
PROCESSED = "../data/processed/train_proc.parquet"
OUT_PATH  = "../data/feast/training_set.parquet"
FV_NAME   = "premium_features"
ID_COL    = "id"
TS_COL    = "event_timestamp"
TARGET    = "Premium Amount"

assert Path(REPO_PATH).exists(), "No se encontró 'feature_repo/'. Ejecuta 'feast init' y/o revisa tu repo."
assert Path(PROCESSED).exists(), "No se encontró el parquet procesado. Corre primero el preprocess del pipeline."

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## 1) Inspeccionar el FeatureView registrado

In [2]:
store = FeatureStore(repo_path=REPO_PATH)
fv = store.get_feature_view(FV_NAME)
fields = getattr(fv, "features", None) or getattr(fv, "schema", None)
feat_names = [getattr(f, "name", None) for f in fields]
feat_names = [n for n in feat_names if n]
print(f"FeatureView: {FV_NAME} — {len(feat_names)} features")
feat_names

FeatureView: premium_features — 36 features


['Age',
 'Annual Income',
 'Number of Dependents',
 'Health Score',
 'Previous Claims',
 'Vehicle Age',
 'Credit Score',
 'Insurance Duration',
 'psd_year',
 'psd_month',
 'psd_dow',
 'psd_month_sin',
 'psd_month_cos',
 'Policy Type',
 'Education Level',
 'Customer Feedback',
 'Gender_Male',
 'Gender_Unknown',
 'Smoking Status_Yes',
 'Smoking Status_Unknown',
 'Marital Status_Married',
 'Marital Status_Single',
 'Marital Status_Unknown',
 'Occupation_Self-Employed',
 'Occupation_Unemployed',
 'Occupation_Unknown',
 'Location_Suburban',
 'Location_Urban',
 'Location_Unknown',
 'Exercise Frequency_Monthly',
 'Exercise Frequency_Rarely',
 'Exercise Frequency_Weekly',
 'Exercise Frequency_Unknown',
 'Property Type_Condo',
 'Property Type_House',
 'Property Type_Unknown']

## 2) Construir el training set con point-in-time join

In [3]:
dfp = pd.read_parquet(PROCESSED)
assert all(c in dfp.columns for c in [ID_COL, TS_COL, TARGET]), "Faltan columnas clave en el procesado."

entity_df = dfp[[ID_COL, TS_COL, TARGET]].rename(columns={TARGET: "label"})
feature_refs = [f"{FV_NAME}:{n}" for n in feat_names]

retrieval_job = store.get_historical_features(
    entity_df=entity_df,
    features=feature_refs,
    full_feature_names=False, # sin prefijos
)
training_df = retrieval_job.to_df()
training_df = training_df.rename(columns={"label": TARGET})

Path(OUT_PATH).parent.mkdir(parents=True, exist_ok=True)
training_df.to_parquet(OUT_PATH, index=False)

training_df.head()

Unnamed: 0,id,event_timestamp,Premium Amount,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,...,Location_Suburban,Location_Urban,Location_Unknown,Exercise Frequency_Monthly,Exercise Frequency_Rarely,Exercise Frequency_Weekly,Exercise Frequency_Unknown,Property Type_Condo,Property Type_House,Property Type_Unknown
0,1112515,2019-08-17 15:21:39.080371+00:00,1328.0,19.0,498.0,2.0,27.068329,0.0,17.0,480.0,...,False,False,False,False,True,False,False,True,False,False
1,364240,2019-08-17 15:21:39.080440+00:00,20.0,45.0,102043.0,0.0,36.477553,1.0,14.0,543.0,...,False,False,False,False,False,False,False,True,False,False
2,957582,2019-08-17 15:21:39.080440+00:00,730.0,33.0,7894.0,1.0,35.986064,0.0,6.0,445.0,...,True,False,False,False,True,False,False,True,False,False
3,994714,2019-08-17 15:21:39.080440+00:00,2979.0,58.0,47253.0,2.0,24.98191,1.0,14.0,485.0,...,False,True,False,False,True,False,False,True,False,False
4,212928,2019-08-17 15:21:39.080440+00:00,688.0,40.0,3050.0,3.0,18.849237,1.0,19.0,716.0,...,True,False,False,False,False,False,False,False,False,False


## 3) Resumen y chequeos

In [5]:
print(f"Guardado en: {OUT_PATH}")
print(f"Filas: {len(training_df):,} | Columnas: {len(training_df.columns):,}")
print("Columnas principales:", [ID_COL, TS_COL, TARGET])

Guardado en: ../data/feast/training_set.parquet
Filas: 1,200,000 | Columnas: 39
Columnas principales: ['id', 'event_timestamp', 'Premium Amount']


## 4) Artefacto

In [None]:
Guardado en dagshub dentro del experimento correspondiente