# Preprocessing: Step-by-Step

This notebook walks through the **complete preprocessing workflow** for the King County house price dataset, explaining each step in detail.

**Note:** For production use, see the companion notebook `04b-preprocessing-pipeline.ipynb` which encapsulates all these steps in a reusable sklearn pipeline.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

## 1. Load Data

In [None]:
import kagglehub

path = kagglehub.dataset_download("harlfoxem/housesalesprediction")
csv_path = Path(path) / "kc_house_data.csv"
df = pd.read_csv(csv_path)
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (21613, 21)


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## 2. Parse Dates

The `date` column contains sale dates as strings like `"20140502T000000"`. We need to parse these into proper datetime objects.

**⚠️ CRITICAL:** Date parsing must happen **BEFORE** splitting the data. Because we need temporal ordering to create train/val/test splits. The string format cannot be sorted chronologically.

In [None]:
# Parse the date string to datetime
# Format: "YYYYMMDDTHHMMSS" → we only need the first 8 characters (YYYYMMDD)
df["date_parsed"] = pd.to_datetime(df["date"].str[:8], format="%Y%m%d")

# Sort by date for temporal ordering
df = df.sort_values("date_parsed").reset_index(drop=True)

print(f"Date range: {df['date_parsed'].min().date()} to {df['date_parsed'].max().date()}")
print(f"Total sales period: {(df['date_parsed'].max() - df['date_parsed'].min()).days} days")

Date range: 2014-05-02 to 2015-05-27
Total sales period: 390 days


## 3. Temporal Train/Validation/Test Split

We use **temporal splitting** (not random) to prevent data leakage, as seen in previous notebook.

In [None]:
def temporal_train_val_test_split(
    df: pd.DataFrame, 
    date_column: str,
    val_size: float = 0.15, 
    test_size: float = 0.15
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Split data temporally into train, validation, and test sets.
    
    Data is sorted by date_column, then split into three consecutive chunks.
    
    Parameters
    ----------
    df : DataFrame
        Data to split.
    date_column : str
        Name of the datetime column for temporal ordering.
    val_size : float
        Proportion for validation set (default 0.15).
    test_size : float  
        Proportion for test set (default 0.15).
        
    Returns
    -------
    tuple of (train_df, val_df, test_df)
    """
    df_sorted = df.sort_values(date_column).reset_index(drop=True)
    
    n = len(df_sorted)
    train_end = int(n * (1 - val_size - test_size))
    val_end = int(n * (1 - test_size))
    
    train_df = df_sorted.iloc[:train_end].copy()
    val_df = df_sorted.iloc[train_end:val_end].copy()
    test_df = df_sorted.iloc[val_end:].copy()
    
    return train_df, val_df, test_df

In [None]:
# Apply the split
train_df, val_df, test_df = temporal_train_val_test_split(df, date_column="date_parsed")

print("Split sizes:")
print(f"  Train: {len(train_df):,} records ({len(train_df)/len(df)*100:.1f}%)")
print(f"  Val:   {len(val_df):,} records ({len(val_df)/len(df)*100:.1f}%)")
print(f"  Test:  {len(test_df):,} records ({len(test_df)/len(df)*100:.1f}%)")
print()
print("Date ranges:")
print(f"  Train: {train_df['date_parsed'].min().date()} to {train_df['date_parsed'].max().date()}")
print(f"  Val:   {val_df['date_parsed'].min().date()} to {val_df['date_parsed'].max().date()}")
print(f"  Test:  {test_df['date_parsed'].min().date()} to {test_df['date_parsed'].max().date()}")

Split sizes:
  Train: 15,129 records (70.0%)
  Val:   3,242 records (15.0%)
  Test:  3,242 records (15.0%)

Date ranges:
  Train: 2014-05-02 to 2015-01-16
  Val:   2015-01-16 to 2015-03-26
  Test:  2015-03-26 to 2015-05-27


## 4. Feature Engineering

Based on the EDA findings (`01-eda.ipynb`), we'll create several derived features that better capture the underlying patterns.

### 4.1 Temporal Feature: `days_since_start`

This feature captures **market trends over time** — real estate prices tend to appreciate (or depreciate) systematically.

We compute: `days_since_start = sale_date - reference_date`

> **⚠️ CRITICAL: Data Leakage Warning**
>
> The `reference_date` must be computed from the **training set ONLY** and then applied to all splits.
> 
> **Wrong approach (causes data leakage):**
> ```python
> # DON'T DO THIS!
> for split in [train_df, val_df, test_df]:
>     min_date = split["date_parsed"].min()  # Different per split!
>     split["days_since_start"] = (split["date_parsed"] - min_date).dt.days
> ```
> 
> This causes the feature to have **inconsistent semantics** across splits:
> - Training: `days_since_start=300` means "late in training period"
> - Validation: `days_since_start=0` means "start of validation" (which is actually later!)
> 
> The model learns that high values mean "late", but validation/test reset to 0!

In [None]:
# CORRECT: Compute reference date from TRAINING data only
train_min_date = train_df["date_parsed"].min()
print(f"Reference date (from training): {train_min_date.date()}")

# Store this value! It's needed for inference on new data.
# In a production pipeline, this would be a fitted parameter.

Reference date (from training): 2014-05-02


In [None]:
# Apply the SAME reference date to ALL splits
train_df["days_since_start"] = (train_df["date_parsed"] - train_min_date).dt.days
val_df["days_since_start"] = (val_df["date_parsed"] - train_min_date).dt.days
test_df["days_since_start"] = (test_df["date_parsed"] - train_min_date).dt.days

print("days_since_start ranges (should increase across splits):")
print(f"  Train: {train_df['days_since_start'].min():3d} to {train_df['days_since_start'].max():3d}")
print(f"  Val:   {val_df['days_since_start'].min():3d} to {val_df['days_since_start'].max():3d}")
print(f"  Test:  {test_df['days_since_start'].min():3d} to {test_df['days_since_start'].max():3d}")

days_since_start ranges (should increase across splits):
  Train:   0 to 259
  Val:   259 to 328
  Test:  328 to 390


**Verify:** The validation and test ranges should be **higher** than training (they're in the future). If you see values starting from 0 in each split, you have data leakage!

### 4.2 Age Features

These features capture the property's age-related characteristics.

In [None]:
def add_age_features(df: pd.DataFrame) -> pd.DataFrame:
    """Add age-related features (per-row computation, no fitting needed)."""
    df = df.copy()
    
    sale_year = df["date_parsed"].dt.year
    
    # House age at time of sale
    df["house_age"] = sale_year - df["yr_built"]
    
    # Was the house ever renovated?
    df["was_renovated"] = (df["yr_renovated"] > 0).astype(int)
    
    # Years since last renovation (0 if never renovated)
    df["years_since_renovation"] = np.where(
        df["yr_renovated"] > 0,
        sale_year - df["yr_renovated"],
        0
    )
    
    return df

# Apply to all splits
train_df = add_age_features(train_df)
val_df = add_age_features(val_df)
test_df = add_age_features(test_df)

print("Age features added: house_age, was_renovated, years_since_renovation")

Age features added: house_age, was_renovated, years_since_renovation


### 4.3 Ratio Features

These features compare properties to themselves or neighbors.

In [None]:
def add_ratio_features(df: pd.DataFrame) -> pd.DataFrame:
    """Add ratio features (per-row computation, no fitting needed)."""
    df = df.copy()
    
    # What fraction of living space is basement?
    df["basement_ratio"] = df["sqft_basement"] / df["sqft_living"].replace(0, 1)
    
    # How does living area compare to neighbors?
    df["living_vs_neighbors"] = df["sqft_living"] / df["sqft_living15"].replace(0, 1)
    
    # How does lot size compare to neighbors?
    df["lot_vs_neighbors"] = df["sqft_lot"] / df["sqft_lot15"].replace(0, 1)
    
    return df

# Apply to all splits
train_df = add_ratio_features(train_df)
val_df = add_ratio_features(val_df)
test_df = add_ratio_features(test_df)

print("Ratio features added: basement_ratio, living_vs_neighbors, lot_vs_neighbors")

Ratio features added: basement_ratio, living_vs_neighbors, lot_vs_neighbors


## 5. Drop Non-Predictive Columns

Some columns should be removed because they:
- Are identifiers (not predictive)
- Have been replaced by engineered features
- Cannot be used effectively (high cardinality)

In [None]:
columns_to_drop = [
    "id",           # Property identifier (21,000+ unique values, not predictive)
    "date",         # Original string format (replaced by date_parsed)
    "date_parsed",  # Used for splitting, now replaced by days_since_start
    "zipcode",      # High cardinality categorical (we use lat/long instead)
    "yr_built",     # Replaced by house_age
    "yr_renovated", # Replaced by was_renovated, years_since_renovation
]

train_df = train_df.drop(columns=columns_to_drop)
val_df = val_df.drop(columns=columns_to_drop)
test_df = test_df.drop(columns=columns_to_drop)

print(f"Columns after dropping: {list(train_df.columns)}")
print(f"Shape: {train_df.shape}")

Columns after dropping: ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'days_since_start', 'house_age', 'was_renovated', 'years_since_renovation', 'basement_ratio', 'living_vs_neighbors', 'lot_vs_neighbors']
Shape: (15129, 23)


## 6. Separate Features and Target

In [None]:
target = "price"

X_train = train_df.drop(columns=[target])
y_train = train_df[target]

X_val = val_df.drop(columns=[target])
y_val = val_df[target]

X_test = test_df.drop(columns=[target])
y_test = test_df[target]

print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_val:   {X_val.shape}, y_val:   {y_val.shape}")
print(f"X_test:  {X_test.shape}, y_test:  {y_test.shape}")

X_train: (15129, 22), y_train: (15129,)
X_val:   (3242, 22), y_val:   (3242,)
X_test:  (3242, 22), y_test:  (3242,)


## 7. Numeric Preprocessing

Most machine learning algorithms work better with normalized/standardized features. We'll apply:

1. **Log transformation** for highly skewed features (square footage)
2. **Standard scaling** for all numeric features

> **⚠️ CRITICAL: Fit on Training Only**
>
> Scalers must be **fit on training data** and then applied to all splits.
> 
> ```python
> # CORRECT:
> scaler.fit(X_train)           # Learn mean/std from training
> X_train_scaled = scaler.transform(X_train)
> X_val_scaled = scaler.transform(X_val)    # Apply training params
> X_test_scaled = scaler.transform(X_test)  # Apply training params
> ```
> 
> **Wrong:**
> ```python
> # DON'T DO THIS!
> X_train_scaled = scaler.fit_transform(X_train)
> X_val_scaled = scaler.fit_transform(X_val)  # Leaks val statistics!
> ```

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
# Identify feature groups
log_features = [
    "sqft_living", "sqft_lot", "sqft_above", "sqft_basement", 
    "sqft_living15", "sqft_lot15"
]

# Binary features that don't need scaling
passthrough_features = ["waterfront", "was_renovated"]

# All other features get standard scaling
scale_features = [col for col in X_train.columns 
                  if col not in log_features + passthrough_features]

print(f"Log+scale features ({len(log_features)}): {log_features}")
print(f"Scale only ({len(scale_features)}): {scale_features}")
print(f"Passthrough ({len(passthrough_features)}): {passthrough_features}")

Log+scale features (6): ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']
Scale only (14): ['bedrooms', 'bathrooms', 'floors', 'view', 'condition', 'grade', 'lat', 'long', 'days_since_start', 'house_age', 'years_since_renovation', 'basement_ratio', 'living_vs_neighbors', 'lot_vs_neighbors']
Passthrough (2): ['waterfront', 'was_renovated']


In [None]:
# Create preprocessing pipeline
log_pipeline = Pipeline([
    ("log", FunctionTransformer(np.log1p, validate=True, feature_names_out="one-to-one")),
    ("scale", StandardScaler())
])

preprocessor = ColumnTransformer([
    ("log", log_pipeline, log_features),
    ("scale", StandardScaler(), scale_features),
    ("passthrough", "passthrough", passthrough_features)
])

# FIT ON TRAINING ONLY
preprocessor.fit(X_train)

print("Preprocessor fitted on training data.")

Preprocessor fitted on training data.


In [None]:
# Transform all splits using the fitted preprocessor
X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print(f"Processed shapes:")
print(f"  Train: {X_train_processed.shape}")
print(f"  Val:   {X_val_processed.shape}")
print(f"  Test:  {X_test_processed.shape}")

Processed shapes:
  Train: (15129, 22)
  Val:   (3242, 22)
  Test:  (3242, 22)


In [None]:
# Get feature names from preprocessor
feature_names = preprocessor.get_feature_names_out()
print(f"Total features: {len(feature_names)}")
print(f"Feature names: {list(feature_names)}")

Total features: 22
Feature names: ['log__sqft_living', 'log__sqft_lot', 'log__sqft_above', 'log__sqft_basement', 'log__sqft_living15', 'log__sqft_lot15', 'scale__bedrooms', 'scale__bathrooms', 'scale__floors', 'scale__view', 'scale__condition', 'scale__grade', 'scale__lat', 'scale__long', 'scale__days_since_start', 'scale__house_age', 'scale__years_since_renovation', 'scale__basement_ratio', 'scale__living_vs_neighbors', 'scale__lot_vs_neighbors', 'passthrough__waterfront', 'passthrough__was_renovated']


## 8. Verify Preprocessing

Let's sanity-check our preprocessing to catch any issues.

In [None]:
# Check that scaling worked (training should have mean~0, std~1)
train_means = X_train_processed.mean(axis=0)
train_stds = X_train_processed.std(axis=0)

print("Training set statistics (should be ~0 mean, ~1 std for scaled features):")
print(f"  Means range: {train_means.min():.3f} to {train_means.max():.3f}")
print(f"  Stds range:  {train_stds.min():.3f} to {train_stds.max():.3f}")

Training set statistics (should be ~0 mean, ~1 std for scaled features):
  Means range: -0.000 to 0.045
  Stds range:  0.088 to 1.000


In [None]:
# Check for any NaN or Inf values
for name, data in [("Train", X_train_processed), ("Val", X_val_processed), ("Test", X_test_processed)]:
    nan_count = np.isnan(data).sum()
    inf_count = np.isinf(data).sum()
    print(f"{name}: {nan_count} NaN, {inf_count} Inf")
    
print("\n✓ Preprocessing verification complete")

Train: 0 NaN, 0 Inf
Val: 0 NaN, 0 Inf
Test: 0 NaN, 0 Inf

✓ Preprocessing verification complete


## 9. Save Preprocessed Data

We save:
1. Processed feature arrays and targets
2. The fitted preprocessor (for scaling new data consistently)
3. Reference values needed for feature engineering
4. Feature names for interpretability

In [None]:
import joblib

output_dir = Path("processed_data")
output_dir.mkdir(exist_ok=True)

# Save processed arrays
np.save(output_dir / "X_train.npy", X_train_processed)
np.save(output_dir / "X_val.npy", X_val_processed)
np.save(output_dir / "X_test.npy", X_test_processed)
np.save(output_dir / "y_train.npy", y_train.values)
np.save(output_dir / "y_val.npy", y_val.values)
np.save(output_dir / "y_test.npy", y_test.values)

# Save the fitted preprocessor (for scaling)
joblib.dump(preprocessor, output_dir / "scaler.joblib")

# Save feature engineering reference values
reference_values = {
    "train_min_date": train_min_date,
}
joblib.dump(reference_values, output_dir / "reference_values.joblib")

# Save feature names
np.save(output_dir / "feature_names.npy", feature_names)

print(f"Saved preprocessed data to {output_dir.absolute()}")

Saved preprocessed data to /media/DIURNOext4/alejandro/wip-clase/PIA-SAA/examen-SAA/example_repos/king-county/processed_data
