# Data Preprocessing

This notebook prepares the dataset for modeling by performing
train-test splitting and feature normalization using reusable functions.

In [1]:
import sys
import os

project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.insert(0, project_root)

In [2]:
import numpy as np
import pandas as pd
from src.preprocessing import train_test_split, normalize_features

In [3]:
df = pd.read_csv("../data/house_prices.csv")

df = df[
    ["sqft_living", "bedrooms", "bathrooms", "floors", "view", "price"]
]

In [4]:
X = df.drop("price", axis=1).values
y = df["price"].values

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [6]:
X_train_norm, mean, std = normalize_features(X_train)
X_test_norm, _, _ = normalize_features(X_test, mean, std)

In [7]:
print("Before normalization:", X_train[0])
print("After normalization:", X_train_norm[0])

Before normalization: [2.77e+03 4.00e+00 2.50e+00 2.00e+00 0.00e+00]
After normalization: [ 0.66097147  0.67051991  0.43690203  0.91040303 -0.30619401]


## Why Feature Normalization Is Required

- Features such as `sqft_living` have much larger scales than room counts.
- Gradient descent converges faster and more reliably when features are scaled.
- Mean and standard deviation are computed only on training data to prevent data leakage.
