# Data Preprocessing

This notebook prepares the dataset for modeling by performing
train-test splitting and feature normalization using reusable functions.

In [None]:
import numpy as np
import pandas as pd
from src.preprocessing import train_test_split, normalize_features

In [None]:
df = pd.read_csv("../data/house_prices.csv")

df = df[
    ["sqft_living", "bedrooms", "bathrooms", "floors", "view", "price"]
]

In [None]:
X = df.drop("price", axis=1).values
y = df["price"].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train_norm, mean, std = normalize_features(X_train)
X_test_norm, _, _ = normalize_features(X_test, mean, std)

In [None]:
print("Before normalization:", X_train[0])
print("After normalization:", X_train_norm[0])

## Why Feature Normalization Is Required

- Features such as `sqft_living` have much larger scales than room counts.
- Gradient descent converges faster and more reliably when features are scaled.
- Mean and standard deviation are computed only on training data to prevent data leakage.
