# House Price Prediction with Neural Networks

Using a deep neural network to predict house prices from the King County (Seattle) dataset.

**Key Concepts:**
- Regression (predicting continuous values)
- Feature scaling with MinMaxScaler
- Deep neural network architecture
- Mean Absolute Error (MAE) loss

In [None]:
# ============================================
# IMPORTS
# ============================================
import pandas as pd  # Data manipulation and CSV loading
import numpy as np   # Numerical operations

In [None]:
# ============================================
# LOAD DATASET
# ============================================
# King County House Sales dataset
# Contains ~21,000 house sales with features like:
# bedrooms, bathrooms, sqft, condition, grade, etc.
dataset = pd.read_csv("kc_house_data.csv")

In [None]:
# ============================================
# HANDLE MISSING VALUES
# ============================================
# dropna() removes rows with any missing values
# This ensures clean data for training
dataset = dataset.dropna()

In [None]:
# ============================================
# PREPARE FEATURES AND TARGET
# ============================================
# X = features (columns 3 onwards): bedrooms, bathrooms, sqft, etc.
# Y = target (column 2): price - what we want to predict
# We skip columns 0-1 (id, date) as they don't help prediction
X = dataset.iloc[:, 3:]   # All rows, columns 3 to end
Y = dataset.iloc[:, 2]    # All rows, column 2 (price)

In [None]:
# ============================================
# FEATURE SCALING
# ============================================
# MinMaxScaler scales all features to range [0, 1]
# This is CRITICAL for neural networks because:
# - Features have different scales (sqft vs bedrooms)
# - Neural networks train faster with normalized inputs
# - Prevents large values from dominating gradients
from sklearn.preprocessing import MinMaxScaler

mmScaler_X = MinMaxScaler()
X = mmScaler_X.fit_transform(X.astype(float))

In [None]:
# ============================================
# TRAIN/TEST SPLIT
# ============================================
# Split data: 80% training, 20% testing
# random_state=0 ensures reproducible results
# Test set is used to evaluate model on unseen data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, Y, 
    test_size=0.2,      # 20% for testing
    random_state=0      # Seed for reproducibility
)

In [None]:
# ============================================
# IMPORT KERAS COMPONENTS
# ============================================
# Sequential: Linear stack of layers
# Dense: Fully connected layer (every neuron connects to all inputs)
# Activation: Activation functions (relu, sigmoid, etc.)
# Adam: Adaptive learning rate optimizer (industry standard)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam

In [None]:
# ============================================
# BUILD THE NEURAL NETWORK
# ============================================
# Architecture: Input -> 30 -> 30 -> 40 -> 40 -> 40 -> 1
# - Hidden layers use ReLU activation (handles non-linearity)
# - Output layer has 1 neuron (single price prediction)
# - No activation on output (regression outputs raw values)
model = Sequential()

# Input layer is automatically created based on input shape during fit()
model.add(Dense(30, activation="relu"))  # Hidden layer 1: 30 neurons
model.add(Dense(30, activation="relu"))  # Hidden layer 2: 30 neurons
model.add(Dense(40, activation="relu"))  # Hidden layer 3: 40 neurons
model.add(Dense(40, activation="relu"))  # Hidden layer 4: 40 neurons
model.add(Dense(40, activation="relu"))  # Hidden layer 5: 40 neurons
model.add(Dense(1))                       # Output layer: 1 neuron (price)

In [None]:
# ============================================
# COMPILE THE MODEL
# ============================================
# optimizer="Adam": Adaptive learning rate, works well in most cases
# loss="mae": Mean Absolute Error - average of |predicted - actual|
#   - Good for regression, less sensitive to outliers than MSE
#   - Easier to interpret (same units as price)
model.compile(optimizer="Adam", loss="mae")

In [None]:
# ============================================
# TRAIN THE MODEL
# ============================================
# epochs=150: Pass through entire dataset 150 times
# batch_size=32: Update weights after every 32 samples
# validation_data: Evaluate on test set after each epoch
#   - Watch for overfitting: training loss decreases but validation increases
model.fit(
    x=X_train, 
    y=y_train.values,           # .values converts pandas Series to numpy array
    epochs=150, 
    validation_data=(X_test, y_test), 
    batch_size=32
)

In [None]:
# ============================================
# MAKE PREDICTIONS
# ============================================
# Use trained model to predict prices on test data
# Compare predictions with actual prices (y_test)
predictions = model.predict(X_test)
print("Predictions:")
print(predictions)
print("\nActual prices:")
print(y_test)

## Understanding the Results

**MAE (Mean Absolute Error)**: Average prediction error in dollars
- Lower is better
- If MAE = $50,000, predictions are off by $50k on average

**Validation Loss**: MAE on unseen test data
- Should be close to training loss
- If much higher â†’ overfitting

**Improvements to try:**
- Add more layers or neurons
- Add Dropout for regularization
- Try different optimizers or learning rates
- Scale the target variable (Y) as well