# House Price Prediction using Regression Models

## Objective:
To predict median house value using regression models and study the effect of
underfitting and overfitting through training and test errors.

## Models Used:
- Linear Regression
- Ridge Regression
- Decision Tree Regressor

## Dataset:
California Housing Prices Dataset


In [3]:
from google.colab import files
uploaded = files.upload()


Saving housing.csv to housing.csv


In [4]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error


In [5]:
df = pd.read_csv("housing.csv")
df.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
df["total_bedrooms"].fillna(df["total_bedrooms"].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["total_bedrooms"].fillna(df["total_bedrooms"].median(), inplace=True)


In [7]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]


In [8]:
num_features = X.select_dtypes(include=["int64", "float64"]).columns
cat_features = ["ocean_proximity"]

numeric_transformer = Pipeline([
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, num_features),
    ("cat", categorical_transformer, cat_features)
])


In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [10]:
def evaluate_model(model, name):
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("model", model)
    ])

    pipeline.fit(X_train, y_train)

    y_train_pred = pipeline.predict(X_train)
    y_test_pred = pipeline.predict(X_test)

    rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)
    rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)
    mae_test = mean_absolute_error(y_test, y_test_pred)

    return {
        "Model": name,
        "RMSE Train": rmse_train,
        "RMSE Test": rmse_test,
        "MAE Test": mae_test
    }


In [12]:
def evaluate_model(model, name):
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("model", model)
    ])

    pipeline.fit(X_train, y_train)

    y_train_pred = pipeline.predict(X_train)
    y_test_pred = pipeline.predict(X_test)

    rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
    rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
    mae_test = mean_absolute_error(y_test, y_test_pred)

    return {
        "Model": name,
        "RMSE Train": rmse_train,
        "RMSE Test": rmse_test,
        "MAE Test": mae_test
    }


In [14]:
results = []

results.append(
    evaluate_model(LinearRegression(), "Linear Regression")
)

results.append(
    evaluate_model(Ridge(alpha=1.0), "Ridge Regression")
)

results.append(
    evaluate_model(
        DecisionTreeRegressor(max_depth=10, random_state=42),
        "Decision Tree"
    )
)

pd.DataFrame(results)


Unnamed: 0,Model,RMSE Train,RMSE Test,MAE Test
0,Linear Regression,68433.937367,70060.521845,50670.738241
1,Ridge Regression,68434.995896,70067.3465,50677.170993
2,Decision Tree,48589.56378,61279.694223,40556.375765


## Model Behavior Analysis

- **Linear Regression** shows high training and test error, indicating underfitting
  due to high bias and inability to capture non-linear patterns.

- **Ridge Regression** achieves the best balance between bias and variance, resulting
  in lower test error and better generalization.

- **Decision Tree** achieves very low training error but higher test error, indicating
  overfitting due to high variance.

### Real-World ML Issue:
Outliers and non-linear relationships in housing data can negatively affect linear
models and lead to misleading performance evaluation.


## Conclusion

Ridge Regression provides the best generalization performance for this dataset.
This experiment clearly demonstrates the biasâ€“variance trade-off and highlights
the risks of underfitting and overfitting in real-world machine learning problems.
