# House Prices Regression

In this project, we aim to predict house prices from various features, such as the area of the house, the number of bedrooms, the neighborhood, and others. We approach this problem in two different ways: a two-step approach where we first classify houses into price groups and then perform regression within each group, and a direct approach where we apply regression to the whole dataset. We use a dataset of house prices from the Kaggle House Prices competition. The goal of this project is not only to build accurate prediction models but also to demonstrate the process of data analysis and model evaluation.

This is the third part of a series, in the first one we do EDA, and in the second we explore more in depth how to solve a classification problem.

Let's get started!


This part of the code was explained in "Classification Project (House Prices Comp 2 of 3)"

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
# Load the dataset
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

# For simplicity, we will use only numeric columns
df = df.select_dtypes(include=[np.number])
df = df.dropna()

# Create price categories (10 equal-sized bins)
df['SalePrice'] = pd.qcut(df['SalePrice'], q=10, labels=False)

# Define the predictors and target
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# Define the classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.62      0.67      0.65        15
           1       0.11      0.18      0.13        11
           2       0.32      0.27      0.29        22
           3       0.28      0.25      0.26        28
           4       0.38      0.38      0.38        26
           5       0.39      0.39      0.39        23
           6       0.50      0.39      0.44        31
           7       0.46      0.46      0.46        24
           8       0.58      0.68      0.62        22
           9       0.83      0.83      0.83        23

    accuracy                           0.45       225
   macro avg       0.45      0.45      0.45       225
weighted avg       0.46      0.45      0.45       225



Now, we are going to make price predictions based on each class.

After classifying the houses into price groups, we perform linear regression within each group to predict the exact price of a house. This is a two-step approach: first classification, then regression.

In [4]:
# Load the dataset again for regression
df_reg = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

# For simplicity, we will use only numeric columns
df_reg = df_reg.select_dtypes(include=[np.number])
df_reg = df_reg.dropna()

# Define the predictors and target
X_reg = df_reg.drop('SalePrice', axis=1)
y_reg = df_reg['SalePrice']

# Split the data into train and test sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Get the labels of the classes
classes = df['SalePrice'].unique()

# Dictionary to store the models
models = {}

# For each class, train a model
for class_ in classes:
    # Filter the data
    mask = y_train == class_
    X_train_class = X_train[mask]
    y_train_class = y_train_reg[mask]
    
    # Define the pipeline
    steps = [('scaler', StandardScaler()),
             ('lr', LinearRegression())]
    pipeline = Pipeline(steps)
    
    # Train the model
    model = pipeline.fit(X_train_class, y_train_class)
    
    # Store the model
    models[class_] = model

In this step, we evaluate the performance of our models. For each price group, we calculate the Root Mean Squared Error (RMSE) between the predicted and actual house prices. The RMSE is a common measure of prediction error in regression problems. Finally, we compute a weighted RMSE to evaluate the overall performance across all price groups.

In [5]:
# List to store the RMSEs and weights for each class
rmse_list = []
weights_list = []

# For each class, predict and calculate metrics
for class_ in classes:
    # Filter the data
    mask = y_test == class_
    X_test_class = X_test[mask]
    y_test_class = y_test_reg[mask]
    
    # Predict
    y_pred_class = models[class_].predict(X_test_class)
    
    # Calculate and print metrics
    rmse = np.sqrt(mean_squared_error(y_test_class, y_pred_class))
    print('Class:', class_, 'RMSE:', rmse)
    
    # Append the RMSE and the weight to the lists
    rmse_list.append(rmse)
    weights_list.append(len(y_test_class))

# Convert the lists to arrays
rmse_array = np.array(rmse_list)
weights_array = np.array(weights_list)

# Calculate the weighted average
rmse_weighted = np.average(rmse_array, weights=weights_array)
print('Weighted RMSE:', rmse_weighted)

Class: 7 RMSE: 10449.895523577286
Class: 6 RMSE: 7878.869140890224
Class: 3 RMSE: 4120.306745904547
Class: 8 RMSE: 37480.826988836656
Class: 9 RMSE: 56350.64879297594
Class: 2 RMSE: 3277.9051931666168
Class: 1 RMSE: 5678.313326125489
Class: 0 RMSE: 11410.591603765497
Class: 4 RMSE: 5082.277810929216
Class: 5 RMSE: 5533.56819435699
Weighted RMSE: 14649.776298931249


Finally, we train a linear regression model to predict house prices directly, without classifying them into price groups first. This gives us a baseline to compare the performance of our two-step approach.

In [6]:
# Define the pipeline
steps = [('scaler', StandardScaler()),
         ('lr', LinearRegression())]
pipeline = Pipeline(steps)

# Train the model
model_reg = pipeline.fit(X_train_reg, y_train_reg)

In [7]:
# Predict
y_pred_reg = model_reg.predict(X_test_reg)

# Calculate and print metrics
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
print('RMSE:', rmse)

RMSE: 39869.514607966565


Comparing both results:

Using the 10 classes of prices --> RMSE: 14649.776298931249

Using all house prices directly --> RMSE: 39869.514607966565

Using the classes we get much better results, by almost 25,000 USD better valuations.

A likely reason for this is that by partitioning the houses into different price groups, we essentially created a stratified dataset where each group represents houses with similar characteristics and, more importantly, similar prices. This could help the model to better understand the price distribution within each group and hence make more accurate predictions. This stratification might have reduced the inherent complexity and heterogeneity within the data, hence providing the model with a clearer signal to learn from.

On the other hand, the single regression model has to deal with the entire spectrum of house prices and their corresponding characteristics all at once. This might make it more challenging for the model to accurately capture the relationships between the features and the target variable, especially if the relationships are non-linear or if they change across different price ranges.

As such, these findings suggest that when dealing with complex and heterogeneous data, it might be beneficial to partition the data into more homogenous groups and train separate models for each group. This can potentially lead to more accurate and robust predictions.



To clarify this concept further, envision that the prices of less expensive houses are influenced predominantly by features such as the total area and construction year. As we progress to more costly houses, the presence and attributes of a swimming pool may bear more influence on the price (please note this is hypothetical, a more detailed analysis of the data would be required to confirm these assertions, for which a simple groupby operation might suffice). When we conduct regression analysis by class, these subtle variations within each category are accounted for. On the contrary, a model built on the entire dataset may overemphasize less impactful features or conversely, fail to assign adequate importance to crucial characteristics, resulting in suboptimal predictions.
