# Housing Prices Regression - Initial Model

In [14]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [2]:
# Load the housing dataset
file_path = 'data/housing.csv'  # Adjust if your folder is different
home_data = pd.read_csv(file_path)

In [6]:
# Basic data review
print(home_data.describe())
print(home_data.head())

# Display all column names
print(home_data.columns)

              Rooms         Price      Distance      Postcode      Bedroom2  \
count  13580.000000  1.358000e+04  13580.000000  13580.000000  13580.000000   
mean       2.937997  1.075684e+06     10.137776   3105.301915      2.914728   
std        0.955748  6.393107e+05      5.868725     90.676964      0.965921   
min        1.000000  8.500000e+04      0.000000   3000.000000      0.000000   
25%        2.000000  6.500000e+05      6.100000   3044.000000      2.000000   
50%        3.000000  9.030000e+05      9.200000   3084.000000      3.000000   
75%        3.000000  1.330000e+06     13.000000   3148.000000      3.000000   
max       10.000000  9.000000e+06     48.100000   3977.000000     20.000000   

           Bathroom           Car       Landsize  BuildingArea    YearBuilt  \
count  13580.000000  13518.000000   13580.000000   7130.000000  8205.000000   
mean       1.534242      1.610075     558.416127    151.967650  1964.684217   
std        0.691712      0.962634    3990.669241   

In [9]:
# Select target
y = home_data['Price']

# Select features
feature_names = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Car']
X = home_data[feature_names]

In [10]:
# Split into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Create and train the model
model = DecisionTreeRegressor(random_state=1)
model.fit(train_X, train_y)

In [11]:
# Predict on validation data
val_predictions = model.predict(val_X)

# Calculate Mean Absolute Error
val_mae = mean_absolute_error(val_y, val_predictions)
print(f"Validation MAE: {val_mae}")

Validation MAE: 417117.66685911367


In [12]:
# Create a comparison table
comparison = pd.DataFrame({
    'Predicted': val_predictions,
    'Actual': val_y
})

# Show first few predictions
print(comparison.head())

       Predicted     Actual
321    1500000.0  1640000.0
4003    572000.0   675000.0
13348  3450000.0  2800000.0
2697    767000.0   615000.0
12600  2325000.0  2700000.0


## Notes and Reflection

- **Model:** DecisionTreeRegressor
- **Features used:** Rooms, Bathroom, Landsize, BuildingArea, YearBuilt, Car
- **Validation MAE:** 417,118
- **Observations:**
  - The model generally predicts house prices within about $400k on average.
  - Some predictions are close (e.g., $2.3M predicted vs $2.7M actual), but others are more noticeably off.
  - There may be missing important factors like location (Suburb, Distance, etc.) that influence house price heavily.
- **Next steps:**
  - Try more features (e.g., Distance, Lattitude/Longtitude).
  - Try more powerful models like Random Forests.
  - Improve data cleaning (handle missing values, remove outliers).

In [17]:
# Drop rows with missing values
train_X = train_X.dropna()
train_y = train_y.loc[train_X.index]

val_X = val_X.dropna()
val_y = val_y.loc[val_X.index]

# Create and train a Linear Regression model
linreg_model = LinearRegression()
linreg_model.fit(train_X, train_y)

# Make predictions with Linear Regression
linreg_val_predictions = linreg_model.predict(val_X)

# Evaluate the Linear Regression model
linreg_val_mae = mean_absolute_error(val_y, linreg_val_predictions)
print(f"Linear Regression Validation MAE: {linreg_val_mae}")

Linear Regression Validation MAE: 319811.9951414179


## Notes on Linear Regression

- After dropping rows with missing feature values, trained a Linear Regression model.
- Validation MAE for Linear Regression: 319,812
- Compared to Decision Tree model (MAE: 417,118), Linear Regression performed better.
- Likely reasons:
  - Relationship between features and price may be fairly linear.
  - Decision Tree may have overfitted to noise in small datasets.
- Future ideas:
  - Add more features (e.g., Distance, Lattitude, Longtitude).
  - Try data imputation instead of dropping missing values to preserve data.