In [1]:
import pandas as pd

df_model = pd.read_csv("../data/uk_rent_clean_rooms_price.csv")
df_model.head()


Unnamed: 0,Rooms_numeric,Price_numeric
0,1.0,625
1,1.0,625
2,3.0,6000
3,2.0,1400
4,2.0,1400


In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [5]:
# Feature X and target vector is y
X = df_model[["Rooms_numeric"]]
y = df_model["Price_numeric"]

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (38952, 1)
y shape:  (38952,)


In [7]:
#Splitting
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state = 42
)

print("training size: ",X_train.shape[0])
print("Test size: ", X_test.shape[0])

training size:  31161
Test size:  7791


In [9]:
lin_reg = LinearRegression()

#training
lin_reg.fit(X_train, y_train) 

print("Intercept (bias): ", lin_reg.intercept_)
print("Coefficient for Rooms: ", lin_reg.coef_[0])


Intercept (bias):  2004.1236604999194
Coefficient for Rooms:  275.42884193826023


In [10]:
#Predicting on the testing set now
y_pred = lin_reg.predict(X_test)

print("First five predictions: ", y_pred[:5])
print("Actual first five: ",y_test.values[:5])

First five predictions:  [2554.98134438 2830.41018631 2554.98134438 2554.98134438 2554.98134438]
Actual first five:  [10833  1800  4796  4780  2058]


In [11]:
mse = mean_squared_error(y_test, y_pred)
rmse = mse **0.5
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f} £")
print(f"R^2:  {r2:.3f}")

RMSE: 2198.52 £
R^2:  0.013


## Baseline Model Summary: Rooms => Price
### Model
- Model type: **Linear Regression**
- Train/test split: 80% train, 20% test
- Learned parameters:
  - **Intercept (bias)** ≈ 2004.12
  - **Coefficient for Rooms** ≈ 275.43

This gives the prediction formula:

\[
y = 2004.12 + 275.43 x
\]


### Performance (on test set)
- **RMSE** ≈ 2198.52 £  
  => On average, predictions are off by about £2200.
- **R²** ≈ 0.013  
  => Only about **1.3%** of the variation in rental prices is explained by number of rooms alone.

### Interpretation
- The model has learned a  trend:  
  => *More rooms => higher predicted rent* (~£275 extra per additional room).
- However, the error is large compared to typical rents, and R² is very low.  
- This suggests that **rooms alone are not enough** to predict rent accurately.
- Other factors (especially **location**, property type, etc.) are likely very important.

### To examine later on:
- Add more features (e.g. encode `Location` as categorical variables).
- Refit the model with multiple features and compare RMSE and R².
