# Multimodal Property Price Prediction

This notebook trains regression models using:
1. Tabular features only
2. Combined tabular + satellite image embeddings

Model performance is evaluated and final predictions are generated.

In [8]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge

## Data Inputs

This notebook consumes preprocessed tabular features and precomputed satellite
image embeddings. Image embeddings are generated separately using a pretrained
CNN to decouple expensive visual feature extraction from downstream regression
modeling.

In [9]:
train_tab = pd.read_csv("../data/processed/train_tabular.csv")
test_tab  = pd.read_csv("../data/processed/test_tabular.csv")

X_img_train = np.load("../data/processed/train_image_embeddings.npy")
X_img_test  = np.load("../data/processed/test_image_embeddings.npy")

print(train_tab.shape, X_img_train.shape)

(16209, 14) (16209, 512)


In [10]:
X_tab = train_tab.drop(columns=["log_price", "image_path"])
y = train_tab["log_price"]

X_tab_test = test_tab.drop(columns=["image_path"])

In [11]:
X_tab_tr, X_tab_val, y_tr, y_val, \
X_img_tr, X_img_val = train_test_split(
    X_tab, y, X_img_train,
    test_size=0.2,
    random_state=42
)

#### Tabluar model

In [14]:
tabular_model = Ridge(alpha=1.0)
tabular_model.fit(X_tab_tr, y_tr)

y_val_pred_tab = tabular_model.predict(X_tab_val)

mse_tab = mean_squared_error(y_val, y_val_pred_tab)
rmse_tab=np.sqrt(mse_tab)
r2_tab = r2_score(y_val, y_val_pred_tab)

rmse_tab, r2_tab

(np.float64(0.2613044529181127), 0.7525671179379543)

In [15]:
X_multi_tr = np.hstack([X_tab_tr.values, X_img_tr])
X_multi_val = np.hstack([X_tab_val.values, X_img_val])
X_multi_test = np.hstack([X_tab_test.values, X_img_test])

In [16]:
multi_model = Ridge(alpha=1.0)
multi_model.fit(X_multi_tr, y_tr)

0,1,2
,alpha,1.0
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,


#### Multimodal Feature Fusion

In [18]:
y_val_pred_multi = multi_model.predict(X_multi_val)

mse_multi = mean_squared_error(y_val, y_val_pred_multi)
rmse_multi=np.sqrt(mse_multi)
r2_multi = r2_score(y_val, y_val_pred_multi)

rmse_multi, r2_multi

(np.float64(0.24086084279734465), 0.7897692712388062)

Although the improvement in accuracy is not very large, the results show that satellite images add useful neighborhood-level information that works alongside traditional house features, instead of replacing them.

In [19]:
X_multi_full = np.hstack([X_tab.values, X_img_train])

final_model = Ridge(alpha=1.0)
final_model.fit(X_multi_full, y)

0,1,2
,alpha,1.0
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,


In [20]:
log_price_preds = final_model.predict(X_multi_test)
price_preds = np.exp(log_price_preds)

## Test Set Predictions and Submission Format

The test dataset does not provide a unique property identifier. Although a house
identifier exists in the raw data, it is non-unique and therefore unsuitable as
a submission key. Predictions are indexed using the original row order of the
test set, which serves as a consistent and unambiguous identifier for evaluation.

In [24]:
submission = pd.DataFrame({
    "id": test_tab.index,
    "predicted_price": price_preds
})

submission.to_csv("../submission.csv", index=False)
submission.head(10)

Unnamed: 0,id,predicted_price
0,0,437584.5
1,1,1018171.0
2,2,1301534.0
3,3,1940665.0
4,4,707450.7
5,5,304621.8
6,6,756110.7
7,7,668257.2
8,8,408072.0
9,9,524435.2
