# üè† House Price Prediction - Data Preprocessing

This notebook handles all data preprocessing steps:
1. Tabular feature engineering
2. Image embedding extraction (ResNet50)
3. Geo-visual feature extraction
4. Transport distance features (OpenStreetMap)

**By:** Anshit Das


## üìö Import Libraries

In [10]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from PIL import Image
from scipy import ndimage
from skimage.feature import graycomatrix, graycoprops

import torch
import torch.nn as nn
from torchvision import models, transforms

import osmnx as ox
from sklearn.neighbors import BallTree

import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)

print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


## ‚öôÔ∏è Configuration

In [11]:
# Paths
CSV_PATH = "/kaggle/input/dataaa/train(1)(train(1)).csv"  # Update with your path
IMAGE_DIR = "/kaggle/working/mapbox_images-1"  # Update with your path
OUTPUT_PATH = "processed_data.csv"

# Settings
CURRENT_YEAR = 2015
IMAGE_SIZE = 512
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
TRANSPORT_SEARCH_RADIUS = 40000  # meters

print(f"Device: {DEVICE}")
print(f"Image directory: {IMAGE_DIR}")

Device: cuda
Image directory: /kaggle/working/mapbox_images-1


## üìä Load Data

In [12]:
# Load CSV
df = pd.read_csv(CSV_PATH)

print(f"üìä Loaded {len(df)} rows")
print(f"\nShape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

# Display first few rows
df.head()

üìä Loaded 16209 rows

Shape: (16209, 21)

Columns: ['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,9117000170,20150505T000000,268643,4,2.25,1810,9240,2.0,0,0,3,7,1810,0,1961,0,98055,47.4362,-122.187,1660,9240
1,6700390210,20140708T000000,245000,3,2.5,1600,2788,2.0,0,0,4,7,1600,0,1992,0,98031,47.4034,-122.187,1720,3605
2,7212660540,20150115T000000,200000,4,2.5,1720,8638,2.0,0,0,3,8,1720,0,1994,0,98003,47.2704,-122.313,1870,7455
3,8562780200,20150427T000000,352499,2,2.25,1240,705,2.0,0,0,3,7,1150,90,2009,0,98027,47.5321,-122.073,1240,750
4,7760400350,20141205T000000,232000,3,2.0,1280,13356,1.0,0,0,3,7,1280,0,1994,0,98042,47.3715,-122.074,1590,8071


In [13]:
# Remove duplicates
before = len(df)
df = df.drop_duplicates(subset='id', keep='first')
after = len(df)

print(f"‚úÖ Removed {before - after} duplicate rows")
print(f"üìä Rows before: {before}, Rows after: {after}")

‚úÖ Removed 99 duplicate rows
üìä Rows before: 16209, Rows after: 16110


## üîß 1. Tabular Feature Engineering

In [14]:
print("üîß Engineering tabular features...\n")

# Parse date
df['date'] = (
    df['date']
    .astype(str)
    .str.split('T')
    .str[0]
    .pipe(pd.to_datetime, format='%Y%m%d')
)
print("‚úì Date parsed")

# Target transformation
df['log_price'] = np.log1p(df['price'])
print("‚úì Log price created")

# Ratio features
df['bath_per_bed'] = df['bathrooms'] / (df['bedrooms'] + 1e-3)
df['sqft_per_bed'] = df['sqft_living'] / (df['bedrooms'] + 1e-3)
df['lot_to_living_ratio'] = df['sqft_lot'] / df['sqft_living']
df['basement_ratio'] = df['sqft_basement'] / df['sqft_living']
print("‚úì Ratio features created")

# Age features
df['house_age'] = CURRENT_YEAR - df['yr_built']
df['renovated'] = (df['yr_renovated'] > 0).astype(int)
df['years_since_reno'] = np.where(
    df['renovated'] == 1,
    CURRENT_YEAR - df['yr_renovated'],
    df['house_age']
)
print("‚úì Age features created")

# Quality features
df['quality_area'] = df['sqft_living'] * df['grade']
df['condition_area'] = df['sqft_living'] * df['condition']
print("‚úì Quality features created")

# View features
df['view_score'] = df['view'] ** 1.5
df['waterfront_premium'] = df['waterfront'] * df['sqft_living']
print("‚úì View features created")

# Neighborhood comparison features
df['relative_living_size'] = df['sqft_living'] / df['sqft_living15']
df['relative_lot_size'] = df['sqft_lot'] / df['sqft_lot15']
print("‚úì Neighborhood features created")

print("\n‚úÖ Tabular feature engineering complete!")
print(f"Total features: {df.shape[1]}")

üîß Engineering tabular features...

‚úì Date parsed
‚úì Log price created
‚úì Ratio features created
‚úì Age features created
‚úì Quality features created
‚úì View features created
‚úì Neighborhood features created

‚úÖ Tabular feature engineering complete!
Total features: 35


In [15]:
# Display new features
new_cols = ['bath_per_bed', 'sqft_per_bed', 'house_age', 'quality_area', 'view_score']
df[new_cols].describe()

Unnamed: 0,bath_per_bed,sqft_per_bed,house_age,quality_area,view_score
count,16110.0,16110.0,16110.0,16110.0,16110.0
mean,1.106221,1395.11,43.769336,16698.770267,0.381207
std,31.256126,45780.53,29.384455,10062.472475,1.334629
min,0.0,49.08942,0.0,290.0,0.0
25%,0.499833,469.7651,18.0,10080.0,0.0
50%,0.624844,576.4745,40.0,14140.0,0.0
75%,0.749813,722.3194,63.0,20560.0,0.0
max,2500.0,4810000.0,115.0,156650.0,8.0


## üñºÔ∏è 2. Image Embedding Extraction (ResNet50)

In [16]:
print("üñºÔ∏è Setting up ResNet50 for image embeddings...\n")

# Load pretrained ResNet50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
model = nn.Sequential(*list(model.children())[:-1])  # Remove classifier
model = model.to(DEVICE)
model.eval()

print(f"‚úì ResNet50 loaded on {DEVICE}")

# Image preprocessing
transform = transforms.Compose([
    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

print("‚úì Image transforms ready")

üñºÔ∏è Setting up ResNet50 for image embeddings...

‚úì ResNet50 loaded on cuda
‚úì Image transforms ready


In [17]:
# Create image lookup
image_files = {
    f.split("_")[0]: f
    for f in os.listdir(IMAGE_DIR)
    if f.endswith(".png")
}

print(f"Found {len(image_files)} images")

Found 16110 images


In [18]:
# Extract embeddings
def extract_embedding(img_path):
    """Extract 2048-dim embedding from image."""
    img = Image.open(img_path).convert("RGB")
    x = transform(img).unsqueeze(0).to(DEVICE)
    
    with torch.no_grad():
        emb = model(x)
        emb = emb.squeeze()
    
    return emb.cpu().numpy()

# Process all images
rows = []

for pid in tqdm(df['id'], desc="Extracting CNN embeddings"):
    pid = str(pid)
    
    if pid not in image_files:
        continue
    
    img_path = os.path.join(IMAGE_DIR, image_files[pid])
    embedding = extract_embedding(img_path)
    
    row = {"id": pid}
    for i, val in enumerate(embedding):
        row[f"img_{i}"] = val
    
    rows.append(row)

# Create embedding dataframe
img_emb_df = pd.DataFrame(rows)

print(f"\n‚úÖ CNN embeddings extracted: {img_emb_df.shape}")

Extracting CNN embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16110/16110 [09:03<00:00, 29.62it/s]



‚úÖ CNN embeddings extracted: (16110, 2049)


In [19]:
# Merge with main dataframe
df['id'] = df['id'].astype(str)
img_emb_df['id'] = img_emb_df['id'].astype(str)

df = df.merge(img_emb_df, on='id', how='left')

print(f"‚úì Image embeddings merged")
print(f"New shape: {df.shape}")

‚úì Image embeddings merged
New shape: (16110, 2083)


## üåç 3. Geo-Visual Feature Extraction

In [20]:
print("üåç Extracting geo-visual features from images...\n")

# Helper functions
def load_image(path, size=(512, 512)):
    """Load and normalize image."""
    img = Image.open(path).convert("RGB")
    img = img.resize(size, Image.BILINEAR)
    arr = np.asarray(img).astype(np.float32) / 255.0
    return arr

def excess_green(arr):
    """Calculate Excess Green Index (vegetation proxy)."""
    r, g, b = arr[:,:,0], arr[:,:,1], arr[:,:,2]
    exg = 2*g - r - b
    exg = (exg - exg.min()) / (exg.max() - exg.min() + 1e-9)
    return exg

def green_fraction(arr, thresh=0.15):
    """Fraction of pixels that are green (vegetation)."""
    exg = excess_green(arr)
    return float((exg > thresh).mean())

def impervious_fraction(arr, bright_thresh=0.6, green_thresh=0.12):
    """Fraction of pixels that are impervious surfaces."""
    brightness = arr.mean(axis=2)
    exg = excess_green(arr)
    mask = (brightness > bright_thresh) & (exg < green_thresh)
    return float(mask.mean())

def edge_density(arr, thresh=0.15):
    """Density of edges (complexity of scene)."""
    gray = arr.mean(axis=2)
    sx = ndimage.sobel(gray, axis=0)
    sy = ndimage.sobel(gray, axis=1)
    grad = np.hypot(sx, sy)
    grad = (grad - grad.min()) / (grad.max() - grad.min() + 1e-9)
    return float((grad > thresh).mean())

def brightness_stats(arr):
    """Mean and std of brightness."""
    b = arr.mean(axis=2)
    return float(b.mean()), float(b.std())

def texture_features(arr):
    """GLCM texture features."""
    gray = (arr.mean(axis=2) * 255).astype(np.uint8)
    glcm = graycomatrix(
        gray,
        distances=[2],
        angles=[0],
        levels=256,
        symmetric=True,
        normed=True
    )
    contrast = graycoprops(glcm, 'contrast')[0, 0]
    homogeneity = graycoprops(glcm, 'homogeneity')[0, 0]
    return float(contrast), float(homogeneity)

print("‚úì Helper functions defined")

üåç Extracting geo-visual features from images...

‚úì Helper functions defined


In [21]:
# Extract features for all images
geo_rows = []

for pid in tqdm(df['id'], desc="Extracting geo-visual features"):
    pid = str(pid)
    
    if pid not in image_files:
        # Image missing ‚Üí fill NaNs
        geo_rows.append({
            "id": pid,
            "green_fraction": np.nan,
            "impervious_fraction": np.nan,
            "edge_density": np.nan,
            "brightness_mean": np.nan,
            "brightness_std": np.nan,
            "texture_contrast": np.nan,
            "texture_homogeneity": np.nan
        })
        continue
    
    img_path = os.path.join(IMAGE_DIR, image_files[pid])
    
    try:
        arr = load_image(img_path)
        
        gf = green_fraction(arr)
        imp = impervious_fraction(arr)
        ed = edge_density(arr)
        b_mean, b_std = brightness_stats(arr)
        tex_con, tex_hom = texture_features(arr)
        
        geo_rows.append({
            "id": pid,
            "green_fraction": gf,
            "impervious_fraction": imp,
            "edge_density": ed,
            "brightness_mean": b_mean,
            "brightness_std": b_std,
            "texture_contrast": tex_con,
            "texture_homogeneity": tex_hom
        })
    
    except Exception as e:
        print(f"‚ùå Error processing id={pid}: {e}")
        geo_rows.append({
            "id": pid,
            "green_fraction": np.nan,
            "impervious_fraction": np.nan,
            "edge_density": np.nan,
            "brightness_mean": np.nan,
            "brightness_std": np.nan,
            "texture_contrast": np.nan,
            "texture_homogeneity": np.nan
        })

# Create geo-visual dataframe
geo_df = pd.DataFrame(geo_rows)

print(f"\n‚úÖ Geo-visual features extracted: {geo_df.shape}")

Extracting geo-visual features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16110/16110 [13:45<00:00, 19.52it/s]


‚úÖ Geo-visual features extracted: (16110, 8)





In [22]:
# Merge with main dataframe
df = df.merge(geo_df, on="id", how="left")

print(f"‚úì Geo-visual features merged")
print(f"New shape: {df.shape}")

# Display sample features
geo_cols = ['green_fraction', 'impervious_fraction', 'edge_density', 'brightness_mean']
df[geo_cols].describe()

‚úì Geo-visual features merged
New shape: (16110, 2090)


Unnamed: 0,green_fraction,impervious_fraction,edge_density,brightness_mean
count,16110.0,16110.0,16110.0,16110.0
mean,0.995318,9.8e-05,0.262282,0.349901
std,0.025951,0.000976,0.042047,0.076173
min,0.27861,0.0,0.033867,0.104035
25%,0.998985,0.0,0.235504,0.29593
50%,0.99968,0.0,0.261347,0.351284
75%,0.999866,1.1e-05,0.290174,0.405003
max,0.999989,0.041489,0.453255,0.602635


## üöá 4. Transport Distance Features (OpenStreetMap)

In [23]:
print("üöá Fetching transport infrastructure from OpenStreetMap...\n")

# Get center point
center_lat = df.lat.mean()
center_lon = df.long.mean()
house_coords = df[['lat', 'long']].values

print(f"Center point: ({center_lat:.4f}, {center_lon:.4f})")
print(f"Search radius: {TRANSPORT_SEARCH_RADIUS/1000:.1f} km")

üöá Fetching transport infrastructure from OpenStreetMap...

Center point: (47.5607, -122.2139)
Search radius: 40.0 km


In [24]:
# Fetch metro stations
print("\n‚û°Ô∏è Fetching METRO stations...")
metro = ox.features_from_point(
    (center_lat, center_lon),
    tags={
        "railway": ["subway_entrance", "station"],
        "station": "subway",
        "public_transport": "station"
    },
    dist=TRANSPORT_SEARCH_RADIUS
)
print(f"‚úÖ Metro fetched: {len(metro)} stations")


‚û°Ô∏è Fetching METRO stations...
‚úÖ Metro fetched: 247 stations


In [25]:
# Fetch railway stations
print("\n‚û°Ô∏è Fetching RAILWAY stations...")
rail = ox.features_from_point(
    (center_lat, center_lon),
    tags={"railway": ["station", "halt", "stop"]},
    dist=TRANSPORT_SEARCH_RADIUS
)
print(f"‚úÖ Railway fetched: {len(rail)} stations")


‚û°Ô∏è Fetching RAILWAY stations...
‚úÖ Railway fetched: 159 stations


In [26]:
# Fetch airports
print("\n‚û°Ô∏è Fetching AIRPORTS...")
airport = ox.features_from_point(
    (center_lat, center_lon),
    tags={"aeroway": ["aerodrome", "airport"]},
    dist=TRANSPORT_SEARCH_RADIUS
)
print(f"‚úÖ Airport fetched: {len(airport)} airports")


‚û°Ô∏è Fetching AIRPORTS...
‚úÖ Airport fetched: 41 airports


In [27]:
# Extract coordinates
def extract_coords(gdf):
    """Extract coordinates from GeoDataFrame."""
    if gdf.empty:
        return np.empty((0, 2))
    gdf = gdf[gdf.geometry.notnull()]
    gdf = gdf[gdf.is_valid]
    centroids = gdf.geometry.centroid
    return np.column_stack((centroids.y, centroids.x))

metro_coords = extract_coords(metro)
rail_coords = extract_coords(rail)
airport_coords = extract_coords(airport)

print(f"\nCoordinates extracted:")
print(f"  Metro: {len(metro_coords)} points")
print(f"  Railway: {len(rail_coords)} points")
print(f"  Airport: {len(airport_coords)} points")


Coordinates extracted:
  Metro: 247 points
  Railway: 159 points
  Airport: 41 points


In [28]:
# Build BallTrees for efficient nearest neighbor search
def build_tree(coords):
    """Build BallTree for efficient spatial queries."""
    if len(coords) == 0:
        return None
    return BallTree(np.radians(coords), metric="haversine")

metro_tree = build_tree(metro_coords)
rail_tree = build_tree(rail_coords)
airport_tree = build_tree(airport_coords)

print("‚úì BallTrees built for spatial queries")

‚úì BallTrees built for spatial queries


In [29]:
# Calculate distances
house_rad = np.radians(house_coords)

def nearest_distance(tree):
    """Find distance to nearest facility."""
    if tree is None:
        return np.full(len(house_rad), np.nan)
    dist, _ = tree.query(house_rad, k=1)
    return dist.flatten() * 6371000  # radians ‚Üí meters

df['dist_to_metro_m'] = nearest_distance(metro_tree)
df['dist_to_railway_m'] = nearest_distance(rail_tree)
df['dist_to_airport_m'] = nearest_distance(airport_tree)

# Log-transformed distances
df['log_dist_to_metro'] = np.log1p(df['dist_to_metro_m'])
df['log_dist_to_railway'] = np.log1p(df['dist_to_railway_m'])
df['log_dist_to_airport'] = np.log1p(df['dist_to_airport_m'])

print("\n‚úÖ Transport distance features calculated")
print(f"Final shape: {df.shape}")


‚úÖ Transport distance features calculated
Final shape: (16110, 2096)


In [30]:
# Display distance statistics
distance_cols = ['dist_to_metro_m', 'dist_to_railway_m', 'dist_to_airport_m']
df[distance_cols].describe()

Unnamed: 0,dist_to_metro_m,dist_to_railway_m,dist_to_airport_m
count,16110.0,16110.0,16110.0
mean,4012.818228,5354.991093,5716.584681
std,3777.172746,4191.067265,3070.927829
min,63.905359,62.245059,129.037803
25%,1690.212696,2263.44935,3441.879678
50%,3017.531741,4281.690207,5295.44498
75%,4801.464883,7014.750062,7340.139496
max,42844.191355,42844.191355,44619.727479


## üíæ Save Processed Data

In [31]:
# Save to CSV
df.to_csv(OUTPUT_PATH, index=False)

print(f"\nüíæ Processed data saved to: {OUTPUT_PATH}")
print(f"üìä Final shape: {df.shape}")
print(f"üìä Total features: {df.shape[1]}")
print(f"üìä Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")


üíæ Processed data saved to: processed_data.csv
üìä Final shape: (16110, 2096)
üìä Total features: 2096
üìä Memory usage: 132.54 MB


## üìä Data Summary

In [32]:
# Count features by type
img_cols = [c for c in df.columns if c.startswith('img_')]
geo_cols = ['green_fraction', 'impervious_fraction', 'edge_density', 
            'brightness_mean', 'brightness_std', 'texture_contrast', 'texture_homogeneity']
transport_cols = ['dist_to_metro_m', 'dist_to_railway_m', 'dist_to_airport_m',
                  'log_dist_to_metro', 'log_dist_to_railway', 'log_dist_to_airport']

print("\n" + "="*50)
print("PREPROCESSING SUMMARY")
print("="*50)
print(f"Total rows: {len(df):,}")
print(f"Total columns: {df.shape[1]:,}")
print(f"\nFeature breakdown:")
print(f"  Image embeddings: {len(img_cols)}")
print(f"  Geo-visual features: {len(geo_cols)}")
print(f"  Transport features: {len(transport_cols)}")
print(f"  Other features: {df.shape[1] - len(img_cols) - len(geo_cols) - len(transport_cols)}")
print("="*50)


PREPROCESSING SUMMARY
Total rows: 16,110
Total columns: 2,096

Feature breakdown:
  Image embeddings: 2048
  Geo-visual features: 7
  Transport features: 6
  Other features: 35


## ‚úÖ Preprocessing Complete!

Next step: Run `model_training.ipynb` to train the multimodal model.