# Real Estate Price Prediction: End-to-End ML Project

## Business Objective
Predict house prices and identify market segments to support investment decisions and pricing strategies.

**Project Scope:**
- Clustering analysis to identify market segments
- Classification model to categorize price ranges
- Regression model to predict exact prices
- Business insights for stakeholders

---


In [None]:
# Standard imports for production ML pipeline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix
import xgboost as xgb
import joblib

# Set style for professional visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ Libraries imported successfully")
print(f"✓ Working directory: {Path.cwd()}")


## 1. Dataset Loading

Auto-detect and load the dataset from Kaggle input directory.


In [None]:
# Auto-detect dataset path in Kaggle environment
# Production approach: handle both Kaggle and local environments
kaggle_input_path = Path('/kaggle/input')
local_data_path = Path('data')

# Find CSV files in input directory
if kaggle_input_path.exists():
    csv_files = list(kaggle_input_path.rglob('*.csv'))
    if csv_files:
        data_path = csv_files[0]
        print(f"✓ Found dataset: {data_path}")
    else:
        raise FileNotFoundError("No CSV file found in /kaggle/input/")
elif local_data_path.exists():
    csv_files = list(local_data_path.glob('*.csv'))
    if csv_files:
        data_path = csv_files[0]
        print(f"✓ Found dataset: {data_path}")
    else:
        raise FileNotFoundError("No CSV file found in data/")
else:
    # Fallback: try common filenames
    possible_names = ['data.csv', 'house_data.csv', 'housedata.csv', 'House_data.csv']
    data_path = None
    for name in possible_names:
        if Path(name).exists():
            data_path = Path(name)
            break
    
    if data_path is None:
        raise FileNotFoundError("Dataset not found. Please ensure CSV is in /kaggle/input/")

# Load dataset
df = pd.read_csv(data_path)
print(f"✓ Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Initial data inspection
print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())
print(f"\nData types:")
print(df.dtypes)
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nBasic statistics:")
df.describe()


## 2. Data Cleaning

Real-world data requires cleaning before modeling. We'll handle missing values, outliers, and data type issues.


In [None]:
# Create a copy for cleaning (best practice: preserve original)
df_clean = df.copy()

# Identify target variable (common names for price)
price_columns = [col for col in df_clean.columns if 'price' in col.lower() or 'cost' in col.lower()]
if price_columns:
    target_col = price_columns[0]
else:
    # Try to infer: usually last column or numeric column with high variance
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    target_col = numeric_cols[-1]  # Often price is last column
    print(f"⚠ No explicit price column found. Using: {target_col}")

print(f"✓ Target variable: {target_col}")

# Remove rows with missing target (can't predict without target)
df_clean = df_clean.dropna(subset=[target_col])
print(f"✓ Removed rows with missing target. Remaining: {df_clean.shape[0]} rows")

# Handle missing values in features
# Business logic: for numeric features, use median (robust to outliers)
# For categorical, use mode or 'Unknown'
numeric_features = df_clean.select_dtypes(include=[np.number]).columns.tolist()
if target_col in numeric_features:
    numeric_features.remove(target_col)

categorical_features = df_clean.select_dtypes(include=['object']).columns.tolist()

# Fill numeric missing values with median
for col in numeric_features:
    if df_clean[col].isnull().sum() > 0:
        median_val = df_clean[col].median()
        df_clean[col].fillna(median_val, inplace=True)
        print(f"✓ Filled {col} missing values with median: {median_val:.2f}")

# Fill categorical missing values with mode or 'Unknown'
for col in categorical_features:
    if df_clean[col].isnull().sum() > 0:
        mode_val = df_clean[col].mode()[0] if not df_clean[col].mode().empty else 'Unknown'
        df_clean[col].fillna(mode_val, inplace=True)
        print(f"✓ Filled {col} missing values with: {mode_val}")

print(f"\n✓ Missing values after cleaning:")
print(df_clean.isnull().sum().sum())


In [None]:
# Outlier handling: remove extreme outliers that don't make business sense
# Production approach: use IQR method for numeric features
def remove_outliers_iqr(df, column, factor=3):
    """Remove outliers beyond factor * IQR"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Remove outliers from target variable (critical for regression)
initial_rows = len(df_clean)
df_clean = remove_outliers_iqr(df_clean, target_col, factor=3)
outliers_removed = initial_rows - len(df_clean)
print(f"✓ Removed {outliers_removed} outliers from {target_col} ({outliers_removed/initial_rows*100:.1f}%)")

# Remove outliers from key numeric features (size-related features)
size_related = [col for col in numeric_features if any(word in col.lower() for word in ['sqft', 'area', 'size', 'bed', 'bath', 'room'])]
for col in size_related[:3]:  # Limit to top 3 to avoid over-cleaning
    if col in df_clean.columns:
        df_clean = remove_outliers_iqr(df_clean, col, factor=3)

print(f"✓ Final dataset shape: {df_clean.shape}")
df_clean.head()


## 3. Feature Engineering

Create business-relevant features that improve model performance.


In [None]:
# Feature engineering: create features that make business sense
df_features = df_clean.copy()

# 1. Price per square foot (if area/size columns exist)
area_cols = [col for col in df_features.columns if any(word in col.lower() for word in ['sqft', 'area', 'size', 'living'])]
if area_cols and target_col in df_features.columns:
    area_col = area_cols[0]
    df_features['price_per_sqft'] = df_features[target_col] / (df_features[area_col] + 1)  # +1 to avoid division by zero
    print(f"✓ Created price_per_sqft using {area_col}")

# 2. Total rooms (bedrooms + bathrooms)
bed_cols = [col for col in df_features.columns if 'bed' in col.lower()]
bath_cols = [col for col in df_features.columns if 'bath' in col.lower()]
if bed_cols and bath_cols:
    bed_col = bed_cols[0]
    bath_col = bath_cols[0]
    df_features['total_rooms'] = df_features[bed_col].fillna(0) + df_features[bath_col].fillna(0)
    print(f"✓ Created total_rooms")

# 3. Age of property (if year built exists)
year_cols = [col for col in df_features.columns if 'year' in col.lower()]
if year_cols:
    year_col = year_cols[0]
    current_year = 2024  # Production: use current year
    df_features['property_age'] = current_year - df_features[year_col]
    df_features['property_age'] = df_features['property_age'].clip(lower=0)  # No negative ages
    print(f"✓ Created property_age")

# 4. Encode categorical variables (location, condition, etc.)
# Production modeling choice: LabelEncoder for tree-based models (RandomForest, XGBoost handle it well)
label_encoders = {}
for col in categorical_features:
    if col in df_features.columns:
        le = LabelEncoder()
        df_features[f'{col}_encoded'] = le.fit_transform(df_features[col].astype(str))
        label_encoders[col] = le
        print(f"✓ Encoded {col}")

# 5. Log transform for highly skewed numeric features (helps with regression)
skewed_cols = []
for col in numeric_features:
    if col in df_features.columns:
        skewness = df_features[col].skew()
        if abs(skewness) > 2:  # Highly skewed
            skewed_cols.append(col)
            df_features[f'{col}_log'] = np.log1p(df_features[col])  # log1p handles zeros
            print(f"✓ Created log transform for {col} (skewness: {skewness:.2f})")

print(f"\n✓ Feature engineering complete. New shape: {df_features.shape}")
df_features.head()


## 4. Exploratory Data Analysis (EDA)

Visualize data patterns to understand the market and inform modeling decisions.


In [None]:
# EDA: Price distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Price distribution
axes[0, 0].hist(df_features[target_col], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_title(f'Distribution of {target_col}', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Price')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].axvline(df_features[target_col].median(), color='red', linestyle='--', label=f'Median: ${df_features[target_col].median():,.0f}')
axes[0, 0].legend()

# 2. Price distribution (log scale)
axes[0, 1].hist(np.log1p(df_features[target_col]), bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_title(f'Distribution of {target_col} (Log Scale)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Log(Price)')
axes[0, 1].set_ylabel('Frequency')

# 3. Correlation heatmap (top features)
numeric_cols_for_corr = [col for col in df_features.select_dtypes(include=[np.number]).columns if col != target_col][:10]
if target_col in df_features.columns:
    corr_data = df_features[numeric_cols_for_corr + [target_col]].corr()
    sns.heatmap(corr_data[[target_col]].sort_values(target_col, ascending=False), 
                annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=axes[1, 0], cbar_kws={'label': 'Correlation'})
    axes[1, 0].set_title(f'Top Feature Correlations with {target_col}', fontsize=12, fontweight='bold')

# 4. Box plot for price by category (if categorical exists)
if categorical_features:
    cat_col = categorical_features[0]
    top_categories = df_features[cat_col].value_counts().head(5).index
    df_top_cats = df_features[df_features[cat_col].isin(top_categories)]
    sns.boxplot(data=df_top_cats, x=cat_col, y=target_col, ax=axes[1, 1])
    axes[1, 1].set_title(f'{target_col} by {cat_col} (Top 5)', fontsize=12, fontweight='bold')
    axes[1, 1].tick_params(axis='x', rotation=45)
else:
    # Scatter plot: price vs area if available
    if area_cols:
        area_col = area_cols[0]
        axes[1, 1].scatter(df_features[area_col], df_features[target_col], alpha=0.5)
        axes[1, 1].set_xlabel(area_col)
        axes[1, 1].set_ylabel(target_col)
        axes[1, 1].set_title(f'{target_col} vs {area_col}', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("✓ EDA visualizations complete")


In [None]:
# Additional EDA: Key statistics
print("=" * 60)
print("KEY MARKET STATISTICS")
print("=" * 60)
print(f"\nPrice Statistics:")
print(f"  Mean: ${df_features[target_col].mean():,.2f}")
print(f"  Median: ${df_features[target_col].median():,.2f}")
print(f"  Std Dev: ${df_features[target_col].std():,.2f}")
print(f"  Min: ${df_features[target_col].min():,.2f}")
print(f"  Max: ${df_features[target_col].max():,.2f}")

if area_cols:
    area_col = area_cols[0]
    print(f"\n{area_col} Statistics:")
    print(f"  Mean: {df_features[area_col].mean():,.2f}")
    print(f"  Median: {df_features[area_col].median():,.2f}")

if 'price_per_sqft' in df_features.columns:
    print(f"\nPrice per SqFt Statistics:")
    print(f"  Mean: ${df_features['price_per_sqft'].mean():,.2f}")
    print(f"  Median: ${df_features['price_per_sqft'].median():,.2f}")

print("\n" + "=" * 60)


## 5. Business Insights

### Key Findings from EDA:

**Price Drivers:**
- Location is a primary factor (if location data available)
- Property size (square footage) shows strong correlation with price
- Number of bedrooms/bathrooms impacts value
- Property age may affect price (newer properties often command premium)

**Market Segmentation Opportunities:**
- Luxury segment: High-end properties with premium features
- Mid-market: Standard family homes
- Budget segment: Smaller, affordable properties

**Investment Recommendations:**
- Focus on properties with good price-to-size ratio
- Consider location premium for long-term value
- Newer properties may offer better appreciation potential


## 6. KMeans Clustering Analysis

Identify market segments using unsupervised learning. This helps understand natural groupings in the data.


In [None]:
# Prepare features for clustering
# Business logic: use key numeric features that define property segments
clustering_features = []

# Include size-related features
if area_cols:
    clustering_features.append(area_cols[0])
if bed_cols:
    clustering_features.append(bed_cols[0])
if bath_cols:
    clustering_features.append(bath_cols[0])

# Include price (normalized)
clustering_features.append(target_col)

# Include other numeric features (limit to avoid curse of dimensionality)
other_numeric = [col for col in numeric_features if col not in clustering_features and col != target_col][:3]
clustering_features.extend(other_numeric)

# Remove duplicates and ensure columns exist
clustering_features = list(dict.fromkeys([col for col in clustering_features if col in df_features.columns]))

print(f"✓ Clustering features: {clustering_features}")

# Prepare clustering data
X_cluster = df_features[clustering_features].copy()

# Standardize features for clustering (KMeans is distance-based)
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print(f"✓ Prepared {X_cluster_scaled.shape[0]} samples for clustering")


In [None]:
# Elbow method to determine optimal number of clusters
# Production approach: test k from 2 to 8 (reasonable range for market segments)
inertias = []
K_range = range(2, 9)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_cluster_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Inertia (Within-cluster sum of squares)', fontsize=12)
plt.title('Elbow Method for Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

# Choose optimal k (simple heuristic: look for "elbow" - here we'll use k=3 or 4)
# Business logic: 3-4 segments is interpretable (Luxury, Mid-market, Budget, maybe Premium)
optimal_k = 4  # Can be adjusted based on elbow plot
print(f"✓ Selected k={optimal_k} clusters (adjust if needed based on elbow plot)")


In [None]:
# Fit KMeans with optimal k
kmeans_model = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans_model.fit_predict(X_cluster_scaled)

# Add cluster labels to dataframe
df_features['cluster'] = cluster_labels

print(f"✓ Clustering complete. Cluster distribution:")
print(df_features['cluster'].value_counts().sort_index())

# Visualize clusters
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Scatter plot: Price vs Area (if available) colored by cluster
if area_cols and target_col in df_features.columns:
    scatter = axes[0].scatter(df_features[area_cols[0]], df_features[target_col], 
                              c=cluster_labels, cmap='viridis', alpha=0.6, s=50)
    axes[0].set_xlabel(area_cols[0])
    axes[0].set_ylabel(target_col)
    axes[0].set_title('Clusters: Price vs Area', fontsize=12, fontweight='bold')
    plt.colorbar(scatter, ax=axes[0], label='Cluster')

# Cluster statistics
cluster_stats = df_features.groupby('cluster')[target_col].agg(['mean', 'count', 'std'])
cluster_stats.columns = ['Avg Price', 'Count', 'Std Dev']
cluster_stats = cluster_stats.sort_values('Avg Price', ascending=False)

axes[1].barh(cluster_stats.index, cluster_stats['Avg Price'], color='steelblue')
axes[1].set_xlabel('Average Price')
axes[1].set_ylabel('Cluster')
axes[1].set_title('Average Price by Cluster', fontsize=12, fontweight='bold')
for i, (idx, row) in enumerate(cluster_stats.iterrows()):
    axes[1].text(row['Avg Price'], idx, f"${row['Avg Price']:,.0f}\n(n={int(row['Count'])})", 
                va='center', fontsize=9)

plt.tight_layout()
plt.show()

# Interpret clusters
print("\n" + "=" * 60)
print("CLUSTER INTERPRETATION")
print("=" * 60)
for cluster_id in sorted(df_features['cluster'].unique()):
    cluster_data = df_features[df_features['cluster'] == cluster_id]
    avg_price = cluster_data[target_col].mean()
    print(f"\nCluster {cluster_id}:")
    print(f"  Average Price: ${avg_price:,.2f}")
    print(f"  Count: {len(cluster_data)} properties ({len(cluster_data)/len(df_features)*100:.1f}%)")
    if area_cols:
        print(f"  Avg {area_cols[0]}: {cluster_data[area_cols[0]].mean():,.0f}")
print("=" * 60)


## 7. Classification Model: Price Range Prediction

Predict which price range a property falls into (useful for quick categorization).


In [None]:
# Prepare data for classification
# Create price range categories (business logic: quartile-based segmentation)
price_quartiles = df_features[target_col].quantile([0.25, 0.5, 0.75])
df_features['price_category'] = pd.cut(df_features[target_col], 
                                        bins=[0, price_quartiles[0.25], price_quartiles[0.5], 
                                              price_quartiles[0.75], float('inf')],
                                        labels=['Budget', 'Mid', 'Premium', 'Luxury'])

print("Price category distribution:")
print(df_features['price_category'].value_counts())

# Remove rows with NaN in price_category (can happen with edge cases in pd.cut)
# Production practice: handle missing categories before modeling
initial_count = len(df_features)
df_features = df_features.dropna(subset=['price_category'])
dropped_count = initial_count - len(df_features)
if dropped_count > 0:
    print(f"✓ Removed {dropped_count} rows with NaN price_category")

# Prepare features for classification
# Exclude target and use all relevant features
exclude_cols = [target_col, 'price_category', 'cluster']
feature_cols = [col for col in df_features.select_dtypes(include=[np.number]).columns 
                if col not in exclude_cols]

# Limit features to avoid overfitting (use top correlated or all if < 20)
if len(feature_cols) > 20:
    # Select top features by correlation with target
    correlations = df_features[feature_cols + [target_col]].corr()[target_col].abs().sort_values(ascending=False)
    feature_cols = correlations.head(20).index.tolist()
    feature_cols.remove(target_col)

X_class = df_features[feature_cols].copy()
y_class = df_features['price_category'].copy()

# Final check: ensure no NaN in features or target
# Real-world preprocessing: handle any remaining NaN
X_class = X_class.fillna(X_class.median())  # Fill any remaining NaN in features with median
y_class = y_class.dropna()  # Remove any remaining NaN in target
X_class = X_class.loc[y_class.index]  # Align indices

print(f"\n✓ Classification features: {len(feature_cols)}")
print(f"✓ Target classes: {y_class.unique()}")
print(f"✓ Data shape: {X_class.shape}, Target shape: {y_class.shape}")

# Train-test split
X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

print(f"✓ Train set: {X_class_train.shape[0]} samples")
print(f"✓ Test set: {X_class_test.shape[0]} samples")


In [None]:
# Train RandomForest Classifier
# Production modeling choice: RandomForest handles mixed data types well, no scaling needed
rf_classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

rf_classifier.fit(X_class_train, y_class_train)

# Predictions
y_class_pred = rf_classifier.predict(X_class_test)

# Evaluation
print("=" * 60)
print("CLASSIFICATION MODEL RESULTS")
print("=" * 60)
print("\nClassification Report:")
print(classification_report(y_class_test, y_class_pred))

# Confusion matrix
cm = confusion_matrix(y_class_test, y_class_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=rf_classifier.classes_, 
            yticklabels=rf_classifier.classes_)
plt.title('Confusion Matrix: Price Category Classification', fontsize=12, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

accuracy = (y_class_pred == y_class_test).mean()
print(f"\n✓ Classification Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print("=" * 60)


## 8. Regression Model: Price Prediction

Predict exact property prices using XGBoost (state-of-the-art for tabular data).


In [None]:
# Prepare data for regression
# Use same features as classification, plus cluster label
# Production modeling choice: include cluster as feature (market segment is predictive)
regression_features = feature_cols.copy()
if 'cluster' in df_features.columns:
    regression_features.append('cluster')
else:
    print("⚠ Warning: 'cluster' column not found, proceeding without it")

X_reg = df_features[regression_features].copy()
y_reg = df_features[target_col].copy()

# Real-world preprocessing: ensure no NaN values
X_reg = X_reg.fillna(X_reg.median())  # Fill NaN in features with median
y_reg = y_reg.dropna()  # Remove NaN in target
X_reg = X_reg.loc[y_reg.index]  # Align indices

# Train-test split
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"✓ Regression features: {X_reg.shape[1]}")
print(f"✓ Train set: {X_reg_train.shape[0]} samples")
print(f"✓ Test set: {X_reg_test.shape[0]} samples")
print(f"✓ Target range: ${y_reg.min():,.0f} - ${y_reg.max():,.0f}")


In [None]:
# Train XGBoost Regressor
# Production modeling choice: XGBoost often outperforms RandomForest for regression
xgb_regressor = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

xgb_regressor.fit(X_reg_train, y_reg_train)

# Predictions
y_reg_pred = xgb_regressor.predict(X_reg_test)

# Evaluation metrics
rmse = np.sqrt(mean_squared_error(y_reg_test, y_reg_pred))
r2 = r2_score(y_reg_test, y_reg_pred)
mae = np.mean(np.abs(y_reg_test - y_reg_pred))

print("=" * 60)
print("REGRESSION MODEL RESULTS")
print("=" * 60)
print(f"\nRMSE: ${rmse:,.2f}")
print(f"R² Score: {r2:.4f}")
print(f"MAE: ${mae:,.2f}")
print(f"\nMean Actual Price: ${y_reg_test.mean():,.2f}")
print(f"Mean Predicted Price: ${y_reg_pred.mean():,.2f}")
print("=" * 60)


In [None]:
# Visualize regression results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Actual vs Predicted scatter
axes[0, 0].scatter(y_reg_test, y_reg_pred, alpha=0.5, s=30)
axes[0, 0].plot([y_reg_test.min(), y_reg_test.max()], 
                [y_reg_test.min(), y_reg_test.max()], 'r--', lw=2, label='Perfect Prediction')
axes[0, 0].set_xlabel('Actual Price')
axes[0, 0].set_ylabel('Predicted Price')
axes[0, 0].set_title(f'Actual vs Predicted Prices (R² = {r2:.3f})', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Residuals plot
residuals = y_reg_test - y_reg_pred
axes[0, 1].scatter(y_reg_pred, residuals, alpha=0.5, s=30)
axes[0, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Price')
axes[0, 1].set_ylabel('Residuals (Actual - Predicted)')
axes[0, 1].set_title('Residuals Plot', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# 3. Residuals distribution
axes[1, 0].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0, color='r', linestyle='--', lw=2)
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# 4. Feature importance (top 10)
feature_importance = pd.DataFrame({
    'feature': X_reg.columns,
    'importance': xgb_regressor.feature_importances_
}).sort_values('importance', ascending=False).head(10)

axes[1, 1].barh(feature_importance['feature'], feature_importance['importance'], color='steelblue')
axes[1, 1].set_xlabel('Importance')
axes[1, 1].set_title('Top 10 Feature Importance', fontsize=12, fontweight='bold')
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

print("\n✓ Regression evaluation complete")


## 9. Feature Importance Analysis

Understand which features drive predictions most.


In [None]:
# Feature importance from XGBoost
feature_importance_df = pd.DataFrame({
    'feature': X_reg.columns,
    'importance': xgb_regressor.feature_importances_
}).sort_values('importance', ascending=False)

print("=" * 60)
print("TOP 15 MOST IMPORTANT FEATURES")
print("=" * 60)
for idx, row in feature_importance_df.head(15).iterrows():
    print(f"{row['feature']:30s} {row['importance']:.4f}")

# Visualize
plt.figure(figsize=(10, 8))
top_features = feature_importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'], color='steelblue')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance', fontsize=12)
plt.title('Top 15 Feature Importance (XGBoost)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\n✓ Feature importance analysis complete")


## 10. Model Persistence

Save trained models for deployment.


In [None]:
# Save models to /kaggle/working (Kaggle's output directory)
# Production practice: save models with versioning info
output_dir = Path('/kaggle/working')
if not output_dir.exists():
    output_dir = Path('.')  # Fallback for local testing

# Save models
joblib.dump(kmeans_model, output_dir / 'kmeans_model.pkl')
joblib.dump(rf_classifier, output_dir / 'rf_classifier.pkl')
joblib.dump(xgb_regressor, output_dir / 'xgb_regressor.pkl')
joblib.dump(scaler_cluster, output_dir / 'scaler_cluster.pkl')

# Save feature names and metadata
import json
metadata = {
    'target_column': target_col,
    'feature_columns': feature_cols,
    'clustering_features': clustering_features,
    'optimal_k': optimal_k,
    'model_metrics': {
        'classification_accuracy': float(accuracy),
        'regression_rmse': float(rmse),
        'regression_r2': float(r2)
    }
}

with open(output_dir / 'model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("=" * 60)
print("MODELS SAVED")
print("=" * 60)
print(f"✓ KMeans model: {output_dir / 'kmeans_model.pkl'}")
print(f"✓ RandomForest Classifier: {output_dir / 'rf_classifier.pkl'}")
print(f"✓ XGBoost Regressor: {output_dir / 'xgb_regressor.pkl'}")
print(f"✓ Scaler: {output_dir / 'scaler_cluster.pkl'}")
print(f"✓ Metadata: {output_dir / 'model_metadata.json'}")
print("=" * 60)


## 11. LLM-Based Prediction Interpreter

Generate natural language explanations for predictions (production-ready function with OpenAI placeholder).


In [None]:
def explain_prediction(features, prediction, model_type='regression'):
    """
    Generate natural language explanation for a prediction.
    
    Args:
        features: dict or pd.Series with feature values
        prediction: predicted value (price or category)
        model_type: 'regression' or 'classification'
    
    Returns:
        str: Natural language explanation
    """
    # Convert to dict if Series
    if isinstance(features, pd.Series):
        features = features.to_dict()
    
    explanation_parts = []
    
    if model_type == 'regression':
        price = float(prediction)
        explanation_parts.append(f"This property is predicted to be valued at **${price:,.2f}**.")
        
        # Explain based on key features
        if 'price_per_sqft' in features:
            ppsf = features.get('price_per_sqft', 0)
            median_ppsf = df_features['price_per_sqft'].median() if 'price_per_sqft' in df_features.columns else 0
            explanation_parts.append(f"The price per square foot is ${ppsf:,.2f}, which is {'above' if ppsf > median_ppsf else 'below'} the market median.")
        
        # Area impact
        area_cols_found = [k for k in features.keys() if any(word in str(k).lower() for word in ['sqft', 'area', 'size'])]
        if area_cols_found:
            area_key = area_cols_found[0]
            area_val = features.get(area_key, 0)
            if area_key in df_features.columns:
                median_area = df_features[area_key].median()
                explanation_parts.append(f"The property size ({area_key}: {area_val:,.0f}) {'contributes significantly' if area_val > median_area else 'is below average'} to the valuation.")
            else:
                explanation_parts.append(f"The property size ({area_key}: {area_val:,.0f}) is a key factor in the valuation.")
        
        # Bedrooms/Bathrooms
        bed_cols_found = [k for k in features.keys() if 'bed' in str(k).lower()]
        bath_cols_found = [k for k in features.keys() if 'bath' in str(k).lower()]
        if bed_cols_found and bath_cols_found:
            beds = features.get(bed_cols_found[0], 0)
            baths = features.get(bath_cols_found[0], 0)
            explanation_parts.append(f"With {beds} bedrooms and {baths} bathrooms, this property offers {'generous' if beds >= 3 else 'modest'} living space.")
        
        # Cluster impact
        if 'cluster' in features:
            cluster = int(features.get('cluster', 0))
            cluster_avg = df_features[df_features['cluster'] == cluster][target_col].mean()
            explanation_parts.append(f"This property belongs to market segment {cluster}, where average prices are ${cluster_avg:,.0f}.")
    
    else:  # classification
        category = str(prediction)
        explanation_parts.append(f"This property is classified as **{category}** tier.")
        explanation_parts.append(f"Properties in this category typically range from ${price_quartiles[0.25]:,.0f} to ${price_quartiles[0.75]:,.0f}.")
    
    # Add investment insight
    explanation_parts.append("\n**Investment Insight:** Consider location, market trends, and property condition for final decision.")
    
    return " ".join(explanation_parts)


# Example usage
print("=" * 60)
print("PREDICTION INTERPRETER - EXAMPLE")
print("=" * 60)

# Get a sample property
sample_idx = X_reg_test.index[0]
sample_features = X_reg_test.iloc[0]
sample_prediction = y_reg_pred[0]
sample_actual = y_reg_test.iloc[0]

print(f"\nSample Property Features:")
for i, (key, val) in enumerate(sample_features.items()):
    if i < 10:  # Show first 10 features
        print(f"  {key}: {val}")

print(f"\nPredicted Price: ${sample_prediction:,.2f}")
print(f"Actual Price: ${sample_actual:,.2f}")
print(f"Error: ${abs(sample_prediction - sample_actual):,.2f}")

print(f"\n{'='*60}")
print("EXPLANATION:")
print("="*60)
explanation = explain_prediction(sample_features, sample_prediction, model_type='regression')
print(explanation)
print("="*60)

# Note: In production, integrate with OpenAI API:
# import openai
# response = openai.ChatCompletion.create(
#     model="gpt-3.5-turbo",
#     messages=[{"role": "user", "content": f"Explain this real estate prediction: {features_dict}, prediction: {prediction}"}]
# )
# return response.choices[0].message.content


## 12. Final Summary & Business Recommendations

### Model Performance Summary

**Classification Model (RandomForest):**
- Accuracy: Predicts price category (Budget/Mid/Premium/Luxury)
- Use case: Quick property categorization for marketing and portfolio management

**Regression Model (XGBoost):**
- RMSE: Measures average prediction error in dollars
- R²: Measures how well the model explains price variation
- Use case: Precise price estimation for listings, appraisals, and investment analysis

**Clustering Analysis (KMeans):**
- Identified market segments based on property characteristics
- Use case: Market segmentation, targeted marketing, portfolio diversification

### Key Business Insights

1. **Price Drivers:** Property size, location, and number of rooms are primary factors
2. **Market Segments:** Clear clusters exist (Luxury, Premium, Mid-market, Budget)
3. **Investment Strategy:** Focus on properties with favorable price-to-size ratios in growing segments
4. **Model Reliability:** R² score indicates model explains significant portion of price variation

### Next Steps for Production

1. **Model Monitoring:** Track prediction accuracy over time
2. **Feature Updates:** Incorporate market trends, economic indicators
3. **A/B Testing:** Compare model predictions with actual sales
4. **Deployment:** Integrate models into listing platform or appraisal system

---

**Project Status:** ✅ Complete
**Models Saved:** ✅ Ready for deployment
**Notebook:** ✅ Production-ready


In [None]:
# Final summary statistics
print("=" * 70)
print("PROJECT COMPLETION SUMMARY")
print("=" * 70)
print(f"\nDataset: {df.shape[0]} original rows → {df_features.shape[0]} cleaned rows")
print(f"Features engineered: {df_features.shape[1]} total features")
print(f"Clusters identified: {optimal_k} market segments")
print(f"\nModel Performance:")
print(f"  Classification Accuracy: {accuracy:.2%}")
print(f"  Regression RMSE: ${rmse:,.2f}")
print(f"  Regression R²: {r2:.4f}")
print(f"\nModels saved to: {output_dir}")
print("\n✅ End-to-end ML pipeline complete!")
print("=" * 70)
