# üè† House Price Prediction - Complete ML Pipeline

**A Professional Machine Learning Project**

This notebook implements a complete end-to-end ML pipeline for predicting house prices in Bangalore, including:
- Data loading and preprocessing
- Exploratory Data Analysis (EDA) with visualizations
- Training and comparison of 8 different ML models
- Model evaluation and selection
- Interactive prediction system

---

## üì¶ 1. Import Required Libraries

In [None]:
# Core libraries
import os
import json
import warnings
from typing import List, Dict, Optional

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
import joblib

# Display settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("‚úÖ All libraries imported successfully!")

## üîß 2. Define Helper Functions

In [None]:
def clean_numeric_range(value):
    """Convert range strings like '3307 - 3464' to their average."""
    if pd.isna(value):
        return value
    
    value_str = str(value).strip()
    
    # Handle ranges (e.g., "3307 - 3464")
    if '-' in value_str:
        parts = value_str.split('-')
        if len(parts) == 2:
            try:
                num1 = float(parts[0].strip())
                num2 = float(parts[1].strip())
                return (num1 + num2) / 2
            except ValueError:
                return np.nan
    
    # Try to convert to float
    try:
        return float(value_str)
    except ValueError:
        return np.nan


def extract_bhk_number(size_str):
    """Extract number from BHK/Bedroom strings (e.g., '3 BHK' -> 3)."""
    if pd.isna(size_str):
        return np.nan
    
    size_str = str(size_str).strip().upper()
    
    # Extract number from strings like "3 BHK", "4 Bedroom", "1 RK"
    for word in size_str.split():
        try:
            num = int(word)
            return num
        except ValueError:
            continue
    
    return np.nan


def _normalize_col_name(name: str) -> str:
    """Normalize column names for matching."""
    return (
        str(name).strip().lower()
        .replace(" ", "").replace("_", "")
        .replace("-", "").replace("/", "")
        .replace("(", "").replace(")", "")
    )


def _pick_first_match(normalized_to_original: Dict[str, str], candidates: List[str]) -> Optional[str]:
    """Pick the first matching column name from candidates."""
    for c in candidates:
        if c in normalized_to_original:
            return normalized_to_original[c]
    return None


def standardize_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Standardize column names and clean data."""
    normalized_to_original = {_normalize_col_name(c): c for c in df.columns}
    mapping = {}

    # Map common column name variations
    location_col = _pick_first_match(
        normalized_to_original,
        ["location", "locality", "area", "city", "address", "region"]
    )
    size_col = _pick_first_match(
        normalized_to_original,
        ["size", "type", "propertytype", "bhk", "bedroom"]
    )
    sqft_col = _pick_first_match(
        normalized_to_original,
        ["totalsqfeet", "totalsqft", "squarefeet", "sqft", "area"]
    )
    bathroom_col = _pick_first_match(
        normalized_to_original,
        ["bathroom", "bathrooms", "bath"]
    )
    price_col = _pick_first_match(
        normalized_to_original,
        ["priceinlakhs", "pricelakhs", "price", "cost"]
    )

    # Create mapping
    if location_col:
        mapping[location_col] = "location"
    if size_col:
        mapping[size_col] = "size"
    if sqft_col:
        mapping[sqft_col] = "total_sqft"
    if bathroom_col:
        mapping[bathroom_col] = "bath"
    if price_col:
        mapping[price_col] = "price"

    df = df.rename(columns=mapping)
    
    # Extract BHK number from size column
    if "size" in df.columns:
        df["bhk"] = df["size"].apply(extract_bhk_number)
    
    # Clean numeric columns
    if "total_sqft" in df.columns:
        df["total_sqft"] = df["total_sqft"].apply(clean_numeric_range)
        df["total_sqft"] = pd.to_numeric(df["total_sqft"], errors='coerce')
    
    if "bath" in df.columns:
        df["bath"] = pd.to_numeric(df["bath"], errors='coerce')
    
    if "price" in df.columns:
        df["price"] = pd.to_numeric(df["price"], errors='coerce')
    
    return df


def build_preprocessor(categorical_features: List[str], numeric_features: List[str]) -> ColumnTransformer:
    """Build preprocessing pipeline."""
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    return ColumnTransformer(transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, numeric_features)
    ])


def evaluate_model(y_true: pd.Series, y_pred: np.ndarray) -> dict:
    """Calculate evaluation metrics."""
    return {
        "R2": float(r2_score(y_true, y_pred)),
        "MAE": float(mean_absolute_error(y_true, y_pred)),
        "RMSE": float(np.sqrt(mean_squared_error(y_true, y_pred)))
    }


print("‚úÖ Helper functions defined successfully!")

## üìÇ 3. Load Dataset

**Note:** Update the `DATA_PATH` variable to point to your dataset file.

In [None]:
# Configuration
DATA_PATH = "data/house_prices.xlsx"  # Update this path

# Load dataset
print("="*80)
print("LOADING DATASET")
print("="*80)

if not os.path.exists(DATA_PATH):
    print(f"‚ùå ERROR: Dataset not found at: {DATA_PATH}")
    print("\nPlease update the DATA_PATH variable to point to your dataset.")
else:
    # Load based on file extension
    _, ext = os.path.splitext(DATA_PATH.lower())
    if ext in {".xlsx", ".xls"}:
        df_raw = pd.read_excel(DATA_PATH)
    elif ext == ".csv":
        df_raw = pd.read_csv(DATA_PATH)
    else:
        raise ValueError("Unsupported file type. Use .csv or .xlsx")
    
    # Strip whitespace from column names
    df_raw.columns = df_raw.columns.str.strip()
    
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"   Rows: {df_raw.shape[0]:,}")
    print(f"   Columns: {df_raw.shape[1]}")
    print(f"\nüìã Column Names: {list(df_raw.columns)}")

### 3.1 Display First Few Rows

In [None]:
print("\n" + "="*80)
print("FIRST 5 ROWS OF DATASET")
print("="*80)
display(df_raw.head())

### 3.2 Dataset Information

In [None]:
print("\n" + "="*80)
print("DATASET INFORMATION")
print("="*80)
print(df_raw.info())

### 3.3 Data Types

In [None]:
print("\n" + "="*80)
print("DATA TYPES")
print("="*80)
print(df_raw.dtypes)

## üßπ 4. Data Cleaning and Preprocessing

In [None]:
print("\n" + "="*80)
print("DATA CLEANING AND PREPROCESSING")
print("="*80)

# Standardize column names
print("\n[1/4] Standardizing column names...")
df = standardize_columns(df_raw.copy())
print(f"‚úÖ Columns standardized: {list(df.columns)}")

# Remove duplicates
print("\n[2/4] Removing duplicate rows...")
initial_rows = len(df)
df = df.drop_duplicates()
duplicates_removed = initial_rows - len(df)
print(f"‚úÖ Removed {duplicates_removed} duplicate rows")

# Drop rows with missing target
print("\n[3/4] Removing rows with missing target (price)...")
initial_rows = len(df)
df = df.dropna(subset=['price'])
missing_price_removed = initial_rows - len(df)
print(f"‚úÖ Removed {missing_price_removed} rows with missing price")

# Feature engineering
print("\n[4/4] Feature engineering...")
if 'total_sqft' in df.columns and 'bhk' in df.columns:
    df['price_per_sqft'] = df['price'] / df['total_sqft']
    print("‚úÖ Created feature: price_per_sqft")

print(f"\n‚úÖ Data cleaning complete! Final dataset: {len(df)} rows, {len(df.columns)} columns")

### 4.1 Display Cleaned Data

In [None]:
print("\n" + "="*80)
print("CLEANED DATASET - FIRST 10 ROWS")
print("="*80)
display(df.head(10))

## üìä 5. Exploratory Data Analysis (EDA)

### 5.1 Missing Values Analysis

In [None]:
print("\n" + "="*80)
print("MISSING VALUES ANALYSIS")
print("="*80)

missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percent.values
}).sort_values('Missing Count', ascending=False)

missing_df = missing_df[missing_df['Missing Count'] > 0]

if len(missing_df) > 0:
    display(missing_df)
    
    # Visualize missing values
    fig, ax = plt.subplots(figsize=(10, 6))
    missing_df.plot(x='Column', y='Percentage', kind='bar', ax=ax, color='coral', legend=False)
    plt.title('Missing Values by Column', fontsize=16, fontweight='bold')
    plt.xlabel('Column Name', fontsize=12)
    plt.ylabel('Percentage Missing (%)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("‚úÖ No missing values found in any column!")

### 5.2 Statistical Summary

In [None]:
print("\n" + "="*80)
print("STATISTICAL SUMMARY")
print("="*80)
display(df.describe())

### 5.3 Price Distribution

In [None]:
print("\n" + "="*80)
print("PRICE DISTRIBUTION ANALYSIS")
print("="*80)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Histogram
axes[0].hist(df['price'].dropna(), bins=50, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Price Distribution (Histogram)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Price (Lakhs)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].grid(alpha=0.3)

# KDE Plot
df['price'].dropna().plot(kind='density', ax=axes[1], color='darkblue', linewidth=2)
axes[1].set_title('Price Distribution (Density)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Price (Lakhs)', fontsize=12)
axes[1].set_ylabel('Density', fontsize=12)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Price statistics
print(f"\nPrice Statistics:")
print(f"  Mean: ‚Çπ{df['price'].mean():.2f} Lakhs")
print(f"  Median: ‚Çπ{df['price'].median():.2f} Lakhs")
print(f"  Std Dev: ‚Çπ{df['price'].std():.2f} Lakhs")
print(f"  Min: ‚Çπ{df['price'].min():.2f} Lakhs")
print(f"  Max: ‚Çπ{df['price'].max():.2f} Lakhs")

### 5.4 Correlation Heatmap

In [None]:
print("\n" + "="*80)
print("CORRELATION ANALYSIS")
print("="*80)

numeric_cols = df.select_dtypes(include=[np.number]).columns

if len(numeric_cols) > 1:
    correlation_matrix = df[numeric_cols].corr()
    
    # Display correlation with price
    if 'price' in correlation_matrix.columns:
        price_corr = correlation_matrix['price'].sort_values(ascending=False)
        print("\nCorrelation with Price:")
        print(price_corr)
    
    # Heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                fmt='.2f', square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è Not enough numeric columns for correlation analysis")

### 5.5 BHK vs Price Analysis

In [None]:
print("\n" + "="*80)
print("BHK vs PRICE ANALYSIS")
print("="*80)

if 'bhk' in df.columns and 'price' in df.columns:
    df_bhk = df.dropna(subset=['bhk', 'price'])
    
    # Statistics by BHK
    bhk_stats = df_bhk.groupby('bhk')['price'].agg(['count', 'mean', 'median', 'std'])
    bhk_stats.columns = ['Count', 'Mean Price', 'Median Price', 'Std Dev']
    print("\nPrice Statistics by BHK:")
    display(bhk_stats)
    
    # Visualizations
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Box plot
    sns.boxplot(x='bhk', y='price', data=df_bhk, ax=axes[0], palette='Set2')
    axes[0].set_title('Price Distribution by BHK (Box Plot)', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('BHK', fontsize=12)
    axes[0].set_ylabel('Price (Lakhs)', fontsize=12)
    axes[0].grid(alpha=0.3)
    
    # Violin plot
    sns.violinplot(x='bhk', y='price', data=df_bhk, ax=axes[1], palette='Set3')
    axes[1].set_title('Price Distribution by BHK (Violin Plot)', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('BHK', fontsize=12)
    axes[1].set_ylabel('Price (Lakhs)', fontsize=12)
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è BHK or Price column not found")

### 5.6 Area (Square Feet) Analysis

In [None]:
print("\n" + "="*80)
print("AREA (SQUARE FEET) ANALYSIS")
print("="*80)

if 'total_sqft' in df.columns and 'price' in df.columns:
    df_sqft = df.dropna(subset=['total_sqft', 'price'])
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Scatter plot
    axes[0].scatter(df_sqft['total_sqft'], df_sqft['price'], alpha=0.5, color='navy')
    axes[0].set_title('Price vs Total Square Feet', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Total Square Feet', fontsize=12)
    axes[0].set_ylabel('Price (Lakhs)', fontsize=12)
    axes[0].grid(alpha=0.3)
    
    # Hexbin plot for density
    axes[1].hexbin(df_sqft['total_sqft'], df_sqft['price'], gridsize=30, cmap='YlOrRd')
    axes[1].set_title('Price vs Total Square Feet (Density)', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('Total Square Feet', fontsize=12)
    axes[1].set_ylabel('Price (Lakhs)', fontsize=12)
    
    plt.tight_layout()
    plt.show()
    
    # Correlation
    corr = df_sqft['total_sqft'].corr(df_sqft['price'])
    print(f"\nüìä Correlation between Total Sqft and Price: {corr:.4f}")
else:
    print("‚ö†Ô∏è total_sqft or price column not found")

### 5.7 Top Locations by Average Price

In [None]:
print("\n" + "="*80)
print("TOP LOCATIONS ANALYSIS")
print("="*80)

if 'location' in df.columns and 'price' in df.columns:
    location_stats = df.groupby('location')['price'].agg(['count', 'mean', 'median']).reset_index()
    location_stats.columns = ['Location', 'Count', 'Mean Price', 'Median Price']
    location_stats = location_stats.sort_values('Mean Price', ascending=False)
    
    # Top 15 locations by average price
    top_locations = location_stats.head(15)
    
    print("\nTop 15 Locations by Average Price:")
    display(top_locations)
    
    # Visualization
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(top_locations)), top_locations['Mean Price'], color='teal')
    plt.yticks(range(len(top_locations)), top_locations['Location'])
    plt.xlabel('Average Price (Lakhs)', fontsize=12)
    plt.ylabel('Location', fontsize=12)
    plt.title('Top 15 Locations by Average Price', fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.grid(alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è location or price column not found")

## ü§ñ 6. Model Training and Comparison

### 6.1 Prepare Features and Target

In [None]:
print("\n" + "="*80)
print("PREPARING FEATURES AND TARGET")
print("="*80)

# Check required columns
required_cols = ['location', 'total_sqft', 'bath', 'bhk', 'price']
missing = [c for c in required_cols if c not in df.columns]

if missing:
    raise ValueError(f"‚ùå Missing required columns: {missing}")

# Separate features and target
X = df[['location', 'total_sqft', 'bath', 'bhk']].copy()
y = df['price'].copy()

print(f"‚úÖ Features selected: {list(X.columns)}")
print(f"‚úÖ Target variable: price")
print(f"‚úÖ Total samples: {len(X):,}")
print(f"\nFeature data types:")
print(X.dtypes)
print(f"\nTarget statistics:")
print(y.describe())

### 6.2 Train-Test Split

In [None]:
print("\n" + "="*80)
print("SPLITTING DATA INTO TRAIN AND TEST SETS")
print("="*80)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"‚úÖ Training set: {len(X_train):,} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"‚úÖ Test set: {len(X_test):,} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTrain set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

### 6.3 Build Preprocessing Pipeline

In [None]:
print("\n" + "="*80)
print("BUILDING PREPROCESSING PIPELINE")
print("="*80)

categorical_features = ['location']
numeric_features = ['total_sqft', 'bath', 'bhk']

preprocessor = build_preprocessor(categorical_features, numeric_features)

print(f"‚úÖ Categorical features: {categorical_features}")
print(f"‚úÖ Numeric features: {numeric_features}")
print(f"\nPreprocessing steps:")
print("  - Categorical: Imputation (most frequent) ‚Üí One-Hot Encoding")
print("  - Numeric: Imputation (median) ‚Üí Standard Scaling")

### 6.4 Train Multiple Models

In [None]:
print("\n" + "="*80)
print("TRAINING AND COMPARING 8 MACHINE LEARNING MODELS")
print("="*80)

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0, random_state=42),
    "Lasso Regression": Lasso(alpha=0.1, random_state=42),
    "ElasticNet Regression": ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42, max_depth=15),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42, max_depth=15),
    "Gradient Boosting Regressor": GradientBoostingRegressor(n_estimators=100, random_state=42, max_depth=5),
    "Support Vector Regressor (SVR)": SVR(kernel='rbf', C=1.0, epsilon=0.1)
}

results = []
best_score = -np.inf
best_model_name = None
best_pipeline = None
trained_pipelines = {}

for name, model in models.items():
    print(f"\n{'='*80}")
    print(f"Training: {name}")
    print(f"{'='*80}")
    
    # Create pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Train
    print("  ‚è≥ Training model...")
    pipeline.fit(X_train, y_train)
    print("  ‚úÖ Training complete")
    
    # Predict
    print("  ‚è≥ Making predictions...")
    y_pred = pipeline.predict(X_test)
    print("  ‚úÖ Predictions complete")
    
    # Evaluate
    print("  ‚è≥ Evaluating model...")
    metrics = evaluate_model(y_test, y_pred)
    
    # Cross-validation
    print("  ‚è≥ Performing 5-fold cross-validation...")
    cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
    metrics['CV_R2_Mean'] = float(cv_scores.mean())
    metrics['CV_R2_Std'] = float(cv_scores.std())
    print("  ‚úÖ Cross-validation complete")
    
    # Store results
    results.append({"Model": name, **metrics})
    trained_pipelines[name] = pipeline
    
    # Display metrics
    print(f"\n  üìä Performance Metrics:")
    print(f"     R¬≤ Score (Test): {metrics['R2']:.4f}")
    print(f"     MAE: ‚Çπ{metrics['MAE']:.2f} Lakhs")
    print(f"     RMSE: ‚Çπ{metrics['RMSE']:.2f} Lakhs")
    print(f"     CV R¬≤ (mean¬±std): {metrics['CV_R2_Mean']:.4f} ¬± {metrics['CV_R2_Std']:.4f}")
    
    # Track best model
    if metrics['R2'] > best_score:
        best_score = metrics['R2']
        best_model_name = name
        best_pipeline = pipeline
        print(f"\n  üèÜ NEW BEST MODEL!")

print(f"\n\n" + "="*80)
print("MODEL TRAINING COMPLETE!")
print("="*80)

### 6.5 Model Comparison Results

In [None]:
print("\n" + "="*80)
print("MODEL COMPARISON RESULTS")
print("="*80)

# Create results dataframe
results_df = pd.DataFrame(results).sort_values(by='R2', ascending=False)

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print(f"   R¬≤ Score: {best_score:.4f}\n")

print("\nAll Models Comparison (sorted by R¬≤ Score):")
display(results_df.style.highlight_max(subset=['R2', 'CV_R2_Mean'], color='lightgreen')
                         .highlight_min(subset=['MAE', 'RMSE'], color='lightcoral')
                         .format({
                             'R2': '{:.4f}',
                             'MAE': '{:.2f}',
                             'RMSE': '{:.2f}',
                             'CV_R2_Mean': '{:.4f}',
                             'CV_R2_Std': '{:.4f}'
                         }))

### 6.6 Visualize Model Performance

In [None]:
print("\n" + "="*80)
print("MODEL PERFORMANCE VISUALIZATION")
print("="*80)

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. R¬≤ Score Comparison
ax1 = axes[0, 0]
colors = ['gold' if model == best_model_name else 'skyblue' for model in results_df['Model']]
ax1.barh(results_df['Model'], results_df['R2'], color=colors, edgecolor='black')
ax1.set_xlabel('R¬≤ Score', fontsize=12, fontweight='bold')
ax1.set_title('Model Comparison - R¬≤ Score', fontsize=14, fontweight='bold')
ax1.grid(alpha=0.3, axis='x')
ax1.invert_yaxis()
for i, v in enumerate(results_df['R2']):
    ax1.text(v + 0.01, i, f'{v:.4f}', va='center', fontweight='bold')

# 2. MAE Comparison
ax2 = axes[0, 1]
ax2.barh(results_df['Model'], results_df['MAE'], color='coral', edgecolor='black')
ax2.set_xlabel('Mean Absolute Error (Lakhs)', fontsize=12, fontweight='bold')
ax2.set_title('Model Comparison - MAE (Lower is Better)', fontsize=14, fontweight='bold')
ax2.grid(alpha=0.3, axis='x')
ax2.invert_yaxis()
for i, v in enumerate(results_df['MAE']):
    ax2.text(v + 0.5, i, f'‚Çπ{v:.2f}', va='center', fontweight='bold')

# 3. RMSE Comparison
ax3 = axes[1, 0]
ax3.barh(results_df['Model'], results_df['RMSE'], color='lightgreen', edgecolor='black')
ax3.set_xlabel('Root Mean Square Error (Lakhs)', fontsize=12, fontweight='bold')
ax3.set_title('Model Comparison - RMSE (Lower is Better)', fontsize=14, fontweight='bold')
ax3.grid(alpha=0.3, axis='x')
ax3.invert_yaxis()
for i, v in enumerate(results_df['RMSE']):
    ax3.text(v + 0.5, i, f'‚Çπ{v:.2f}', va='center', fontweight='bold')

# 4. Cross-Validation R¬≤ Score
ax4 = axes[1, 1]
colors = ['gold' if model == best_model_name else 'plum' for model in results_df['Model']]
ax4.barh(results_df['Model'], results_df['CV_R2_Mean'], 
         xerr=results_df['CV_R2_Std'], color=colors, edgecolor='black', capsize=5)
ax4.set_xlabel('Cross-Validation R¬≤ Score', fontsize=12, fontweight='bold')
ax4.set_title('Model Comparison - CV R¬≤ Score (with std)', fontsize=14, fontweight='bold')
ax4.grid(alpha=0.3, axis='x')
ax4.invert_yaxis()
for i, (mean, std) in enumerate(zip(results_df['CV_R2_Mean'], results_df['CV_R2_Std'])):
    ax4.text(mean + 0.01, i, f'{mean:.4f}¬±{std:.4f}', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

### 6.7 Actual vs Predicted (Best Model)

In [None]:
print("\n" + "="*80)
print(f"ACTUAL vs PREDICTED - {best_model_name}")
print("="*80)

# Get predictions from best model
y_pred_best = best_pipeline.predict(X_test)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Scatter plot
ax1 = axes[0]
ax1.scatter(y_test, y_pred_best, alpha=0.6, color='navy', edgecolors='black')
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', lw=3, label='Perfect Prediction')
ax1.set_xlabel('Actual Price (Lakhs)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Predicted Price (Lakhs)', fontsize=12, fontweight='bold')
ax1.set_title(f'Actual vs Predicted - {best_model_name}', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(alpha=0.3)

# Residual plot
ax2 = axes[1]
residuals = y_test - y_pred_best
ax2.scatter(y_pred_best, residuals, alpha=0.6, color='darkgreen', edgecolors='black')
ax2.axhline(y=0, color='r', linestyle='--', lw=3)
ax2.set_xlabel('Predicted Price (Lakhs)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Residuals (Lakhs)', fontsize=12, fontweight='bold')
ax2.set_title(f'Residual Plot - {best_model_name}', fontsize=14, fontweight='bold')
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Residual statistics
print(f"\nResidual Statistics:")
print(f"  Mean Residual: ‚Çπ{residuals.mean():.2f} Lakhs")
print(f"  Std Residual: ‚Çπ{residuals.std():.2f} Lakhs")
print(f"  Min Residual: ‚Çπ{residuals.min():.2f} Lakhs")
print(f"  Max Residual: ‚Çπ{residuals.max():.2f} Lakhs")

## üîÆ 7. Interactive Prediction System

Use the best trained model to make predictions for custom inputs.

### 7.1 Get Available Locations

In [None]:
# Get unique locations from dataset
available_locations = sorted(df['location'].dropna().unique().tolist())

print(f"\nüìç Available Locations ({len(available_locations)} total):")
print("\nShowing first 30 locations:")
for i, loc in enumerate(available_locations[:30], 1):
    print(f"  {i}. {loc}")

if len(available_locations) > 30:
    print(f"\n... and {len(available_locations) - 30} more locations")

### 7.2 Make Predictions - Example 1

In [None]:
print("\n" + "="*80)
print("PREDICTION EXAMPLE 1")
print("="*80)

# Example input - Modify these values
example_1 = {
    'location': available_locations[0] if available_locations else 'Rajaji Nagar',
    'total_sqft': 1500.0,
    'bath': 2.0,
    'bhk': 3.0
}

# Create input dataframe
input_df_1 = pd.DataFrame([example_1])

print("\nüìã Input Details:")
display(input_df_1)

# Make prediction
prediction_1 = best_pipeline.predict(input_df_1)[0]

print(f"\n" + "="*80)
print(f"üí∞ PREDICTED PRICE")
print(f"="*80)
print(f"\n  ‚Çπ{prediction_1:.2f} Lakhs")
print(f"  (‚Çπ{prediction_1*100000:,.0f} Rupees)")

# Calculate price per sqft
price_per_sqft = (prediction_1 * 100000) / example_1['total_sqft']
print(f"\nüìä Price Analysis:")
print(f"  Price per sqft: ‚Çπ{price_per_sqft:,.0f}")
print(f"  Total Price: ‚Çπ{prediction_1:.2f} Lakhs")

### 7.3 Make Predictions - Example 2

In [None]:
print("\n" + "="*80)
print("PREDICTION EXAMPLE 2")
print("="*80)

# Example input - Modify these values
example_2 = {
    'location': available_locations[5] if len(available_locations) > 5 else available_locations[0],
    'total_sqft': 2000.0,
    'bath': 3.0,
    'bhk': 4.0
}

# Create input dataframe
input_df_2 = pd.DataFrame([example_2])

print("\nüìã Input Details:")
display(input_df_2)

# Make prediction
prediction_2 = best_pipeline.predict(input_df_2)[0]

print(f"\n" + "="*80)
print(f"üí∞ PREDICTED PRICE")
print(f"="*80)
print(f"\n  ‚Çπ{prediction_2:.2f} Lakhs")
print(f"  (‚Çπ{prediction_2*100000:,.0f} Rupees)")

# Calculate price per sqft
price_per_sqft = (prediction_2 * 100000) / example_2['total_sqft']
print(f"\nüìä Price Analysis:")
print(f"  Price per sqft: ‚Çπ{price_per_sqft:,.0f}")
print(f"  Total Price: ‚Çπ{prediction_2:.2f} Lakhs")

### 7.4 Batch Predictions - Compare Multiple Properties

In [None]:
print("\n" + "="*80)
print("BATCH PREDICTIONS - COMPARE MULTIPLE PROPERTIES")
print("="*80)

# Create multiple property scenarios
batch_properties = pd.DataFrame([
    {'location': available_locations[0] if available_locations else 'Rajaji Nagar', 
     'total_sqft': 1000, 'bath': 2, 'bhk': 2, 'property': '2 BHK Apartment'},
    {'location': available_locations[0] if available_locations else 'Rajaji Nagar', 
     'total_sqft': 1500, 'bath': 2, 'bhk': 3, 'property': '3 BHK Apartment'},
    {'location': available_locations[0] if available_locations else 'Rajaji Nagar', 
     'total_sqft': 2000, 'bath': 3, 'bhk': 4, 'property': '4 BHK Villa'},
    {'location': available_locations[0] if available_locations else 'Rajaji Nagar', 
     'total_sqft': 2500, 'bath': 4, 'bhk': 5, 'property': '5 BHK Luxury Villa'},
])

# Make predictions
prediction_features = batch_properties[['location', 'total_sqft', 'bath', 'bhk']]
batch_predictions = best_pipeline.predict(prediction_features)
batch_properties['predicted_price_lakhs'] = batch_predictions
batch_properties['predicted_price_rupees'] = batch_predictions * 100000
batch_properties['price_per_sqft'] = batch_properties['predicted_price_rupees'] / batch_properties['total_sqft']

print("\nComparison of Different Property Types:")
display(batch_properties[['property', 'bhk', 'total_sqft', 'bath', 
                          'predicted_price_lakhs', 'price_per_sqft']].style.format({
    'predicted_price_lakhs': '‚Çπ{:.2f} L',
    'price_per_sqft': '‚Çπ{:.0f}',
    'total_sqft': '{:.0f}'
}))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Price comparison
ax1 = axes[0]
ax1.bar(range(len(batch_properties)), batch_properties['predicted_price_lakhs'], 
        color=['skyblue', 'lightgreen', 'coral', 'gold'], edgecolor='black', linewidth=2)
ax1.set_xticks(range(len(batch_properties)))
ax1.set_xticklabels(batch_properties['property'], rotation=15, ha='right')
ax1.set_ylabel('Predicted Price (Lakhs)', fontsize=12, fontweight='bold')
ax1.set_title('Price Comparison Across Property Types', fontsize=14, fontweight='bold')
ax1.grid(alpha=0.3, axis='y')
for i, v in enumerate(batch_properties['predicted_price_lakhs']):
    ax1.text(i, v + 5, f'‚Çπ{v:.1f}L', ha='center', fontweight='bold')

# Price per sqft comparison
ax2 = axes[1]
ax2.bar(range(len(batch_properties)), batch_properties['price_per_sqft'], 
        color=['skyblue', 'lightgreen', 'coral', 'gold'], edgecolor='black', linewidth=2)
ax2.set_xticks(range(len(batch_properties)))
ax2.set_xticklabels(batch_properties['property'], rotation=15, ha='right')
ax2.set_ylabel('Price per Sqft (‚Çπ)', fontsize=12, fontweight='bold')
ax2.set_title('Price per Sqft Comparison', fontsize=14, fontweight='bold')
ax2.grid(alpha=0.3, axis='y')
for i, v in enumerate(batch_properties['price_per_sqft']):
    ax2.text(i, v + 100, f'‚Çπ{v:.0f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### 7.5 Custom Prediction Function

Create your own predictions by modifying the cell below:

In [None]:
def predict_house_price(location, sqft, bathrooms, bedrooms):
    """
    Predict house price based on input parameters.
    
    Parameters:
    -----------
    location : str
        Location of the property
    sqft : float
        Total area in square feet
    bathrooms : int
        Number of bathrooms
    bedrooms : int
        Number of bedrooms (BHK)
    
    Returns:
    --------
    dict : Dictionary containing prediction results
    """
    # Create input dataframe
    input_data = pd.DataFrame([{
        'location': location,
        'total_sqft': float(sqft),
        'bath': float(bathrooms),
        'bhk': float(bedrooms)
    }])
    
    # Make prediction
    prediction_lakhs = best_pipeline.predict(input_data)[0]
    prediction_rupees = prediction_lakhs * 100000
    price_per_sqft = prediction_rupees / sqft
    
    # Display results
    print("\n" + "="*80)
    print("üè† HOUSE PRICE PREDICTION")
    print("="*80)
    print(f"\nüìç Location: {location}")
    print(f"üõèÔ∏è  BHK: {bedrooms}")
    print(f"üöø Bathrooms: {bathrooms}")
    print(f"üìè Total Area: {sqft:,.0f} sq ft")
    print(f"\n" + "-"*80)
    print(f"üí∞ Predicted Price: ‚Çπ{prediction_lakhs:.2f} Lakhs")
    print(f"üíµ In Rupees: ‚Çπ{prediction_rupees:,.0f}")
    print(f"üìä Price per sq ft: ‚Çπ{price_per_sqft:,.0f}")
    print("="*80)
    
    return {
        'price_lakhs': prediction_lakhs,
        'price_rupees': prediction_rupees,
        'price_per_sqft': price_per_sqft
    }

# Example usage - Modify these values to make your own predictions
result = predict_house_price(
    location=available_locations[0] if available_locations else 'Rajaji Nagar',
    sqft=1800,
    bathrooms=3,
    bedrooms=3
)

## üíæ 8. Save Model and Results

In [None]:
print("\n" + "="*80)
print("SAVING MODEL AND RESULTS")
print("="*80)

# Create output directory
output_dir = "models"
os.makedirs(output_dir, exist_ok=True)

# Save best model
model_path = os.path.join(output_dir, "house_price_model.pkl")
joblib.dump(best_pipeline, model_path)
print(f"\n‚úÖ Best model saved: {model_path}")
print(f"   Model: {best_model_name}")

# Save model comparison
comparison_path = os.path.join(output_dir, "model_comparison.csv")
results_df.to_csv(comparison_path, index=False)
print(f"\n‚úÖ Model comparison saved: {comparison_path}")

# Save metrics JSON
metrics_data = {
    "best_model": best_model_name,
    "best_metrics": results_df.iloc[0].to_dict(),
    "all_models": results,
    "training_info": {
        "total_samples": len(df),
        "train_samples": len(X_train),
        "test_samples": len(X_test),
        "features": list(X.columns),
        "test_size": 0.2,
        "random_state": 42
    }
}

metrics_path = os.path.join(output_dir, "metrics.json")
with open(metrics_path, 'w') as f:
    json.dump(metrics_data, f, indent=2)
print(f"\n‚úÖ Metrics saved: {metrics_path}")

print("\n" + "="*80)
print("ALL FILES SAVED SUCCESSFULLY!")
print("="*80)

## üìã 9. Project Summary

In [None]:
print("\n" + "="*80)
print("PROJECT SUMMARY")
print("="*80)

summary = f"""
üè† HOUSE PRICE PREDICTION PROJECT
{'='*80}

üìä DATASET INFORMATION:
  ‚Ä¢ Total Records: {len(df):,}
  ‚Ä¢ Features: {list(X.columns)}
  ‚Ä¢ Target: price (in Lakhs)
  ‚Ä¢ Locations: {len(available_locations)}

üî¨ MODELS TRAINED:
  ‚Ä¢ Total Models: 8
  ‚Ä¢ Linear Models: 4 (Linear, Ridge, Lasso, ElasticNet)
  ‚Ä¢ Non-Linear Models: 4 (Decision Tree, Random Forest, Gradient Boosting, SVR)

üèÜ BEST MODEL:
  ‚Ä¢ Model: {best_model_name}
  ‚Ä¢ R¬≤ Score: {best_score:.4f}
  ‚Ä¢ MAE: ‚Çπ{results_df.iloc[0]['MAE']:.2f} Lakhs
  ‚Ä¢ RMSE: ‚Çπ{results_df.iloc[0]['RMSE']:.2f} Lakhs
  ‚Ä¢ CV R¬≤ Score: {results_df.iloc[0]['CV_R2_Mean']:.4f} ¬± {results_df.iloc[0]['CV_R2_Std']:.4f}

üìÅ OUTPUT FILES:
  ‚Ä¢ Model: {model_path}
  ‚Ä¢ Comparison: {comparison_path}
  ‚Ä¢ Metrics: {metrics_path}

‚ú® PROJECT COMPLETED SUCCESSFULLY!
{'='*80}
"""

print(summary)

---

## üéì Conclusion

This notebook demonstrates a complete end-to-end machine learning pipeline for house price prediction:

1. ‚úÖ **Data Loading & Preprocessing**: Handled various data formats and cleaned data
2. ‚úÖ **Exploratory Data Analysis**: Comprehensive analysis with visualizations
3. ‚úÖ **Feature Engineering**: Created meaningful features like price_per_sqft
4. ‚úÖ **Model Training**: Trained and compared 8 different ML models
5. ‚úÖ **Model Evaluation**: Used multiple metrics (R¬≤, MAE, RMSE) and cross-validation
6. ‚úÖ **Model Selection**: Automatically selected the best performing model
7. ‚úÖ **Prediction System**: Interactive system for making predictions
8. ‚úÖ **Results Export**: Saved model and metrics for future use

### üìù Key Takeaways:

- The best model achieved an R¬≤ score of **{:.4f}**, indicating good predictive performance
- Location, square footage, and number of bedrooms are key factors in house pricing
- The model can be used to estimate house prices for new properties

### üöÄ Next Steps:

- Try hyperparameter tuning to improve model performance
- Add more features (age of property, amenities, etc.)
- Experiment with ensemble methods
- Deploy the model as a web application

---

**Created by:** House Price Prediction ML Pipeline  
**Date:** {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}  
**Best Model:** {best_model_name}  
**R¬≤ Score:** {best_score:.4f}
""".format(best_score, best_model_name, best_score)