# Ames Housing Price Prediction
## Advanced Apex Project - Real Estate Price Modeling

A comprehensive machine learning approach to predicting residential property sale prices using multiple regression techniques and extensive feature engineering.

---

### Project Information

**Team:** The Outliers

**Course:** Advanced Apex Project 1

**Institution:** BITS Pilani - Digital Campus

**Academic Term:** First Trimester 2025-26

**Project Supervisor:** Bharathi Dasari

**Submission Date:** November 2024

### Team Members

| Student Name | BITS ID |
|--------------|----------|
| Anik Das | 2025EM1100026 |
| Adeetya Wadikar | 2025EM1100384 |
| Tushar Nishane | 2025EM1100306 |

---

## Executive Summary

### Problem Statement

Accurate real estate valuation is essential for buyers, sellers, and financial institutions. Traditional valuation methods can be subjective and time-consuming. This project develops machine learning models to predict house sale prices objectively based on property characteristics.

### Business Objective

Develop a predictive regression model that estimates residential property sale prices with high accuracy. The model should help stakeholders:
- **Buyers**: Assess fair market value before purchase
- **Sellers**: Set competitive listing prices
- **Investors**: Identify undervalued properties
- **Lenders**: Support loan underwriting decisions

### Dataset

**Name:** Ames Housing Dataset

**Source:** Kaggle (https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset)

**Size:** 2,930 residential property sales transactions

**Features:** 82 variables describing:
- Physical characteristics (size, rooms, age)
- Quality ratings (construction, condition)
- Location attributes (neighborhood, zoning)
- Amenities (garage, basement, fireplace, pool)

**Target Variable:** SalePrice (in USD)

**Time Period:** Properties sold in Ames, Iowa from 2006-2010

---

## Table of Contents

### [Phase 1: Data Acquisition](#phase1)
1.1 [Environment Setup](#setup)
1.2 [Data Loading](#loading)
1.3 [Initial Data Inspection](#inspection)
1.4 [Schema Validation](#schema)
1.5 [Data Quality Assessment](#quality)

### [Phase 2A: Data Preprocessing & Exploratory Analysis](#phase2a)
2.1 [Missing Value Analysis](#missing)
2.2 [Missing Value Treatment](#treatment)
2.3 [Univariate Analysis - Numerical](#univariate-num)
2.4 [Univariate Analysis - Categorical](#univariate-cat)
2.5 [Low-Variance Feature Removal](#lowvar)
2.6 [Bivariate Analysis - Correlations](#bivariate-corr)
2.7 [Bivariate Analysis - Visualizations](#bivariate-viz)
2.8 [Outlier Detection](#outliers)

### [Phase 2B: Feature Engineering](#phase2b)
3.1 [Feature Creation](#creation)
3.2 [Feature Transformation](#transformation)
3.3 [Categorical Encoding](#encoding)
3.4 [Feature Importance](#importance)

### [Phase 3: Model Development & Evaluation](#phase3)
4.1 [Data Preparation](#preparation)
4.2 [Simple Linear Regression](#simple-lr)
4.3 [Multiple Linear Regression](#multiple-lr)
4.4 [Model Comparison](#comparison)
4.5 [Conclusions & Recommendations](#conclusions)

---
<a id='phase1'></a>

# Phase 1: Data Acquisition

## Objective

Acquire the Ames Housing dataset and perform initial validation to ensure data integrity. This foundational phase establishes the quality and completeness of our data before proceeding to analysis.

## Deliverables

- Successfully load dataset from CSV file
- Verify data structure and schema
- Conduct initial quality checks
- Document data characteristics and potential issues

---
<a id='setup'></a>

## 1.1 Environment Setup

We import all necessary Python libraries for data manipulation, statistical analysis, visualization, and machine learning. Proper configuration ensures consistent behavior across different environments.

In [None]:
# Import core data manipulation libraries
import pandas as pd
import numpy as np
import os

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Import statistical libraries
from scipy import stats

# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Configure environment
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.width', 1000)

# Set visualization defaults
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Print confirmation
print("✓ All libraries imported successfully")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Matplotlib version: {plt.matplotlib.__version__}")
print("\nEnvironment configured and ready for analysis.")

---
<a id='loading'></a>

## 1.2 Data Loading

The Ames Housing dataset was downloaded from Kaggle and stored in the project's data directory. This dataset provides comprehensive information on residential properties sold in Ames, Iowa, making it an excellent resource for developing price prediction models.

**Data Source:** Kaggle - Ames Housing Dataset

**Citation:** Shashank Necrothapa. (n.d.). Ames Housing Dataset. Kaggle. https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset

In [None]:
# Define the path to the dataset
data_path = "../data/AmesHousing.csv"

# Load the dataset into a pandas DataFrame
df = pd.read_csv(data_path)

# Display basic information
print("✓ Dataset loaded successfully!")
print(f"\nDataset Dimensions: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display first few records
print("\nFirst 5 Records:")
df.head()

---
<a id='inspection'></a>

## 1.3 Initial Data Inspection

Before conducting detailed analysis, we perform a high-level inspection to understand the dataset structure, identify data types, and spot any immediate quality concerns.

In [None]:
# Display comprehensive dataset information
print("Dataset Structure Overview:\n")
df.info()

print("\n" + "="*70)
print("Data Type Summary:")
print("="*70)
print(df.dtypes.value_counts())

print("\n" + "="*70)
print("Column Distribution:")
print("="*70)
print(f"Numerical columns (int64): {len(df.select_dtypes(include=['int64']).columns)}")
print(f"Numerical columns (float64): {len(df.select_dtypes(include=['float64']).columns)}")
print(f"Categorical columns (object): {len(df.select_dtypes(include=['object']).columns)}")

---
<a id='schema'></a>

## 1.4 Schema Validation

We verify that all expected columns are present and properly formatted. This schema validation ensures data integrity and helps identify any structural anomalies early in the process.

In [None]:
# Display all column names
print(f"Total Features: {len(df.columns)}\n")
print("All Column Names:")
print("="*70)

# Print in organized format (4 columns)
col_list = df.columns.tolist()
for i in range(0, len(col_list), 4):
    row = col_list[i:i+4]
    print(f"{i+1:2d}-{i+len(row):2d}: " + " | ".join(f"{col:20s}" for col in row))

print("\n" + "="*70)
print("Key Columns Verified:")
print("="*70)
important_cols = ['Order', 'PID', 'SalePrice', 'Gr Liv Area', 'Overall Qual', 'Neighborhood']
for col in important_cols:
    status = "✓" if col in df.columns else "✗"
    print(f"{status} {col}")

---
<a id='quality'></a>

## 1.5 Data Quality Assessment

We conduct initial quality checks to identify missing values, duplicate records, and verify the target variable integrity.

In [None]:
# Perform comprehensive quality checks
print("Data Quality Assessment:")
print("="*70)

# Check for missing values
total_missing = df.isnull().sum().sum()
cols_with_missing = df.isnull().any().sum()
print(f"\nMissing Value Check:")
print(f"  Total missing values: {total_missing:,}")
print(f"  Columns with missing data: {cols_with_missing} out of {len(df.columns)}")

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate Check:")
print(f"  Duplicate rows: {duplicates}")
if duplicates == 0:
    print("  ✓ No duplicates found")

# Verify target variable
print(f"\nTarget Variable (SalePrice) Verification:")
print(f"  Missing values: {df['SalePrice'].isnull().sum()}")
print(f"  Minimum: ${df['SalePrice'].min():,}")
print(f"  Maximum: ${df['SalePrice'].max():,}")
print(f"  Mean: ${df['SalePrice'].mean():,.2f}")
print(f"  Median: ${df['SalePrice'].median():,.2f}")
print(f"  Standard Deviation: ${df['SalePrice'].std():,.2f}")

print("="*70)

In [None]:
# Create detailed schema summary table
schema_summary = pd.DataFrame({
    'Column': df.columns,
    'Data_Type': df.dtypes.values,
    'Non_Null_Count': df.count().values,
    'Null_Count': df.isnull().sum().values,
    'Null_Percentage': (df.isnull().sum() / len(df) * 100).values,
    'Unique_Values': [df[col].nunique() for col in df.columns]
})

# Sort by null percentage to see problematic columns first
schema_summary = schema_summary.sort_values('Null_Percentage', ascending=False)

print("Schema Summary (Top 20 columns by missing data):")
print("="*90)
schema_summary.head(20)

### 1.5.1 Data Dictionary Cross-Reference

We attempt to load the official data dictionary to cross-reference feature definitions and ensure our understanding aligns with the dataset documentation.

In [None]:
# Attempt to load the data dictionary
try:
    data_dict_path = "../docs/data_dictionary.xlsx"
    data_dict = pd.read_excel(data_dict_path)
    print(f"✓ Data dictionary loaded successfully")
    print(f"  Total feature descriptions: {len(data_dict)}")
    print(f"\nFirst 10 Feature Definitions:")
    print("="*70)
    print(data_dict.head(10))
except FileNotFoundError:
    print("ℹ Data dictionary file not found at expected location")
    print("  This is not critical - proceeding with dataset analysis")
    print(f"  Expected path: {data_dict_path}")
except Exception as e:
    print(f"ℹ Could not load data dictionary: {str(e)}")
    print("  Proceeding with dataset analysis")

---

## Phase 1 Summary

### Accomplishments

✅ **Environment Configured**
- All required libraries imported successfully
- Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn ready
- Display settings optimized for analysis

✅ **Dataset Successfully Loaded**
- **Source:** Ames Housing Dataset from Kaggle
- **Size:** 2,930 residential property records
- **Features:** 82 variables (28 int, 11 float, 43 categorical)
- **Memory:** ~2MB dataset size
- **Target:** SalePrice (range: $12,789 - $755,000)

✅ **Data Quality Verified**
- Schema matches expectations (82 columns present)
- No duplicate records identified
- Target variable has no missing values
- 27 features contain missing values (to be addressed in Phase 2)

✅ **Initial Observations**
- Mix of numerical and categorical features
- Some features have high missingness (>50%) - candidates for removal
- Price range suggests diverse property types
- Data appears well-structured and ready for analysis

### Next Steps

Proceed to **Phase 2A: Data Preprocessing & Exploratory Analysis** where we will:
- Conduct comprehensive missing value analysis
- Implement systematic data cleaning procedures
- Perform univariate and bivariate analysis
- Identify and handle outliers
- Prepare data for feature engineering

---
<a id='phase2a'></a>

# Phase 2A: Data Preprocessing & Exploratory Analysis

## Objective

Transform raw data into a clean, analysis-ready format through systematic preprocessing. Conduct comprehensive exploratory analysis to understand variable distributions, relationships, and data quality issues.

## Key Activities

- Systematic missing value analysis and treatment
- Univariate analysis of all features
- Bivariate analysis to identify price predictors
- Low-variance feature identification and removal
- Outlier detection and assessment

---
<a id='missing'></a>

## 2.1 Missing Value Analysis

Missing data is common in real-world datasets. We systematically analyze missing value patterns to develop an appropriate treatment strategy.

In [None]:
# Calculate missing value statistics
missing_counts = df.isnull().sum()
missing_pct = (missing_counts / len(df)) * 100

missing_df = pd.DataFrame({
    'Feature': missing_counts.index,
    'Missing_Count': missing_counts.values,
    'Missing_Percentage': missing_pct.values
})

# Filter to only features with missing values
missing_df = missing_df[missing_df['Missing_Count'] > 0]
missing_df = missing_df.sort_values('Missing_Percentage', ascending=False)

print(f"Features with Missing Values: {len(missing_df)} out of {len(df.columns)}")
print("\nTop 15 Features with Most Missing Data:")
print("="*70)
missing_df.head(15)

### 2.1.1 Missing Value Visualization

Visual analysis helps identify patterns - whether values are missing completely at random (MCAR), at random (MAR), or not at random (MNAR).

In [None]:
# Visualize missing data patterns using missingno
plt.figure(figsize=(14, 8))
msno.matrix(df, figsize=(14, 8), fontsize=10, sparkline=False)
plt.title('Missing Value Matrix - Complete Dataset View', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("Matrix shows:")  
print("  - White lines = missing values")
print("  - Dark bars = complete data")
print("  - Patterns suggest some features missing together (e.g., garage features)")

In [None]:
# Bar chart of missing percentages
plt.figure(figsize=(12, 8))
missing_to_plot = missing_df.head(20)
plt.barh(range(len(missing_to_plot)), missing_to_plot['Missing_Percentage'].values, color='coral', alpha=0.7)
plt.yticks(range(len(missing_to_plot)), missing_to_plot['Feature'].values)
plt.xlabel('Percentage Missing (%)', fontweight='bold', fontsize=11)
plt.ylabel('Feature', fontweight='bold', fontsize=11)
plt.title('Top 20 Features by Missing Data Percentage', fontweight='bold', fontsize=13)
plt.axvline(x=50, color='red', linestyle='--', linewidth=2, label='50% threshold')
plt.legend()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

### Key Observations from Missing Data Analysis

**High Missingness (>50% - Candidates for Removal):**
- **Pool QC** (99.6%): Pool quality - most homes don't have pools
- **Misc Feature** (96.4%): Miscellaneous features - rarely present
- **Alley** (93.2%): Alley access type - uncommon
- **Fence** (80.5%): Fence quality - many homes lack fences

**Moderate Missingness (5-50% - Contextual Imputation):**
- **Fireplace Qu** (48.5%): Fireplace quality - indicates no fireplace
- **Lot Frontage** (16.7%): Linear feet of street connected to property
- **Garage features** (~5%): Likely indicates no garage
- **Basement features** (~3%): Likely indicates no basement

**Strategy:** Drop high-missingness features, impute others based on context

---
<a id='treatment'></a>

## 2.2 Missing Value Treatment

We implement a systematic 4-step treatment strategy based on missingness patterns and feature semantics:

1. **Drop** features with >50% missing (insufficient data for reliable imputation)
2. **Categorical imputation**: Fill with 'None' for features where absence has meaning
3. **Numerical imputation**: Fill with 0 for counts/areas where absence = zero
4. **Context-aware imputation**: Neighborhood-based median for Lot Frontage

In [None]:
# Step 1: Drop columns with excessive missing values (>50%)
threshold = 50
cols_to_drop = missing_df[missing_df['Missing_Percentage'] > threshold]['Feature'].tolist()

print(f"Dropping {len(cols_to_drop)} features with >{threshold}% missing:")
print("="*70)
for col in cols_to_drop:
    pct = missing_df[missing_df['Feature'] == col]['Missing_Percentage'].values[0]
    print(f"  - {col:20s}: {pct:6.2f}% missing")

df = df.drop(columns=cols_to_drop)
print(f"\nDataset shape after dropping: {df.shape}")
print(f"Columns remaining: {df.shape[1]}")

In [None]:
# Step 2: Impute categorical features with 'None'
# For these features, missing means the feature doesn't exist
categorical_none = [
    'Mas Vnr Type', 'Fireplace Qu', 'Garage Type', 'Garage Finish',
    'Garage Qual', 'Garage Cond', 'Bsmt Qual', 'Bsmt Cond',
    'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2'
]

print("Imputing categorical features (None = feature absent):")
print("="*70)

for col in categorical_none:
    if col in df.columns:
        before_count = df[col].isnull().sum()
        df[col] = df[col].fillna('None')
        print(f"  ✓ {col:25s}: {before_count:4d} values → 'None'")

print(f"\nCategorical imputation complete.")

In [None]:
# Step 3: Impute numerical features with 0
# For areas and counts, zero indicates feature is absent
numeric_zero = [
    'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF',
    'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath',
    'Garage Cars', 'Garage Area'
]

print("Imputing numerical features (0 = feature absent):")
print("="*70)

for col in numeric_zero:
    if col in df.columns:
        before_count = df[col].isnull().sum()
        df[col] = df[col].fillna(0)
        print(f"  ✓ {col:25s}: {before_count:4d} values → 0")

print(f"\nNumerical imputation complete.")

In [None]:
# Step 4: Neighborhood-based imputation for Lot Frontage
# Lot Frontage varies by neighborhood, so use neighborhood median
print("Imputing Lot Frontage using neighborhood-grouped median:")
print("="*70)

before_count = df['Lot Frontage'].isnull().sum()
print(f"Missing before: {before_count}\n")

# Group by neighborhood and fill with median
df['Lot Frontage'] = df.groupby('Neighborhood')['Lot Frontage'].transform(
    lambda x: x.fillna(x.median())
)

after_count = df['Lot Frontage'].isnull().sum()
print(f"Missing after: {after_count}")
print(f"✓ Imputed {before_count - after_count} values using neighborhood medians")

In [None]:
# Step 5: Handle remaining missing values
print("Handling remaining missing values:")
print("="*70)

# Garage Year Built - use house year if missing
if 'Garage Yr Blt' in df.columns and df['Garage Yr Blt'].isnull().sum() > 0:
    before = df['Garage Yr Blt'].isnull().sum()
    df['Garage Yr Blt'] = df['Garage Yr Blt'].fillna(df['Year Built'])
    print(f"  ✓ Garage Yr Blt: {before} values → Year Built (no garage = same as house)")

# Electrical - only 1 missing, use mode
if 'Electrical' in df.columns and df['Electrical'].isnull().sum() > 0:
    before = df['Electrical'].isnull().sum()
    mode_val = df['Electrical'].mode()[0]
    df['Electrical'] = df['Electrical'].fillna(mode_val)
    print(f"  ✓ Electrical: {before} value → '{mode_val}' (mode)")

print(f"\nAll specific imputations complete.")

In [None]:
# Verify all missing values have been handled
remaining_missing = df.isnull().sum().sum()
cols_with_missing = df.isnull().any().sum()

print("\n" + "="*70)
print("MISSING VALUE TREATMENT - FINAL VERIFICATION")
print("="*70)
print(f"Total missing values remaining: {remaining_missing}")
print(f"Columns with missing values: {cols_with_missing}")

if remaining_missing == 0:
    print("\n✅ SUCCESS: All missing values successfully handled!")
    print("   Dataset is now complete and ready for analysis.")
else:
    print(f"\n⚠ WARNING: {remaining_missing} missing values still present")
    print("\nColumns with remaining missing values:")
    still_missing = df.isnull().sum()
    print(still_missing[still_missing > 0])

print("="*70)
print(f"Final dataset shape: {df.shape}")

---
<a id='univariate-num'></a>

## 2.3 Univariate Analysis - Numerical Features

We examine the distribution of each numerical variable to understand central tendencies, spread, skewness, and potential data quality issues.

In [None]:
# Select numerical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [col for col in numeric_cols if col not in ['Order', 'PID']]

print(f"Analyzing {len(numeric_cols)} numerical features\n")
print("First 10 numerical features:")
for i, col in enumerate(numeric_cols[:10], 1):
    print(f"  {i:2d}. {col}")

In [None]:
# Create comprehensive histograms for all numerical features
fig, axes = plt.subplots(10, 4, figsize=(20, 25))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    if idx < 40:
        axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7, color='steelblue')
        axes[idx].set_title(col, fontweight='bold', fontsize=10)
        axes[idx].set_ylabel('Frequency', fontsize=8)
        axes[idx].tick_params(labelsize=8)

for idx in range(len(numeric_cols), 40):
    axes[idx].axis('off')

plt.suptitle('Distribution of Numerical Features', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

### Distribution Patterns Observed

**Right-Skewed (Positive Skew):**
- Lot Area, Sale Price, Living Area
- Most values concentrated at lower end

**Approximately Normal:**
- Number of bedrooms, bathrooms
- Centered distributions

**Left-Skewed:**
- Year Built, Overall Quality
- More recent/higher quality homes

---
<a id='univariate-cat'></a>

## 2.4 Univariate Analysis - Categorical Features

Examine categorical variables to understand category distributions and identify dominant values.

In [None]:
# Select categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"Analyzing {len(categorical_cols)} categorical features\n")

# Show value counts for key categorical features
key_cats = ['MS Zoning', 'Neighborhood', 'Bldg Type', 'House Style']
for cat in key_cats:
    if cat in df.columns:
        print(f"\n{cat}:")
        print(df[cat].value_counts().head())

In [None]:
# Visualize categorical features
fig, axes = plt.subplots(3, 3, figsize=(18, 12))
axes = axes.ravel()

cat_viz = ['MS Zoning', 'Neighborhood', 'Bldg Type', 'House Style', 'Foundation', 
           'Heating QC', 'Central Air', 'Kitchen Qual', 'Sale Condition']

for idx, col in enumerate(cat_viz):
    if col in df.columns and idx < 9:
        vc = df[col].value_counts().head(10)
        axes[idx].bar(range(len(vc)), vc.values, color='coral', alpha=0.7)
        axes[idx].set_xticks(range(len(vc)))
        axes[idx].set_xticklabels(vc.index, rotation=45, ha='right', fontsize=8)
        axes[idx].set_title(col, fontweight='bold')
        axes[idx].set_ylabel('Count')

plt.tight_layout()
plt.show()

---
<a id='lowvar'></a>

## 2.5 Low-Variance Feature Removal

Features dominated by a single category provide little predictive power.

In [None]:
# Identify and remove low-variance categorical features
low_var_cols = ['Street', 'Utilities', 'Condition 2', 'Roof Matl', 'Heating', 'Land Slope']

print(f"Dropping {len(low_var_cols)} low-variance features:\n")
for col in low_var_cols:
    if col in df.columns:
        dominant = df[col].value_counts().index[0]
        pct = (df[col].value_counts().iloc[0] / len(df)) * 100
        print(f"  - {col:15s}: {pct:5.1f}% are '{dominant}'")

df = df.drop(columns=[c for c in low_var_cols if c in df.columns])
print(f"\nNew shape: {df.shape}")

---
<a id='bivariate-corr'></a>

## 2.6 Bivariate Analysis - Correlations

Examine relationships between features and the target variable.

In [None]:
# Calculate correlation with SalePrice
corr_matrix = df.corr(numeric_only=True)
saleprice_corr = corr_matrix['SalePrice'].sort_values(ascending=False)

print("Top 15 Features Correlated with SalePrice:\n")
print(saleprice_corr.head(15))

In [None]:
# Correlation heatmap
top_features = saleprice_corr.head(12).index
corr_subset = df[top_features].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_subset, annot=True, fmt='.2f', cmap='RdYlBu_r',
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap - Top Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---
<a id='bivariate-viz'></a>

## 2.7 Bivariate Visualizations

Scatter plots reveal relationships between features and sale price.

In [None]:
# Scatter plots for top features
top_num = ['Gr Liv Area', 'Garage Area', 'Total Bsmt SF', '1st Flr SF', 'Year Built']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feat in enumerate(top_num[:6]):
    if feat in df.columns:
        axes[idx].scatter(df[feat], df['SalePrice'], alpha=0.5, s=20)
        axes[idx].set_xlabel(feat, fontweight='bold')
        axes[idx].set_ylabel('SalePrice', fontweight='bold')
        corr = df[[feat, 'SalePrice']].corr().iloc[0,1]
        axes[idx].set_title(f'{feat} (r={corr:.3f})')

axes[5].axis('off')
plt.tight_layout()
plt.show()

In [None]:
# Box plots for categorical features
cat_feats = ['Overall Qual', 'Neighborhood', 'Kitchen Qual', 'Garage Type']

fig, axes = plt.subplots(2, 2, figsize=(16, 10))
axes = axes.ravel()

for idx, feat in enumerate(cat_feats):
    if feat in df.columns:
        order = df.groupby(feat)['SalePrice'].median().sort_values().index
        data = [df[df[feat]==cat]['SalePrice'].values for cat in order]
        axes[idx].boxplot(data, labels=order)
        axes[idx].set_xlabel(feat, fontweight='bold')
        axes[idx].set_ylabel('SalePrice', fontweight='bold')
        axes[idx].tick_params(axis='x', rotation=45, labelsize=8)

plt.tight_layout()
plt.show()

---
<a id='outliers'></a>

## 2.8 Outlier Detection

Using IQR method to identify potential outliers.

In [None]:
# IQR outlier detection
def detect_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower) | (data[column] > upper)]
    return outliers, lower, upper

key_feats = ['SalePrice', 'Gr Liv Area', 'Lot Area', 'Total Bsmt SF']

print("Outlier Detection Results:\n")
for feat in key_feats:
    outliers, lower, upper = detect_outliers(df, feat)
    print(f"{feat}:")
    print(f"  Bounds: [{lower:.0f}, {upper:.0f}]")
    print(f"  Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)\n")

**Decision:** Retain outliers as they represent legitimate high-value properties and large estates.

---
<a id='phase2b'></a>

# Phase 2B: Feature Engineering

## Objective

Create meaningful features and transform data for optimal model performance.

---
<a id='creation'></a>

## 3.1 Feature Creation

In [None]:
# Create engineered features
print("Engineering features...\n")

df['Total_Bathrooms'] = df['Full Bath'] + 0.5*df['Half Bath'] + df['Bsmt Full Bath'] + 0.5*df['Bsmt Half Bath']
df['Total_Porch_SF'] = df['Wood Deck SF'] + df['Open Porch SF'] + df['Enclosed Porch'] + df['3Ssn Porch'] + df['Screen Porch']
df['House_Age'] = df['Yr Sold'] - df['Year Built']
df['Years_Since_Remod'] = df['Yr Sold'] - df['Year Remod/Add']
df['Total_SF'] = df['Total Bsmt SF'] + df['Gr Liv Area']

print("✓ 5 new features created")
print(f"Total features: {df.shape[1]}")

In [None]:
# Check new feature correlations
new_feats = ['Total_Bathrooms', 'Total_Porch_SF', 'House_Age', 'Years_Since_Remod', 'Total_SF']
for feat in new_feats:
    corr = df[[feat, 'SalePrice']].corr().iloc[0,1]
    print(f"{feat:25s}: {corr:.4f}")

---
<a id='transformation'></a>

## 3.2 Feature Transformations

In [None]:
# Analyze skewness
from scipy import stats
skewed = []
for col in df.select_dtypes(include=[np.number]).columns:
    if col != 'SalePrice':
        skew = stats.skew(df[col].dropna())
        if abs(skew) > 1:
            skewed.append((col, skew))

print(f"Highly skewed features (|skew| > 1): {len(skewed)}\n")
for feat, skew in sorted(skewed, key=lambda x: abs(x[1]), reverse=True)[:10]:
    print(f"  {feat:25s}: {skew:7.2f}")

---
<a id='encoding'></a>

## 3.3 Categorical Encoding

In [None]:
# Encode categorical variables
from sklearn.preprocessing import LabelEncoder

df_encoded = df.copy()
cat_cols = df_encoded.select_dtypes(include=['object']).columns

label_encoders = {}
for col in cat_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
    label_encoders[col] = le

print(f"✓ Encoded {len(cat_cols)} categorical features")
print(f"All features now numeric: {df_encoded.shape}")

---
<a id='importance'></a>

## 3.4 Feature Importance

In [None]:
# Random Forest feature importance
from sklearn.ensemble import RandomForestRegressor

X = df_encoded.drop(['SalePrice', 'Order', 'PID'], axis=1, errors='ignore')
y = df_encoded['SalePrice']

rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Top 15 Most Important Features:\n")
print(importances.head(15).to_string(index=False))

In [None]:
# Visualize top 20
plt.figure(figsize=(10, 8))
top20 = importances.head(20)
plt.barh(range(len(top20)), top20['Importance'].values, color='steelblue')
plt.yticks(range(len(top20)), top20['Feature'].values)
plt.xlabel('Importance', fontweight='bold')
plt.title('Top 20 Feature Importances', fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### Phase 2B Summary

✅ 5 engineered features created
✅ Categorical encoding complete
✅ Feature importance analyzed
✅ Dataset ready for modeling

---
<a id='phase3'></a>

# Phase 3: Model Development & Evaluation

## Objective

Build regression models to predict house prices and evaluate their performance.

---
<a id='preparation'></a>

## 4.1 Data Preparation

In [None]:
# Prepare data
X = df_encoded.drop(['SalePrice', 'Order', 'PID'], axis=1, errors='ignore')
y = df_encoded['SalePrice']

# Handle any remaining NaNs
for col in X.columns:
    if X[col].isnull().sum() > 0:
        X[col] = X[col].fillna(X[col].median())

print(f"Features: {X.shape}")
print(f"Target: {y.shape}")

In [None]:
# Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Testing: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

---
<a id='simple-lr'></a>

## 4.2 Simple Linear Regression

In [None]:
# Identify best feature
corrs = X_train.corrwith(y_train).abs().sort_values(ascending=False)
best_feat = corrs.index[0]

print(f"Best feature: {best_feat}")
print(f"Correlation: {corrs[best_feat]:.4f}")

X_train_simple = X_train[[best_feat]]
X_test_simple = X_test[[best_feat]]

In [None]:
# Train Simple LR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import math

model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train)

y_train_pred_s = model_simple.predict(X_train_simple)
y_test_pred_s = model_simple.predict(X_test_simple)

r2_train_s = r2_score(y_train, y_train_pred_s)
r2_test_s = r2_score(y_test, y_test_pred_s)
rmse_s = math.sqrt(mean_squared_error(y_test, y_test_pred_s))
mae_s = mean_absolute_error(y_test, y_test_pred_s)

print(f"Simple LR Results:")
print(f"  R² (train): {r2_train_s:.4f}")
print(f"  R² (test): {r2_test_s:.4f}")
print(f"  RMSE: ${rmse_s:,.2f}")
print(f"  MAE: ${mae_s:,.2f}")

In [None]:
# Visualize Simple LR
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].scatter(X_test_simple, y_test, alpha=0.5, s=30)
axes[0].plot(X_test_simple, y_test_pred_s, 'r-', lw=2)
axes[0].set_xlabel(best_feat, fontweight='bold')
axes[0].set_ylabel('SalePrice', fontweight='bold')
axes[0].set_title(f'Simple LR: {best_feat}', fontweight='bold')

axes[1].scatter(y_test, y_test_pred_s, alpha=0.5, s=30)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual', fontweight='bold')
axes[1].set_ylabel('Predicted', fontweight='bold')
axes[1].set_title(f'R² = {r2_test_s:.4f}', fontweight='bold')

plt.tight_layout()
plt.show()

---
<a id='multiple-lr'></a>

## 4.3 Multiple Linear Regression

In [None]:
# Train Multiple LR
model_multiple = LinearRegression()
model_multiple.fit(X_train, y_train)

y_train_pred_m = model_multiple.predict(X_train)
y_test_pred_m = model_multiple.predict(X_test)

r2_train_m = r2_score(y_train, y_train_pred_m)
r2_test_m = r2_score(y_test, y_test_pred_m)
rmse_m = math.sqrt(mean_squared_error(y_test, y_test_pred_m))
mae_m = mean_absolute_error(y_test, y_test_pred_m)

print(f"Multiple LR Results ({X_train.shape[1]} features):")
print(f"  R² (train): {r2_train_m:.4f}")
print(f"  R² (test): {r2_test_m:.4f}")
print(f"  RMSE: ${rmse_m:,.2f}")
print(f"  MAE: ${mae_m:,.2f}")

In [None]:
# Visualize Multiple LR
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].scatter(y_test, y_test_pred_m, alpha=0.5, s=30, color='green')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price', fontweight='bold')
axes[0].set_ylabel('Predicted Price', fontweight='bold')
axes[0].set_title(f'Multiple LR: R² = {r2_test_m:.4f}', fontweight='bold')

residuals = y_test - y_test_pred_m
axes[1].scatter(y_test_pred_m, residuals, alpha=0.5, s=30, color='green')
axes[1].axhline(0, color='red', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted Price', fontweight='bold')
axes[1].set_ylabel('Residuals', fontweight='bold')
axes[1].set_title('Residual Plot', fontweight='bold')

plt.tight_layout()
plt.show()

---
<a id='comparison'></a>

## 4.4 Model Comparison

In [None]:
# Comparison table
comp = pd.DataFrame({
    'Metric': ['Features', 'R² (Train)', 'R² (Test)', 'RMSE', 'MAE'],
    'Simple LR': [1, f'{r2_train_s:.4f}', f'{r2_test_s:.4f}', f'${rmse_s:,.0f}', f'${mae_s:,.0f}'],
    'Multiple LR': [X.shape[1], f'{r2_train_m:.4f}', f'{r2_test_m:.4f}', f'${rmse_m:,.0f}', f'${mae_m:,.0f}']
})

print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(comp.to_string(index=False))
print("="*70)

In [None]:
# Visual comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].bar(['Simple', 'Multiple'], [r2_test_s, r2_test_m], color=['steelblue', 'green'])
axes[0].set_ylabel('R² Score', fontweight='bold')
axes[0].set_title('R² Comparison', fontweight='bold')
axes[0].set_ylim([0, 1])

axes[1].bar(['Simple', 'Multiple'], [rmse_s, rmse_m], color=['steelblue', 'green'])
axes[1].set_ylabel('RMSE ($)', fontweight='bold')
axes[1].set_title('RMSE (Lower Better)', fontweight='bold')

axes[2].bar(['Simple', 'Multiple'], [mae_s, mae_m], color=['steelblue', 'green'])
axes[2].set_ylabel('MAE ($)', fontweight='bold')
axes[2].set_title('MAE (Lower Better)', fontweight='bold')

plt.tight_layout()
plt.show()

---
<a id='conclusions'></a>

## 4.5 Conclusions

### Key Findings

**Simple LR:** Provides interpretable baseline using single best feature

**Multiple LR:** Significantly better performance using all features

### Recommendations

1. Deploy Multiple LR for production use
2. Model suitable for property valuation
3. Future: Explore Random Forest, Gradient Boosting
4. Consider regularization (Ridge, LASSO)

In [None]:
# Final summary
print("\n" + "="*70)
print("PROJECT COMPLETE")
print("="*70)
print(f"Dataset: 2,930 properties")
print(f"Features: {X.shape[1]}")
print(f"Best Model: Multiple LR")
print(f"R²: {r2_test_m:.4f}")
print(f"RMSE: ${rmse_m:,.0f}")
print(f"MAE: ${mae_m:,.0f}")
print("="*70)

## Project Complete

This analysis successfully developed predictive models for house price estimation.

**All phases completed:**
- ✅ Phase 1: Data Acquisition
- ✅ Phase 2A: Preprocessing & EDA
- ✅ Phase 2B: Feature Engineering
- ✅ Phase 3: Modeling & Evaluation