# Ames Housing Price Prediction
## Advanced Apex Project - Complete Implementation

**A comprehensive machine learning project for real estate price prediction**

---

### Project Information

**Team Name:** The Outliers

**Course:** Advanced Apex Project 1

**Institution:** BITS Pilani Digital

**Academic Period:** First Trimester 2025-26

**Project Supervisor:** Bharathi Dasari

### Team Members

| Student Name | BITS ID |
|--------------|----------|
| Anik Das | 2025EM1100026 |
| Adeetya Wadikar | 2025EM1100384 |
| Tushar Nishane | 2025EM1100306 |

---

## Executive Summary

### Problem Statement

Real estate valuation is a critical challenge in the housing market. Accurate price prediction helps buyers, sellers, and investors make informed financial decisions. This project develops machine learning regression models to predict residential property sale prices based on property characteristics.

### Business Objective

Build a predictive model that estimates house sale prices with high accuracy using property features such as size, quality, location, and amenities. The model aims to provide reliable valuations that can support:

- Property buyers in assessing fair market value
- Real estate agents in pricing recommendations
- Investors in portfolio decision-making
- Financial institutions in loan underwriting

### Dataset Overview

**Source:** Ames Housing Dataset (Kaggle)

**Size:** 2,930 residential property transactions

**Features:** 82 variables including:
- Physical characteristics (square footage, rooms, age)
- Quality ratings (condition, materials)
- Location attributes (neighborhood, zoning)
- Amenities (garage, basement, pool)

**Target Variable:** SalePrice (in USD)

**Citation:** Shashank Necrothapa. Ames Housing Dataset. Kaggle. https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset

---

## Table of Contents

### [Phase 1: Data Acquisition](#phase1)
1.1 [Environment Setup](#setup)
1.2 [Data Loading](#loading)
1.3 [Initial Data Inspection](#inspection)
1.4 [Schema Validation](#schema)

### [Phase 2A: Data Preprocessing & Exploratory Analysis](#phase2a)
2.1 [Data Quality Assessment](#quality)
2.2 [Missing Value Analysis](#missing)
2.3 [Missing Value Treatment](#treatment)
2.4 [Univariate Analysis - Numerical](#univariate-num)
2.5 [Univariate Analysis - Categorical](#univariate-cat)
2.6 [Feature Selection - Low Variance](#lowvar)
2.7 [Bivariate Analysis - Correlations](#bivariate-corr)
2.8 [Bivariate Analysis - Visualizations](#bivariate-viz)
2.9 [Outlier Detection](#outliers)

### [Phase 2B: Feature Engineering](#phase2b)
3.1 [Feature Creation](#creation)
3.2 [Feature Transformation](#transformation)
3.3 [Categorical Encoding](#encoding)
3.4 [Feature Evaluation](#evaluation)

### [Phase 3: Model Development & Evaluation](#phase3)
4.1 [Data Preparation](#preparation)
4.2 [Simple Linear Regression](#simple-lr)
4.3 [Multiple Linear Regression](#multiple-lr)
4.4 [Model Comparison](#comparison)
4.5 [Conclusions & Recommendations](#conclusions)

---
<a id='phase1'></a>

# Phase 1: Data Acquisition

## Objective

Acquire the Ames Housing dataset and verify its integrity for downstream analysis. This phase establishes the foundation for all subsequent work by ensuring we have clean, properly structured data.

## Key Deliverables

- Successfully load the dataset from source
- Verify data structure matches expectations
- Conduct initial quality checks
- Document data schema and metadata

---
<a id='setup'></a>

## 1.1 Environment Setup

We begin by importing all necessary Python libraries for data manipulation, analysis, visualization, and machine learning.

In [None]:
# Core data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# System and utility libraries
import os
import warnings
warnings.filterwarnings('ignore')

# Configure display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---
<a id='loading'></a>

## 1.2 Data Loading

The Ames Housing dataset was obtained from Kaggle and stored locally in the `data/` directory. The dataset contains comprehensive information about residential property sales in Ames, Iowa, making it ideal for regression analysis and price prediction tasks.

In [None]:
# Define the data file path
data_path = "../data/AmesHousing.csv"

# Load the dataset
df = pd.read_csv(data_path)

# Display basic information
print("Dataset loaded successfully!")
print(f"\nDataset Dimensions: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Show first few records
print("\nFirst 5 Records:")
df.head()

---
<a id='inspection'></a>

## 1.3 Initial Data Inspection

Before diving into detailed analysis, we conduct a high-level inspection to understand the dataset structure, data types, and identify any immediate quality issues.

In [None]:
# Display comprehensive dataset information
print("Dataset Structure and Data Types:\n")
df.info()

print("\n" + "="*80)
print("Summary of Data Types:")
print("="*80)
print(df.dtypes.value_counts())

---
<a id='schema'></a>

## 1.4 Schema Validation

We verify that all expected columns are present and properly formatted. This ensures data integrity and helps identify any structural issues early in the analysis process.

In [None]:
# List all column names
print(f"Total Features: {len(df.columns)}\n")
print("Column Names:")
print(df.columns.tolist())

In [None]:
# Perform basic sanity checks
print("Data Quality Checks:\n")

# Check for missing values
total_missing = df.isnull().sum().sum()
columns_with_missing = df.isnull().any().sum()
print(f"Total missing values: {total_missing:,}")
print(f"Columns with missing values: {columns_with_missing}")

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

# Check target variable
print(f"\nTarget Variable (SalePrice) Statistics:")
print(f"  Minimum: ${df['SalePrice'].min():,}")
print(f"  Maximum: ${df['SalePrice'].max():,}")
print(f"  Mean: ${df['SalePrice'].mean():,.2f}")
print(f"  Median: ${df['SalePrice'].median():,.2f}")

In [None]:
# Create a schema summary table
schema_summary = pd.DataFrame({
    'Column': df.columns,
    'Data_Type': df.dtypes.values,
    'Non_Null_Count': df.count().values,
    'Null_Count': df.isnull().sum().values,
    'Null_Percentage': (df.isnull().sum() / len(df) * 100).values
})

print("Schema Summary (Top 20 columns):")
schema_summary.head(20)

### 1.4.1 Data Dictionary Cross-Reference

We load the data dictionary to ensure our understanding of each feature aligns with the official documentation. This helps prevent misinterpretation during analysis.

In [None]:
# Load the data dictionary
try:
    data_dict = pd.read_excel("../docs/data_dictionary.xlsx")
    print(f"Data dictionary loaded: {len(data_dict)} feature descriptions")
    print("\nFirst 10 entries:")
    print(data_dict.head(10))
except FileNotFoundError:
    print("Data dictionary file not found. Proceeding with dataset analysis.")

---

## Phase 1 Summary

### Accomplishments

✅ **Dataset Successfully Loaded**
- 2,930 residential property records
- 82 features covering property characteristics
- Data loaded from local CSV file

✅ **Schema Validated**
- All expected columns present
- Mix of 28 numeric and 43 categorical features
- 11 float columns identified

✅ **Quality Assessment Completed**
- No duplicate records found
- 27 features contain missing values (to be addressed in Phase 2)
- Target variable (SalePrice) has no missing values
- Sale prices range from $12,789 to $755,000

✅ **Data Dictionary Referenced**
- Feature definitions verified
- Ready for detailed preprocessing and analysis

### Next Steps

Proceed to Phase 2A for comprehensive data preprocessing, missing value treatment, and exploratory data analysis.

---\n\n<a id='phase2a'></a>\n\n# Phase 2A: Data Preprocessing & Exploratory Data Analysis\n\n## Objective\n\nClean the dataset by handling missing values, removing low-quality features, and conducting comprehensive exploratory analysis to understand data distributions and relationships. This phase transforms raw data into a clean, analysis-ready format.\n\n## Key Deliverables\n\n- Systematic missing value treatment\n- Univariate analysis of all features\n- Bivariate analysis to identify predictors\n- Outlier detection and assessment

---\n\n## 2.1 Data Quality Assessment\n\nWe start by examining overall data quality, focusing on data types, statistical summaries, and missing value patterns.

In [None]:
# Comprehensive data quality overview\nprint("Data Quality Report")\nprint("="*80)\n\nprint(f"Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns\\n")\n\nprint("Data Type Distribution:")\nprint(df.dtypes.value_counts())\n\nprint("\\n" + "="*80)\nprint("Statistical Summary (Numerical Features):\\n")\ndf.describe()

In [None]:
# Missing value overview\nmissing_data = df.isnull().sum()\nmissing_pct = (missing_data / len(df)) * 100\n\nmissing_df = pd.DataFrame({\n    'Missing_Count': missing_data,\n    'Missing_Percentage': missing_pct\n})\n\n# Filter to only columns with missing values\nmissing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)\n\nprint(f"Features with Missing Values: {len(missing_df)}\\n")\nprint("Top 15 Features with Most Missing Data:\\n")\nmissing_df.head(15)

---\n<a id='missing'></a>\n\n## 2.2 Missing Value Analysis\n\nMissing data is a common challenge in real-world datasets. We analyze missing value patterns to inform our imputation strategy.

### 2.2.1 Missing Value Visualization\n\nVisual inspection helps identify patterns - are values missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)?

In [None]:
# Visualize missing data patterns\nimport missingno as msno\n\n# Create missing value matrix\nplt.figure(figsize=(14, 8))\nmsno.matrix(df, figsize=(14, 8), fontsize=10)\nplt.title('Missing Value Matrix - Dataset Overview', fontsize=14, fontweight='bold')\nplt.tight_layout()\nplt.show()

In [None]:
# Bar chart of missing values\nplt.figure(figsize=(12, 6))\nmissing_df.head(20)['Missing_Percentage'].plot(kind='barh', color='coral')\nplt.xlabel('Percentage Missing (%)', fontweight='bold')\nplt.ylabel('Feature', fontweight='bold')\nplt.title('Top 20 Features by Missing Data Percentage', fontweight='bold')\nplt.grid(axis='x', alpha=0.3)\nplt.tight_layout()\nplt.show()

### Key Observations\n\n- **Pool QC, Misc Feature, Alley, Fence**: >80% missing - too sparse for reliable imputation\n- **Garage and Basement features**: ~5% missing - likely indicates absence of feature\n- **Lot Frontage**: 16.7% missing - requires neighborhood-based imputation\n- **Fireplace Qu**: 48.5% missing - absence indicates no fireplace

---\n<a id='treatment'></a>\n\n## 2.3 Missing Value Treatment\n\nBased on our analysis, we implement a systematic treatment strategy:\n\n1. **Drop** columns with >50% missing (insufficient information)\n2. **Categorical imputation**: Fill with 'None' (indicates feature absence)\n3. **Numerical imputation**: Fill with 0 for counts/areas, median for measurements\n4. **Context-aware imputation**: Neighborhood-based for Lot Frontage

In [None]:
# Step 1: Drop columns with excessive missing values\ncols_to_drop = ['Pool QC', 'Misc Feature', 'Alley', 'Fence']\n\nprint(f"Dropping {len(cols_to_drop)} features with >50% missing:\\n")\nfor col in cols_to_drop:\n    pct = (df[col].isnull().sum() / len(df)) * 100\n    print(f"  - {col}: {pct:.1f}% missing")\n\ndf = df.drop(columns=cols_to_drop)\nprint(f"\\nDataset shape after dropping: {df.shape}")

In [None]:
# Step 2: Impute categorical features with 'None'\ncategorical_none = [\n    'Mas Vnr Type', 'Fireplace Qu', 'Garage Type', 'Garage Finish',\n    'Garage Qual', 'Garage Cond', 'Bsmt Qual', 'Bsmt Cond',\n    'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2'\n]\n\nprint("Imputing categorical features (None = feature does not exist):\\n")\nfor col in categorical_none:\n    if col in df.columns:\n        before = df[col].isnull().sum()\n        df[col] = df[col].fillna('None')\n        print(f"  - {col}: {before} values imputed")

In [None]:
# Step 3: Impute numerical features with 0\nnumeric_zero = [\n    'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF',\n    'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath',\n    'Garage Cars', 'Garage Area'\n]\n\nprint("Imputing numerical features (0 = feature absent):\\n")\nfor col in numeric_zero:\n    if col in df.columns:\n        before = df[col].isnull().sum()\n        df[col] = df[col].fillna(0)\n        print(f"  - {col}: {before} values imputed")

In [None]:
# Step 4: Neighborhood-based imputation for Lot Frontage\nprint("Imputing Lot Frontage using neighborhood-grouped median:\\n")\nbefore = df['Lot Frontage'].isnull().sum()\ndf['Lot Frontage'] = df.groupby('Neighborhood')['Lot Frontage'].transform(\n    lambda x: x.fillna(x.median())\n)\nprint(f"  - Lot Frontage: {before} values imputed using neighborhood medians")

In [None]:
# Step 5: Handle remaining missing values\n# Garage Year Built - use house year if garage missing\nif df['Garage Yr Blt'].isnull().sum() > 0:\n    df['Garage Yr Blt'] = df['Garage Yr Blt'].fillna(df['Year Built'])\n    print("Garage Yr Blt: filled with Year Built for properties without garage")\n\n# Electrical - only 1 missing, use mode\nif df['Electrical'].isnull().sum() > 0:\n    df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])\n    print("Electrical: filled with mode")

In [None]:
# Verify all missing values handled\nremaining_missing = df.isnull().sum().sum()\nprint("\\n" + "="*80)\nprint("Missing Value Treatment Complete")\nprint("="*80)\nprint(f"Remaining missing values: {remaining_missing}")\n\nif remaining_missing == 0:\n    print("\\n✅ All missing values successfully handled!")\nelse:\n    print(f"\\n⚠️ {remaining_missing} missing values still present:")\n    print(df.isnull().sum()[df.isnull().sum() > 0])

---\n<a id='univariate-num'></a>\n\n## 2.4 Univariate Analysis - Numerical Features\n\nWe examine the distribution of each numerical variable to understand central tendencies, spread, and skewness.

In [None]:
# Select numerical columns for analysis\nnumeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()\n\n# Exclude identifier columns\nnumeric_cols = [col for col in numeric_cols if col not in ['Order', 'PID']]\n\nprint(f"Analyzing {len(numeric_cols)} numerical features\\n")\nprint("Features:", numeric_cols[:10], "...")

In [None]:
# Create comprehensive histograms\nfig, axes = plt.subplots(10, 4, figsize=(20, 25))\naxes = axes.ravel()\n\nfor idx, col in enumerate(numeric_cols[:40]):\n    axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7, color='steelblue')\n    axes[idx].set_title(col, fontweight='bold', fontsize=10)\n    axes[idx].set_xlabel('')\n    axes[idx].set_ylabel('Frequency', fontsize=8)\n\nfor idx in range(len(numeric_cols), 40):\n    axes[idx].axis('off')\n\nplt.suptitle('Distribution of Numerical Features', fontsize=16, fontweight='bold', y=0.995)\nplt.tight_layout()\nplt.show()

### Distribution Analysis Summary\n\n**Right-Skewed Features** (long tail to the right):\n- **Lot Area**: Most properties have smaller lots\n- **Sale Price**: Majority of homes in mid-range, few luxury properties\n- **Living Area**: Most homes are modest size\n\n**Left-Skewed Features**:\n- **Year Built**: More recent construction\n- **Overall Quality**: Most homes rated average to good\n\n**Approximately Normal**:\n- **Number of Bedrooms**: Centered around 3 bedrooms\n- **Full Bathrooms**: Most homes have 1-2 bathrooms

---\n<a id='univariate-cat'></a>\n\n## 2.5 Univariate Analysis - Categorical Features\n\nExamining categorical variables to understand the composition and balance of different property types and characteristics.

In [None]:
# Select categorical columns\ncategorical_cols = df.select_dtypes(include=['object']).columns.tolist()\n\nprint(f"Analyzing {len(categorical_cols)} categorical features\\n")\n\n# Display value counts for key categorical features\nkey_categoricals = ['MS Zoning', 'Neighborhood', 'Bldg Type', 'House Style', 'Foundation']\n\nfor col in key_categoricals:\n    if col in df.columns:\n        print(f"\\n{col}:")\n        print(df[col].value_counts().head())

In [None]:
# Visualize key categorical features\nfig, axes = plt.subplots(3, 3, figsize=(18, 12))\naxes = axes.ravel()\n\ncat_features_viz = ['MS Zoning', 'Neighborhood', 'Bldg Type', 'House Style', \n                     'Foundation', 'Heating QC', 'Central Air', 'Kitchen Qual', 'Sale Condition']\n\nfor idx, col in enumerate(cat_features_viz):\n    if col in df.columns and idx < 9:\n        value_counts = df[col].value_counts().head(10)\n        axes[idx].bar(range(len(value_counts)), value_counts.values, color='coral', alpha=0.7)\n        axes[idx].set_xticks(range(len(value_counts)))\n        axes[idx].set_xticklabels(value_counts.index, rotation=45, ha='right', fontsize=8)\n        axes[idx].set_title(col, fontweight='bold')\n        axes[idx].set_ylabel('Count')\n\nplt.tight_layout()\nplt.show()

---\n<a id='lowvar'></a>\n\n## 2.6 Feature Selection - Low Variance Removal\n\nFeatures dominated by a single value provide little predictive power. We identify and remove such low-variance categorical features.

In [None]:
# Identify low-variance categorical features\nlow_variance_features = []\n\nfor col in categorical_cols:\n    if col in df.columns:\n        value_pct = (df[col].value_counts().iloc[0] / len(df)) * 100\n        if value_pct > 95:\n            low_variance_features.append((col, value_pct))\n\nprint("Features with >95% single value:\\n")\nfor feat, pct in low_variance_features:\n    print(f"  - {feat}: {pct:.1f}% in dominant category")

In [None]:
# Drop low-variance features\ncols_low_var = ['Street', 'Utilities', 'Condition 2', 'Roof Matl', 'Heating', 'Land Slope']\n\nprint(f"Dropping {len(cols_low_var)} low-variance categorical features:\\n")\nfor col in cols_low_var:\n    if col in df.columns:\n        dominant = df[col].value_counts().index[0]\n        pct = (df[col].value_counts().iloc[0] / len(df)) * 100\n        print(f"  - {col}: {pct:.1f}% are '{dominant}'")\n\ndf = df.drop(columns=[c for c in cols_low_var if c in df.columns])\nprint(f"\\nDataset shape after removal: {df.shape}")

---\n<a id='bivariate-corr'></a>\n\n## 2.7 Bivariate Analysis - Correlation Analysis\n\nWe examine relationships between features and the target variable (SalePrice) to identify strong predictors.

In [None]:
# Calculate correlation matrix\ncorr_matrix = df.corr(numeric_only=True)\n\n# Get correlations with SalePrice\nsaleprice_corr = corr_matrix['SalePrice'].sort_values(ascending=False)\n\nprint("Top 15 Features Correlated with SalePrice:\\n")\nprint(saleprice_corr.head(15))

In [None]:
# Visualize correlation heatmap for top features\ntop_features = saleprice_corr.head(12).index.tolist()\ncorr_subset = df[top_features].corr()\n\nplt.figure(figsize=(12, 10))\nsns.heatmap(corr_subset, annot=True, fmt='.2f', cmap='RdYlBu_r', \n            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})\nplt.title('Correlation Heatmap - Top Features', fontsize=14, fontweight='bold')\nplt.tight_layout()\nplt.show()

**Key Findings:**\n\n- **Overall Qual** (0.80): Strongest predictor - quality rating directly impacts price\n- **Gr Liv Area** (0.71): Living space is a major price driver\n- **Garage Cars/Area** (~0.65): Garage capacity correlates with property value\n- **Total Bsmt SF** (0.63): Basement size adds significant value\n- **Year Built/Remod** (~0.56): Newer homes command higher prices

---\n<a id='bivariate-viz'></a>\n\n## 2.8 Bivariate Visualizations\n\nScatter plots and box plots reveal the nature and strength of relationships between features and sale price.

In [None]:
# Scatter plots for top continuous predictors\ntop_numeric = ['Gr Liv Area', 'Garage Area', 'Total Bsmt SF', '1st Flr SF', 'Year Built']\n\nfig, axes = plt.subplots(2, 3, figsize=(18, 10))\naxes = axes.ravel()\n\nfor idx, feature in enumerate(top_numeric[:6]):\n    if feature in df.columns:\n        axes[idx].scatter(df[feature], df['SalePrice'], alpha=0.5, s=20, color='steelblue')\n        axes[idx].set_xlabel(feature, fontweight='bold')\n        axes[idx].set_ylabel('SalePrice', fontweight='bold')\n        axes[idx].set_title(f'SalePrice vs {feature}')\n        \n        # Add correlation\n        corr_val = df[[feature, 'SalePrice']].corr().iloc[0,1]\n        axes[idx].text(0.05, 0.95, f'r = {corr_val:.3f}', \n                      transform=axes[idx].transAxes, fontsize=10,\n                      bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))\n\naxes[5].axis('off')\nplt.tight_layout()\nplt.show()

In [None]:
# Box plots for categorical/ordinal features\ncat_features = ['Overall Qual', 'Neighborhood', 'Kitchen Qual', 'Garage Type', 'Bsmt Qual']\n\nfig, axes = plt.subplots(2, 3, figsize=(18, 10))\naxes = axes.ravel()\n\nfor idx, feature in enumerate(cat_features[:5]):\n    if feature in df.columns:\n        # Get order by median price\n        order = df.groupby(feature)['SalePrice'].median().sort_values().index\n        \n        data = [df[df[feature]==cat]['SalePrice'].values for cat in order]\n        axes[idx].boxplot(data, labels=order)\n        axes[idx].set_xlabel(feature, fontweight='bold')\n        axes[idx].set_ylabel('SalePrice', fontweight='bold')\n        axes[idx].set_title(f'SalePrice by {feature}')\n        axes[idx].tick_params(axis='x', rotation=45, labelsize=8)\n\naxes[5].axis('off')\nplt.tight_layout()\nplt.show()

---\n<a id='outliers'></a>\n\n## 2.9 Outlier Detection\n\nWe use the Interquartile Range (IQR) method to identify potential outliers in key numerical features.

In [None]:
# IQR-based outlier detection\ndef detect_outliers_iqr(data, column):\n    Q1 = data[column].quantile(0.25)\n    Q3 = data[column].quantile(0.75)\n    IQR = Q3 - Q1\n    lower_bound = Q1 - 1.5 * IQR\n    upper_bound = Q3 + 1.5 * IQR\n    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]\n    return outliers, lower_bound, upper_bound\n\nkey_features = ['SalePrice', 'Gr Liv Area', 'Lot Area', 'Total Bsmt SF']\n\nprint("Outlier Detection Results:\\n")\nprint("="*80)\n\nfor feature in key_features:\n    outliers, lower, upper = detect_outliers_iqr(df, feature)\n    print(f"\\n{feature}:")\n    print(f"  Lower Bound: {lower:,.2f}")\n    print(f"  Upper Bound: {upper:,.2f}")\n    print(f"  Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")

### Outlier Assessment\n\n**Decision: Retain Outliers**\n\nThe detected outliers represent legitimate data points:\n- High-value properties in premium neighborhoods\n- Large estates with extensive land\n- Luxury homes with exceptional features\n\nRemoving these would bias our model toward average properties and reduce its ability to predict across the full price spectrum.

---\n<a id='phase2b'></a>\n\n# Phase 2B: Feature Engineering\n\n## Objective\n\nCreate new meaningful features that capture domain knowledge and improve model performance. Transform existing features to better represent underlying patterns.\n\n## Key Activities\n\n- Create composite features\n- Apply transformations to skewed variables\n- Encode categorical variables\n- Evaluate feature importance

---\n<a id='creation'></a>\n\n## 3.1 Feature Creation\n\nWe engineer new features by combining related variables in meaningful ways.

In [None]:
# Create engineered features\nprint("Engineering new features...\\n")\n\n# 1. Total Bathrooms\ndf['Total_Bathrooms'] = df['Full Bath'] + 0.5*df['Half Bath'] + df['Bsmt Full Bath'] + 0.5*df['Bsmt Half Bath']\nprint("✓ Total_Bathrooms: Combines all bathroom counts")\n\n# 2. Total Porch Area\ndf['Total_Porch_SF'] = df['Wood Deck SF'] + df['Open Porch SF'] + df['Enclosed Porch'] + df['3Ssn Porch'] + df['Screen Porch']\nprint("✓ Total_Porch_SF: Sum of all porch/deck areas")\n\n# 3. House Age\ndf['House_Age'] = df['Yr Sold'] - df['Year Built']\nprint("✓ House_Age: Years since construction")\n\n# 4. Years Since Remodel\ndf['Years_Since_Remod'] = df['Yr Sold'] - df['Year Remod/Add']\nprint("✓ Years_Since_Remod: Years since last remodel")\n\n# 5. Total Square Footage\ndf['Total_SF'] = df['Total Bsmt SF'] + df['Gr Liv Area']\nprint("✓ Total_SF: Total interior square footage")\n\nprint(f"\\nNew feature count: 5 features added")\nprint(f"Total features now: {df.shape[1]}")

In [None]:
# Check correlations of new features with SalePrice\nnew_features = ['Total_Bathrooms', 'Total_Porch_SF', 'House_Age', 'Years_Since_Remod', 'Total_SF']\n\nprint("New Feature Correlations with SalePrice:\\n")\nfor feat in new_features:\n    corr = df[[feat, 'SalePrice']].corr().iloc[0,1]\n    print(f"{feat:25s}: {corr:7.4f}")

---\n<a id='transformation'></a>\n\n## 3.2 Feature Transformations\n\nApplying log transformations to highly skewed features improves model performance by normalizing distributions.

In [None]:
# Calculate skewness for features\nfrom scipy import stats\n\nskewed_features = []\nfor col in df.select_dtypes(include=[np.number]).columns:\n    if col != 'SalePrice':\n        skewness = stats.skew(df[col].dropna())\n        if abs(skewness) > 1:\n            skewed_features.append((col, skewness))\n\nprint(f"Found {len(skewed_features)} highly skewed features (|skew| > 1)\\n")\nprint("Top 10 most skewed:")\nfor feat, skew in sorted(skewed_features, key=lambda x: abs(x[1]), reverse=True)[:10]:\n    print(f"  {feat:25s}: {skew:7.2f}")

**Note:** Log transformations can be applied to highly skewed features during model preprocessing. For this analysis, we retain original scales for interpretability.

---\n<a id='encoding'></a>\n\n## 3.3 Categorical Encoding\n\nConvert categorical variables to numerical format required for machine learning algorithms.

In [None]:
# Encode categorical variables\nfrom sklearn.preprocessing import LabelEncoder\n\n# Create a copy for encoding\ndf_encoded = df.copy()\n\n# Get categorical columns\ncat_cols = df_encoded.select_dtypes(include=['object']).columns.tolist()\n\nprint(f"Encoding {len(cat_cols)} categorical features...\\n")\n\nlabel_encoders = {}\nfor col in cat_cols:\n    le = LabelEncoder()\n    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))\n    label_encoders[col] = le\n\nprint(f"Encoding complete. All features are now numeric.")\nprint(f"\\nDataset shape: {df_encoded.shape}")\nprint(f"Data types: {df_encoded.dtypes.value_counts().to_dict()}")

---\n<a id='evaluation'></a>\n\n## 3.4 Feature Importance Evaluation\n\nUsing Random Forest to assess which features contribute most to price prediction.

In [None]:
# Calculate feature importance using Random Forest\nfrom sklearn.ensemble import RandomForestRegressor\n\nprint("Calculating feature importance...\\n")\n\n# Prepare data\nX = df_encoded.drop(['SalePrice', 'Order', 'PID'], axis=1, errors='ignore')\ny = df_encoded['SalePrice']\n\n# Train Random Forest\nrf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)\nrf.fit(X, y)\n\n# Get feature importances\nimportances = pd.DataFrame({\n    'Feature': X.columns,\n    'Importance': rf.feature_importances_\n}).sort_values('Importance', ascending=False)\n\nprint("Top 15 Most Important Features:\\n")\nprint(importances.head(15).to_string(index=False))

In [None]:
# Visualize feature importance\nplt.figure(figsize=(10, 8))\ntop_20 = importances.head(20)\nplt.barh(range(len(top_20)), top_20['Importance'].values, color='steelblue', alpha=0.7)\nplt.yticks(range(len(top_20)), top_20['Feature'].values)\nplt.xlabel('Feature Importance', fontweight='bold')\nplt.ylabel('Feature', fontweight='bold')\nplt.title('Top 20 Feature Importances (Random Forest)', fontweight='bold')\nplt.gca().invert_yaxis()\nplt.tight_layout()\nplt.show()

### Phase 2B Summary\n\n✅ **Features Engineered:** 5 new composite features created\n✅ **Categorical Encoding:** All categorical variables encoded numerically\n✅ **Feature Importance:** Identified top predictors using Random Forest\n✅ **Dataset Ready:** Fully preprocessed and ready for modeling\n\n**Key Insights:**\n- Overall Quality remains the strongest predictor\n- Living area and total square footage are critical\n- Location (neighborhood) significantly impacts price\n- Age-related features provide additional predictive power

---\n<a id='phase3'></a>\n\n# Phase 3: Model Development & Evaluation\n\n## Objective\n\nBuild and evaluate regression models to predict house sale prices. Compare simple and multiple linear regression approaches to understand the value of incorporating multiple features.\n\n## Models\n\n1. **Simple Linear Regression**: Single best feature\n2. **Multiple Linear Regression**: All available features\n\n## Evaluation Metrics\n\n- **R² Score**: Proportion of variance explained\n- **RMSE**: Root Mean Squared Error (average prediction error)\n- **MAE**: Mean Absolute Error (average absolute error)

---\n<a id='preparation'></a>\n\n## 4.1 Data Preparation\n\nPrepare the final dataset for model training and testing.

In [None]:
# Prepare final dataset\nprint("Preparing data for modeling...\\n")\n\n# Separate features and target\nX = df_encoded.drop(['SalePrice', 'Order', 'PID'], axis=1, errors='ignore')\ny = df_encoded['SalePrice']\n\nprint(f"Features (X): {X.shape}")\nprint(f"Target (y): {y.shape}")\nprint(f"\\nFeature count: {X.shape[1]}")\nprint(f"Total samples: {len(X):,}")

In [None]:
# Handle any remaining missing values\nmissing_count = X.isnull().sum().sum()\nif missing_count > 0:\n    print(f"\\nHandling {missing_count} remaining missing values...")\n    for col in X.columns:\n        if X[col].isnull().sum() > 0:\n            X[col] = X[col].fillna(X[col].median())\n    print("Missing values filled with median")\nelse:\n    print("\\n✓ No missing values in feature set")

In [None]:
# Train-test split\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=42\n)\n\nprint("\\nData Split Completed:")\nprint("="*60)\nprint(f"Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")\nprint(f"Testing set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")\nprint("="*60)

---\n<a id='simple-lr'></a>\n\n## 4.2 Simple Linear Regression\n\nWe start with a baseline model using only the single best predictor.

In [None]:
# Identify best single feature\ncorrelations = X_train.corrwith(y_train).abs().sort_values(ascending=False)\nbest_feature = correlations.index[0]\n\nprint("Best Single Feature for Prediction:\\n")\nprint(f"Feature: {best_feature}")\nprint(f"Correlation with SalePrice: {correlations[best_feature]:.4f}")\n\n# Prepare data for simple model\nX_train_simple = X_train[[best_feature]]\nX_test_simple = X_test[[best_feature]]

In [None]:
# Train Simple Linear Regression\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error\nimport math\n\nmodel_simple = LinearRegression()\nmodel_simple.fit(X_train_simple, y_train)\n\nprint("Simple Linear Regression Model Trained\\n")\nprint(f"Coefficient: {model_simple.coef_[0]:,.2f}")\nprint(f"Intercept: {model_simple.intercept_:,.2f}")\n\n# Make predictions\ny_train_pred_simple = model_simple.predict(X_train_simple)\ny_test_pred_simple = model_simple.predict(X_test_simple)

In [None]:
# Evaluate Simple Linear Regression\nr2_train_simple = r2_score(y_train, y_train_pred_simple)\nr2_test_simple = r2_score(y_test, y_test_pred_simple)\nrmse_test_simple = math.sqrt(mean_squared_error(y_test, y_test_pred_simple))\nmae_test_simple = mean_absolute_error(y_test, y_test_pred_simple)\n\nprint("="*70)\nprint("SIMPLE LINEAR REGRESSION - PERFORMANCE METRICS")\nprint("="*70)\nprint(f"Feature Used: {best_feature}")\nprint(f"\\nTraining Performance:")\nprint(f"  R² Score: {r2_train_simple:.4f}")\nprint(f"\\nTesting Performance:")\nprint(f"  R² Score: {r2_test_simple:.4f}")\nprint(f"  RMSE: ${rmse_test_simple:,.2f}")\nprint(f"  MAE: ${mae_test_simple:,.2f}")\nprint("="*70)

In [None]:
# Visualize Simple LR results\nfig, axes = plt.subplots(1, 2, figsize=(15, 5))\n\n# Scatter with regression line\naxes[0].scatter(X_test_simple, y_test, alpha=0.5, s=30, label='Actual')\naxes[0].plot(X_test_simple, y_test_pred_simple, color='red', linewidth=2, label='Predicted')\naxes[0].set_xlabel(best_feature, fontweight='bold', fontsize=11)\naxes[0].set_ylabel('SalePrice', fontweight='bold', fontsize=11)\naxes[0].set_title(f'Simple LR: {best_feature} vs SalePrice', fontweight='bold')\naxes[0].legend()\naxes[0].grid(alpha=0.3)\n\n# Actual vs Predicted\naxes[1].scatter(y_test, y_test_pred_simple, alpha=0.5, s=30)\naxes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], \n             'r--', linewidth=2, label='Perfect Prediction')\naxes[1].set_xlabel('Actual Price', fontweight='bold', fontsize=11)\naxes[1].set_ylabel('Predicted Price', fontweight='bold', fontsize=11)\naxes[1].set_title(f'Actual vs Predicted (R² = {r2_test_simple:.4f})', fontweight='bold')\naxes[1].legend()\naxes[1].grid(alpha=0.3)\n\nplt.tight_layout()\nplt.show()

---\n<a id='multiple-lr'></a>\n\n## 4.3 Multiple Linear Regression\n\nNow we train a comprehensive model using all available features.

In [None]:
# Train Multiple Linear Regression\nmodel_multiple = LinearRegression()\nmodel_multiple.fit(X_train, y_train)\n\nprint("Multiple Linear Regression Model Trained\\n")\nprint(f"Features used: {X_train.shape[1]}")\nprint(f"Model Intercept: {model_multiple.intercept_:,.2f}")\nprint(f"\\nTop 10 Feature Coefficients:")\n\ncoef_df = pd.DataFrame({\n    'Feature': X.columns,\n    'Coefficient': model_multiple.coef_\n}).sort_values('Coefficient', key=abs, ascending=False)\n\nprint(coef_df.head(10).to_string(index=False))\n\n# Make predictions\ny_train_pred_multiple = model_multiple.predict(X_train)\ny_test_pred_multiple = model_multiple.predict(X_test)

In [None]:
# Evaluate Multiple Linear Regression\nr2_train_multiple = r2_score(y_train, y_train_pred_multiple)\nr2_test_multiple = r2_score(y_test, y_test_pred_multiple)\nrmse_test_multiple = math.sqrt(mean_squared_error(y_test, y_test_pred_multiple))\nmae_test_multiple = mean_absolute_error(y_test, y_test_pred_multiple)\n\nprint("="*70)\nprint("MULTIPLE LINEAR REGRESSION - PERFORMANCE METRICS")\nprint("="*70)\nprint(f"Features Used: {X_train.shape[1]} features")\nprint(f"\\nTraining Performance:")\nprint(f"  R² Score: {r2_train_multiple:.4f}")\nprint(f"\\nTesting Performance:")\nprint(f"  R² Score: {r2_test_multiple:.4f}")\nprint(f"  RMSE: ${rmse_test_multiple:,.2f}")\nprint(f"  MAE: ${mae_test_multiple:,.2f}")\nprint("="*70)

In [None]:
# Visualize Multiple LR results\nfig, axes = plt.subplots(1, 2, figsize=(15, 5))\n\n# Actual vs Predicted\naxes[0].scatter(y_test, y_test_pred_multiple, alpha=0.5, s=30, color='green')\naxes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], \n             'r--', linewidth=2, label='Perfect Prediction')\naxes[0].set_xlabel('Actual Price', fontweight='bold', fontsize=11)\naxes[0].set_ylabel('Predicted Price', fontweight='bold', fontsize=11)\naxes[0].set_title(f'Actual vs Predicted (R² = {r2_test_multiple:.4f})', fontweight='bold')\naxes[0].legend()\naxes[0].grid(alpha=0.3)\n\n# Residual plot\nresiduals = y_test - y_test_pred_multiple\naxes[1].scatter(y_test_pred_multiple, residuals, alpha=0.5, s=30, color='green')\naxes[1].axhline(y=0, color='red', linestyle='--', linewidth=2)\naxes[1].set_xlabel('Predicted Price', fontweight='bold', fontsize=11)\naxes[1].set_ylabel('Residuals', fontweight='bold', fontsize=11)\naxes[1].set_title('Residual Plot', fontweight='bold')\naxes[1].grid(alpha=0.3)\n\nplt.tight_layout()\nplt.show()

---\n<a id='comparison'></a>\n\n## 4.4 Model Comparison\n\nComparing performance between the two approaches.

In [None]:
# Create comparison table\ncomparison = pd.DataFrame({\n    'Metric': ['Features Used', 'R² (Train)', 'R² (Test)', 'RMSE', 'MAE'],\n    'Simple LR': [\n        f'1 ({best_feature})',\n        f'{r2_train_simple:.4f}',\n        f'{r2_test_simple:.4f}',\n        f'${rmse_test_simple:,.0f}',\n        f'${mae_test_simple:,.0f}'\n    ],\n    'Multiple LR': [\n        f'{X_train.shape[1]} features',\n        f'{r2_train_multiple:.4f}',\n        f'{r2_test_multiple:.4f}',\n        f'${rmse_test_multiple:,.0f}',\n        f'${mae_test_multiple:,.0f}'\n    ]\n})\n\nprint("\\n" + "="*80)\nprint("MODEL PERFORMANCE COMPARISON")\nprint("="*80)\nprint(comparison.to_string(index=False))\nprint("="*80)\n\n# Calculate improvement\nimprovement = ((r2_test_multiple - r2_test_simple) / r2_test_simple) * 100\nprint(f"\\nR² Improvement: {improvement:.1f}%")

In [None]:
# Visual comparison\nfig, axes = plt.subplots(1, 3, figsize=(18, 5))\n\n# R² comparison\naxes[0].bar(['Simple LR', 'Multiple LR'], [r2_test_simple, r2_test_multiple], \n            color=['steelblue', 'green'], alpha=0.7)\naxes[0].set_ylabel('R² Score', fontweight='bold')\naxes[0].set_title('R² Score Comparison', fontweight='bold')\naxes[0].set_ylim([0, 1])\nfor i, v in enumerate([r2_test_simple, r2_test_multiple]):\n    axes[0].text(i, v + 0.02, f'{v:.4f}', ha='center', fontweight='bold')\naxes[0].grid(axis='y', alpha=0.3)\n\n# RMSE comparison\naxes[1].bar(['Simple LR', 'Multiple LR'], [rmse_test_simple, rmse_test_multiple], \n            color=['steelblue', 'green'], alpha=0.7)\naxes[1].set_ylabel('RMSE ($)', fontweight='bold')\naxes[1].set_title('RMSE Comparison (Lower is Better)', fontweight='bold')\nfor i, v in enumerate([rmse_test_simple, rmse_test_multiple]):\n    axes[1].text(i, v + 1000, f'${v:,.0f}', ha='center', fontweight='bold')\naxes[1].grid(axis='y', alpha=0.3)\n\n# MAE comparison\naxes[2].bar(['Simple LR', 'Multiple LR'], [mae_test_simple, mae_test_multiple], \n            color=['steelblue', 'green'], alpha=0.7)\naxes[2].set_ylabel('MAE ($)', fontweight='bold')\naxes[2].set_title('MAE Comparison (Lower is Better)', fontweight='bold')\nfor i, v in enumerate([mae_test_simple, mae_test_multiple]):\n    axes[2].text(i, v + 1000, f'${v:,.0f}', ha='center', fontweight='bold')\naxes[2].grid(axis='y', alpha=0.3)\n\nplt.tight_layout()\nplt.show()

---\n<a id='conclusions'></a>\n\n## 4.5 Conclusions & Recommendations

### Key Findings\n\n#### Model Performance\n\n**Simple Linear Regression:**\n- Uses only the best single predictor\n- Provides interpretable baseline\n- Limited by single-feature constraint\n- Suitable for quick estimations\n\n**Multiple Linear Regression:**\n- Leverages all available features\n- Significantly higher R² score\n- Lower prediction errors\n- Better captures complex relationships\n\n#### Performance Comparison\n\nThe Multiple Linear Regression model demonstrates substantial improvement over the simple approach:\n\n- **Variance Explained**: Captures significantly more price variation\n- **Prediction Accuracy**: Lower RMSE and MAE indicate better predictions\n- **Practical Application**: More reliable for real-world use\n\n### Business Recommendations\n\n1. **Primary Model**: Deploy Multiple Linear Regression for price predictions\n2. **Target Audience**: Suitable for buyers, sellers, and real estate professionals\n3. **Expected Accuracy**: Predictions within average error of ~$25,000\n4. **Key Predictors**: Focus on Overall Quality, Living Area, Location\n\n### Future Enhancements\n\n1. **Advanced Models**: Experiment with Random Forest, Gradient Boosting, or Neural Networks\n2. **Feature Selection**: Apply LASSO or Ridge regression to handle multicollinearity\n3. **Cross-Validation**: Implement k-fold CV for robust performance estimates\n4. **Hyperparameter Tuning**: Optimize model parameters systematically\n5. **Ensemble Methods**: Combine multiple models for improved predictions\n6. **Feature Interactions**: Explore polynomial features and interactions\n7. **External Data**: Incorporate economic indicators, school ratings, crime statistics

In [None]:
# Final summary\nprint("\\n" + "="*80)\nprint("PROJECT COMPLETION SUMMARY")\nprint("="*80)\nprint(f"Dataset: Ames Housing (2,930 properties)")\nprint(f"Original Features: 82")\nprint(f"Engineered Features: {df.shape[1]}")\nprint(f"Final Model Features: {X.shape[1]}")\nprint(f"\\nBest Model: Multiple Linear Regression")\nprint(f"Test R²: {r2_test_multiple:.4f}")\nprint(f"Test RMSE: ${rmse_test_multiple:,.2f}")\nprint(f"Test MAE: ${mae_test_multiple:,.2f}")\nprint(f"\\nModel explains {r2_test_multiple*100:.2f}% of price variance.")\nprint("="*80)

---\n\n## Project Complete\n\nThis comprehensive analysis successfully developed predictive models for house price estimation using the Ames Housing dataset. Through systematic data preprocessing, feature engineering, and model development, we achieved strong predictive performance suitable for practical real estate applications.\n\n**All project phases completed:**\n- ✅ Phase 1: Data Acquisition\n- ✅ Phase 2A: Preprocessing & EDA\n- ✅ Phase 2B: Feature Engineering\n- ✅ Phase 3: Model Development & Evaluation