# Ames Housing Price Prediction
## Advanced Apex Project - Real Estate Price Modeling

A comprehensive machine learning approach to predicting residential property sale prices using multiple regression techniques and extensive feature engineering.

---

### Project Information

**Team:** The Outliers

**Course:** Advanced Apex Project 1

**Institution:** BITS Pilani - Digital Campus

**Academic Term:** First Trimester 2025-26

**Project Supervisor:** Bharathi Dasari

**Submission Date:** November 2024

### Team Members

| Student Name | BITS ID |
|--------------|----------|
| Anik Das | 2025EM1100026 |
| Adeetya Wadikar | 2025EM1100384 |
| Tushar Nishane | 2025EM1100306 |

---

## Executive Summary

### Problem Statement

Accurate real estate valuation is essential for buyers, sellers, and financial institutions. Traditional valuation methods can be subjective and time-consuming. This project develops machine learning models to predict house sale prices objectively based on property characteristics.

### Business Objective

Develop a predictive regression model that estimates residential property sale prices with high accuracy. The model should help stakeholders:
- **Buyers**: Assess fair market value before purchase
- **Sellers**: Set competitive listing prices
- **Investors**: Identify undervalued properties
- **Lenders**: Support loan underwriting decisions

### Dataset

**Name:** Ames Housing Dataset

**Source:** Kaggle (https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset)

**Size:** 2,930 residential property sales transactions

**Features:** 82 variables describing:
- Physical characteristics (size, rooms, age)
- Quality ratings (construction, condition)
- Location attributes (neighborhood, zoning)
- Amenities (garage, basement, fireplace, pool)

**Target Variable:** SalePrice (in USD)

**Time Period:** Properties sold in Ames, Iowa from 2006-2010

---

## Table of Contents

### [Phase 1: Data Acquisition](#phase1)
1.1 [Environment Setup](#setup)
1.2 [Data Loading](#loading)
1.3 [Initial Data Inspection](#inspection)
1.4 [Schema Validation](#schema)
1.5 [Data Quality Assessment](#quality)

### [Phase 2A: Data Preprocessing & Exploratory Analysis](#phase2a)
2.1 [Missing Value Analysis](#missing)
2.2 [Missing Value Treatment](#treatment)
2.3 [Univariate Analysis - Numerical](#univariate-num)
2.4 [Univariate Analysis - Categorical](#univariate-cat)
2.5 [Low-Variance Feature Removal](#lowvar)
2.6 [Bivariate Analysis - Correlations](#bivariate-corr)
2.7 [Bivariate Analysis - Visualizations](#bivariate-viz)
2.8 [Outlier Detection](#outliers)

### [Phase 2B: Feature Engineering](#phase2b)
3.1 [Feature Creation](#creation)
3.2 [Feature Transformation](#transformation)
3.3 [Categorical Encoding](#encoding)
3.4 [Feature Importance](#importance)

### [Phase 3: Model Development & Evaluation](#phase3)
4.1 [Data Preparation](#preparation)
4.2 [Simple Linear Regression](#simple-lr)
4.3 [Multiple Linear Regression](#multiple-lr)
4.4 [Model Comparison](#comparison)
4.5 [Conclusions & Recommendations](#conclusions)

---
<a id='phase1'></a>

# Phase 1: Data Acquisition

## Objective

Acquire the Ames Housing dataset and perform initial validation to ensure data integrity. This foundational phase establishes the quality and completeness of our data before proceeding to analysis.

## Deliverables

- Successfully load dataset from CSV file
- Verify data structure and schema
- Conduct initial quality checks
- Document data characteristics and potential issues

---
<a id='setup'></a>

## 1.1 Environment Setup

We import all necessary Python libraries for data manipulation, statistical analysis, visualization, and machine learning. Proper configuration ensures consistent behavior across different environments.

In [None]:
# Import core data manipulation libraries
import pandas as pd
import numpy as np
import os

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Import statistical libraries
from scipy import stats

# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Configure environment
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.width', 1000)

# Set visualization defaults
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Print confirmation
print("✓ All libraries imported successfully")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Matplotlib version: {plt.matplotlib.__version__}")
print("\nEnvironment configured and ready for analysis.")