# Phase 3: Modeling & Inferencing

**Team**: The Outliers  
**Course**: Advanced Apex Project 1 - BITS Pilani Digital  
**Phase**: 3 (Model Construction & Evaluation)

---

## üéØ What are we doing in Phase 3?

We are building **regression models** to predict house prices (`SalePrice`) using the features we engineered in Phase 2.

## ü§î Why are we doing this?

- To **predict house prices accurately** for properties in Ames, Iowa
- To **understand which features** most influence property values
- To **compare different modeling approaches** (Simple vs Multiple Linear Regression)
- To **measure model performance** using statistical metrics

## üìä Models to Build (as per Phase 3 requirements):

1. **Simple Linear Regression**: Uses ONE best feature to predict price
2. **Multiple Linear Regression**: Uses ALL features to predict price

## üìè Evaluation Metrics:

- **R¬≤ (R-squared)**: How much variance in price our model explains (0 to 1, higher is better)
- **RMSE (Root Mean Squared Error)**: Average prediction error in dollars (lower is better)
- **MAE (Mean Absolute Error)**: Average absolute error in dollars (lower is better)

## üîÆ Expected Results:

From Phase 2, we know `Overall Qual` has 0.80 correlation with price:
- Simple LR should achieve R¬≤ ‚âà 0.64
- Multiple LR should achieve R¬≤ > 0.70

---

## Step 1: Setup & Import Libraries

### üéØ What are we doing?
Importing all necessary Python libraries for data manipulation, machine learning, and visualization.

### ü§î Why these libraries?
- **pandas**: Load and work with datasets
- **numpy**: Mathematical operations
- **scikit-learn**: Build and evaluate machine learning models
- **matplotlib/seaborn**: Create visualizations

### üìä Expected Result:
All libraries should import successfully with no errors.

---

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Settings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# Visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Libraries imported successfully!")
print(f"üì¶ Pandas version: {pd.__version__}")
print(f"üì¶ NumPy version: {np.__version__}")

‚úÖ Libraries imported successfully!
üì¶ Pandas version: 2.3.3
üì¶ NumPy version: 2.3.4


---

## Step 2: Load Engineered Dataset

### üéØ What are we doing?
Loading the **engineered dataset** from Phase 2C (`AmesHousing_engineered.csv`).

### ü§î Why this specific dataset?
This dataset contains:
- ‚úÖ All missing values handled (imputed)
- ‚úÖ All categorical features encoded (converted to numbers)
- ‚úÖ New engineered features (Total_SF, Total_Bathrooms, etc.)
- ‚úÖ Log-transformed skewed features for better modeling
- ‚úÖ Multicollinearity reduced (highly correlated features removed)

### üìä Expected Result:
- Dataset shape: **(2,930 rows √ó 71 columns)**
- Target variable: `SalePrice` (house prices in dollars)
- All features should be numeric (ready for machine learning)
- Minimal to no missing values

---

In [None]:
# Load the engineered dataset from Phase 2C
df = pd.read_csv("../data/AmesHousing_engineered.csv")

print("‚úÖ Dataset loaded successfully!")
print(f"\nüìä Dataset Shape: {df.shape}")
print(f"   - Total Records (Houses): {df.shape[0]:,}")
print(f"   - Total Features (Columns): {df.shape[1]}")

# Check for missing values
missing_count = df.isnull().sum().sum()
print(f"\n‚ùì Missing Values: {missing_count}")
if missing_count > 0:
    print(f"   ‚ö†Ô∏è Warning: Found {missing_count} missing values - need to handle these")
else:
    print("   ‚úÖ No missing values - dataset is complete!")

# Check data types
print(f"\nüìã Data Types:")
print(df.dtypes.value_counts())

# Display first few rows
print("\nüìã First 3 rows of the dataset:")
print(df.head(3))