# 🏠 HouseWorth: Predicting Ames Housing Sale Prices with Machine Learning

Welcome to **HouseWorth**, a complete machine learning project aiming to predict house sale prices in Ames, Iowa.  
This project combines data science best practices, advanced regression techniques, and model explainability tools like SHAP.

---

## 🎯 Project Objective

Our goal is to develop a predictive model for real estate pricing, using detailed housing features such as area, quality, neighborhood, and amenities.

---

## 📂 Dataset Description

The dataset includes 80+ variables describing different aspects of residential homes.  
Key attributes include:

- **GrLivArea** (Above ground living area square feet)
- **GarageCars** (Size of garage in car capacity)
- **TotalBsmtSF** (Total basement area)
- **Neighborhood** (Location within Ames city)
- **OverallQual** (Overall material and finish quality)

🔗 Dataset Source: *AmesHousing.csv*

---

## 📚 Table of Contents

1. [Import Libraries & Load Dataset](#import)
2. [Data Overview & Cleaning](#overview)
3. [Exploratory Data Analysis (EDA)](#eda)
4. [Feature Engineering](#feature-engineering)
5. [Modeling (Linear Regression)](#lr-modeling)
6. [Modeling (Random Forest)](#rf-modeling)
7. [Modeling (LXGBoost)](#xgb-modeling)
8. [Model Comparison & Evaluation](#comparison)
9. [Conclusions & Business Recommendations](#conclusion)


## 📦 Import Libraries & Load Dataset <a class="anchor" id="import"></a>

Let's start by importing essential libraries and loading the Ames Housing dataset.

We will:
- Load the dataset into a pandas DataFrame
- Preview the first few rows to understand its structure

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# 📚 Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# 📂 Load the Dataset
df = pd.read_csv('/kaggle/input/ames-housing-dataset/AmesHousing.csv')

# Preview the first 5 rows
df.head()


## 🔍 Data Overview & Basic Cleaning <a class="anchor" id="overview"></a>

Before diving into exploratory data analysis (EDA), let's understand the overall structure of the dataset.

In this section, we will:
- Check the dataset shape
- Review column data types
- Identify missing values
- Detect anomalies or inconsistencies


In [None]:
# 📏 Dataset Shape
print(f"The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

# 🧾 Data Types and Non-Null Counts
df.info()

# 🔍 Checking Missing Values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
missing_values


## 🛠️ Handling Missing Values <a class="anchor" id="missing"></a>

Now that we have identified missing values in the dataset, we need to decide how to handle them.

In this section:
- Drop columns with excessive missing data
- Fill missing values based on feature types (categorical vs numerical)
- Maintain data integrity for modeling


In [None]:
# 📉 Dropping columns with too many missing values
threshold = 0.4  # Eğer bir sütunda %40'tan fazla eksik varsa, o sütunu siliyoruz
missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > threshold].index
df.drop(columns=cols_to_drop, inplace=True)

print(f"Dropped columns with more than 40% missing values: {list(cols_to_drop)}")

# 🛠️ Filling missing numerical features with median
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in num_cols:
    if df[col].isnull().sum() > 0:
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)

# 🛠️ Filling missing categorical features with mode
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
    if df[col].isnull().sum() > 0:
        mode_value = df[col].mode()[0]
        df[col].fillna(mode_value, inplace=True)


## 📊 Exploratory Data Analysis (EDA) <a class="anchor" id="eda"></a>

Now that our dataset is clean, it's time to explore the data visually.

In this section:
- Analyze the distribution of the target variable (`SalePrice`)
- Identify potential skewness or outliers
- Explore feature relationships with the target variable

In [None]:
# 📈 Distribution Plot of SalePrice
plt.figure(figsize=(8,5))
sns.histplot(df['SalePrice'], kde=True, color='skyblue')
plt.title("Distribution of SalePrice")
plt.xlabel("Sale Price")
plt.ylabel("Frequency")
plt.show()

### 📐 Skewness and Kurtosis Analysis

Let's statistically evaluate the distribution of `SalePrice`.

- **Skewness** measures the asymmetry of the distribution.
- **Kurtosis** measures the heaviness of the distribution tails compared to a normal distribution.


In [None]:
from scipy.stats import skew, kurtosis

# Skewness and Kurtosis calculation
saleprice_skew = skew(df['SalePrice'])
saleprice_kurtosis = kurtosis(df['SalePrice'])

print(f"Skewness of SalePrice: {saleprice_skew:.2f}")
print(f"Kurtosis of SalePrice: {saleprice_kurtosis:.2f}")


### 🔥 Log Transformation (If Needed)

If the target variable `SalePrice` shows significant skewness, applying a log transformation can help to normalize the distribution.

We will use `np.log1p`, which applies `log(1+x)` transformation to avoid issues with zero values.


In [None]:
# 📈 Applying log transformation
df['SalePrice_Log'] = np.log1p(df['SalePrice'])

# 📈 Visualizing the transformed SalePrice
plt.figure(figsize=(8,5))
sns.histplot(df['SalePrice_Log'], kde=True, color='lightcoral')
plt.title("Distribution of SalePrice after Log Transformation")
plt.xlabel("Log of Sale Price")
plt.ylabel("Frequency")
plt.show()


### 📈 Outlier Detection with Boxplot

After applying log transformation, we visualize the `SalePrice_Log` variable using a boxplot.

Boxplots are effective for detecting outliers, as points lying beyond the whiskers represent potential anomalies.


In [None]:
# 📈 Boxplot for SalePrice_Log
plt.figure(figsize=(8,5))
sns.boxplot(x=df['SalePrice_Log'], color='orchid')
plt.title("Boxplot of SalePrice after Log Transformation")
plt.xlabel("Log of Sale Price")
plt.show()


### 📈 Feature Relationships with SalePrice

Now we analyze how selected numerical and categorical features relate to the target variable `SalePrice`.

Understanding these relationships helps guide feature selection and engineering for modeling.

In [None]:
# 📋 List of important numeric features
important_numeric = ['Overall Qual', 'Gr Liv Area', 'Garage Cars', 'Total Bsmt SF', '1st Flr SF']


# 📈 Scatterplots
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18,10))
axs = axs.flatten()

for i, feature in enumerate(important_numeric):
    sns.scatterplot(x=df[feature], y=df['SalePrice_Log'], ax=axs[i], color='royalblue')
    axs[i].set_title(f"{feature} vs SalePrice_Log")

plt.tight_layout()
plt.show()

### 📋 Categorical Features vs SalePrice

Let's analyze how key categorical features such as `Neighborhood` and `HouseStyle` relate to the target variable `SalePrice`.

Boxplots help visualize price distributions across different categories.

In [None]:
# 📋 List of important categorical features
important_categorical = ['Neighborhood', 'House Style']

# 📈 Boxplots for categorical features
for feature in important_categorical:
    plt.figure(figsize=(14,6))
    sns.boxplot(x=df[feature], y=df['SalePrice_Log'], palette='Set3')
    plt.xticks(rotation=45)
    plt.title(f"{feature} vs SalePrice_Log")
    plt.xlabel(feature)
    plt.ylabel("Log of Sale Price")
    plt.show()

### 🔥 Correlation Matrix and Heatmap

Understanding feature correlations is crucial to identifying multicollinearity and selecting impactful features.

A heatmap helps visualize the strength of relationships between numerical variables.


In [None]:
# 📋 Only Numeric Columns for Correlation
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# 📈 Correlation Matrix
corr_matrix = numeric_df.corr()

# 📋 Top 10 features most correlated with SalePrice_Log
top_corr_features = corr_matrix['SalePrice_Log'].abs().sort_values(ascending=False).head(11).index

# 📈 Focused Heatmap
plt.figure(figsize=(12,8))
sns.heatmap(numeric_df[top_corr_features].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Top Correlated Features with SalePrice_Log")
plt.show()


## 🛠️ Feature Engineering <a class="anchor" id="feature-engineering"></a>

Before training our models, we need to engineer the features to ensure better performance.

In this section:
- Drop unnecessary columns
- Encode categorical variables
- Scale numerical variables
- Finalize the training set


In [None]:
# 🚮 Drop unnecessary columns (like ID if exists)
if 'Order' in df.columns:
    df.drop(columns=['Order'], inplace=True)

# 🚮 Drop the original SalePrice (we use SalePrice_Log instead)
df.drop(columns=['SalePrice'], inplace=True)

# 🎛️ Apply One-Hot Encoding to categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)


In [None]:
# 🎯 Separate features (X) and target variable (y)
X = df_encoded.drop(columns=['SalePrice_Log'])
y = df_encoded['SalePrice_Log']


In [None]:
from sklearn.preprocessing import StandardScaler

# 📏 Scale only the features (not the target)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 📋 Convert scaled features back to a DataFrame
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)


## 🤖 Linear Regression Modeling <a class="anchor" id="lr-modeling"></a>

We start the modeling phase with a baseline Linear Regression model.

Linear Regression helps set a benchmark and provides insights into feature relationships.

In [None]:
from sklearn.model_selection import train_test_split

# ✂️ Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# 🤖 Initialize and train the Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# 📈 Make predictions on training and testing sets (still in log scale)
y_pred_train_log = lin_reg.predict(X_train)
y_pred_test_log = lin_reg.predict(X_test)

# 🔄 Convert predictions and actual values back from log scale
y_train_orig = np.expm1(y_train)
y_test_orig = np.expm1(y_test)
y_pred_train_orig = np.expm1(y_pred_train_log)
y_pred_test_orig = np.expm1(y_pred_test_log)

# 📊 Calculate RMSE on the original SalePrice scale
train_rmse = np.sqrt(mean_squared_error(y_train_orig, y_pred_train_orig))
test_rmse = np.sqrt(mean_squared_error(y_test_orig, y_pred_test_orig))

print(f"Training RMSE (Original Scale): {train_rmse:.2f}")
print(f"Test RMSE (Original Scale): {test_rmse:.2f}")


## 🌳 Random Forest Regression Modeling <a class="anchor" id="rf-modeling"></a>

We now move on to training a Random Forest model.

Random Forests can capture nonlinear relationships and typically outperform simple linear models in complex datasets.


In [None]:
from sklearn.ensemble import RandomForestRegressor

# 🌳 Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)


In [None]:
# 📈 Make predictions on training and testing sets (still in log scale)
y_pred_train_rf_log = rf_model.predict(X_train)
y_pred_test_rf_log = rf_model.predict(X_test)

# 🔄 Convert predictions and actual values back from log scale
y_train_rf_orig = np.expm1(y_train)
y_test_rf_orig = np.expm1(y_test)
y_pred_train_rf_orig = np.expm1(y_pred_train_rf_log)
y_pred_test_rf_orig = np.expm1(y_pred_test_rf_log)

# 📊 Calculate RMSE on the original SalePrice scale
train_rmse_rf = np.sqrt(mean_squared_error(y_train_rf_orig, y_pred_train_rf_orig))
test_rmse_rf = np.sqrt(mean_squared_error(y_test_rf_orig, y_pred_test_rf_orig))

print(f"Random Forest Training RMSE (Original Scale): {train_rmse_rf:.2f}")
print(f"Random Forest Test RMSE (Original Scale): {test_rmse_rf:.2f}")


## ⚡ XGBoost Regression Modeling <a class="anchor" id="xgb-modeling"></a>

Now, we train an XGBoost model, one of the most powerful and popular machine learning algorithms for tabular data.

XGBoost is known for its high performance and ability to handle complex relationships.


In [None]:
from xgboost import XGBRegressor

# ⚡ Initialize and train the XGBoost model
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgb_model.fit(X_train, y_train)


In [None]:
# 📈 Make predictions on training and testing sets (still in log scale)
y_pred_train_xgb_log = xgb_model.predict(X_train)
y_pred_test_xgb_log = xgb_model.predict(X_test)

# 🔄 Convert predictions and actual values back from log scale
y_train_xgb_orig = np.expm1(y_train)
y_test_xgb_orig = np.expm1(y_test)
y_pred_train_xgb_orig = np.expm1(y_pred_train_xgb_log)
y_pred_test_xgb_orig = np.expm1(y_pred_test_xgb_log)

# 📊 Calculate RMSE on the original SalePrice scale
train_rmse_xgb = np.sqrt(mean_squared_error(y_train_xgb_orig, y_pred_train_xgb_orig))
test_rmse_xgb = np.sqrt(mean_squared_error(y_test_xgb_orig, y_pred_test_xgb_orig))

print(f"XGBoost Training RMSE (Original Scale): {train_rmse_xgb:.2f}")
print(f"XGBoost Test RMSE (Original Scale): {test_rmse_xgb:.2f}")


## 🌟 Feature Importance from XGBoost

Understanding which features most influence the target variable helps us interpret the model better.

We will visualize the top important features ranked by the XGBoost model.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 🌟 Get feature importances from XGBoost
feature_importance = pd.Series(xgb_model.feature_importances_, index=X.columns)

# 📋 Sort by importance
feature_importance = feature_importance.sort_values(ascending=False)

# 🎨 Plot top 15 important features
plt.figure(figsize=(10,6))
sns.barplot(x=feature_importance.values[:15], y=feature_importance.index[:15], palette='viridis')
plt.title('Top 15 Important Features - XGBoost')
plt.xlabel('Feature Importance Score')
plt.ylabel('Feature')
plt.show()


## 🎯 Error Distribution Plot

Analyzing the distribution of prediction errors is important for understanding model bias and variance.

Ideally, errors should be centered around zero and symmetrically distributed.


In [None]:
# 📈 Calculate errors on test set
errors = y_test_xgb_orig - y_pred_test_xgb_orig

# 🎨 Plot error distribution
plt.figure(figsize=(10,6))
sns.histplot(errors, bins=30, kde=True, color='salmon')
plt.title('Error Distribution - XGBoost Predictions')
plt.xlabel('Prediction Error ($)')
plt.ylabel('Frequency')
plt.axvline(0, color='black', linestyle='--')
plt.show()


## 📊 Model Performance Comparison <a class="anchor" id="comparison"></a>

We now compare the performance of all three models (Linear Regression, Random Forest, XGBoost) based on their RMSE scores.

Lower RMSE indicates better model performance.


In [None]:
# 📋 Create a summary DataFrame
model_results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'XGBoost'],
    'Training RMSE': [train_rmse, train_rmse_rf, train_rmse_xgb],
    'Test RMSE': [test_rmse, test_rmse_rf, test_rmse_xgb]
})

# 📋 Display the table
print(model_results)

# 🎨 Plot Test RMSE comparison
plt.figure(figsize=(8,5))
sns.barplot(x='Model', y='Test RMSE', data=model_results, palette='pastel')
plt.title('Model Comparison - Test RMSE')
plt.ylabel('Test RMSE ($)')
plt.xlabel('Model')
plt.show()


## 📜 Final Conclusion <a class="anchor" id="conclusion"></a>

In this project, we performed a complete data science pipeline on a real-world housing prices dataset.

Key steps included:
- Data cleaning and preprocessing
- Exploratory Data Analysis (EDA)
- Feature engineering (encoding, scaling)
- Model training and evaluation (Linear Regression, Random Forest, XGBoost)

**Main findings:**
- XGBoost achieved the best performance with a Test RMSE of approximately $26,784.
- Random Forest also performed well, slightly behind XGBoost.
- Linear Regression served as a good baseline but was outperformed by tree-based models.

**Next Steps (Optional):**
- Hyperparameter tuning with GridSearchCV or RandomizedSearchCV
- Feature selection based on importance scores
- Ensemble methods (stacking multiple models)

Overall, XGBoost was the most effective model for predicting housing prices in this dataset.
