# House Prices Prediction - Regression Techniques

My name is **Nikos**, and this project along with others in this folder mark a significant milestone in my journey to becoming an AI Engineer. I’m passionate about artificial intelligence and machine learning, and this project serves as the first step in that path. Through it, I aim to not only understand how neural networks work but also demonstrate the practical applications of AI in real-world problems.

In this project, we aim to predict house prices using various features available in the dataset. We will explore the dataset, clean the data, perform feature engineering, and then train machine learning models to make predictions.

The steps we will follow:
1. Data Exploration
2. Data Cleaning
3. Feature Engineering
4. Model Selection
5. Model Evaluation

## Data Exploration

In this step, we load the dataset and explore its structure, checking for missing values, data types, and summary statistics. Understanding the data is crucial before performing any cleaning or preprocessing.


In [1]:
import pandas as pd

In [2]:
# Load the train and test data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [3]:
# Display the first few rows of the training data
train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Exploring the Data

Before diving into machine learning, let's explore the dataset. We will check the size of the dataset, look at the data types, and identify any potential issues such as missing values.


In [4]:
# Get the shape of the training dataset
train_df.shape

# Check the data types and non-null counts for each column
train_df.info()

# Display summary statistics of the numerical columns
train_df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [5]:
# Check for missing values
train_df.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

## Handling Missing Values

Based on the missing value analysis, `LotFrontage` has 259 missing values. Since it’s a numerical feature, we will fill missing values using the **median** value, as this is a common approach when dealing with skewed data.

Step 3.1: Filling Missing Values in LotFrontage

To handle the "SettingWithCopyWarning", we will avoid using `inplace=True` and instead reassign the column directly after filling the missing values with the median.


In [6]:
# Fill missing values in LotFrontage with the median
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(train_df['LotFrontage'].median())

# Check if missing values in LotFrontage are filled
train_df.isnull().sum().sort_values(ascending=False).head(10)


PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
GarageCond        81
GarageType        81
GarageYrBlt       81
GarageFinish      81
dtype: int64

## Filling Missing Values in Categorical Columns (Remaining)

The remaining categorical columns with missing values are: `PoolQC`, `MiscFeature`, `Alley`, `Fence`, `MasVnrType`, `FireplaceQu`, `GarageType`, `GarageQual`, and `GarageCond`. We'll fill the missing values in these columns with `'None'`.


In [7]:
# List of categorical features with missing values
categorical_features = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 
                        'FireplaceQu', 'GarageType', 'GarageQual', 'GarageCond']

# Fill missing values in these categorical columns with 'None'
train_df[categorical_features] = train_df[categorical_features].fillna('None')

# Check if missing values are filled for these columns
train_df.isnull().sum().sort_values(ascending=False).head(10)


GarageYrBlt     81
GarageFinish    81
BsmtExposure    38
BsmtFinType2    38
BsmtQual        37
BsmtCond        37
BsmtFinType1    37
MasVnrArea       8
Electrical       1
HalfBath         0
dtype: int64

## Filling Missing Values in Numerical Columns

We will fill missing values in `GarageYrBlt` and `MasVnrArea` using the median, as these are numerical features.


In [8]:
# Fill missing values in numerical columns with the median
train_df['GarageYrBlt'] = train_df['GarageYrBlt'].fillna(train_df['GarageYrBlt'].median())
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].median())

# Check if missing values are filled for numerical columns
train_df.isnull().sum().sort_values(ascending=False).head(10)


GarageFinish    81
BsmtFinType2    38
BsmtExposure    38
BsmtFinType1    37
BsmtCond        37
BsmtQual        37
Electrical       1
Fireplaces       0
Functional       0
TotRmsAbvGrd     0
dtype: int64

## Filling Missing Values in Categorical Columns

We'll fill the remaining categorical columns (`GarageFinish`, `BsmtExposure`, `BsmtQual`, etc.) with `'None'`.


In [9]:
# List of basement-related categorical features with missing values
basement_features = ['BsmtExposure', 'BsmtFinType2', 'BsmtQual', 'BsmtCond', 'BsmtFinType1']

# Fill missing values in the categorical columns with 'None'
train_df[basement_features + ['GarageFinish']] = train_df[basement_features + ['GarageFinish']].fillna('None')

# Check if missing values are filled
train_df.isnull().sum().sort_values(ascending=False).head(10)


Electrical      1
GarageYrBlt     0
GarageType      0
FireplaceQu     0
Fireplaces      0
Functional      0
TotRmsAbvGrd    0
KitchenQual     0
KitchenAbvGr    0
BedroomAbvGr    0
dtype: int64

## Filling Missing Value in `Electrical`

We will fill the single missing value in `Electrical` with the most common value (mode).


In [10]:
# Fill missing value in Electrical with the most common value (mode)
train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode()[0])

# Check if any missing values remain
train_df.isnull().sum().sort_values(ascending=False).head(10)


SalePrice        0
PavedDrive       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
PoolQC           0
Utilities        0
dtype: int64

## Why Weren’t These Missing Values Handled Initially?

In the earlier steps, we addressed missing values in some of the categorical features, but certain features related to the garage and basement were not included in our first round of handling missing values. This happened because we focused on a subset of categorical columns, and some columns (like `GarageYrBlt` and `Basement`-related features) appeared after we handled the initial group.

Additionally:
- **GarageYrBlt** is a numerical column that needed a different treatment (median filling), which wasn't included in the initial categorical columns.
- The basement-related features (`BsmtExposure`, `BsmtQual`, etc.) also required special attention, as missing values here likely indicate that the house does not have a basement, hence filling them with `'None'`.

We addressed these separately to ensure all missing values were properly handled before moving on to the next steps in the project.

---

## Next Steps: Encoding Categorical Variables

Now that we’ve handled all missing values, we need to convert categorical variables into a format that machine learning algorithms can work with. We will use **One-Hot Encoding** to transform these categorical features into binary columns.

One-hot encoding is necessary because machine learning models work best with numerical data, and this step allows us to convert categories into numbers without introducing any unintended ordinal relationships between them.


## Encoding Categorical Variables

We need to convert the categorical variables into numeric format using **One-Hot Encoding**. This will create binary columns for each category, allowing machine learning algorithms to interpret them.


In [11]:
# Apply one-hot encoding to categorical variables
train_df = pd.get_dummies(train_df)

# Display the first few rows after encoding
train_df.head()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,False,False,False,True,False,False,False,False,True,False
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,False,False,False,True,False,False,False,False,True,False
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,False,False,False,True,False,False,False,False,True,False
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,False,False,False,True,True,False,False,False,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,False,False,False,True,False,False,False,False,True,False


## Feature Selection and Target Variable

Now that all the categorical variables are encoded, it's time to separate the **features** (independent variables) from the **target variable** (`SalePrice`). 
- **X** will represent the features (all columns except `SalePrice`).
- **y** will represent the target variable (`SalePrice`), which we want to predict.


In [12]:
# Define feature matrix (X) and target vector (y)
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

# Display the shapes of X and y
X.shape, y.shape


((1460, 303), (1460,))

## Splitting the Data into Training and Validation Sets

We’ll split the dataset into training and validation sets. Typically, we use 80% of the data for training and 20% for validation to ensure we can evaluate the model’s performance on unseen data.


In [13]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the split data
X_train.shape, X_val.shape, y_train.shape, y_val.shape


((1168, 303), (292, 303), (1168,), (292,))

## Training a Linear Regression Model

We’ll start by training a **Linear Regression** model. This model is a good starting point for regression tasks like predicting house prices. After training, we’ll evaluate the model’s performance on the validation set using the **Mean Absolute Error (MAE)**.


In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the validation data
y_pred = model.predict(X_val)

# Calculate the Mean Absolute Error (MAE)
mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error: {mae}')


Mean Absolute Error: 21109.70645342619


## Feature Engineering

We will create new features based on domain knowledge, which might improve the model’s ability to make accurate predictions. We will:
1. Create an `Age` feature, which represents the age of the house.
2. Transform the skewed `LotArea` feature using a logarithmic scale.


In [15]:
import numpy as np

# Create a new feature 'Age' (current year - year built)
train_df['Age'] = 2024 - train_df['YearBuilt']  # Assuming the current year is 2024

# Create a new feature 'RemodAge' (current year - year remodeled)
train_df['RemodAge'] = 2024 - train_df['YearRemodAdd']

# Log-transform the 'LotArea' feature to deal with skewness
train_df['LogLotArea'] = np.log1p(train_df['LotArea'])

# Drop 'YearBuilt' and 'YearRemodAdd' as they are now represented by new features
train_df.drop(['YearBuilt', 'YearRemodAdd'], axis=1, inplace=True)

# Display the first few rows to verify the changes
train_df[['Age', 'RemodAge', 'LogLotArea']].head()


Unnamed: 0,Age,RemodAge,LogLotArea
0,21,21,9.04204
1,48,48,9.169623
2,23,22,9.328212
3,109,54,9.164401
4,24,24,9.565284


In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the validation data
y_pred = model.predict(X_val)

# Calculate the Mean Absolute Error (MAE)
mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error: {mae}')


Mean Absolute Error: 21109.70645342619


## No improvement from feature engineering

In linear regression and not complex cases, feature engineering might not give what we want. Same is with hypertuning. So we move on.

## Training a Decision Tree Model

We will now train a **Decision Tree** model, which is more flexible and can capture non-linear relationships in the data. Decision Trees are also less sensitive to feature scaling and work well with both numerical and categorical data.


In [17]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Initialize the Decision Tree model
tree_model = DecisionTreeRegressor(random_state=42)

# Train the model on the training data
tree_model.fit(X_train, y_train)

# Make predictions on the validation data
y_pred_tree = tree_model.predict(X_val)

# Calculate the Mean Absolute Error (MAE)
mae_tree = mean_absolute_error(y_val, y_pred_tree)
print(f'Mean Absolute Error (Decision Tree): {mae_tree}')


Mean Absolute Error (Decision Tree): 27587.36301369863


## Evaluation of the Decision Tree Model

We trained a **Decision Tree** model, and the resulting **Mean Absolute Error (MAE)** was approximately **27,587**. This is higher than the MAE from the Linear Regression model, which indicates that the decision tree, in its current form, is not performing as well.

### Why Did the Decision Tree Perform Worse?

1. **Overfitting**: 
   - Decision Trees can easily overfit to the training data, especially if they are allowed to grow too deep. This might lead to excellent performance on the training data but poor generalization to unseen data (like the validation set).

2. **Default Hyperparameters**:
   - By default, the decision tree may have created a very complex tree, capturing noise and not just the actual patterns in the data. Without any constraints (e.g., limiting tree depth), this can lead to high variance.

### Next Step: Tuning the Decision Tree

To improve the performance of the Decision Tree, we need to **tune its hyperparameters**. Some important hyperparameters to tune include:
- **max_depth**: Limits the depth of the tree to prevent overfitting.
- **min_samples_split**: The minimum number of samples required to split a node.
- **min_samples_leaf**: The minimum number of samples that a leaf node must have.

We will tune the **max_depth** hyperparameter first to see if limiting the depth of the tree can reduce overfitting and improve the MAE.


In [18]:
# Tune the max_depth hyperparameter of the Decision Tree
tree_model_tuned = DecisionTreeRegressor(max_depth=5, random_state=42)

# Train the tuned model on the training data
tree_model_tuned.fit(X_train, y_train)

# Make predictions on the validation data
y_pred_tree_tuned = tree_model_tuned.predict(X_val)

# Calculate the Mean Absolute Error (MAE) for the tuned model
mae_tree_tuned = mean_absolute_error(y_val, y_pred_tree_tuned)
print(f'Mean Absolute Error (Tuned Decision Tree): {mae_tree_tuned}')


Mean Absolute Error (Tuned Decision Tree): 27511.28283135003


## Decision Tree Tuning Results

After tuning the **max_depth** hyperparameter of the Decision Tree, the **Mean Absolute Error (MAE)** only slightly improved, from **27,587** to **27,511**. This minor improvement suggests that tuning the depth of the tree is not sufficient to significantly enhance the model's performance.

### Evaluation of the Situation

1. **Decision Trees**:
   - Even after tuning, the Decision Tree model does not outperform the Linear Regression model, which had an MAE of **21,109**. This suggests that the relationships in the data might be simpler and more linear in nature, which is why a Decision Tree model (which is more complex and captures non-linear relationships) is not yielding better results.

2. **Data Simplicity**:
   - It’s possible that the dataset’s features have a more linear relationship with the target (`SalePrice`), meaning that **Linear Regression** might be better suited for this task.
   - For instance, house prices may increase in a relatively straightforward way as features like living area or overall quality increase. Complex algorithms that model non-linear interactions (like Decision Trees) may not be necessary here.

### Next Steps: Trying Different Algorithms

Given that **Linear Regression** outperformed the Decision Tree, it might be worth trying a few other algorithms to explore different aspects of the data:
1. **Random Forests or Gradient Boosting**:
   - These ensemble methods can help improve upon Decision Trees by combining multiple trees and reducing overfitting.
2. **Neural Networks**:
   - Neural networks might capture more complex patterns in the data, but given the dataset size and the relatively straightforward nature of the problem, neural networks may not provide significant benefits.
3. **Regularized Linear Models**:
   - Using models like **Ridge Regression** or **Lasso Regression**, which add regularization to the linear model, can help improve performance by penalizing large coefficients and reducing overfitting.

### Conclusion: Is Linear Regression Better?

In this case, it seems that the relationships in the dataset might be simple enough that **Linear Regression** works better than more complex models like Decision Trees. While it's always worth experimenting with different algorithms, the linear relationship in this dataset might be driving better results with simpler models.

---

## Next Step: Trying Random Forest

Let's next try a **Random Forest** model, which is an ensemble method that builds multiple Decision Trees and averages their predictions to improve performance and reduce overfitting.


In [19]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest model
forest_model = RandomForestRegressor(random_state=42)

# Train the model on the training data
forest_model.fit(X_train, y_train)

# Make predictions on the validation data
y_pred_forest = forest_model.predict(X_val)

# Calculate the Mean Absolute Error (MAE)
mae_forest = mean_absolute_error(y_val, y_pred_forest)
print(f'Mean Absolute Error (Random Forest): {mae_forest}')


Mean Absolute Error (Random Forest): 17615.47934931507


## Random Forest Results

The **Random Forest** model resulted in a **Mean Absolute Error (MAE)** of **17,615**, which is a significant improvement over both the Decision Tree (MAE ~27,500) and Linear Regression (MAE ~21,100). This shows that Random Forest is able to capture more complex patterns in the data compared to the simpler models.

### What Does This Result Show?

1. **Improved Performance**:
   - The Random Forest's MAE of **17,615** indicates that it is better at capturing the relationships in the data compared to both Linear Regression and Decision Trees. This makes sense because Random Forests, as an ensemble method, aggregate multiple decision trees, reducing the risk of overfitting while still capturing complex patterns.

2. **Random Forest's Strengths**:
   - Random Forests combine the strengths of many decision trees, averaging their predictions to reduce variance. This helps the model generalize better to unseen data, as evidenced by the lower error.
   - It also handles both linear and non-linear relationships well, making it a versatile model for this dataset.

### How Do We Proceed?

Since Random Forest has shown significant improvement, we can now focus on **fine-tuning** the model to optimize its performance even further. Some common hyperparameters to tune in a Random Forest include:
- **n_estimators**: The number of trees in the forest. More trees generally improve performance but also increase computation time.
- **max_depth**: The maximum depth of each tree. Limiting the depth can help prevent overfitting.
- **min_samples_split** and **min_samples_leaf**: These control when nodes are split and how many samples a leaf must have. Higher values can reduce overfitting.

Alternatively, if we want to explore further improvements, we could try other ensemble methods like **Gradient Boosting**, which builds trees sequentially and corrects mistakes from previous trees.

### Conclusion:
The Random Forest model currently provides the best performance with an MAE of **17,615**. The next step would be to either:
1. **Tune the Random Forest hyperparameters** to further optimize the model.
2. **Try Gradient Boosting**, which might provide even better results by focusing on correcting errors in predictions step by step.

---

## Next Step: Hyperparameter Tuning for Random Forest

Let's start by tuning some key hyperparameters for the Random Forest model and see if we can further improve the MAE.


In [20]:
# Tune the hyperparameters of the Random Forest
forest_model_tuned = RandomForestRegressor(n_estimators=300, max_depth=20, random_state=42)

# Train the tuned model
forest_model_tuned.fit(X_train, y_train)

# Make predictions on the validation data
y_pred_forest_tuned = forest_model_tuned.predict(X_val)

# Calculate the Mean Absolute Error (MAE) for the tuned model
mae_forest_tuned = mean_absolute_error(y_val, y_pred_forest_tuned)
print(f'Mean Absolute Error (Tuned Random Forest): {mae_forest_tuned}')


Mean Absolute Error (Tuned Random Forest): 17461.31504498976


## Explanation of Hyperparameters in Random Forest

After tuning the **max_depth** to 20 and increasing the **n_estimators** (iterations) to 400, you achieved an MAE of **17,400**, which is a slight improvement. Let’s break down what these hyperparameters mean and why adjusting them can affect the model’s performance.

### Key Hyperparameters:

1. **max_depth**:
   - **Definition**: This parameter controls the maximum depth of each tree in the forest. A deeper tree can model more complex relationships but also increases the risk of overfitting, where the model performs well on training data but poorly on unseen data.
   - **Effect**: Increasing the `max_depth` allows the model to split the data more and make finer decisions, which can improve accuracy up to a certain point. Beyond that, it might overfit.
   - In your case, setting `max_depth=20` gave a good balance between capturing complexity and avoiding overfitting.

2. **n_estimators**:
   - **Definition**: This parameter controls the number of trees in the Random Forest. More trees generally lead to better performance, as the model averages predictions over more trees, reducing variance and improving generalization.
   - **Effect**: Increasing `n_estimators` makes the model more robust, but with diminishing returns as it increases computation time. In your case, increasing to 400 iterations led to a slight improvement in performance.

### Next Step: Hyperparameter Tuning Loop

To find the optimal combination of `max_depth` and `n_estimators`, we can set up a loop that systematically varies both parameters and tracks the resulting MAE. This will allow us to identify the combination that works best for this dataset.


In [21]:
from sklearn.metrics import mean_absolute_error
import numpy as np

# Define a range of values for max_depth and n_estimators
max_depth_range = [10, 15, 20, 25]
n_estimators_range = [100, 200, 300, 400, 500]

# To store the results
results = []

# Loop through all combinations of max_depth and n_estimators
for max_depth in max_depth_range:
    for n_estimators in n_estimators_range:
        # Initialize the Random Forest with current parameters
        forest_model = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=42)
        
        # Train the model
        forest_model.fit(X_train, y_train)
        
        # Make predictions on the validation set
        y_pred = forest_model.predict(X_val)
        
        # Calculate the MAE
        mae = mean_absolute_error(y_val, y_pred)
        
        # Append the results as a tuple (max_depth, n_estimators, mae)
        results.append((max_depth, n_estimators, mae))

# Convert results to a sorted list and display the top combinations
results = sorted(results, key=lambda x: x[2])
for result in results:
    print(f"max_depth: {result[0]}, n_estimators: {result[1]}, MAE: {result[2]}")


max_depth: 25, n_estimators: 400, MAE: 17456.371685216895
max_depth: 25, n_estimators: 500, MAE: 17459.093170091324
max_depth: 20, n_estimators: 400, MAE: 17461.181618514885
max_depth: 20, n_estimators: 300, MAE: 17461.31504498976
max_depth: 15, n_estimators: 400, MAE: 17463.258091350883
max_depth: 15, n_estimators: 500, MAE: 17468.437681632855
max_depth: 25, n_estimators: 300, MAE: 17469.458280060884
max_depth: 20, n_estimators: 500, MAE: 17480.510176232194
max_depth: 15, n_estimators: 300, MAE: 17503.25418301735
max_depth: 20, n_estimators: 200, MAE: 17514.76008789611
max_depth: 15, n_estimators: 200, MAE: 17524.09608539495
max_depth: 25, n_estimators: 200, MAE: 17531.990650684933
max_depth: 20, n_estimators: 100, MAE: 17600.040353847635
max_depth: 25, n_estimators: 100, MAE: 17612.309760273973
max_depth: 15, n_estimators: 100, MAE: 17658.094402714447
max_depth: 10, n_estimators: 400, MAE: 17712.123787781504
max_depth: 10, n_estimators: 500, MAE: 17727.101839952917
max_depth: 10, n_e

## Hyperparameter Tuning Results for Random Forest

After running a loop to tune the hyperparameters (`max_depth` and `n_estimators`), we identified the best combinations. Here are the top-performing models based on **Mean Absolute Error (MAE)**:

### Top Results:
1. **max_depth: 25, n_estimators: 400**, MAE: **17,456**
2. **max_depth: 25, n_estimators: 500**, MAE: **17,459**
3. **max_depth: 20, n_estimators: 400**, MAE: **17,461**
4. **max_depth: 20, n_estimators: 300**, MAE: **17,461**
5. **max_depth: 15, n_estimators: 400**, MAE: **17,463**

### Interpretation:
1. **Optimal Depth**: It appears that deeper trees (with `max_depth` of 20-25) are better at capturing the complexity of the data. Shallow trees (e.g., `max_depth: 10`) resulted in higher MAEs.
2. **Number of Trees**: Increasing the number of trees (`n_estimators`) from 300 to 500 did not yield a significant improvement beyond 400 trees. More trees generally help, but after a certain point, the improvement diminishes.

### Conclusion:
- The best model uses **max_depth: 25** and **n_estimators: 400**, resulting in an MAE of **17,456**. This is a strong improvement compared to previous models like Linear Regression (~21,100) and Decision Trees (~27,500).
- Increasing the `max_depth` seems to help, but additional trees beyond 400 don't provide much improvement, indicating that 400 trees is sufficient for this dataset.


## Trying Another Model: Gradient Boosting

Since Random Forest provided good results but still has room for improvement, the next step is to try **Gradient Boosting**. Gradient Boosting is another powerful ensemble method that builds trees sequentially, where each tree tries to correct the errors of the previous one. This makes Gradient Boosting more effective at handling complex patterns than Random Forests, as it focuses on reducing bias over multiple iterations.

### Why Gradient Boosting?
- **Sequential learning**: Unlike Random Forests, which build trees independently, Gradient Boosting builds trees in a sequential manner, where each tree attempts to correct the mistakes of the previous trees.
- **More control over overfitting**: Gradient Boosting tends to perform better with hyperparameter tuning, offering finer control over the learning process and tree complexity.

### Let's try Gradient Boosting using the `GradientBoostingRegressor` from `sklearn`.


In [22]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting model
gb_model = GradientBoostingRegressor(random_state=42)

# Train the model on the training data
gb_model.fit(X_train, y_train)

# Make predictions on the validation data
y_pred_gb = gb_model.predict(X_val)

# Calculate the Mean Absolute Error (MAE)
mae_gb = mean_absolute_error(y_val, y_pred_gb)
print(f'Mean Absolute Error (Gradient Boosting): {mae_gb}')


Mean Absolute Error (Gradient Boosting): 17205.71765440213


## Gradient Boosting Results

The **Gradient Boosting** model resulted in a **Mean Absolute Error (MAE)** of **17,205**, which is even better than the tuned Random Forest model (MAE ~17,456). This suggests that Gradient Boosting is more effective at capturing the patterns in this dataset.

### Why is Gradient Boosting Performing Better?
- **Sequential Learning**: Gradient Boosting corrects mistakes made by previous models, leading to a more refined set of predictions.
- **Bias Reduction**: The model reduces bias by sequentially learning from residuals, leading to more accurate predictions.

### Hyperparameter Tuning for Gradient Boosting

Gradient Boosting has several hyperparameters that can significantly affect its performance. Some key parameters to tune include:
- **n_estimators**: The number of boosting stages or trees. More trees usually improve performance but increase computation time.
- **learning_rate**: This controls how much each tree corrects the errors of the previous one. Lower values can improve performance, but you need more trees.
- **max_depth**: The maximum depth of each tree. Deeper trees can model more complex patterns but are prone to overfitting.
- **subsample**: The fraction of samples used for fitting each individual tree. Smaller values can improve generalization.

### Next Step: Tuning the Hyperparameters

We can use a loop to test different combinations of `n_estimators`, `learning_rate`, and `max_depth` to see which combination yields the best results.


In [23]:
# Define a range of values for n_estimators, learning_rate, and max_depth
n_estimators_range = [100, 200, 300, 400]
learning_rate_range = [0.01, 0.05, 0.1]
max_depth_range = [3, 5, 7]

# To store the results
results_gb = []

# Loop through all combinations of n_estimators, learning_rate, and max_depth
for n_estimators in n_estimators_range:
    for learning_rate in learning_rate_range:
        for max_depth in max_depth_range:
            # Initialize the Gradient Boosting model with current parameters
            gb_model_tuned = GradientBoostingRegressor(n_estimators=n_estimators, 
                                                       learning_rate=learning_rate, 
                                                       max_depth=max_depth, 
                                                       random_state=42)
            
            # Train the model
            gb_model_tuned.fit(X_train, y_train)
            
            # Make predictions on the validation set
            y_pred_gb_tuned = gb_model_tuned.predict(X_val)
            
            # Calculate the MAE
            mae_gb_tuned = mean_absolute_error(y_val, y_pred_gb_tuned)
            
            # Append the results as a tuple (n_estimators, learning_rate, max_depth, mae)
            results_gb.append((n_estimators, learning_rate, max_depth, mae_gb_tuned))

# Sort the results and display the top combinations
results_gb = sorted(results_gb, key=lambda x: x[3])
for result in results_gb[:5]:  # Display top 5
    print(f"n_estimators: {result[0]}, learning_rate: {result[1]}, max_depth: {result[2]}, MAE: {result[3]}")


n_estimators: 400, learning_rate: 0.05, max_depth: 5, MAE: 15972.977296066885
n_estimators: 300, learning_rate: 0.05, max_depth: 5, MAE: 15994.068067639964
n_estimators: 200, learning_rate: 0.1, max_depth: 5, MAE: 16001.807618517276
n_estimators: 100, learning_rate: 0.1, max_depth: 5, MAE: 16026.443816349469
n_estimators: 300, learning_rate: 0.1, max_depth: 5, MAE: 16030.627509743945


## Final Gradient Boosting Results

The tuning loop for **Gradient Boosting** produced the best result with:
- **n_estimators: 400**
- **learning_rate: 0.05**
- **max_depth: 5**
- **MAE: 15,973**

This is a significant improvement from the previous models:
- **Linear Regression MAE**: ~21,109
- **Tuned Random Forest MAE**: ~17,456
- **Gradient Boosting Initial MAE**: ~17,205

### Is It Worth Trying Other Models?

Given the current results, **Gradient Boosting** has shown excellent performance and may already be a good candidate for a production model. While other algorithms like **XGBoost** or **LightGBM** may offer marginal improvements, they often work similarly to Gradient Boosting and might not yield drastically better results on this dataset.

### Is This Project Workable in Real Life?

Absolutely! Here's why:
1. **Structured Data Problem**: Predicting house prices based on structured data is a common real-life application in fields like real estate or financial services.
2. **Machine Learning Engineer Role**: As a machine learning engineer, being able to develop, tune, and evaluate models like Random Forest and Gradient Boosting is essential. This project demonstrates key skills, including data preprocessing, feature engineering, and hyperparameter tuning.
3. **Deployability**: The models you’ve built can easily be deployed in real-world applications using APIs or integrated into data pipelines for continuous predictions.
4. **Explainability**: Gradient Boosting and Random Forest models are interpretable, meaning you can explain which features impact the predictions the most, which is crucial in business contexts.

### Would Neural Networks Help?

While neural networks are powerful, they generally excel in cases where there are highly complex patterns in data (e.g., images, text, deep relationships between features). In this case:
- **Structured Data**: For tabular data like this, algorithms such as **Gradient Boosting** or **Random Forest** usually perform better or on par with neural networks.
- **Computational Cost**: Neural networks require more computational resources and tuning to achieve similar results to the Gradient Boosting model you already tuned.

Given the results from Gradient Boosting, neural networks would likely provide **marginal benefits at a much higher cost**. So, it may not be worth trying neural networks for this particular dataset.


## Trying a Neural Network Model

Even though **Gradient Boosting** has proven to be highly effective for this dataset, it’s worth experimenting with a **Neural Network** to see how it performs. Neural networks are powerful, but for structured data like this, they often don’t outperform models like Gradient Boosting. However, for curiosity’s sake, let’s try a basic neural network using **Keras** and see how it compares.

### Key Considerations for Neural Networks:
- **Feature Scaling**: Neural networks typically require that features are scaled. We will apply **StandardScaler** to ensure all features are on the same scale.
- **Architecture**: We'll create a simple feed-forward neural network with a few dense layers.

Let's start by setting up a neural network and training it.


In [28]:
from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler


# Step 1: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Step 2: Define the neural network architecture using Input layer
model = Sequential()

# Input layer and one hidden layer
model.add(Input(shape=(X_train_scaled.shape[1],)))  # Using Input layer, shape should match feature count
model.add(Dense(128, activation='relu'))  # 128 neurons, hidden layer
model.add(Dense(64, activation='relu'))  # 64 neurons, hidden layer
model.add(Dense(32, activation='relu'))  # 32 neurons, hidden layer

# Output layer
model.add(Dense(1))  # Output for regression task

# Step 3: Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_absolute_error')

# Step 4: Train the model
history = model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, validation_data=(X_val_scaled, y_val), verbose=1)

# Step 5: Make predictions on the validation set
y_pred_nn = model.predict(X_val_scaled)

# Step 6: Calculate the Mean Absolute Error
mae_nn = mean_absolute_error(y_val, y_pred_nn)
print(f'Mean Absolute Error (Neural Network): {mae_nn}')


Epoch 1/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 176285.2500 - val_loss: 178820.6562
Epoch 2/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 781us/step - loss: 182358.8906 - val_loss: 178588.2812
Epoch 3/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 743us/step - loss: 179599.4844 - val_loss: 177004.2500
Epoch 4/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 725us/step - loss: 177299.9219 - val_loss: 170712.7188
Epoch 5/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 751us/step - loss: 170505.3750 - val_loss: 153472.9688
Epoch 6/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 732us/step - loss: 147293.9688 - val_loss: 118456.4219
Epoch 7/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 707us/step - loss: 112232.0469 - val_loss: 68946.2266
Epoch 8/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 720us/step

## Neural Network Results

After training the Neural Network model, the **Mean Absolute Error (MAE)** was approximately **19,961**. While this is a respectable result, it did not outperform the best results from **Gradient Boosting** (MAE ~15,973) or **Random Forest** (MAE ~17,456).

### Key Observations:
1. **Neural Networks Performance**:
   - Despite fine-tuning the architecture (hidden layers, neurons) and training for 100 epochs, the Neural Network’s performance was slightly worse than tree-based models like Gradient Boosting.
   - This is expected, as **Neural Networks** often perform best on highly complex data (like images, text, or large-scale data) rather than structured data like this housing dataset.

2. **Gradient Boosting Is the Best Performer**:
   - For structured/tabular data like this, **Gradient Boosting** algorithms often outperform neural networks due to their ability to capture complex, non-linear interactions in the data without extensive feature engineering.

### Conclusion: Best Algorithm for This Project
- **Gradient Boosting** emerged as the best model with an MAE of **15,973**, making it the most suitable for this problem.
- **Random Forest** also performed well, but Gradient Boosting’s sequential learning process allowed it to refine predictions more effectively.

---

## Next Steps: Moving to a New Project

### Where to Go from Here:

1. **More Complex Datasets**:
   - You can move on to larger, more complex datasets, possibly from industries like **finance**, **healthcare**, or **e-commerce**. These often require handling more features and complex relationships.
   
2. **Time Series or Sequential Data**:
   - You could explore projects that involve **time series forecasting** (predicting future values) or **sequence data** (such as stock prices, weather, or user behavior).

3. **Unsupervised Learning**:
   - Work on **clustering** or **dimensionality reduction** projects using methods like **K-means** or **PCA** to explore datasets without labeled outcomes.

4. **End-to-End Deployment**:
   - Deploy one of your models into a real-world application. This could involve using **Flask**, **FastAPI**, or deploying it on platforms like **AWS Lambda** or **Google Cloud**.

