<p style="text-align:center"> 
<a href="https://skills.network" target="_blank"> 
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo"> 
</a>
</p>

# <h1 align="center"><font size="7"><strong>Final project</strong></font></h1>
## <h2 align= "center"><font size="6.8">**Homes for sale in King County, USA**</font></h2>

<hr>

# Part 4: Model Development

Once we understand the relationships between our variables, we can begin to develop a model to obtain the estimating equation for the price of a house based on certain specific characteristics.

First, let's separate the data into different samples (training data, test data, and validation data).

We import the following module to separate our data:

In [None]:
%pip install xgboost

In [28]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('../data/processed/data_clean.csv')

In [3]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3.0,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3.0,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000.0,2.0,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4.0,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3.0,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


We divide the data into training, test and validation sets:

In [29]:
X = df[['sqft_living']]
Y = df['price']

def evaluate_regression_models(X, y, test_size=0.2, random_state=42):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Models to evaluate
    models = {
        "Linear Regression": LinearRegression(),
        "Lasso Regression (L1)": Lasso(alpha=0.1),
        "Decision Tree": DecisionTreeRegressor(random_state=random_state),
        "Random Forest": RandomForestRegressor(random_state=random_state),
        "Support Vector Regression": SVR(),
        "K-Nearest Neighbors": KNeighborsRegressor(),
        "XGBoost": XGBRegressor(random_state=random_state)
    }

    results = []

    # Train and evaluate each model
    for name, model in models.items():
        model.fit(X_train, y_train)  # Fit the model
        y_pred = model.predict(X_test)  # Make predictions

        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)

        # Store the results
        results.append({
            "Model": name,
            "MAE": mae,
            "RMSE": rmse,
            "R2 Score": r2
        })

    # Return the sorted results by RMSE
    return pd.DataFrame(results).sort_values(by="RMSE")

In [30]:
print(evaluate_regression_models(X, Y))

                       Model            MAE           RMSE  R2 Score
6                    XGBoost  165352.247594  246739.413739  0.512308
1      Lasso Regression (L1)  170870.786261  252029.740667  0.491170
0          Linear Regression  170870.786277  252029.740668  0.491170
3              Random Forest  168343.428823  256248.182112  0.473994
2              Decision Tree  170120.593611  261662.193989  0.451533
5        K-Nearest Neighbors  176919.756772  262332.935369  0.448717
4  Support Vector Regression  219747.736216  362978.747862 -0.055435


We created and fitted the model to our data to obtain the `Coefficient of Determination` using the test data.

An R² (coefficient of determination) of 0.51 indicates that the model explains 51% of the variability in the dependent data as a function of the independent variables. While this doesn't necessarily mean the model is poor, there are reasons to believe it could be improved.

Now, we are going to use all the features and this time use different regression algorithms to compare the different results and find the best performance.

In [31]:
X = df[['sqft_living', 'sqft_lot', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']]
Y = df['price']

In [32]:
def evaluate_regression_models(X, y, test_size=0.2, random_state=42):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Models to evaluate
    models = {
        "Linear Regression": LinearRegression(),
        "Lasso Regression (L1)": Lasso(alpha=0.1),
        "Decision Tree": DecisionTreeRegressor(random_state=random_state),
        "Random Forest": RandomForestRegressor(random_state=random_state),
        "Support Vector Regression": SVR(),
        "K-Nearest Neighbors": KNeighborsRegressor(),
        "XGBoost": XGBRegressor(random_state=random_state)
    }

    results = []

    # Train and evaluate each model
    for name, model in models.items():
        model.fit(X_train, y_train)  # Fit the model
        y_pred = model.predict(X_test)  # Make predictions

        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)

        # Store the results
        results.append({
            "Model": name,
            "MAE": mae,
            "RMSE": rmse,
            "R2 Score": r2
        })

    # Return the sorted results by RMSE
    return pd.DataFrame(results).sort_values(by="RMSE")


In [33]:
print(evaluate_regression_models(X, Y))

                       Model            MAE           RMSE  R2 Score
3              Random Forest  115583.976128  178454.927215  0.744891
6                    XGBoost  118690.431786  179693.848951  0.741336
1      Lasso Regression (L1)  134497.326922  206149.246421  0.659566
0          Linear Regression  134497.339596  206149.304852  0.659566
5        K-Nearest Neighbors  160978.677458  247851.798866  0.507900
2              Decision Tree  159011.911642  252021.843763  0.491202
4  Support Vector Regression  220730.705907  364160.594745 -0.062319


### Pipeline or data pipeline

We already saw that we were able to improve the R2 by using all the variables, which makes our model more accurate than fitting it to just one. But as always, we can improve it further by using a pipeline.

This basically automates and links sequential processes so that the data passes from one step to another in an orderly and efficient manner without losing or truncating data.

In [None]:
def evaluate_models_with_pipeline(X, y): 
    # Division of the dataset 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

    # Models and pipelines 
    models = { 
    "Linear Regression": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", LinearRegression()) 
    ]), 
    "Linear Regression (Poly)": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("poly", PolynomialFeatures(degree=2, include_bias=False)), 
    ("model", LinearRegression()) 
    ]), 
    "Lasso": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", Lasso(alpha=0.1)) 
    ]), 
    "Lasso (Poly)": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("poly", PolynomialFeatures(degree=2, include_bias=False)), 
    ("model", Lasso(alpha=0.1)) 
    ]), 
    "KNN": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", KNeighborsRegressor()) 
    ]), 
    "SVR": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", SVR()) 
    ]), 
    "Decision Tree": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", DecisionTreeRegressor(random_state=42)) 
    ]), 
    "Random Forest": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", RandomForestRegressor(random_state=42)) 
    ]), 
    "XGBoost": Pipeline([ 
    ("scaler", StandardScaler()), 
    ("model", XGBRegressor(random_state=42)) 
    ]) 
    } 

    results = [] 

    # Training, prediction and evaluation 
    for name, pipeline in models.items(): 
        pipeline.fit(X_train, y_train) 
        y_pred = pipeline.predict(X_test) 

        mae = mean_absolute_error(y_test, y_pred) 
        rmse = np.sqrt(mean_squared_error(y_test, y_pred)) 
        r2 = r2_score(y_test, y_pred) 

        results.append({ 
        "Model": name, 
        "MAE": mae, 
        "RMSE": rmse, 
        "R2": r2, 
        "Pipeline": pipeline 
        }) 

    # Results sorted by RMSE 
    results_df = pd.DataFrame(results).sort_values("RMSE") 

    return results_df[["Model", "MAE", "RMSE", "R2"]]

In [37]:
print(evaluate_models_with_pipeline(X, Y))

                      Model            MAE           RMSE        R2
7             Random Forest  115588.837804  178333.500991  0.745238
8                   XGBoost  118690.431786  179693.848951  0.741336
1  Linear Regression (Poly)  125148.974702  189339.644386  0.712821
3              Lasso (Poly)  125155.924272  189339.712971  0.712821
4                       KNN  129793.162338  203895.945859  0.666968
0         Linear Regression  134497.339596  206149.304852  0.659566
2                     Lasso  134497.357740  206149.307683  0.659566
6             Decision Tree  159503.223098  252150.177925  0.490684
5                       SVR  220241.219285  363591.481095 -0.059001


The improvement is practically insignificant, with the best model still being the ``Random forest``, which went from an R2 of ``0.744891`` to ``0.745238``.

<hr>

## Author

<a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/">**Flavio Aguirre**</a><br>
<a href="https://coursera.org/share/e27ae5af81b56f99a2aa85289b7cdd04">***Data Scientist***</a>