# 🏡 House Prices - Advanced Regression Techniques 

In this notebook, we'll try to predict house prices around Ames, Iowa. There are 79 features, independent variables we can work on, and this dataset was created by Dean De Cock for data science education. You can get more information by visitng the Kaggle page:

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/description 

Let's begin by defining our objective:

* **Objective**: Given 79 features, we'd like to predict *the exact price of an house* with the minumum amount of error. (Root-Mean-Squared-Log-Error)
* This is a *supervised learning* example, since the model will be trained on data with *labeled examples*.
* This is a typical *regression* problem, since we're trying to predict a value. Moreover, this is a *univariate regression* problem beacuse we are trying to predict only one feature which is the price of the given house. 
* **Evaluation**:
> From the page: Submissions are evaluated on **Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.** (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
 

### Getting Started with Data

I downloaded and uploaded the training data the my Google Drive account. To access the data, we'll mount our drive account to this notebook. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Standard tools for data analysis, we'll be using plotly library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import scipy.stats as stats
from IPython.display import display, HTML

In [None]:
# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/house-prices-advanced-regression/data/train.csv")

### **1. Exploratory Data Analysis**

In [None]:
df.shape

(1460, 81)

In [None]:
# Funtion to create scrollable table - for better visualization-
# Bc variable number to visualize is too much
def create_scrollable_table(df, table_id, title):
    html = f"<h3>{title}<h3>"
    html += f'<div id="{table_id}" style="height:200px; overflow:auto;">'
    html += df.to_html() 
    html += '</div>'
    return html

In [None]:
numerical_features = df.select_dtypes(include=[np.number])
numerical_features.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [None]:
# Select all the numerical variables and visualize them in a scrollable window
numerical_features = df.select_dtypes(include=[np.number])
summary_stats = numerical_features.describe().T
html_numerical = create_scrollable_table(summary_stats, "numerical_features", "Summary statistics for numerical features")

display(HTML(html_numerical))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


There are 79 features we can use for our model's prediction. This is a lot to go over one by one. But pay attention to this: Variables like YearSold or YearBuilt are continuous variables (1460-2020..) But they don't need to be. We can convert them to categories later.



In [None]:
# Summary statistics for categorical features
categorical_features = df.select_dtypes(include=[object])
cat_summary_stats = categorical_features.describe().T 
html_categorical = create_scrollable_table(cat_summary_stats, 'categorical_features', 'Summary statistics for categorical features')

display(HTML(html_categorical))

Unnamed: 0,count,unique,top,freq
MSZoning,1460,5,RL,1151
Street,1460,2,Pave,1454
Alley,91,2,Grvl,50
LotShape,1460,4,Reg,925
LandContour,1460,4,Lvl,1311
Utilities,1460,2,AllPub,1459
LotConfig,1460,5,Inside,1052
LandSlope,1460,3,Gtl,1382
Neighborhood,1460,25,NAmes,225
Condition1,1460,9,Norm,1260


In [None]:
# Null values in the dataset 
null_values = df.isnull().sum()
html_null_values = create_scrollable_table(null_values.to_frame(), 'null_values', 'Null values in dataset')

# Percentage of missing values for each feature
missing_percentage = (df.isnull().sum()/len(df)) * 100
html_missing_percentage = create_scrollable_table(missing_percentage.to_frame(), 'missing_percentage', "Percentage og missing feaures")

display(HTML(html_null_values + html_missing_percentage))

Unnamed: 0,0
Id,0
MSSubClass,0
MSZoning,0
LotFrontage,259
LotArea,0
Street,0
Alley,1369
LotShape,0
LandContour,0
Utilities,0

Unnamed: 0,0
Id,0.0
MSSubClass,0.0
MSZoning,0.0
LotFrontage,17.739726
LotArea,0.0
Street,0.0
Alley,93.767123
LotShape,0.0
LandContour,0.0
Utilities,0.0


In [None]:
# Get a list of all the colums
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [None]:
# We don't need ID column in our analysis so we can drop it. 
df = df.drop("Id", axis=1)
df.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

### 1.2. Explore the Dependent Variable

* Should we normalize the dependent variable?

Normalizing the dependent variable (SalePrice) might be necessary in certain scenarios. Some machine learning algorithms assume that the variables have a normal distribution, which can help ensure the assumptions of the model are met. Additionally, normalizing the dependent variable can improve the interpretability of the model's coefficients and enhance the model's performance, especially when features have different scales.

By visually inspecting the distribution of SalePrice and comparing it to a fitted normal distribution, we can assess if normalization is necessary or if any transformations are required to meet the assumptions of the chosen machine learning model.

In [None]:
import scipy.stats as stats

# Fit normal distribution to the SalePrice data 
mu, sigma = stats.norm.fit(df.SalePrice) # returns mean, standard deviation 

# Create a histogram of the SalePrice column
hist_data = go.Histogram(x=df['SalePrice'], 
                         nbinsx=50, 
                         name="Histogram", 
                         opacity=0.75, 
                         histnorm="probability density", 
                         marker=dict(color='red'))
# Calculate the normal distribution based on the fitted parameters
# to generate a set of 100 equally spaced values between the min and max SalePrice values
x_norm = np.linspace(df['SalePrice'].min(), df['SalePrice'].max(), 100)  
y_norm = stats.norm.pdf(x_norm, mu, sigma)

# Create the normal distribution overlay 
norm_data = go.Scatter(x=x_norm, 
                       y=y_norm, 
                       mode='lines',
                       name=f'Normal dist. (μ={mu:.2f}), σ={sigma:.2f})',
                       line=dict(color='blue'))
# Combine histogram and overlay 
fig = go.Figure(data=[hist_data, norm_data])

# Set the layout for the plot
fig.update_layout(
    title="SalePrice Distribution",
    xaxis_title="SalePrice",
    yaxis_title="Density",
    legend_title_text="Fitted Normal Distribution",
    plot_bgcolor='rgba(32, 32, 32, 1)',
    paper_bgcolor='rgba(32, 32, 32, 1)',
    font=dict(color='white'))

**What is a Q-Q Plot?**

A Q-Q plot (Quantile-Quantile plot) is a graphical tool used to assess the similarity between the observed data and a theoretical distribution. It is commonly used to determine if a dataset follows a specific probability distribution, such as the normal distribution.

The Q-Q plot compares the quantiles of the observed data against the quantiles of the theoretical distribution. The x-axis represents the quantiles of the theoretical distribution, while the y-axis represents the quantiles of the observed data. If the data perfectly follows the theoretical distribution, the points in the Q-Q plot will lie on a straight line.

In [None]:
# Create a Q-Q plot
qq_data = stats.probplot(df['SalePrice'], dist="norm")
qq_fig = px.scatter(x=qq_data[0][0], 
                    y=qq_data[0][1], 
                    labels={'x': 'Theoretical Quantiles', 'y': 'Ordered Values'}, 
                    color_discrete_sequence=["red"])
qq_fig.update_layout(
    title="Q-Q plot",
    plot_bgcolor='rgba(32, 32, 32, 1)',
    paper_bgcolor='rgba(32, 32, 32, 1)',
    font=dict(color='white')
)

# Calculate the line of best fit
slope, intercept, r_value, p_value, std_err = stats.linregress(qq_data[0][0], qq_data[0][1])
line_x = np.array(qq_data[0][0])
line_y = intercept + slope * line_x

# Add the line of best fit to the Q-Q plot
line_data = go.Scatter(x=line_x, y=line_y, mode="lines", name="Normal Line", line=dict(color="blue"))

# Update the Q-Q plot with the normal line
qq_fig.add_trace(line_data)

# Show the plot
qq_fig.show()


The SalePrice column is rightly-skewed. If we didn't have this skew, we'd see the blue-straight line. We can use a log transformation later to normalize it. 

### Questions to ask about our data:

We can examine how different features of the dataset are related to SalePrice: 

Some examples:
1. How does the dwelling types and their sale prices related?

2. How does the type of road access (Street) impact the sale price (SalePrice)?

3. Does the presence or type of alley access (Alley) affect the sale price (SalePrice)?

*Dwelling type and sale prices:* 

In [None]:
# Select value counts of BldgType
value_counts = df['BldgType'].value_counts()

# Create a bar chart for value counts
fig1 = go.Figure(data=go.Bar(x=value_counts.index, y=value_counts.values, text=value_counts.values, textposition='auto', marker_color='red'))

# Set the axis labels and title for value counts chart
fig1.update_layout(
    xaxis_title='Building Type',
    yaxis_title='Count',
    title='Value Counts of Building Type',
    bargap=0.4  # Adjust the gap between bars for more distance
)

# Group the data by building type and calculate the average sale price
grouped_data = df.groupby('BldgType')['SalePrice'].mean().reset_index()

# Format the sale price values with $ sign and thousand separators
formatted_values = grouped_data['SalePrice'].apply(lambda x: '${:,.2f}'.format(x))

# Create a bar chart for average sale price with formatted values
fig2 = go.Figure(data=go.Bar(x=grouped_data['BldgType'], y=grouped_data['SalePrice'], text=formatted_values,
                            textposition='auto', marker_color='blue'))

# Set the axis labels and title for average sale price chart
fig2.update_layout(
    xaxis_title='Building Type',
    yaxis_title='Sale Price',
    title='Average Sale Price by Building Type',
    bargap=0.4  # Adjust the gap between bars for more distance
)

# Set the dark theme for both plots
fig1.update_layout(template='plotly_dark')
fig2.update_layout(template='plotly_dark')

# Display both plots
fig1.show()
fig2.show()


*Street access and sale prices:*

In [None]:
import locale

# Set the locale for formatting average prices with commas and dollar sign
locale.setlocale(locale.LC_ALL, '')

# Create a box plot for the sale price by street type
fig1 = go.Figure()
for street_type in df['Street'].unique():
    fig1.add_trace(go.Box(y=df[df['Street'] == street_type]['SalePrice'], name=street_type))

# Set the axis labels and title for the box plot
fig1.update_layout(
    xaxis_title='Street',
    yaxis_title='Sale Price',
    title='Sale Price Distribution by Street Type',
    template='plotly_dark'
)

# Calculate the average sale price by street type
average_prices = df.groupby('Street')['SalePrice'].mean().reset_index()

# Format the average prices with dollar sign and commas
average_prices['FormattedPrice'] = average_prices['SalePrice'].apply(lambda x: locale.currency(x, grouping=True))

# Create a bar chart for the average sale price
fig2 = go.Figure(data=go.Bar(x=average_prices['Street'], y=average_prices['SalePrice'], text=average_prices['FormattedPrice'], textposition='auto', marker_color='red'))

# Set the axis labels and title for the bar chart
fig2.update_layout(
    xaxis_title='Street',
    yaxis_title='Average Sale Price',
    title='Average Sale Price by Street Type',
    template='plotly_dark',
    bargap=0.4  # Adjust the gap between bars for more distance
)

# Display both plots
fig1.show()
fig2.show()


Finally, *alley access and saleprice:*

In [None]:
# Select value counts of Alley access
value_counts = df['Alley'].value_counts()

# Create a bar chart for value counts
fig1 = go.Figure(data=go.Bar(x=value_counts.index, y=value_counts.values, text=value_counts.values, textposition='auto', marker_color='red'))

# Set the axis labels and title for value counts chart
fig1.update_layout(
    xaxis_title='Alley Access',
    yaxis_title='Count',
    title='Value Counts of Alley Access',
    bargap=0.4  # Adjust the gap between bars for more distance
)

# Group the data by Alley access and calculate the average sale price
grouped_data = df.groupby('Alley')['SalePrice'].mean().reset_index()

# Format the sale price values with $ sign and thousand separators
formatted_values = grouped_data['SalePrice'].apply(lambda x: '${:,.2f}'.format(x))

# Create a bar chart for average sale price with formatted values
fig2 = go.Figure(data=go.Bar(x=grouped_data['Alley'], y=grouped_data['SalePrice'], text=formatted_values,
                            textposition='auto', marker_color='blue'))

# Set the axis labels and title for average sale price chart
fig2.update_layout(
    xaxis_title='Alley Access',
    yaxis_title='Average Sale Price',
    title='Average Sale Price by Alley Access',
    bargap=0.4  # Adjust the gap between bars for more distance
)

# Set the dark theme for both plots
fig1.update_layout(template='plotly_dark')
fig2.update_layout(template='plotly_dark')

# Display both plots
fig1.show()
fig2.show()


In [None]:
import numpy as np
import scipy.stats as stats
import plotly.graph_objects as go

# Select numeric features from the DataFrame
numeric_features = df.select_dtypes(include='number').columns.tolist()

# Create a separate histogram plot for each numeric feature
for feature in numeric_features:
    fig = go.Figure(data=go.Histogram(x=df[feature], nbinsx=30, marker_color='blue'))

    # Fit a line representing the best-fit normal distribution curve
    x_range = np.linspace(df[feature].min(), df[feature].max(), num=100)
    fitted_line = stats.norm.pdf(x_range, loc=df[feature].mean(), scale=df[feature].std()) * len(df[feature])
    fig.add_trace(go.Scatter(x=x_range, y=fitted_line, mode='lines', name='Fitted Line', line_color='red'))

    # Set the axis labels and title for the plot
    fig.update_layout(
        xaxis_title=feature,
        yaxis_title='Count',
        title=f'Distribution of {feature}',
        template='plotly_dark'
    )

    # Show the plot
    fig.show()


As you see, most numeric values have skewed distribution, so that we'll be using median value while imputing numeric values since it's more robust to the outliers.

### **2. Create Data Pipeline**

Creating a Pipeline ensures that the preprocessing steps are applied consistently through all data. It also improves preprocessing workflow, code organization, and automates the preprocessing steps, and integrates well with machine learning models, which ultimately leads to more efficient and scalable data analysis and model deployment.









In [None]:
# Import the required libraries
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# This transformer handles the numerical columns in the dataset.
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # replace missing values by median
    ('scaler', StandardScaler()) # Scale the data using standard scaler
])


# Create a categorical transformer
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # create new cat for missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)) #ignore unknown categories, return dense arrays
])


In [None]:
# Select categorical and numerical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
numeric_columns = df.select_dtypes(include=['number']).columns

# Drop target variable from numeric columns
numeric_columns = numeric_columns.drop('SalePrice')

# Use Columntransformer to combine transformers
preprocessor = ColumnTransformer(
    transformers = [
        ("numeric", numerical_transformer, numeric_columns),
        ("categorical", categorical_transformer, categorical_columns)],
        remainder="passthrough")

In [None]:
# Create a Pipeline with this preprocessor
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor)
    ])

# Apply this Pipeline to dataset
X = df.drop("SalePrice", axis=1)
y = np.log(df["SalePrice"]) # normalize dependent (y) variable
X_preprocessed = pipeline.fit_transform(X)


`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.



### **3. Train Model**

First, we'll import several algorithms and try them on the training data. Later, we'll choose a subset that performed the best, and we'll do hyperparameter tuning on them to improve the model's performance. 


In [None]:
# Import algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
# Evaluation function
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error

In [None]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

In [None]:
# Define a dictionary of regression algorithms
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor(),
    'Support Vector Machine': SVR(),
    'K-Nearest Neighbors': KNeighborsRegressor(),
    'Gradient Boosting': XGBRegressor()
}

scores = {}

In [None]:
cv = KFold(n_splits=3, shuffle=True, random_state=42)

Explanation on the cv object: This code initializes a 3-fold cross-validation object (cv) with shuffling enabled and a specific random state. In this case, n_splits is set to 3, so the data will be split into 3 subsets or folds. Data will be randomly shuffled before splitting. Shuffling the data helps in reducing any potential biases that may be present in the original order of the data. Random state is specified for reproducibility. 

You can pass this cv object (which created using KFold function) to later cross_val_score functions cv parameter. 

In [None]:
for name, model in models.items():

    # Create a model object
    model = model

    print(f"Evaluation {name}..")
    # Score it using cross-validation 
    score = np.mean(cross_val_score(model,
                            X_preprocessed,
                            y,
                            cv=cv,
                            scoring="neg_mean_squared_error"))
    # Add it to the empty scores dictionary
    scores[name] = np.sqrt(score * -1) # we multiply by -1 to make values positive
print(scores)

Evaluation Linear Regression..
Evaluation Decision Tree..
Evaluation Random Forest..
Evaluation Support Vector Machine..
Evaluation K-Nearest Neighbors..
Evaluation Gradient Boosting..
{'Linear Regression': 1094921217.2471867, 'Decision Tree': 0.2202366400657777, 'Random Forest': 0.14977577449462873, 'Support Vector Machine': 0.14417618033337998, 'K-Nearest Neighbors': 0.17574327222954278, 'Gradient Boosting': 0.14682223493959806}


The algorithms that performed the best in our initial training are:

* Random Forest
* Support Vector Machine
* Gradient Boosting

 ### **4. Hyperparamenter Tuning with GridSearchCV**

 We'll create hyperparameters grids for these subset of algorithms that performed best in our initial test. 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Define the hyperparameter grids for each algorithm

parameter_grids = { 
        "XGBoost" : {'n_estimators': [100, 200, 500],
                    'learning_rate': [0.01, 0.1, 0.3],
                    'max_depth': [3, 6, 10] },

        "RandomForest" : {'n_estimators': [100, 200, 300],
                          'max_depth': [None, 5, 10],
                          'min_samples_split': [2, 5, 10]},
        "SVM" : {'C': [0.1, 1, 10],
                 'kernel': ['linear', 'rbf'],
                 'gamma': ['scale', 'auto']}

}
best_scores = {}

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Create a dictionary to store the best scores for each algorithm
best_scores = {}

# Iterate over the parameter grids for each algorithm
for algorithm, param_grid in parameter_grids.items():
    if algorithm == 'XGBoost':
        # Initialize the XGBoost regressor
        regressor = XGBRegressor()

    elif algorithm == 'RandomForest':
        # Initialize the RandomForest regressor
        regressor = RandomForestRegressor()

    elif algorithm == 'SVM':
        # Initialize the SVM regressor
        regressor = SVR()

    # Perform grid search using cross-validation
    grid_search = GridSearchCV(regressor, param_grid, scoring='neg_mean_squared_error', cv=5)
    grid_search.fit(X_train, y_train)  # Assuming you have X_train and y_train defined

    # Get the best parameters and best score
    best_params = grid_search.best_params_
    best_score = -grid_search.best_score_

    # Store the best score for the algorithm
    best_scores[algorithm] = best_score

    # Make predictions on the test data using the best model
    y_pred = grid_search.predict(X_test)  # Assuming you have X_test defined

    # Calculate the RMSE for the predictions
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"{algorithm} - Best Parameters: {best_params}")
    print(f"{algorithm} - Best Score: {best_score}")
    print(f"{algorithm} - RMSE: {rmse}\n")

# Print the best scores for each algorithm
print("Best Scores:")
for algorithm, score in best_scores.items():
    print(f"{algorithm}: {score}")
