<a href="https://www.kaggle.com/code/vidhikishorwaghela/house-prices-predictions?scriptVersionId=117543739" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### COMPETITION DESCRIPTION:

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### SKILLS I'AM GONNA USE:

Creative feature engineering 
Advanced regression techniques like random forest and gradient boosting

The code below is building a machine learning model to predict the final price of each home based on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. 

The following steps are performed in the code:

1. Importing important libraries (pandas, numpy, scikit-learn)
2. Loading the training dataset using pandas
3. Creating new features like total area, age of the house, ratio of living area to total area, ratio of porch area to total area
4. Selecting the features to be used in the model
5. Splitting the data into training and validation sets
6. Preprocessing the data using SimpleImputer and OneHotEncoder
7. Defining the model as Random Forest Regressor and Gradient Boosting Regressor
8. Creating a pipeline for the model and fitting it to the training data
9. Making predictions on the validation set
10. Evaluating the models using mean absolute error and mean squared error


In [1]:
#Importing important libraries:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

In [2]:
#Laoding the datasets:
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

In [3]:
# Create a new feature: total area (square feet)
df["TotalArea"] = df["GrLivArea"] + df["TotalBsmtSF"] + df["GarageArea"]

In [4]:
# Create a new feature: age of the house
df["Age"] = df["YrSold"] - df["YearBuilt"]

In [5]:
# Create a new feature: age of the house
df["Age"] = df["YrSold"] - df["YearBuilt"]

In [6]:
# Create a new feature: ratio of living area to total area
df["LivAreaRatio"] = df["GrLivArea"] / df["TotalArea"]

In [7]:
# Create a new feature: ratio of porch area to total area
df["PorchAreaRatio"] = (df["OpenPorchSF"] + df["EnclosedPorch"] + df["3SsnPorch"] + df["ScreenPorch"]) /df["TotalArea"]

In [8]:
#Select features
X = df[["TotalArea", "Age", "LivAreaRatio", "PorchAreaRatio"]]
y = df["SalePrice"]

In [9]:
#Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
#Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='mean')

In [11]:
#Preprocessing for categorical data
categorical_transformer = OneHotEncoder()

In [12]:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, X_train.select_dtypes(include=np.number).columns)
    ])

In [13]:
# Define the model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
gb = GradientBoostingRegressor(random_state=42)


In [14]:
# Create the pipeline
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                         ('classifier', rf)])

gb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                         ('classifier', gb)])

In [15]:
# Fit the pipeline to the training data
rf_pipeline.fit(X_train, y_train)
gb_pipeline.fit(X_train, y_train)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', SimpleImputer(),
                                                  Index(['TotalArea', 'Age', 'LivAreaRatio', 'PorchAreaRatio'], dtype='object'))])),
                ('classifier', GradientBoostingRegressor(random_state=42))])

In [16]:
# Make predictions on the validation set
rf_val_preds = rf_pipeline.predict(X_val)
gb_val_preds = gb_pipeline.predict(X_val)


In [17]:
# Evaluate the models using mean absolute error and mean squared error
print("Random Forest MAE:", mean_absolute_error(y_val, rf_val_preds))
print("Random Forest MSE:", mean_squared_error(y_val, rf_val_preds))
print("Gradient Boosting MAE:", mean_absolute_error(y_val, gb_val_preds))
print("Gradient Boosting MSE:", mean_squared_error(y_val, gb_val_preds))

Random Forest MAE: 21856.327423352905
Random Forest MSE: 1222006238.9447064
Gradient Boosting MAE: 22541.704284093335
Gradient Boosting MSE: 1212398798.7224472


In [18]:
# Loading the test dataset
test_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')


In [19]:
# Creating new features for the test dataset similar to the training dataset
test_df["TotalArea"] = test_df["GrLivArea"] + test_df["TotalBsmtSF"] + test_df["GarageArea"]
test_df["Age"] = test_df["YrSold"] - test_df["YearBuilt"]
test_df["LivAreaRatio"] = test_df["GrLivArea"] / test_df["TotalArea"]
test_df["PorchAreaRatio"] = (test_df["OpenPorchSF"] + test_df["EnclosedPorch"] + test_df["3SsnPorch"] + test_df["ScreenPorch"]) /test_df["TotalArea"]


In [20]:
# Selecting the features for the test dataset
X_test = test_df[["TotalArea", "Age", "LivAreaRatio", "PorchAreaRatio"]]


In [21]:
# Make predictions on the test dataset using the fitted model
rf_test_preds = rf_pipeline.predict(X_test)
gb_test_preds = gb_pipeline.predict(X_test)


In [22]:
# Creating a dataframe with the predictions
submission_df = pd.DataFrame({'Id': test_df.Id, 'SalePrice': rf_test_preds})


In [23]:
# Saving the dataframe as a csv file
submission_df.to_csv('/kaggle/working/submission.csv', index=False)