## **This code incorporates various techniques for preprocessing, feature engineering, model selection, hyperparameter tuning, and evaluation.**

This code incorporates the following steps:

1. Load the dataset and split it into features and target variable.
2. Create separate preprocessing pipelines for numeric and categorical features, handling missing values and performing scaling and encoding.
3. Combine the preprocessing pipelines using the ColumnTransformer.
4. Create a feature selection transformer to select the most relevant features.
5. Define a random forest regressor as the underlying model for prediction.
6. Create a pipeline that includes the preprocessing, feature selection, and regression steps.
7. Define a parameter grid to search over for hyperparameter tuning.
8. Perform a grid search using cross-validation to find the best combination of hyperparameters.
9. Get the best model from the grid search.
10. Make predictions on the testing data using the best model.
11. Calculate evaluation metrics including RMSE, MAE, and R-squared.

### Import librares

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor



### **1.** Load the dataset and split it into features and target variable.

This code snippet performs data loading, preprocessing, and data splitting to prepare the dataset for house price prediction using advanced regression techniques. The main steps are as follows:

1. Load the dataset from a CSV file located in the '/kaggle/input/house-prices-advanced-regression-techniques/' directory and store it in the 'data' variable.
2. Select only the numeric columns from the dataset, discarding non-numeric columns, to create a subset of the data that contains features and the target variable.
3. Split the dataset into features 'X' (all columns except 'SalePrice') and the target variable 'y' (which is 'SalePrice').
4.  one-hot encoding for categorical variables in 'X' using 'pd.get_dummies()' to convert them into numerical format.
5. Split the preprocessed dataset into training and testing sets using 'train_test_split' from 'sklearn.model_selection'. The training set consists of 80% of the data, and the testing set contains 20% of the data. The random_state is set to 42 for reproducibility.

After this code snippet, the data is ready for further processing, including the creation of a machine learning model, hyperparameter tuning, and evaluation using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared.

In [None]:
# Load the dataset
data = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
data = data.select_dtypes(include='number')

# Split the data into features (X) and target variable (y)
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']

# Perform one-hot encoding for categorical variables
X_encoded = pd.get_dummies(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

### **2.** Create separate preprocessing pipelines for numeric and categorical features, handling missing values and performing scaling and encoding.

This code snippet creates preprocessing pipelines for both numeric and categorical features to prepare the data for training a machine learning model:

1. For numeric features:
   - `numeric_features` stores the names of columns with numeric data types (float64 and int64) in the 'X' dataset.
   - The `numeric_pipeline` is created as a sequence of preprocessing steps using 'Pipeline' from 'sklearn.pipeline':
     - 'imputer': It uses 'SimpleImputer' with the 'median' strategy to fill missing values in numeric features with the median of the respective columns.
     - 'scaler': It uses 'StandardScaler' to standardize the numeric features by subtracting the mean and scaling to unit variance.

2. For categorical features:
   - `categorical_features` stores the names of columns with object data type (categorical) in the 'X' dataset.
   - The `categorical_pipeline` is created as a sequence of preprocessing steps using 'Pipeline':
     - 'imputer': It uses 'SimpleImputer' with the 'most_frequent' strategy to fill missing values in categorical features with the most frequent value of the respective columns.
     - 'encoder': It uses 'OneHotEncoder' to perform one-hot encoding on the categorical features, converting them into binary vectors.

These preprocessing pipelines will be applied to the respective feature types during the data transformation step, ensuring that the data is appropriately handled before training the machine learning model.

In [None]:
# Create preprocessing pipelines for numeric and categorical features
numeric_features = X.select_dtypes(include=['float64', 'int64']).columns
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = X.select_dtypes(include=['object']).columns
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

### **3.** Combine the preprocessing pipelines using the ColumnTransformer.

In this code snippet, the preprocessing pipelines for numeric and categorical features are combined using `ColumnTransformer` to create an integrated preprocessing step that handles both types of features.

The `ColumnTransformer` allows you to apply different preprocessing steps to different subsets of the features in the dataset. It takes a list of transformers, where each transformer is associated with a specific set of features.

In this case:
- The `numeric_pipeline` is associated with the numeric features specified in `numeric_features`.
- The `categorical_pipeline` is associated with the categorical features specified in `categorical_features`.

During the data transformation, the `preprocessor` will apply the appropriate preprocessing steps to each subset of features based on their types (numeric or categorical). This ensures that the data is correctly processed before feeding it into the machine learning model.

In [None]:
# Combine the preprocessing pipelines
preprocessor = ColumnTransformer([
    ('numeric', numeric_pipeline, numeric_features),
    ('categorical', categorical_pipeline, categorical_features)
])

### **4.** Create a feature selection transformer to select the most relevant features.

In this code snippet, a feature selection transformer is created to select the most relevant features from the dataset before training the machine learning model:


The `SelectKBest` is a feature selection technique from `sklearn.feature_selection`. It is a univariate feature selection method that evaluates each feature independently and selects the top 'K' features based on their statistical significance and relationship with the target variable.

- The `score_func` is set to `f_regression`, which is the score function used to evaluate the relevance of features for regression problems. It computes the F-value and p-value for each feature based on the linear regression model between the feature and the target variable.

- After creating the `feature_selector`, it can be incorporated into the machine learning pipeline to automatically select the 'K' most informative features during the model training process.

By using `SelectKBest` feature selection, the model can focus on the most relevant features, potentially improving the model's performance and reducing overfitting. The choice of 'K' (the number of top features to select) can be adjusted based on the dataset and the model's complexity requirements.

In [None]:
# Create a feature selection transformer
feature_selector = SelectKBest(score_func=f_regression)

### **5.** Define a random forest regressor as the underlying model for prediction.

In this code snippet, a Random Forest Regressor is created to serve as the machine learning model for predicting house prices:

The `RandomForestRegressor` is a powerful ensemble learning algorithm from `sklearn.ensemble` that utilizes an ensemble of decision trees to perform regression tasks. It works by constructing multiple decision trees during training and then combining their predictions to make a final prediction.

- The `RandomForestRegressor` is well-suited for regression problems, like predicting house prices, as it can handle both numeric and categorical features, and it tends to provide robust and accurate predictions.

- The regressor is initialized with default hyperparameters, but you can further customize its behavior by tuning hyperparameters, such as the number of trees (`n_estimators`), the maximum depth of the trees (`max_depth`), and other options like feature importance estimation.

- Once the regressor is created, it can be incorporated into the machine learning pipeline, along with the preprocessing steps and feature selection, to build an end-to-end predictive model for house prices based on the given dataset.

Overall, the Random Forest Regressor is a popular and versatile choice for regression tasks, and it is expected to perform well in predicting house prices due to its ability to capture complex relationships in the data and handle a wide range of features.

In [None]:
# Create a random forest regressor
regressor = RandomForestRegressor()

### **6.** Create a pipeline that includes the preprocessing, feature selection, and regression steps.

In this code snippet, a machine learning pipeline is created by combining the previously defined preprocessing steps, feature selection, and the Random Forest Regressor into a cohesive sequence:


The pipeline allows you to streamline the entire machine learning workflow, from data preprocessing to model training and prediction, in a concise and organized manner.

- The `preprocessor` is the `ColumnTransformer` that was defined earlier. It handles the transformation of the dataset, applying the appropriate preprocessing steps to numeric and categorical features.

- The `feature_selector` is the `SelectKBest` feature selection transformer that was also defined earlier. It selects the most relevant features based on their statistical significance.

- The `regressor` is the `RandomForestRegressor` machine learning model that was instantiated earlier. It performs the regression task, predicting house prices based on the selected features.

By combining these steps into a pipeline, you can conveniently apply them in sequence, ensuring consistent data processing and model training. Later, this pipeline can be used for both training and making predictions on new data efficiently, and it simplifies the process of tuning hyperparameters or incorporating additional preprocessing steps into the model.

In [None]:
# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selector', feature_selector),
    ('regressor', regressor)
])

### **7.** Define a parameter grid to search over for hyperparameter tuning.

In this code snippet, a grid of hyperparameters is defined to perform a hyperparameter search using GridSearchCV:


The `param_grid` is a dictionary containing hyperparameter names as keys and lists of hyperparameter values to be searched over. The hyperparameters specified in the `param_grid` are specific to the `feature_selector` and `regressor` steps within the pipeline.

- `'feature_selector__k'`: This hyperparameter controls the number of top features to select from the dataset using the `SelectKBest` feature selection. It is set to be searched over three values: 10, 20, and 30.

- `'regressor__n_estimators'`: This hyperparameter determines the number of decision trees to create in the Random Forest Regressor. The grid search will try three values: 100, 200, and 300.

- `'regressor__max_depth'`: This hyperparameter specifies the maximum depth of the decision trees in the Random Forest Regressor. The grid search will explore three options: None (unlimited depth), 5, and 10.

By performing a grid search with these hyperparameter combinations, the pipeline will evaluate different model configurations, selecting the best combination of hyperparameters based on cross-validated performance. The result will be a model that is optimized for predicting house prices with the most appropriate number of selected features, number of estimators, and maximum depth of trees.

In [None]:
# Define hyperparameters to search over
param_grid = {
    'feature_selector__k': [10, 20, 30],  # Adjust the number of selected features
    'regressor__n_estimators': [100, 200, 300],  # Adjust the number of estimators
    'regressor__max_depth': [None, 5, 10]  # Adjust the maximum depth of trees
}

### **8.** Perform a grid search using cross-validation to find the best combination of hyperparameters.

In this code snippet, the GridSearchCV is performed to find the best hyperparameters for the machine learning pipeline:


The `GridSearchCV` is a method from `sklearn.model_selection` that performs an exhaustive search over the specified hyperparameter grid to find the best combination of hyperparameters for the given pipeline.

- `pipeline`: The machine learning pipeline that includes data preprocessing, feature selection, and the Random Forest Regressor.

- `param_grid`: The hyperparameter grid that defines the combinations of hyperparameters to be searched over. This grid was defined earlier and includes various values for the number of selected features, the number of estimators, and the maximum depth of trees.

- `cv=5`: The parameter `cv` determines the number of folds for cross-validation. In this case, cross-validation is performed with 5 folds, meaning the data is split into 5 parts, and the model is trained and evaluated 5 times, each time using a different part as the validation set.

During the grid search process, the pipeline will be trained and evaluated with different combinations of hyperparameters using cross-validation. The best combination of hyperparameters will be determined based on the performance metrics (e.g., Mean Squared Error) on the validation folds.

After the grid search is completed, the `grid_search` object will contain the best model, which is trained on the entire training set using the best hyperparameters. This best model can be accessed using `grid_search.best_estimator_`, and it will be used for making predictions on the test dataset and evaluating the model's performance.

In [None]:
# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

### **9.** Get the best model from the grid search.

In this code snippet, the best model from the grid search is obtained.


After performing the grid search using `GridSearchCV`, the `grid_search` object contains information about the results of the search, including the best hyperparameters and the best model.

The `best_estimator_` attribute of the `grid_search` object holds the machine learning pipeline that achieved the best performance based on the cross-validated evaluation metrics. This pipeline consists of the optimal combination of preprocessing steps, feature selection, and the Random Forest Regressor, as determined by the grid search.

Now, the `best_model` variable contains the entire pipeline with the best hyperparameters. You can use this model to make predictions on new data or evaluate its performance on the test dataset. Since this model is fine-tuned using the best hyperparameters, it is expected to perform better than the default model configuration.

In [None]:
# Get the best model from grid search
best_model = grid_search.best_estimator_

### **10.** Make predictions on the testing data using the best model.

In this code snippet, predictions are made on the testing data using the best model obtained from the grid search.

The `predict` method of the `best_model` is used to generate predicted values for the target variable 'y' based on the features in the testing dataset 'X_test'. The model takes the preprocessed and selected features as input and produces predicted house prices as output.

After executing this code, the `y_pred` variable holds the predicted house prices for the testing dataset. These predictions can now be used to evaluate the model's performance and compare it with the true house prices in the testing dataset.

By comparing the predicted values (`y_pred`) with the true target values (`y_test`), you can assess the accuracy of the model's predictions and compute various evaluation metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared. These metrics provide insights into how well the model is performing in predicting house prices on unseen data.

In [None]:
# Make predictions on the testing data
y_pred = best_model.predict(X_test)

### **11.** Calculate evaluation metrics including MSE, RMSE, MAE, and R-squared.

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_test)


gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
gb_preds = gb_model.predict(X_test)


xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)


In this code snippet, various evaluation metrics are calculated to assess the performance of the model on the testing data.

- `mean_squared_error(y_test, y_pred)`: Calculates the Mean Squared Error (MSE) between the true target values (`y_test`) and the predicted values (`y_pred`). MSE measures the average squared difference between the predicted and true values. It provides an indication of the model's accuracy, with lower values indicating better performance.

- `np.sqrt(mean_squared_error(y_test, y_pred))`: Computes the Root Mean Squared Error (RMSE) by taking the square root of the MSE. RMSE is another metric that measures the average prediction error but is in the same unit as the target variable (SalePrice). It is easier to interpret than MSE since it represents the average error in the same scale as the target variable.

- `mean_absolute_error(y_test, y_pred)`: Calculates the Mean Absolute Error (MAE), which measures the average absolute difference between the predicted and true values. It provides a more interpretable measure of prediction accuracy.

- `r2_score(y_test, y_pred)`: Computes the R-squared (R2) score, also known as the coefficient of determination. R2 indicates the proportion of the variance in the target variable (SalePrice) that is explained by the model. It ranges from 0 to 1, where higher values indicate a better fit of the model to the data.

After calculating these metrics, they are printed to the console, providing a comprehensive evaluation of the model's performance on the testing data. By analyzing these metrics, you can gain insights into how well the model predicts house prices and assess its overall effectiveness in capturing the underlying relationships in the data.

In [None]:
# Calculate evaluation metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared:", r2)

Root Mean Squared Error (RMSE): 29309.827794785124
Mean Absolute Error (MAE): 18100.190770547946
R-squared: 0.8880013728027616


def evaluate_model(y_true, y_pred, model_name="Model"):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)

    print(f"--- {model_name} ---")
    print(f"MAE: {mae:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"R² Score: {r2:.2f}")
    print()


evaluate_model(y_test, lr_preds, "Linear Regression")
evaluate_model(y_test, gb_preds, "Gradient Boosting")
evaluate_model(y_test, xgb_preds, "XGBoost")


### See the result

You can add the predicted values `y_pred` to the actual dataset `X_test` along with the true target values `y_test` to observe the differences between the predicted and true values. Here's how you can do it using pandas:

```python
y_pred_series = pd.Series(y_pred, name='Predicted_SalePrice')

y_test_series = pd.Series(y_test.values, name='True_SalePrice')

result_df = pd.concat([X_test, y_pred_series, y_test_series], axis=1)

```

This will create a new DataFrame `result_df` that contains the original features in `X_test`, the predicted SalePrice values in a new column 'Predicted_SalePrice', and the true SalePrice values in a new column 'True_SalePrice'. You can then examine this DataFrame to compare the predicted and true values side by side and analyze the differences between them.

In [None]:
# Convert the predicted values y_pred to a pandas Series
y_pred_series = pd.Series(y_pred, name='Predicted_SalePrice')

# Convert the true target values y_test to a pandas Series
y_test_series = pd.Series(y_test.values, name='True_SalePrice')

# Concatenate the actual dataset X_test with the predicted and true SalePrice columns
result_df = pd.concat([X_test, y_pred_series, y_test_series], axis=1)

# Print the result_df to see the differences between predicted and true values
result_df

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,Predicted_SalePrice,True_SalePrice
892,893.0,20.0,70.0,8414.0,6.0,8.0,1963.0,2003.0,0.0,663.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2006.0,,
1105,1106.0,60.0,98.0,12256.0,8.0,5.0,1994.0,1995.0,362.0,1032.0,...,32.0,0.0,0.0,0.0,0.0,0.0,4.0,2010.0,,
413,414.0,30.0,56.0,8960.0,5.0,6.0,1927.0,1950.0,0.0,0.0,...,0.0,130.0,0.0,0.0,0.0,0.0,3.0,2010.0,,
522,523.0,50.0,50.0,5000.0,6.0,7.0,1947.0,1950.0,0.0,399.0,...,24.0,36.0,0.0,0.0,0.0,0.0,10.0,2006.0,,
1036,1037.0,20.0,89.0,12898.0,9.0,5.0,2007.0,2008.0,70.0,1022.0,...,0.0,0.0,0.0,0.0,0.0,0.0,9.0,2009.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286,,,,,,,,,,,...,,,,,,,,,227146.645,230000.0
287,,,,,,,,,,,...,,,,,,,,,119098.875,89471.0
288,,,,,,,,,,,...,,,,,,,,,235196.155,260000.0
290,,,,,,,,,,,...,,,,,,,,,136779.415,108000.0


In [None]:
result_df.to_csv('House_Prediction_Result', index=False)

In [None]:
last_two_columns = result_df.columns[-2:]

# Create a new DataFrame with only the last two columns
df_new = result_df[last_two_columns]
df_new

Unnamed: 0,Predicted_SalePrice,True_SalePrice
892,,
1105,,
413,,
522,,
1036,,
...,...,...
286,227146.645,230000.0
287,119098.875,89471.0
288,235196.155,260000.0
290,136779.415,108000.0


In [None]:
df_new.to_csv('House_Prediction_Result_Price', index=False)