# Why Sklearn is used?
1. Model Building like LogR,SVC,KNN,RandomForest(Classification) and Regresssion Models like LR,Ridge,SVR and Clustering (KMeans) and Dimensionality Reduction (PCA).
2. Model Evaluation (accuracy, precision, recall, F1 score, ROC AUC, mean squared error, and R² score) also (Cross Validation like cross_val_score, GridSearchCV)
3. Data PreProcessing- (Scaling like Standard Scaler, MinMax Scaler) and (Encoding like OneHotEncoder, LabelEncoder) and Imputation like (SimpleImputer and KNNImputer).
4. Feature Engineering.

In [1]:
# Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import  Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv(r'C:\Users\adity\OneDrive\Documents\GitHub\Advanced_Housing_Price_Prediction\data\combined_file_EDA_final.csv')
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,3.044522,1,4.143135,8.922792,1,3,3,4,0,21,...,0.0,0.0,0.0,0.0,0.0,10,2009,8,4,12.128117
1,4.110874,3,4.143135,8.976894,1,3,3,4,0,8,...,0.0,0.0,0.0,0.0,0.0,8,2007,8,4,12.072547
2,3.044522,3,4.454347,9.486152,1,3,3,4,0,14,...,0.0,0.0,0.0,0.0,0.0,2,2010,8,4,12.254868
3,4.26268,3,4.204693,9.109746,1,3,3,4,0,6,...,0.0,0.0,0.0,0.0,7.824446,5,2010,8,4,12.493133
4,3.044522,3,4.234107,9.181735,1,3,3,4,0,12,...,4.727388,0.0,0.0,0.0,0.0,4,2010,8,4,11.864469


In [3]:
# Print the types of each column
print(df.dtypes)
df = df.select_dtypes(include=['int64','float64'])

MSSubClass       float64
MSZoning           int64
LotFrontage      float64
LotArea          float64
Street             int64
                  ...   
MoSold             int64
YrSold             int64
SaleType           int64
SaleCondition      int64
SalePrice        float64
Length: 74, dtype: object


In [4]:
nan_mean = df.isna().mean()
threshold = 0.1
columns_to_drop = nan_mean[nan_mean > 0.5].index
df = df.drop(columns=columns_to_drop)
# df = df.drop(columns=['Id'])
print("\nCleaned DataFrame:")
print(df)


Cleaned DataFrame:
      MSSubClass  MSZoning  LotFrontage   LotArea  Street  LotShape  \
0       3.044522         1     4.143135  8.922792       1         3   
1       4.110874         3     4.143135  8.976894       1         3   
2       3.044522         3     4.454347  9.486152       1         3   
3       4.262680         3     4.204693  9.109746       1         3   
4       3.044522         3     4.234107  9.181735       1         3   
...          ...       ...          ...       ...     ...       ...   
2915    4.510860         3     4.110874  9.105091       1         3   
2916    3.044522         3     4.369448  9.133783       1         3   
2917    5.198497         4     3.583519  8.209580       1         3   
2918    3.044522         3     4.510860  9.753711       1         3   
2919    3.044522         1     4.143135  8.922792       1         3   

      LandContour  LotConfig  LandSlope  Neighborhood  ...  EnclosedPorch  \
0               3          4          0           

In [5]:
print(df.isnull().sum().sort_values(ascending=False).head(10))

MSSubClass      0
GarageYrBlt     0
Fireplaces      0
Functional      0
TotRmsAbvGrd    0
KitchenQual     0
KitchenAbvGr    0
BedroomAbvGr    0
HalfBath        0
FullBath        0
dtype: int64


In [6]:
# Fill 0 to all Missing Values
df = df.fillna(0)
print(df.isnull().sum().sort_values(ascending=False).head(10))

MSSubClass      0
GarageYrBlt     0
Fireplaces      0
Functional      0
TotRmsAbvGrd    0
KitchenQual     0
KitchenAbvGr    0
BedroomAbvGr    0
HalfBath        0
FullBath        0
dtype: int64


### Splitting the Dataset

In [7]:
X = df.drop('SalePrice',axis=1) # Features
y = df['SalePrice']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)
X.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,-1.120412,-3.075032,-0.194624,-0.337569,0.064238,0.74647,0.316611,0.58842,-0.216222,1.436698,...,1.115326,-0.427459,-0.112755,-0.309086,-0.06669,-0.189295,1.394469,0.917963,0.319251,0.204931
1,0.419784,-0.041042,-0.194624,-0.231427,0.064238,0.74647,0.316611,0.58842,-0.216222,-0.745221,...,0.640561,-0.427459,-0.112755,-0.309086,-0.06669,-0.189295,0.65775,-0.603292,0.319251,0.204931
2,-1.120412,-0.041042,0.775178,0.767672,0.064238,0.74647,0.316611,0.58842,-0.216222,0.261819,...,-1.083506,-0.427459,-0.112755,-0.309086,-0.06669,-0.189295,-1.55241,1.67859,0.319251,0.204931
3,0.639046,-0.041042,-0.002797,0.029212,0.064238,0.74647,0.316611,0.58842,-0.216222,-1.0809,...,0.825013,-0.427459,-0.112755,-0.309086,-0.06669,6.121818,-0.44733,1.67859,0.319251,0.204931
4,-1.120412,-0.041042,0.088863,0.170445,0.064238,0.74647,0.316611,0.58842,-0.216222,-0.073861,...,-1.083506,2.241357,-0.112755,-0.309086,-0.06669,-0.189295,-0.81569,1.67859,0.319251,0.204931


In [8]:
X = X.to_numpy()
y = y.to_numpy()

In [9]:
print(X.shape,y.shape)

(2920, 73) (2920,)


## Splitting of Data
* Use train_test_split from sklearn.model_selection to shuffle and split the features and prices data into training and testing sets.
    *
Split the data into 80% training and 20% testing
    *
Set the random_state for train_test_split to a value of your choi . . This ensures results are consistent.

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                 test_size=0.2,
                                                random_state=1)  
# Random State ensures same set every time
# Stractify distribution(propertion of each class) is preserved in train and test sets

In [11]:
# Initializing Classifiers for all algos
reg1 = KNeighborsRegressor(algorithm='ball_tree', leaf_size=50)

reg2 = DecisionTreeRegressor(random_state=1)

reg3 = SVR()

reg4 = RandomForestRegressor(random_state=1)

reg5 = Lasso(fit_intercept=True, max_iter=5000)

reg6 = Ridge()

In [12]:
# Building the pipelines for streamline the process
pipe1 = Pipeline([('std', StandardScaler()),
                 ('reg1',reg1)])
pipe2 = Pipeline([('std', StandardScaler()),
                 ('reg2',reg2)])
pipe3 = Pipeline([('std', StandardScaler()),
                 ('reg3',reg3)])
pipe4 = Pipeline([('std', StandardScaler()),
                 ('reg4',reg4)])
pipe5 = Pipeline([('std', StandardScaler()),
                 ('reg5',reg5)])
pipe6 = Pipeline([('std', StandardScaler()),
                 ('reg6',reg6)])

In [13]:
# Setting up the Parameter grids

param_grid1 = [{'reg1__n_neighbors': list(range(1, 10)),
                'reg1__p': [1, 2]}]

param_grid2 = [{'reg2__max_depth': list(range(1, 10)) + [None],
                'reg2__criterion': ['squared_error', 'absolute_error']}]

param_grid3 = [
    {
        'reg3__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'reg3__C': [1, 5, 10],
        'reg3__degree': [3, 8],  # Only for 'poly'
        'reg3__coef0': [0.01, 10, 0.5],  # Only for 'poly' and 'sigmoid'
        'reg3__gamma': ['auto', 'scale']  # Only for 'rbf', 'poly', and 'sigmoid'
    }
]
param_grid4 = [{'reg4__n_estimators': [10, 100, 500, 1000, 10000]}]

param_grid5 = [{'reg5__alpha':[0.001, 0.01, 0.1, 1, 10]}]

param_grid6 = [{'reg6__alpha':[0.001, 0.01, 0.1, 1, 10]}]

### Define a Performance Metric
 For this project, you will be calculating the coefficient of determination, R2, to quantify your model's performance. The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how "good" that model is at making predictions.

The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 is no better than a model that always predicts the mean of the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features. A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable.

In [14]:
# Initialising list for param_grid, pipelines and names
param_grids = [param_grid1, param_grid2, param_grid3, param_grid4, param_grid5, param_grid6]
pipelines = [pipe1, pipe2, pipe3, pipe4, pipe5, pipe6]
names = ['KNN', 'DTree', 'SVR', 'RForest', 'Lasso', 'Ridge']

# Setup GridSearchCV objects
gridcvs = {}
inner_cv = KFold(n_splits = 2, shuffle = True, random_state=1)

for pgrid, est, name in zip(param_grids,pipelines,names):
    gcv = GridSearchCV(
        estimator = est,
        param_grid= pgrid,
        # scoring='r2',     # GridSearch is designed to max the scoring metric while mse is used for minimize
        n_jobs = -1,
        cv = inner_cv,
        verbose = 0,
        refit = True
    )
    gridcvs[name] = gcv

In [15]:
def nested_cv(X_train,y_train):
    for name,gs_est in sorted(gridcvs.items()):
        print(50*'-','\n')
        print('Algorithm:',name)
        print('       Inner Loop:')

        outer_scores = []
        outer_cv = KFold(n_splits = 5, shuffle = True,random_state = 1)

        for train_idx, valid_idx in outer_cv.split(X_train,y_train):
            # Run inner loop
            gridcvs[name].fit(X_train[train_idx],y_train[train_idx])
            print('\n        Best R2 (avg. of inner test folds): %.2f' % (gs_est.best_score_))
            print('        Best parameters:', gs_est.best_params_)

            # Performance on test fold (valid_idx)
            outer_scores.append(gs_est.best_estimator_.score(X_train[valid_idx],y_train[valid_idx]))
            print('               R2 (on outer test fold) %.2f' % (outer_scores[-1]))

        print('\n       Outer Loop :')
        print('           Mean R2: %.2f +/- %.2f' % (np.mean(outer_scores), np.std(outer_scores)))

In [16]:
# Nested CV Without Principal Component Analysis
nested_cv(X_train,y_train)

-------------------------------------------------- 

Algorithm: DTree
       Inner Loop:

        Best R2 (avg. of inner test folds): 0.33
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 3}
               R2 (on outer test fold) 0.34

        Best R2 (avg. of inner test folds): 0.30
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 3}
               R2 (on outer test fold) 0.31

        Best R2 (avg. of inner test folds): 0.28
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 4}
               R2 (on outer test fold) 0.39

        Best R2 (avg. of inner test folds): 0.34
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 3}
               R2 (on outer test fold) 0.34

        Best R2 (avg. of inner test folds): 0.32
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 4}
               R2 (on outer test fold) 0.26

       Outer Loop :


## Nested CV with Principal Component Analysis

In [17]:
df = pd.read_csv(r'data\combined_output_pca.csv')
df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC52,PC53,PC54,PC55,PC56,PC57,PC58,PC59,PC60,output
0,2.503369,0.981434,1.452086,-0.919147,-0.569434,1.03184,0.408213,1.352798,-0.259095,1.344579,...,0.111515,0.627657,0.009098,-0.257025,0.172023,-0.307469,-0.1798,0.522025,-0.210473,208500
1,0.389412,-2.255764,-0.742392,-0.293369,-0.813007,-0.124055,-0.817843,0.258579,-0.34879,-0.904775,...,0.599312,-0.998549,0.54397,-0.584959,0.223728,-0.743847,0.350146,-0.00437,-0.200582,181500
2,3.129607,0.578375,0.945196,-0.810753,-0.471309,1.198339,-0.035848,1.028838,0.657342,1.230198,...,0.378765,-0.049458,0.092046,0.086509,0.187371,-0.167515,-0.571866,0.317287,-0.038791,223500
3,-1.087373,0.670798,-1.634989,-1.401094,-0.489789,1.386112,-0.612641,-1.073709,-0.143132,2.311278,...,-1.439508,1.24668,-0.618297,0.489699,-0.990683,0.102216,0.389552,-0.03701,0.368762,140000
4,5.266403,1.039346,-0.567113,-1.468311,0.047984,0.953615,-0.474076,0.997248,0.251615,1.299148,...,0.437711,-0.530069,0.04647,0.22921,0.086508,0.026815,-0.364965,0.248669,0.403331,250000


In [18]:
# Print the types of each column
print(df.dtypes)
df = df.select_dtypes(include=['int64','float64'])

PC1       float64
PC2       float64
PC3       float64
PC4       float64
PC5       float64
           ...   
PC57      float64
PC58      float64
PC59      float64
PC60      float64
output      int64
Length: 61, dtype: object


In [19]:
nan_mean = df.isna().mean()
threshold = 0.1
columns_to_drop = nan_mean[nan_mean > 0.5].index
df = df.drop(columns=columns_to_drop)
# df = df.drop(columns=['Id'])
print("\nCleaned DataFrame:")
print(df)


Cleaned DataFrame:
           PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
0     2.503369  0.981434  1.452086 -0.919147 -0.569434  1.031840  0.408213   
1     0.389412 -2.255764 -0.742392 -0.293369 -0.813007 -0.124055 -0.817843   
2     3.129607  0.578375  0.945196 -0.810753 -0.471309  1.198339 -0.035848   
3    -1.087373  0.670798 -1.634989 -1.401094 -0.489789  1.386112 -0.612641   
4     5.266403  1.039346 -0.567113 -1.468311  0.047984  0.953615 -0.474076   
...        ...       ...       ...       ...       ...       ...       ...   
2915 -5.915854  2.950851  5.440385 -0.986437  0.728559  0.336368 -2.429370   
2916 -4.146280  0.950446  5.208857 -3.249140 -0.510877 -1.634566 -0.914299   
2917 -0.717246 -2.109801 -2.603754 -0.592623 -0.442453  1.335394  0.451205   
2918 -2.835838  0.271633  1.446852  0.923445  2.008861  2.186468 -1.773955   
2919  3.359249  0.641392  0.399533 -2.328369  0.629936  0.118251 -0.790409   

           PC8       PC9      PC10  ...    

In [20]:
print(df.isnull().sum().sort_values(ascending=False).head(10))

PC1     0
PC32    0
PC34    0
PC35    0
PC36    0
PC37    0
PC38    0
PC39    0
PC40    0
PC41    0
dtype: int64


In [21]:
# Fill 0 to all Missing Values
df = df.fillna(0)
print(df.isnull().sum().sort_values(ascending=False).head(10))

PC1     0
PC32    0
PC34    0
PC35    0
PC36    0
PC37    0
PC38    0
PC39    0
PC40    0
PC41    0
dtype: int64


In [22]:
X = df.drop('output',axis=1) # Features
y = df['output']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)
X.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC51,PC52,PC53,PC54,PC55,PC56,PC57,PC58,PC59,PC60
0,0.781092,0.475969,0.763351,-0.540695,-0.356187,0.720676,0.286309,1.03782,-0.205435,1.088999,...,0.805345,0.170564,0.96986,0.014441,-0.419712,0.285787,-0.524763,-0.315513,0.940876,-0.390364
1,0.121503,-1.093984,-0.39027,-0.172576,-0.508544,-0.086644,-0.573612,0.198373,-0.276555,-0.732794,...,-0.280371,0.91666,-1.542965,0.863412,-0.955217,0.371686,-1.269539,0.614437,-0.007876,-0.372019
2,0.976489,0.280496,0.496882,-0.476932,-0.294808,0.836965,-0.025142,0.78929,0.521204,0.996359,...,0.454151,0.579329,-0.076423,0.1461,0.141267,0.311285,-0.285901,-1.003512,0.571866,-0.071945
3,-0.339278,0.325319,-0.859502,-0.824205,-0.306368,0.968112,-0.429689,-0.823713,-0.113489,1.871945,...,0.056418,-2.201756,1.926378,-0.981387,0.79966,-1.645852,0.174454,0.683587,-0.066705,0.683942
4,1.643204,0.504054,-0.298127,-0.863746,0.030015,0.666041,-0.332503,0.765055,0.199505,1.052203,...,-0.196253,0.669488,-0.819066,0.07376,0.374291,0.143718,0.045765,-0.640442,0.448191,0.748056


In [23]:
X = X.to_numpy()
y = y.to_numpy()

In [24]:
print(X.shape,y.shape)

(2920, 60) (2920,)


In [25]:
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                 test_size=0.2,
                                                random_state=1)  


In [26]:
# Nested CV With Principal Component Analysis
nested_cv(X_train,y_train)

-------------------------------------------------- 

Algorithm: DTree
       Inner Loop:

        Best R2 (avg. of inner test folds): 0.25
        Best parameters: {'reg2__criterion': 'absolute_error', 'reg2__max_depth': 2}
               R2 (on outer test fold) 0.22

        Best R2 (avg. of inner test folds): 0.29
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 2}
               R2 (on outer test fold) 0.33

        Best R2 (avg. of inner test folds): 0.22
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 1}
               R2 (on outer test fold) 0.28

        Best R2 (avg. of inner test folds): 0.32
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 5}
               R2 (on outer test fold) -0.23

        Best R2 (avg. of inner test folds): 0.29
        Best parameters: {'reg2__criterion': 'squared_error', 'reg2__max_depth': 2}
               R2 (on outer test fold) 0.36

       Outer Loop 

## Conclusion
- Random Forest emerged as the most reliable and accurate model for this regression task.
- **Reasons:** The R² value of 0.83 indicates that 83% of the variance in the target variable is captured by the model.
## HyperParameter Tuning for Random Forest

In [27]:
gcv_hyperparameter_tuning = GridSearchCV(estimator=RandomForestRegressor(random_state=1),
                                        param_grid= [{'n_estimators': [10, 100, 500, 1000, 10000]}],
                                        n_jobs=-1,
                                        cv = inner_cv,
                                        verbose=1,
                                        refit=True)
gcv_hyperparameter_tuning.fit(X_train,y_train)
print('Best CV R2: %.2f%%' % (gcv_hyperparameter_tuning.best_score_))
print('Best Parameters:',gcv_hyperparameter_tuning.best_params_)

Fitting 2 folds for each of 5 candidates, totalling 10 fits
Best CV R2: 0.42%
Best Parameters: {'n_estimators': 10000}


### Question - Best HyperParameter for RandomForestRegressor
- Which maximum depth do you think results in a model that best generalizes to unseen data?
### Answer -
- Random Forest with **n_estimators = 1000** gives best r2 score on validation dataset in comparision output. The training score is around 0.80 and close to the validation score which hints us that the model is generalizing the data well.

# Evaluation and Performance Results

In [28]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

final_model = RandomForestRegressor(random_state=1,n_estimators=1000)
final_model.fit(X_train,y_train)
# Predicting on the training data
y_train_pred = final_model.predict(X_train)

# Predicting on the test data
y_test_pred = final_model.predict(X_test)

# Calculating metrics for the training data
mse_train = mean_squared_error(y_train, y_train_pred)
rmse_train = np.sqrt(mse_train)
mae_train = mean_absolute_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

# Calculating metrics for the test data
mse_test = mean_squared_error(y_test, y_test_pred)
rmse_test = np.sqrt(mse_test)
mae_test = mean_absolute_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print('Training Metrics:')
print('Mean Squared Error (MSE): %.2f' % mse_train)
print('Root Mean Squared Error (RMSE): %.2f' % rmse_train)
print('Mean Absolute Error (MAE): %.2f' % mae_train)
print('R² Score: %.2f' % r2_train)

print('\nTest Metrics:')
print('Mean Squared Error (MSE): %.2f' % mse_test)
print('Root Mean Squared Error (RMSE): %.2f' % rmse_test)
print('Mean Absolute Error (MAE): %.2f' % mae_test)
print('R² Score: %.2f' % r2_test)

Training Metrics:
Mean Squared Error (MSE): 252578587.53
Root Mean Squared Error (RMSE): 15892.72
Mean Absolute Error (MAE): 10694.36
R² Score: 0.92

Test Metrics:
Mean Squared Error (MSE): 1996855774.09
Root Mean Squared Error (RMSE): 44686.19
Mean Absolute Error (MAE): 30543.31
R² Score: 0.48
