# For this assignment, we use House Dataset - 

### https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

We will use Regression Techniques to predict House Pricing

In [1]:
import numpy as np
import pandas as pd

In [2]:
house_train_data = pd.read_csv("House Price Prediction\\train.csv")

## EDA

After importing train and test data, we try to understand the data that we are dealing with using below function

In [3]:
def understand_variables(dataset):
    print("Datatype = " +str(type(dataset))+"\n") 
    print("Data Shape = "+str(dataset.shape)+"\n")
    print("Top5 Rows : \n\n"+str(dataset.head())+"\n\n")
    print("Data Columns:\n"+str(dataset.columns)+"\n\n")
    print("No.of unique values :\n\n"+str(dataset.nunique(axis=0).sort_values())+"\n\n")
    print("Description :\n\n"+str(dataset.describe())+"\n\n")
    
    #print(dataset.describe(exclude=[np.number]))
    #Since no categorical variables, no need to have the above line
    
    print("Null count :\n\n"+str(dataset.isnull().sum().sort_values()))

In [4]:
understand_variables(house_train_data)

Datatype = <class 'pandas.core.frame.DataFrame'>

Data Shape = (1460, 81)

Top5 Rows : 

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         Na

### Insights from data :

1) We have multiple columns that have discrete string values such as Street etc.. - OneHot Encoding can convert strings to numeric form for ML models to process and train on

2) We have plenty of columns with null values - we need to impute them - we will use mean for numeric columns, mode for categorical columns

In [5]:
#id is data index, and is not meant for training a model
house_train_data = house_train_data.set_index("Id")

In [6]:
# Thus, what we do next is split our columns into categorical (discrete - typically string) and numeric (continuous - typically decimals)

categorical=[feature for feature in house_train_data.columns if house_train_data[feature].dtype=='O'] 
numerical=[feature for feature in house_train_data.columns if house_train_data[feature].dtype!='O' and feature!='SalePrice']

# Then, we separate the target variable from fearures 

features = numerical + categorical
target = ['SalePrice']

## ML Pipeline

Next, we create an ML Pipeline to process our data

In [20]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

numerical_transformer = Pipeline(
    steps=[('imputer', SimpleImputer(strategy='mean')), # replaces nulls in numeric columns with mean of available values
           ('scaler', StandardScaler())]) # scales numeric data in range of -1 to 1

categorical_transformer = Pipeline(
    steps=[('imputer', SimpleImputer(strategy='most_frequent')), # replaces nulls in numeric columns with mode  of available values
           ('onehot', OneHotEncoder(handle_unknown='ignore'))]) # converts categorical variables into a form suitable for ML algorithms

preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical), # to process numerical columns
                  ('cat', categorical_transformer, categorical)]) # to process categorical columns

In [8]:
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

def accuracy_per_regression_model(regression_model,cv=5):
    
    pipe = Pipeline(steps=[('preprocessor', preprocessor),('model', regression_model)]) # initiates preprocessing, then modelling

    pipe.fit(house_train_data[features], np.ravel(house_train_data[target])) # trains data
    
    cv = cv # cross validation 

    error_percentage = (-cross_val_score(pipe, house_train_data[features],np.ravel(house_train_data[target]), cv=cv,scoring="neg_mean_absolute_error").mean())/house_train_data[target].mean()*100
    accuracy_percentage = 100 - error_percentage
    
    # error % = mean aboslute error of predicted prices vs actual prices / mean of actual prices * 100
    # accuracy = 100 - error
    
    return accuracy_percentage.iloc[0]

  import pandas.util.testing as tm


We try out Linear, Random Forest and XGB Regression

In [9]:
print("LinearRegression accuracy : "+str(accuracy_per_regression_model(LinearRegression())))
print("RandomForestRegressor accuracy : "+str(accuracy_per_regression_model(RandomForestRegressor(random_state=1))))
print("XGBRegressor accuracy : "+str(accuracy_per_regression_model(XGBRegressor(random_state=1))))

LinearRegression accuracy : 89.60502768602375




RandomForestRegressor accuracy : 89.4800797362218
XGBRegressor accuracy : 90.40758463471842


### XGboost seems to fit best amongst all Regression Models

Thus we will go ahead with XGBoostRegressor, and try to improve its accuracy further

## KFold Cross Validation

So far, we have used cross validation directly as 5. Now, we will use KFold Cross Validation function

In [10]:
from sklearn.model_selection import KFold

cv = KFold(n_splits=10, random_state=1, shuffle=True)
accuracy_per_regression_model(XGBRegressor())

90.40758463471842

We see that accuracy has not changed with shuffle=True. So, we will keep it as false

## Best Value for n_splits in KFold Cross Validation

Next, we will find out the best value for n_splits from a range [2,5,10,15,20,50,100]

In [11]:
n_accuracy_dict = dict()

for n in [2,5,10,15,20,50,100]:
    cv = KFold(n_splits=n, random_state=1, shuffle=False)
    
    n_accuracy_dict[n] = accuracy_per_regression_model(XGBRegressor(),cv)

print(dict(sorted(n_accuracy_dict.items(), key=lambda item: item[1])))

{2: 89.70878951626808, 20: 90.33706074403274, 5: 90.40758463471842, 100: 90.50864423275947, 15: 90.5240708964742, 10: 90.54618703166244, 50: 90.61426574495266}


### n_splits = 50 seems to provide best score. 

This is why we will choose that as n_splits for xgbRegressor

# Repeated k-Fold Cross-Validation


We now go for Repeated k-Fold Cross Validation 

The estimate of model performance via k-fold cross-validation can be noisy.

This means that each time the procedure is run, a different split of the dataset into k-folds can be implemented, and in turn, the distribution of performance scores can be different, resulting in a different mean estimate of model performance.

One solution to reduce the noise in the estimated model performance is to increase the k-value. This will reduce the bias in the model’s estimated performance, although it will increase the variance: e.g. tie the result more to the specific dataset used in the evaluation.

An alternate approach is to repeat the k-fold cross-validation process multiple times and report the mean performance across all folds and all repeats. This approach is generally referred to as repeated k-fold cross-validation.

In [12]:
from sklearn.model_selection import RepeatedKFold

cv = RepeatedKFold(n_splits=50, n_repeats=3, random_state=1)

accuracy_per_regression_model(XGBRegressor(),cv)

90.62046434585686

We see that our model accuracy has improved a bit. But, we aren't sure with n_repeats = 3 provides the best accucracy. Thus, we search for the best n_repeats from a range of values [1,3,5,10]

In [13]:
from sklearn.model_selection import RepeatedKFold

for n in [1,3,5,10]:

    cv = RepeatedKFold(n_splits=50, n_repeats=n, random_state=1)

    print(str(n)+" : "+str(accuracy_per_regression_model(XGBRegressor(),cv)))

1 : 90.56938217777602
3 : 90.62046434585686
5 : 90.6529924008154
10 : 90.69408668747397


### Best result is for N_repeats = 10

# Further improvements on accuracy

We can attempt to tune hyperparameters of the ML algorithm. One of these is called max_depth - The maximum depth of each tree, often values are between 1 and 5.

In [27]:
depth_list = range(1,6) 

In [28]:
for depth in depth_list:
    print(str(depth)+" : "+str(accuracy_per_regression_model(XGBRegressor(max_depth=depth),cv=RepeatedKFold(n_splits=50, n_repeats=10, random_state=1))))

1 : 89.57777521119488
2 : 90.9839599873666
3 : 90.91137393263479
4 : 90.8283626675284
5 : 90.85393923389182


### max_depth seems to be the best for a value of 2

# Conclusion

We first make use of pipelines to process input data. 

Then, we try to find out which ML algorithm offers most accuracy, which is XGBoostRegression. 

Then, we apply RepeatedKFold Cross Validation to find the most appropriate way to measure accuracy

Finally, we try hyperparameter tuning to find the best parameters that enhance accuarcy.