# Introduction

In this notebook I trained several Regression models for 'all features' vs 'selected features' data

Some of features might be very helpful important and have huge impact on determining Housing prices

Feature Selection techniques are used in order to improve model's accuracy

And in this notebook we will take a look at by how much the accuracy of our models would improve

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
df = pd.read_csv("/kaggle/input/house/Housing.csv")

# Superficial analysis of the dataset

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## Checking for null values and replacing values with mean values of current features

In [None]:
df.isna().sum()

In [None]:
for i in df.columns[:-1]:
    df[i].fillna(np.mean(df[i]), inplace=True)

In [None]:
sns.pairplot(df, corner=True)

In [None]:
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

# Evaluation Pipeline

Evaluation pipeline to automate the process of training and evaluation, instead of training and evaluating for every model

In [None]:
class Evaluation:
    def __init__(self, train, test):
        self.train = train
        self.test = test
        
    def evaluate(self, model, name):
        x, y = self.train
        y_pred = model.predict(x)
        mae = mean_absolute_error(y_pred, y)
        mse = mean_squared_error(y_pred, y)
        r2 = r2_score(y_pred, y)
        print(name, "\n", "-"*20)
        print("MAE: {}\nMSE: {}\nr2: {}".format(mae, mse, r2))
        
    def training(self, model, name):
        x, y = self.train
        model.fit(x, y)
        self.evaluate(model, name)
        return model

## Defining models, tuning their hyperparameters

In [None]:
lnr = LinearRegression()
rfr = RandomForestRegressor(n_estimators=150, max_depth=115, criterion='friedman_mse',
                           max_features='log2')
dtr = DecisionTreeRegressor(max_depth=110,criterion='friedman_mse')
svr = SVR(C=0.7)
abr = AdaBoostRegressor(n_estimators=50, learning_rate=0.5)
xgb = XGBRegressor(n_estimators=1000, max_depth=11, eta=0.31)

models = [lnr, rfr, dtr, svr, abr, xgb]
names = ['Linear Regression', 'Random Forest Regressor',
        'Decision Tree Regressor', 'SVR',
        'Ada Boost Regressor', 'XGBRegressor']

assesment = Evaluation((x_train, y_train), (x_test, y_test))

## Evaluation of models

In [None]:
trained = []
for i, j in zip(models, names):
    trained += [assesment.training(i, j)]
    print()

# Feature Selection

We will perform Feature Selection using Feature Importance techniques.

We are peforming Regression techniques with mostly numerical inputs, which means, that we either perform Pearson's correlation or Spearman's correlation

![image](https://machinelearningmastery.com/wp-content/uploads/2019/11/How-to-Choose-Feature-Selection-Methods-For-Machine-Learning.png)

## Pearson's correlation

I decided to go for Pearson's correlation rates

As it seems from Pairplot - most of features have monotonic relationship or linear relationship with the dominance of
linear relationship, Therefore I decided to perform feature selection using Pearson's correlation

![formula](https://editor.analyticsvidhya.com/uploads/39170Formula.JPG)

In [None]:
corr_matrix = df.corr(method='pearson')
sns.heatmap(corr_matrix, annot=True)

## Threshold

Let's classify ranges for Pearson's correlation

As it seems coefficient ranging from +/-0.4 to +/-0.59 inclusive is supposed to be moderate

If we choose more than 0.6 - we get very low amount of features, hence we choose features with correlation rate of more than 0.4

![pearson](https://www.researchgate.net/profile/Mahiswaran-Selvanathan/publication/345693737/figure/tbl1/AS:956412914040832@1605038016475/The-scale-of-Pearsons-Correlation-Coefficient.png)

In [None]:
selected = []
for i in corr_matrix.index[:-1]:
    if corr_matrix.loc[i, "MEDV"] > 0.4 or corr_matrix.loc[i, "MEDV"] < -0.4:
        selected += [i]

In [None]:
sns.pairplot(df, vars=selected+['MEDV'], corner=True)

In [None]:
x_s = df.loc[:, selected].values
y_s = df.loc[:, 'MEDV'].values
xs_train, xs_test, ys_train, ys_test = train_test_split(x_s, y_s, random_state=42, test_size=0.2)

## Evaluation of Selected Features

Drastic improvement for:
* Random Forest Classifier
* Decision Tree Classifier
* Ada Boost Regressor
* XGBRegressor

Slight improvement for Linear Regression

And SVR remain horrible, that r^2 score remains negative

In [None]:
assesment_selected = Evaluation((xs_train, ys_train), (xs_test, ys_test))
selected_trained = []
for i, j in zip(models, names):
    selected_trained += [assesment_selected.training(i, j)]

# Conclusion

In this dataset there is a list of features that are **ALL** affect Housing price.

However, in order to improve Regression Models' accuracies we need to perform the removal of such preprocessing techniques as scaling the data, outliers detection and feature selection

Most of ML algorithms using sklearn perform just as good with the data that is not preprocessed using Min Max Scaler technique or Stanard Scaler technique, etc. 

There are not that much of significant outliers.

Therefore the only thing that's left is to perform Feature Selection. And I decided to go for Feature Importance Pearson's correlation technique.

I compared the results of models trained on data that did not go through feature selection and models trained on selected features.

In the end, most models' performance was drastically improved.