In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('/kaggle/input/housing-prices-dataset/Housing.csv')
print(df.columns)
print(df.shape)

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')
(545, 13)


### Data Preprocessing
Now, we categorize the features depending on their datatype (int, float, object) and then calculate the number of them. 

In [3]:
obj = (df.dtypes == 'object')
object_cols = list(obj[obj].index)
print("Categorical variables:",len(object_cols))
 
int_ = (df.dtypes == 'int')
num_cols = list(int_[int_].index)
print("Integer variables:",len(num_cols))
 
fl = (df.dtypes == 'float')
fl_cols = list(fl[fl].index)
print("Float variables:",len(fl_cols))

Categorical variables: 7
Integer variables: 6
Float variables: 0


[Data Cleaning](https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/) is the way to improvise the data or remove incorrect, corrupted or irrelevant data.

As in our dataset, there are some columns that are not important and irrelevant for the model training. So, we can drop that column before training. There are 2 approaches to dealing with empty/null values

*   We can easily delete the column/row (if the feature or record is not much important).
    
*   Filling the empty slots with mean/mode/0/NA/etc. (depending on the dataset requirement).

In [23]:
df.head()

False


In [24]:
# Check if there are any missing values in the DataFrame
has_missing = df.isnull().values.any()
print(has_missing)

False


OneHotEncoder – For Label categorical features
----------------------------------------------

One hot Encoding is the best way to convert categorical data into binary vectors. This maps the values to integer values. By using [OneHotEncoder](https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/), we can easily convert object data into int. So for that, firstly we have to collect all the features which have the object datatype. To do so, we will make a loop.

In [26]:
from sklearn.preprocessing import OneHotEncoder

s = (df.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
print('No. of. categorical features: ', 
	len(object_cols))


Categorical variables:
['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea', 'furnishingstatus']
No. of. categorical features:  7


Then once we have a list of all the features. We can apply OneHotEncoding to the whole list.

In [30]:
OH_encoder = OneHotEncoder(sparse_output=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(df[object_cols]))
OH_cols.index = df.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = df.drop(object_cols, axis=1)
df_final = pd.concat([df_final, OH_cols], axis=1)


Splitting Dataset into Training and Testing
-------------------------------------------

X and Y splitting (i.e. Y is the SalePrice column and the rest of the other columns are X)

In [31]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

X = df_final.drop(['price'], axis=1)
Y = df_final['price']

# Split the training set into 
# training and validation set
X_train, X_valid, Y_train, Y_valid = train_test_split(
	X, Y, train_size=0.8, test_size=0.2, random_state=0)


### **Linear Regression**

Linear Regression predicts the final output-dependent value based on the given independent features. Like, here we have to predict SalePrice depending on features like area, bedrooms, bathrooms, stories, mainroad, guestroom, basement, hotwaterheating, airconditioning, parking, prefarea, furnishingstatus etc.

In [33]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error  # Import the required function


model_LR = LinearRegression()
model_LR.fit(X_train, Y_train)
Y_pred = model_LR.predict(X_valid)

print(mean_absolute_percentage_error(Y_valid, Y_pred))


0.16035195155220616


### **SVM – Support vector Machine**

SVM can be used for both regression and classification model. It finds the hyperplane in the n-dimensional plane.

In [34]:
from sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_percentage_error

model_SVR = svm.SVR()
model_SVR.fit(X_train,Y_train)
Y_pred = model_SVR.predict(X_valid)

print(mean_absolute_percentage_error(Y_valid, Y_pred))


0.2710074432862681


### **Random Forest Regression**

Random Forest is an ensemble technique that uses multiple of decision trees and can be used for both regression and classification tasks.

In [35]:
from sklearn.ensemble import RandomForestRegressor

model_RFR = RandomForestRegressor(n_estimators=10)
model_RFR.fit(X_train, Y_train)
Y_pred = model_RFR.predict(X_valid)

mean_absolute_percentage_error(Y_valid, Y_pred)


0.18611592106068287

**Note:** Here the Linear Regression predicts outcome with least error.