# Introduction

Choosing a furniture for your household is very important, especially if you are a new family and moving in to a new place to plant your roots.

But managing your budget is also important.

In this post I will use ML algorithms to predict prices for furniture on Jamia website.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [None]:
df = pd.read_csv("/kaggle/input/furniture-price-prediction/Furniture Price Prediction.csv")

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.isna().sum()

In [None]:
absent = df[df['price'].isnull()]
df.dropna(inplace=True)

## There are more non-rated furniture than rated furniture

In [None]:
non_rated = df[df['rate'] == 0.0].copy()

In [None]:
non_rated.head()

In [None]:
non_rated['price'].plot.hist(bins=20)

In [None]:
rated = df[df['rate'] > 0.0].copy()

In [None]:
rated['price'].plot.hist(bins=15)

In [None]:
## Count of non_rated vs rated
print(non_rated.shape)
print(rated.shape)

## Pre-processing of the dataset

So, basically there are lot of unique variables for names and types of furniture. As I suggested, they are not fit for Label Encoder. Hence, I decided to leave these features behind

The only features I find fit enough for training models are:
* rate
* delivery
* sale

However, sale feature needs additional pre-processing. We need to remove "%" sign

In [None]:
def sales_process(x):
    if "%" in x:
        return int(x[:-1])
    
    return x

In [None]:
df['sale'] = df['sale'].apply(sales_process)
absent['sale'] = absent['sale'].apply(sales_process)

In [None]:
# selecting the most important features
df = df[['rate', 'delivery', 'sale', 'price']]

## Models training, hypermarameters tuning phase

In [None]:
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

In [None]:
def model_evaluate(model, name, plot=True):
    y_pred = model.predict(x_test)
    tit = name + "\nMAE:{}\nMSE:{}".format(mean_absolute_error(y_pred, y_test), mean_squared_error(y_pred, y_test))
    if plot:
        dd = pd.DataFrame(x_test, columns=['rate', 'delivery', 'sale'])
        dd['price'] = y_pred
        sns.pairplot(dd, x_vars=['rate', 'delivery', 'sale'], y_vars=['price'])
        plt.suptitle(tit)
        plt.tight_layout()
    else:
        print(tit)

In [None]:
def model_train(model, name):
    model.fit(x_train, y_train)
    model_evaluate(model, name, False)
    return model

In [None]:
lr = LinearRegression()
rfr = RandomForestRegressor(n_estimators=150, max_depth=60)
svr = SVR(kernel='linear', C=0.6)
abr = AdaBoostRegressor(n_estimators=45, learning_rate=0.3)
knr = KNeighborsRegressor(n_neighbors=10)
models = [lr, rfr, svr, abr, knr]
names = ['Linear Regression', 'Random Forest Regressor', 'Support Vector Regression',
        'Ada Boost Regressor', 'KNeighbors Regressor']

In [None]:
mls = []
for i, j in zip(models, names):
    mls.append(model_train(i, j))
    print()

In [None]:
dt = pd.DataFrame(x_test, columns=['rate', 'delivery', 'sale'])
dt['price'] = y_test

# Evaluate trained models

### I used Mean Absolute Error(MAE)

![MAE](https://www.codingprof.com/wp-content/uploads/2021/12/Formula_MeanAbsoluteError.png)

### and Mean Squared Error(MSE)

![MSE](https://cdn-media-1.freecodecamp.org/images/hmZydSW9YegiMVPWq2JBpOpai3CejzQpGkNG)

In [None]:
sns.pairplot(dt, x_vars=['rate', 'delivery', 'sale'], y_vars=['price'])
plt.suptitle("Original")
plt.tight_layout()
for i, j in zip(mls, names):
    model_evaluate(i, j)

# Conclusion

## The best and most consistent models used are: KNeighborsRegressor with K=10 and Random Forest Regressor max_depth = 60 and n_estimators = 150

However, we need to keep a note that features used are ratings, delivery and sale.

There are also such factors as brands and type of furniture. However, I suggested that using Label Encoder would be wrong, considering there would be 500+ unique labels for the column of furniture type.