<h1 style='color:purple' align='center'>Memprediksi Harga Rumah di Indonesia

Dataset berisikan 1000 data rumah yang dapat diakses melalui
https://www.kaggle.com/datasets/wisnuanggara/daftar-harga-rumah

In [135]:
import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_percentage_error
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

In [136]:
df1 = pd.read_excel("data/DATA RUMAH.xlsx")
df1.head()

Unnamed: 0,NO,NAMA RUMAH,HARGA,LB,LT,KT,KM,GRS
0,1,"Rumah Murah Hook Tebet Timur, Tebet, Jakarta S...",3800000000,220,220,3,3,0
1,2,"Rumah Modern di Tebet dekat Stasiun, Tebet, Ja...",4600000000,180,137,4,3,2
2,3,"Rumah Mewah 2 Lantai Hanya 3 Menit Ke Tebet, T...",3000000000,267,250,4,4,4
3,4,"Rumah Baru Tebet, Tebet, Jakarta Selatan",430000000,40,25,2,2,0
4,5,"Rumah Bagus Tebet komp Gudang Peluru lt 350m, ...",9000000000,400,355,6,5,3


In [137]:
df1 = df1.drop(columns=['NO', 'NAMA RUMAH'])
df1.head()

Unnamed: 0,HARGA,LB,LT,KT,KM,GRS
0,3800000000,220,220,3,3,0
1,4600000000,180,137,4,3,2
2,3000000000,267,250,4,4,4
3,430000000,40,25,2,2,0
4,9000000000,400,355,6,5,3


In [138]:
train_df = df1.copy()
train_df["harga_per_m2"] = df1["HARGA"] / df1["LB"]

</h1>

In [139]:
train_df   = train_df.reset_index()

## Outlier Removal

In [140]:
train_df['harga_per_m2'].describe()

count    1.010000e+03
mean     2.649793e+07
std      1.508695e+07
min      2.604167e+06
25%      1.791957e+07
50%      2.222222e+07
75%      3.000000e+07
max      2.166667e+08
Name: harga_per_m2, dtype: float64

Ditemukan bahwa nilai harga per m2 terkecil adalah 2,6 juta rupiah, dan harga terbesar berada di angka 216 juta rupiah. <br>
Dengan begitu, akan dilakukan removing outlier berdasarkan mean dan standar deviation.

In [141]:
m = np.mean(train_df['harga_per_m2'])
st = np.std(train_df['harga_per_m2'])

In [142]:
train_df = train_df[train_df['harga_per_m2'] > (m-st)]

In [143]:
train_df = train_df[train_df['harga_per_m2'] <= (m+st)]

In [144]:
train_df.shape

(859, 8)

Didapat bahwa jumlah data yang tidak berada pada rentang outlier adalah sebanyak 859 data

## Modelling

In [145]:
X = train_df.drop(columns=['HARGA', 'index', 'harga_per_m2'])
y = train_df['HARGA']

In [146]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

Menggunakan GridSearchCV untuk mencari model terbaik

In [147]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit

def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'normalize': [True, False]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,y)

Unnamed: 0,model,best_score,best_params
0,linear_regression,0.830733,{'normalize': True}
1,lasso,0.830733,"{'alpha': 2, 'selection': 'random'}"
2,decision_tree,0.75226,"{'criterion': 'mse', 'splitter': 'random'}"


Linear regression dipilih menjadi algoritma yang digunakan untuk modelling karena memiliki score terbaik dengan 0.83

In [148]:
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)

0.8585718646063245

In [149]:
from sklearn.tree import DecisionTreeRegressor 

regressor = DecisionTreeRegressor(random_state = 0) 
regressor.fit(X_train, y_train)


DecisionTreeRegressor(random_state=0)

In [150]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)

array([0.86573303, 0.8864129 , 0.76008866, 0.82255773, 0.81887252])

In [151]:
def predict_price(lb, lt, kt, km, grs):    
    x = np.zeros(len(X.columns))
    x[0] = lb
    x[1] = lt
    x[2] = kt
    x[3] = km
    x[4] = grs
    return lr_clf.predict([x])[0]

In [152]:
predict_price(180,137,4,3,0)

3244487002.6515055

# Export the tested model to a pickle file

In [153]:
import pickle
with open('../server/models/home_price_prediction.pickle','wb') as f:
    pickle.dump(lr_clf, f)