## Regression Trees

In [1]:
# Pandas will allow us to create a dataframe of the data so it can be used and manipulated
import pandas as pd
# Regression Tree Algorithm
from sklearn.tree import DecisionTreeRegressor
# Split our data into a training and testing data
from sklearn.model_selection import train_test_split

## Project Aim

##### In this project, I am a data scientist that is working for a real estate company that is planning to invest in Malaysian real estate. I have collected information about various areas of Boston and are tasked with creating a model that can predict the median price of rental of houses for that area so it can be used to make offers

In [2]:
data = pd.read_csv('/kaggle/input/rent-pricing-kuala-lumpur-malaysi/mudah-apartment-kl-selangor.csv')


In [3]:
data.head()
data.shape
data.isna().sum()

ads_id                      0
prop_name                 948
completion_year          9185
monthly_rent                2
location                    0
property_type               0
rooms                       6
parking                  5702
bathroom                    6
size                        0
furnished                   5
facilities               2209
additional_facilities    5948
region                      0
dtype: int64

## Data Pre-Processing

In [4]:
data.dropna(inplace=True)

In [5]:
data['monthly_rent'] = (data['monthly_rent'].str.replace('RM ','', regex=False).str.replace(' per month', '', regex=False).str.replace(' ','',regex=True))
data['size'] = (data['size'].str.replace('sq.ft.','', regex=False))
data.drop(columns=['location','ads_id','prop_name','region'], inplace=True)
data['property_type'] = pd.factorize(data['property_type'])[0]
data['furnished'] = data['furnished'].str.split(',').str.len()
data['facilities'] = data['facilities'].str.split(',').str.len()
data['monthly_rent'] = data['monthly_rent'].astype(int)
data['additional_facilities'] = data['additional_facilities'].str.split(',').str.len()
numeric_cols = data.select_dtypes(include=['float', 'int']).columns
data[numeric_cols] = data[numeric_cols].fillna(0).astype(int)

data.head()


Unnamed: 0,completion_year,monthly_rent,property_type,rooms,parking,bathroom,size,furnished,facilities,additional_facilities
0,2022,4200,0,5,2,6,1842,1,10,3
3,2020,1700,1,2,1,2,743,1,8,3
7,2018,1550,2,1,1,1,700,1,7,4
8,2014,1400,1,2,1,1,750,1,5,4
11,2022,2000,2,4,2,2,1100,1,9,4


In [6]:
X = data.drop(columns=["monthly_rent"])
Y = data["monthly_rent"]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)


## Create Regression Tree

Regression Trees are implemented using `DecisionTreeRegressor` from `sklearn.tree`

The important parameters of `DecisionTreeRegressor` are

`criterion`: {"mse", "friedman_mse", "mae", "poisson"} - The function used to measure error

`max_depth` - The max depth the tree can be

`min_samples_split` - The minimum number of samples required to split a node

`min_samples_leaf` - The minimum number of samples that a leaf can contain

`max_features`: {"auto", "sqrt", "log2"} - The number of feature we examine looking for the best one, used to speed up training


In [7]:
regression_tree = DecisionTreeRegressor(criterion = 'squared_error')

## Training

In [8]:
regression_tree.fit(X_train, Y_train)

## Evaluation


In [9]:
regression_tree.score(X_test, Y_test)

-12.478508927275906

Finding the average error in our testing set which is the average error in rental predicition


In [10]:
prediction = regression_tree.predict(X_test)

print("$",(prediction - Y_test).abs().mean())

$ 563.1672237629542


In [11]:
regression_tree = DecisionTreeRegressor(criterion = 'squared_error')
regression_tree.fit(X_train, Y_train)


In [12]:
regression_tree.score(X_test, Y_test)


-170.6035249469245

In [13]:
prediction = regression_tree.predict(X_test)
print("$",(prediction - Y_test).abs().mean()*1000)

$ 956548.3139988834
