Let's first load the dataset and see if we can get information from it

In [1]:
import pandas as pd

# Load the dataset
file_path = 'domain_properties.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
data.head(), data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11160 entries, 0 to 11159
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   price                     11160 non-null  int64  
 1   date_sold                 11160 non-null  object 
 2   suburb                    11160 non-null  object 
 3   num_bath                  11160 non-null  int64  
 4   num_bed                   11160 non-null  int64  
 5   num_parking               11160 non-null  int64  
 6   property_size             11160 non-null  int64  
 7   type                      11160 non-null  object 
 8   suburb_population         11160 non-null  int64  
 9   suburb_median_income      11160 non-null  int64  
 10  suburb_sqkm               11160 non-null  float64
 11  suburb_lat                11160 non-null  float64
 12  suburb_lng                11160 non-null  float64
 13  suburb_elevation          11160 non-null  int64  
 14  cash_r

(    price date_sold         suburb  num_bath  num_bed  num_parking  \
 0  530000   13/1/16      Kincumber         4        4            2   
 1  525000   13/1/16     Halekulani         2        4            2   
 2  480000   13/1/16  Chittaway Bay         2        4            2   
 3  452000   13/1/16        Leumeah         1        3            1   
 4  365500   13/1/16    North Avoca         0        0            0   
 
    property_size         type  suburb_population  suburb_median_income  \
 0           1351        House               7093                 29432   
 1            594        House               2538                 24752   
 2            468        House               2028                 31668   
 3            344        House               9835                 32292   
 4           1850  Vacant land               2200                 45084   
 
    suburb_sqkm  suburb_lat  suburb_lng  suburb_elevation  cash_rate  \
 0        9.914   -33.47252   151.40208         

Using the CatBoost model and then checking its performance.

In [5]:
import pandas as pd
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the dataset
data = pd.read_csv('domain_properties.csv')

# Convert 'date_sold' to datetime
data['date_sold'] = pd.to_datetime(data['date_sold'], format='%d/%m/%y')

data['year_sold'] = data['date_sold'].dt.year
data['month_sold'] = data['date_sold'].dt.month
# Adding a year sold and a month sold column 
# to the dataset. This allows catboosting to 
# handle datetime objects

# Separate features and target
X = data.drop(columns=['price', 'date_sold'])
y = data['price']

# Identify categorical columns
categorical_features = ['suburb', 'type', 'year_sold', 'month_sold'] 
# These columns represent categories 
# rather than continuous numerical values and 
# need to be treated differently in the model.

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Use 80% of the dataset for training and 20% for testing.

# Initialize and train the CatBoost Regressor
model = CatBoostRegressor(cat_features=categorical_features, verbose=0)
model.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae}')
print(f'RMSE: {rmse}')
print(f'R^2: {r2}')


MAE: 313687.39225696994
RMSE: 631309.6028879706
R^2: 0.7072074147967327


MAE: Mean Absolute Error measures the average size of the mistakes in a collection of predictions, without taking their direction into account. It is measured as the average absolute difference between the predicted values and the actual values and is used to assess the effectiveness of a regression model. If we assume that there is a linear relationship between features and the target features, then it works really well for cat boosting. It is also simple to unsterdatnd and calculate. Our MAE is $313687
RMSE: RMSE is particularly useful when you want to penalize larger errors more heavily. RMSE tends to produce smoother gradients, which can be beneficial during the optimization process in gradient boosting models like CatBoost. Our RMSE is $631309
R^2: An R² value close to 1 suggests that the model explains most of the variability in the data, indicating a good fit. R² allows for easy comparison of the model's performance with other models or benchmarks. Our R^2 is 0.71, which indicates that the model needs a lot more improvement

In [7]:
df = pd.DataFrame(data)

# Calculate Mean
mean_value = df['price'].mean()

# Calculate Median
median_value = df['price'].median()

# Calculate Q1 (25th percentile)
Q1 = df['price'].quantile(0.25)

# Calculate Q3 (75th percentile)
Q3 = df['price'].quantile(0.75)

# Calculate Interquartile Range (IQR)
IQR = Q3 - Q1

# Output the results
print("Mean:", mean_value)
print("Median:", median_value)
print("Q1 (25th percentile):", Q1)
print("Q3 (75th percentile):", Q3)
print("IQR (Interquartile Range):", IQR)

# Find the lowest value
lowest_value = df['price'].min()

# Find the highest value
highest_value = df['price'].max()

# Output the results
print("Lowest Value:", lowest_value)
print("Highest Value:", highest_value)

Mean: 1675395.2672939068
Median: 1388000.0
Q1 (25th percentile): 1002000.0
Q3 (75th percentile): 2020000.0
IQR (Interquartile Range): 1018000.0
Lowest Value: 225000
Highest Value: 60000000


Choose MAE if you want a straightforward average error metric that is not too sensitive to outliers. Choose RMSE if avoiding large errors is critical or if you're optimizing models where smooth gradients are beneficial.

From the dataset, we can see that the lowest value is $225,000 and the highest value is $60,000,000, we need to foucs on improving the results of RMSE. Since 75% of the prices are less than $2,020,000, a RMSE of $631,309 indicates that our model needs improvement.