# Compare Multiple Machine Learning Models 

In [12]:
import pandas as pd

# Load the dataset
data = pd.read_csv(r"C:\Users\USER\Desktop\Tedprime\3mtt\Real.csv")

# Display the first few rows of the dataset and the info about the dataset
data_head = data.head()
data_info = data.info()

print(data_head)
print(data_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Transaction date                     414 non-null    object 
 1   House age                            414 non-null    float64
 2   Distance to the nearest MRT station  414 non-null    float64
 3   Number of convenience stores         414 non-null    int64  
 4   Latitude                             414 non-null    float64
 5   Longitude                            414 non-null    float64
 6   House price of unit area             414 non-null    float64
dtypes: float64(5), int64(1), object(1)
memory usage: 22.8+ KB
      Transaction date  House age  Distance to the nearest MRT station  \
0  2012-09-02 16:42:31       13.3                            4082.0150   
1  2012-09-04 22:52:30       35.5                             274.0144   
2  2012-09-05 01:

In [None]:
''''The dataset consists of 414 entries and 7 columns, with no missing values. Here’s a brief overview of the columns:

Transaction date: The date of the house sale (object type, which suggests it might need conversion or extraction of 
useful features like year, month, etc.).
    
House age: The age of the house in years (float).
    
Distance to the nearest MRT station: The distance to the nearest mass rapid transit station in meters (float).
    
Number of convenience stores: The number of convenience stores in the living circle on foot (integer).

Latitude: The geographic coordinate that specifies the north-south position (float).

Longitude: The geographic coordinate that specifies the east-west position (float).

House price of unit area: Price of the house per unit area (float), which is likely our target variable for prediction.''''

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import datetime

# convert "Transaction date" to datetime and extract year and month
data['Transaction date'] = pd.to_datetime(data['Transaction date'])
data['Transaction year'] = data['Transaction date'].dt.year
data['Transaction month'] = data['Transaction date'].dt.month

# drop the original "Transaction date" as we've extracted relevant features
data = data.drop(columns=['Transaction date'])

# define features and target variable
X = data.drop('House price of unit area', axis=1)
y = data['House price of unit area']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape

(331, 7)

In [14]:
X_test_scaled.shape

(83, 7)

# Model Training and Comparison

In [None]:
''''Linear Regression: A good baseline model for regression tasks.
Decision Tree Regressor: To see how a simple tree-based model performs.
Random Forest Regressor: An ensemble method to improve upon the decision tree’s performance.
Gradient Boosting Regressor: Another powerful ensemble method for regression

We’ll train each model using the training data and evaluate their performance on the test set using 
Mean Absolute Error (MAE) and R-squared (R²) as metrics. These metrics will help us understand both 
the average error of the predictions and how well the model explains the variance in the target variable.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# initialize the models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42)
}

# dictionary to hold the evaluation metrics for each model
results = {}

# train and evaluate each model
for name, model in models.items():
    # training the model
    model.fit(X_train_scaled, y_train)

    # making predictions on the test set
    predictions = model.predict(X_test_scaled)

    # calculating evaluation metrics
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)

    # storing the metrics
    results[name] = {"MAE": mae, "R²": r2}
results_df = pd.DataFrame(results).T  # convert the results to a DataFrame for better readability
print(results_df)

                         MAE        R²
Linear Regression   9.748246  0.529615
Decision Tree      11.700141  0.191559
Random Forest       9.829512  0.513097
Gradient Boosting  10.002797  0.475671


In [None]:
''''Linear Regression has the lowest MAE (9.75) and the highest R² (0.53), making it the 
best-performing model among those evaluated. It suggests that, despite its simplicity, 
Linear Regression is quite effective for this dataset.


Decision Tree Regressor shows the highest MAE (11.76) and the lowest R² (0.20), indicating it may be
overfitting to the training data and performing poorly on the test data. On the other hand, Random Forest 
Regressor and Gradient Boosting Regressor have similar MAEs (9.89 and 10.00, respectively) and R² scores 
(0.51 and 0.48, respectively), performing slightly worse than the Linear Regression model but better than 
the Decision Tree.