## Data Preparation & Exploration
The first step in any machine learning project is to understand and prepare the data. Here, we load the dataset from houses_Madrid.csv and perform an initial exploration.

The scatter plot gives us a visual confirmation that there appears to be a positive linear relationship between the square meters of a property and its price as the size increases, the price tends to increase as well. This makes it a good candidate for a linear regression model.

Finally, we split the data into a training set (first 16,000 samples) and a test set (remaining samples).

Why do we do this? This separation is crucial. We use the training set to teach our model the relationship between size and price. The test set, which the model has never seen, is then used to evaluate how well the model generalizes to new, unseen data. This gives us an honest assessment of its performance and helps prevent overfitting.

In [2]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split

# Set a random seed for reproducibility of results
np.random.seed(42)

# Load the dataset from a CSV file into a pandas DataFrame
df = pd.read_csv("houses_Madrid.csv")
features = [
    'sq_mt_built',
    'n_rooms',
    'n_bathrooms',
    'built_year',
    'has_lift',
    'has_pool',
    'has_parking',
    'is_new_development',
    'is_renewal_needed',
    'floor',
    'rent_price',
    'is_exterior',
    'buy_price'
]

df_dropped = df[features].dropna()

cat = ['has_lift', 'has_pool', 'has_parking', 'is_new_development',
    'is_renewal_needed', 'is_exterior', 'floor']

num = ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'rent_price']

one_hot_df = pd.get_dummies(df_dropped, columns = cat, drop_first=True).astype(int)
# print(one_hot_df.head())











# for item in features:
#     print(df[item].head())

#proccess the floors categorical variables into numeric
#print(df_dropped['floor'].unique())


# df_dropped['floor'] = df_dropped['floor'].replace(floor_mapping)

# drop rows with missing values in the selected features
# dropped 2992 rows



# convert has_lift to numeric
# df_dropped['has_lift'] = df_dropped['has_lift'].replace(True, 1)
# df_dropped['has_lift'] = df_dropped['has_lift'].replace(False, 0)



# for item in features:
#     print(df_dropped[item].head())

X = one_hot_df.drop(columns=['buy_price'])
y = df_dropped['buy_price']
print(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")


# # # 1. Initialize the model
model = LinearRegression()

# # 
# 2. Train the model using your training data
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# # 4. Calculate the performance metrics by comparing actual values (y_test) to predicted values (y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error (RMSE): {rmse:,.2f}")
print(f"R-squared (R²): {r2:.4f}")





#encode categorical features and truncate numeric features


    
   









       sq_mt_built  n_rooms  n_bathrooms  built_year  rent_price  \
4              108        2            2        2003        1094   
28              65        3            1        1976         744   
31             123        3            2        2002        1195   
40             125        3            3        2004        1135   
45              83        2            2        2020        1004   
...            ...      ...          ...         ...         ...   
21719          161        4            2        1974        1816   
21724          102        2            2        1976        1187   
21732           74        2            1        1988        1037   
21738           96        2            2        2002        1496   
21739          175        4            2        2002        2081   

       has_lift_True  has_parking_True  is_renewal_needed_True  \
4                  1                 1                       0   
28                 1                 0             

## Building a Model from Scratch
The Linear Predictor
Why a linear model? Based on our initial data exploration, a simple straight line is a good starting point to model the relationship between a house's size and its price.

How it works:
The predict function implements the mathematical formula for a line:

 $\hat{y} = \theta_1 \cdot x + \theta_0$
​
 

Where: $\hat{y}$ is the predicted price.

x is the input feature (sq_mt_built).

$\theta_1$ is the weight (slope), representing the price increase per square meter.

$\theta_0$ is the bias (y-intercept), representing the base price of a house.

This function takes an input x and the model parameters theta and returns the model's prediction.

In [None]:
# Import the standard RMSE function
from sklearn.metrics import root_mean_squared_error

# Calculate Root Mean Squared Error (RMSE) from scratch
def loss_RMSE(y, yhat):
    y = np.array(y)
    yhat = np.array(yhat)
    diff = y-yhat
    squ_diff = diff ** 2
    mean = np.mean(squ_diff)
    rmse = np.sqrt(mean)
    return rmse

# Wrapper for the scikit-learn RMSE
def loss_RMSE_sk(y, yhat):
    return root_mean_squared_error(y, yhat)

# Validation test data
test_y = [1, 5, 6, 10, 11]
test_y_hat = [1.5, 5.7, 6.1, 10.4, 11.2]

# Test
print(loss_RMSE(test_y, test_y_hat))
print(loss_RMSE_sk(test_y, test_y_hat))

# find the best fit line
m, b = np.polyfit(training_data_x, training_data_y, 1)
yhat = m * test_data_x + b

print("m: ", m,"b: ", b)

print(loss_RMSE(test_data_y, yhat))
print(loss_RMSE_sk(test_data_y, yhat))



## Implementation of a Linear Predictor
Linear Predictor: The predictor function implements the standard linear model equation:  $\hat{y} = \theta_1 \cdot x + \theta_0$

In [None]:
def predict(x, theta):
    X_with_bias = np.hstack([np.ones((x.shape[0], 1)), x])
    return X_with_bias @ theta

## Training with Grid Search
Why Grid Search? Now that we have a model and a way to measure its error, we need to find the best values for the parameters $\theta_0$ and $\theta_1$
​
 . Grid search is an intuitive, brute-force technique for this.

How it works:
The grid_search function systematically checks every possible combination of $\theta_0$ and $\theta_1$
​
  from a predefined range of values (the "grid"). For each pair, it calculates the RMSE on the training data. The function keeps track of the parameter combination that results in the lowest error and returns them as the optimal parameters.

We first run it with a coarse grid to find a promising region and then use a finer grid in that region to zero in on a more precise solution, all while keeping the training time under 10 seconds.

In [None]:
import sys

# Implement grid search to find the optimal theta_0, theta_1
# This is a brute force method that exhaustively checks every parameter combination in a predefined grid
def grid_search(training_data_x, training_data_y, grid0, grid1):
    best_theta_0 = None
    best_theta_1 = None
    best_loss = sys.maxsize
    for i in grid0:
        for j in grid1:
            yhat = predict(training_data_x, [i,j])
            temp_loss = loss_RMSE_sk(training_data_y, yhat)
            if(temp_loss < best_loss):
                best_loss = temp_loss
                best_theta_0 = i
                best_theta_1 = j

    return best_theta_0, best_theta_1, best_loss
            
import time

# Define the ranges (grids) of parameters to search through
grid0 = np.arange(14000,16000,10)
grid1 = np.arange(4000,6000,10)

# Time the training process to measure computational cost.
start = time.time()
theta_0, theta_1, best_loss = grid_search(training_data_x, training_data_y, grid0, grid1)
end = time.time()
s = (end-start)
print(f"Training Time Elapsed {s} secs.")

print("theta_0: ", theta_0, "theta_1: ", theta_1)

# loss on training set
print("Best train loss: ", best_loss)



# Evaluate the model's performance on the unseen test set using the best parameters found via grid seach
yhat_test = predict(test_data_x, [theta_0, theta_1])
print("Test loss is: ", loss_RMSE_sk(test_data_y, yhat_test))



## Training with Random Search
Why Random Search? Grid search can be computationally expensive, especially if the range of possible parameter values is large. Random search is a more efficient alternative. Instead of trying every single combination, it tests a fixed number of random combinations within the specified range.

How it works:
The random_search function samples a specified number of trials for θ 
0
​
  and θ 
1
​
  from a uniform distribution over a given range. It then evaluates the loss for each random pair and returns the best one it found. This approach can often find a very good solution much faster than an exhaustive grid search.

In [None]:
# Implement a random search to find the optimal model parameters.
# This is often more efficient than grid search, as it samples random points in the parameter space
# instead of checking every single combination.
def random_search(training_data_x, training_data_y, trials):
    best_theta_0 = None
    best_theta_1 = None
    best_loss = sys.maxsize
    grid0 = np.random.uniform(15000, 16000, trials)
    grid1 = np.random.uniform(4000, 5000, trials)
    for i in grid0:
        for j in grid1:
            yhat = predict(training_data_x, [i,j])
            temp_loss = loss_RMSE_sk(training_data_y, yhat)
            if(temp_loss < best_loss):
                best_loss = temp_loss
                best_theta_0 = i
                best_theta_1 = j

    return best_theta_0, best_theta_1, best_loss


# Measure computational cost in seconds
start = time.time()

# Run the random search with 300 trials for each parameter.
theta_0, theta_1, best_loss = random_search(training_data_x, training_data_y, 300)
end = time.time()
s = (end-start)
print(f"Training Time Elapsed {s} secs.")

print("theta_0: ", theta_0, "theta_1: ", theta_1)

# loss on training set
print("Best train loss is: ", best_loss)

#loss on test set
yhat_test = predict(test_data_x, [theta_0, theta_1])
print("Test loss is: ", loss_RMSE_sk(test_data_y, yhat_test))

## The Industry Standard: Scikit-Learn
While building models from scratch is excellent for learning, in a professional setting, we use highly optimized libraries like scikit-learn.

Why scikit-learn? It's fast, reliable, and its functions are implemented using efficient, direct analytical methods (like Ordinary Least Squares) rather than brute-force search. This allows it to find the optimal solution almost instantly.

How it works:
We use the LinearRegression object from sklearn.linear_model. The .fit() method trains the model on the training data, automatically finding the best intercept $\theta_0$ and coefficient $\theta_1$. 

We then use the .predict() method to make predictions on our training and test sets.

In [None]:
# Use the highly-optimized LinearRegression model from scikit-learn.
# This represents the industry-standard approach for this type of problem.
from sklearn import linear_model
def sk_linear_reg(tdx, tdy):
    reg_sk = linear_model.LinearRegression()
    reg_sk.fit(tdx, tdy)
    return reg_sk

# Reshape the data from a 1D for SK learn
training_data_x_sk = training_data_x.reshape(-1, 1)
test_data_x_sk = test_data_x.reshape(-1, 1)

# Measure computational cost in seconds
start = time.time()
reg = sk_linear_reg(training_data_x_sk, training_data_y)
end = time.time()
s = (end-start)
print(f"Training Time Elapsed {s} secs.")

# The optimal parameters are stored in the .coef_ (slope) and .intercept_ (bias) attributes.
print(reg.coef_, reg.intercept_)

# Evaluate the final model on both the training and test sets.
yhat_train = reg.predict(training_data_x_sk)
print("train loss is: ", loss_RMSE_sk(training_data_y,yhat_train))
yhat_test = reg.predict(test_data_x_sk)
print("test loss is: ", loss_RMSE_sk(test_data_y,yhat_test))

    

    

    



## Results

Grid Search

    Best theta: 4372.100000000044, 15440.0
	Training Error: 510201.77011160203
	Test Error: 497888.0900697986
	Training Time: 3.399505138397217 secs
	Conclusion: I spent a lot of time fine-tuning grid search to the good values of theta, if I hadn't done this, the algorithm would have been fast but at a cost to accuracy as it would have provided less optimal values of theta. 
 
Random Search
	
 
    Best theta: 4374.653091653261, 15000.330159259476,
	Training Error: 510201.4673081034
	Test Error: 497888.0900697986
	Training Time: 7.122501850128174 secs
	Conclusion: I used a higher value of trial making this algorithm take longer, however, I leveraged the ranges found in grid search which gave me similar accuracy if I hadn't done this it would have been slower and less accurate.
 
Sk Learn
	
 
    Best theta: 4374.46949346, 14798.092112582992
	Training Error: 510201.4152506144
	Test Error: 497867.84193848644
	Training Time: 0.0020799636840820312 secs
	Conclusion: This algorithm is by the fastest with extreme accuracy with no trial or error for good theta values, if I were to create a linear regression on my own I would use sk learn

## Key Takeaways
Performance: All three methods found very similar optimal parameters and achieved nearly identical performance on the test set, with an RMSE of approximately 497,867. This indicates that our custom implementations were successful in finding a near-optimal solution. While a RSME of 497, 867 seems high it is an expected outcome for a single-feature model.

Speed: The difference in training time is staggering. scikit-learn found the solution in milliseconds, while our search-based methods took several seconds. This highlights the power of using libraries with analytically optimized algorithms.

Conclusion: This project demonstrates a solid understanding of the end-to-end machine learning process. Implementing models from scratch provides invaluable insight into the underlying mechanics of training. However, for practical applications, scikit-learn is the clear winner, offering superior speed and guaranteed optimality without the need for manual parameter searching. A professional engineer knows both how the tools work and which one to use for the job.
