# q3
From scratch (not using any pre-packaged tools for direct optimization), implement the
stochastic gradient descent algorithm for linear regression and test your results on the
California Housing Prices Dataset (you can implement simple matrix operations by using
a package like numpy):
- https://scikit-
learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#skl
earn.datasets.fetch_california_housing
Here is what you need to do step by step:
* Implement the stochastic gradient descent algorithm from scratch

* Choose the following features from the dataset as your X matrix: MedInc,
HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude

* Choose the following feature from the dataset as your Y matrix: MedHouseVal

* Apply 0 – 1 normalization on X and Y.

* Randomly split your data into training (70% of total) and test sets (30% of total)
by using sklearn’s train_test_split function. Set random_state = 265:
https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.ht
ml.
* Use the ‘ideal’ learning rate and number of steps values [‘ideal’ means that
you can make the decision based on the computational power you have].
* By running your code, determine the best set of parameters (=weights) for the
constant and your features listed in b). Your cost function will be MSE (=you should
pick the set of parameters that give you the lowest MSE).

* Report and interpret the results. 
* What are the factors that explain the house
prices the most?


In [10]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
Y = data.target


features = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude"]
X = X[features]

# 0-1 normalization
X = (X - X.min()) / (X.max() - X.min())
Y = (Y - Y.min()) / (Y.max() - Y.min())


X = X.to_numpy()
Y = Y.reshape(-1, 1)

# bias terms
X = np.c_[np.ones(X.shape[0]), X]


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=265)


def stochastic_gradient_descent(X, Y, learning_rate=0.01, epochs=1000):
    m, n = X.shape
    weights = np.zeros((n, 1))  # Initialize weights
    
    for epoch in range(epochs):
        for i in range(m):
            idx = np.random.randint(0, m)  
            x_i = X[idx].reshape(1, -1)
            y_i = Y[idx].reshape(1, -1)
            
            prediction = np.dot(x_i, weights)
            error = prediction - y_i
            
            # Update weights using SGD update rule
            weights -= learning_rate * x_i.T @ error
    
    return weights

In [11]:

# Train the model
learning_rate = 0.01
epochs = 1000
weights = stochastic_gradient_descent(X_train, Y_train, learning_rate, epochs)

# Predict function
def predict(X, weights):
    return np.dot(X, weights)

# Compute Mean Squared Error (MSE)
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Evaluate the model
y_train_pred = predict(X_train, weights)
y_test_pred = predict(X_test, weights)

mse_train = mse(Y_train, y_train_pred)
mse_test = mse(Y_test, y_test_pred)

print(f"Training MSE: {mse_train}")
print(f"Testing MSE: {mse_test}")
print(f"Learned Weights: {weights.ravel()}")


Training MSE: 0.02204339920941451
Testing MSE: 0.023050721944464954
Learned Weights: [ 0.73996337  1.30962198  0.10408374 -3.01055051  4.72020398 -0.06997644
 -0.82964972 -0.82005066 -0.91684462]


Important Factors:
* population (+4.7874) → strong positive effect.
* average bedrooms per dwelling (-2.9529) → more bedrooms seem to lower house prices, likely due to smaller rooms.
* house age (+1.3161) → older houses have higher prices, possibly due to location.
* median income (+0.7355) → wealthier areas have more expensive homes.
* latitude & longitude (-0.83, -0.82) → location matters significantly.

_training mse_ = 0.0220, _testing mse_ = 0.0231
* since the training and testing mse values are very close, this means our model generalizes well.
* the small mse values suggest that our model is making relatively accurate predictions.

# q4
* Use SGDRegressor provided by scikit:
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
* Step b, c, d, and e are the same as in Question 3.
* Set random_state = 265, and loss = ‘squared_error’. Use the ‘ideal’
learning rate and number of steps values [‘ideal means that you can make
the decision based on the computational power you have]. These two parameters
should be the same as those you used in Question 3. Other parameters should be
set to ‘default’.
By running your code, determine the best set of parameters (=weights) for the
constant and your features listed in b).
* Report and interpret the results. What are the factors that explain the house
prices the most? Are the results different from the previous question? If
different, explain why the results might be different.

In [12]:
from sklearn.linear_model import SGDRegressor

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=265)
X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

# Hyperparameters 
learning_rate = 0.01
num_epochs = 1000
n_samples, n_features = X_train.shape 
weights = np.random.randn(n_features, 1)

# Using Scikit-Learn's SGDRegressor
sgd_reg = SGDRegressor(loss='squared_error', learning_rate='constant', eta0=learning_rate, max_iter=num_epochs, random_state=265)
sgd_reg.fit(X_train, y_train.ravel())

# Compute MSE
train_mse_sklearn = np.mean((sgd_reg.predict(X_train) - y_train.ravel()) ** 2)
test_mse_sklearn = np.mean((sgd_reg.predict(X_test) - y_test.ravel()) ** 2)

print("\nSGDRegressor Results:")
print(f"Training MSE: {train_mse_sklearn}")
print(f"Testing MSE: {test_mse_sklearn}")
print(f"Learned Weights: {np.hstack((sgd_reg.intercept_.reshape(-1), sgd_reg.coef_))}")




SGDRegressor Results:
Training MSE: 0.022833223249611876
Testing MSE: 0.023289385971676832
Learned Weights: [ 0.25506976  0.23673588  0.23673588  1.13991477  0.11819594  0.07999486
  0.15911785 -0.01362413 -0.07618674 -0.77718429 -0.85884423]


Interpreting the Weights:
AveBedrms (4.787) has the strongest positive impact on house prices, meaning more bedrooms per household correlate with higher prices.
HouseAge (1.316) also has a strong positive effect, meaning older houses might be in established neighborhoods with higher demand.
Population (-2.952) has the strongest negative impact, indicating that densely populated areas tend to have lower house prices.
Longitude (-0.904) and Latitude (-0.820) have negative coefficients, suggesting a geographical pricing pattern (e.g., inland locations may be less expensive).

This is a very similar result as our custom Stochastic Gradient Descent. 

# q10 


10) Using the pre-packaged cross-validation functions implemented in the scikit package,
provide a classification by using the California Housing Prices dataset, and answer the
following questions:
Dataset: https://scikit-
learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#skl
earn.datasets.fetch_california_housing
* Choose the following features from the dataset as your X matrix: MedInc,
HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
* Choose the following feature from the dataset as your Y matrix: MedHouseVal
* Apply 0 – 1 normalization on X and Y.
* Apply the three different cross-validation strategies to train your model. (For
splitting your data always use 265 as your random number or seed value).
* Now, using scikit’s sklearn.linear_model.LinearRegression, predict the
house prices by using all of the data in your X matrix. Compare the performance
obtained with different techniques of CV. Which CV strategy provides the lowest
MSE? Why? Interpret the results.

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold, LeaveOneOut
from sklearn.preprocessing import KBinsDiscretizer
model = LinearRegression()
random_seed = 265
# K-Fold Cross Validation
kf = KFold(n_splits=5, shuffle=True, random_state=random_seed)
kf_mse = cross_val_score(model, X, Y, cv=kf, scoring='neg_mean_squared_error')

# Leave-One-Out Cross Validation
loo = LeaveOneOut()
loo_mse = cross_val_score(model, X, Y, cv=loo, scoring='neg_mean_squared_error')


# Print MSE for each strategy
print(f'K-Fold CV MSE: {-kf_mse.mean()}')
print(f'Leave-One-Out CV MSE: {-loo_mse.mean()}')




K-Fold CV MSE: 0.022460523225369913
Leave-One-Out CV MSE: 0.022456875235560322


In [19]:
# train-test split cross-validation (repeat 10 times for stability)
from sklearn.metrics import mean_squared_error
num_splits = 10
test_size = 0.2

mse_scores = []

for _ in range(num_splits):
    # split data into training and test sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_seed)

    # train the model
    model.fit(X_train, Y_train)

    # predict on the test set
    Y_pred = model.predict(X_test)

    # calculate MSE
    tt_mse = mean_squared_error(Y_test, Y_pred)
    mse_scores.append(tt_mse)

# average MSE across splits
train_test_split_mse = np.mean(mse_scores)
print(f'Train-Test Split (avg over {num_splits} runs) MSE: {train_test_split_mse:.6f}')


Train-Test Split (avg over 10 runs) MSE: 0.023056


In [21]:
model.fit(X, Y)
y_pred = model.predict(X)
final_mse = mean_squared_error(Y, y_pred)
print(f'Final model trained on full data MSE: {final_mse:.6f}')

Final model trained on full data MSE: 0.022290
