<b> Implementing Linear Regression from Scratch with California Housing Dataset </b>

Objective:
This exercise aims to provide a hands-on experience in implementing linear regression from scratch using the California housing dataset. You will gain a deeper understanding of the inner workings of linear regression, including the concepts of cost function, and gradient descent optimization.

<b>Steps:</b>

1- Load the California Housing Dataset:

- Use the fetch_california_housing function from scikit-learn to load the dataset.

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the California housing dataset
housing = fetch_california_housing()
data, target = housing.data, housing.target

In [3]:
# explore the data
print(data.shape)
print(target.shape)
print(housing.DESCR)

(20640, 8)
(20640,)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, 

2- Data Preprocessing:

- Add a bias term to the input features.
- Split the dataset into training and testing sets.

In [4]:
# Add a bias term to the input features
data_bias = np.c_[np.ones((data.shape[0], 1)), data]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_bias, target, test_size=0.2, random_state=42)

3- Standardization:

- Standardize the input features using StandardScaler from scikit-learn.

In [5]:
# Standardize the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4- Linear Regression Implementation:

- Implement a simple linear regression class with methods for fitting the model and making predictions.
- Use mean squared error as the cost function.
- Utilize gradient descent for optimization.

In [6]:
# Linear regression implementation from scratch
class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=10000):
        self.learning_rate = learning_rate  
        self.n_iterations = n_iterations  
        #I first tried initializing it to 0 but since we are creating a numpy array in the fit function, there is a type error,
        #so I have to put it at None, which represent no variable type
        self.theta_0=None
        self.theta_n = None

    def fit(self, X, y):
        #Here I'm just retrieving the result of the shape function into two variables:
        #n_samples (number of line), and n_features (number of column)
        n_samples, n_features = X.shape
        
        #The interest of creating a numpy array, is that every time I make an operation on it,
        #I do not have to iterate between every row, its done automatically with Numpy's vectorisation
        #and update each value according to the calcul made more efficiently than itterating.
        #we are putting it at the size of n_features, cause according to the formula:
        #y= theta_0 + theta_1*x + ... + theta_n*x, the number of thetas is the number features
        self.theta_0 = 0
        self.theta_n = np.zeros(n_features)
        
        for _ in range(self.n_iterations):
            #At the beginning of each iterration we calculate the prediction value
            #for each ad every value of 'self.theta', th result is an array of the size n_samples
            y_pred = self.theta_0 + np.dot(X, self.theta_n)
            
            # This is the gradient of the cost function, ( go to next comment ) 
            MSE_gradient_0 = (1/n_samples)*np.sum(y_pred-y)
            MSE_gradient_n = (1 / n_samples) * np.dot(X.T, (y_pred - y)) 
            
            
            # that we use here to calculate each and every theta
            self.theta_n -= self.learning_rate * MSE_gradient_n
            self.theta_0 -= self.learning_rate * MSE_gradient_0

    def predict(self, X):
        # Here I calculate the predicted values for the input data using each theta and return it
        return  self.theta_0 +np.dot(X, self.theta_n)

5- Training the Model:

- Instantiate the linear regression model.
- Train the model on the training set using the implemented gradient descent algorithm.

In [7]:
# Instantiate and train the model
model = LinearRegression(learning_rate=0.05, n_iterations=10000)
model.fit(X_train_scaled, y_train)

6- Prediction and Evaluation:

- Make predictions on the test set.
- Evaluate the model's performance using mean squared error.

In [8]:
# Make predictions on the test set
predictions = model.predict(X_test_scaled)

# Evaluate the model
mse = np.mean((predictions - y_test)**2)
print(f"Mean Squared Error on Test Set: {mse}")

Mean Squared Error on Test Set: 0.5558915986959896
