## Mathematics behind regularisation

**Step 1**: Initialization and SetupThe Concept:First, we define our class. The most important parameter here is lambda_param (often called $\alpha$ in libraries like Scikit-Learn). This controls the "strength" of the regularization.
If $\lambda = 0$: We have standard Linear Regression.
If $\lambda$ is high: We heavily punish the model for complex weights.

In [3]:
import numpy as np

class RegularizedLinearRegression:
    def __init__(self, learning_rate=0.001, n_iterations=1000, lambda_param=0.1, mode='l2'):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.lambda_param = lambda_param
        self.mode = mode  # 'l1' for Lasso, 'l2' for Ridge
        self.weights = None
        self.bias = None

**Step 2**: The Prediction (Forward Pass)The Math:This part does not change. Regardless of regularization, the linear prediction formula remains:$$y = w \cdot X + b$$

In [5]:
def predict(self, X):
        return np.dot(X, self.weights) + self.bias

**Step 3**: The Cost Function (The "Why")The Math:This is where we change the definition of "Success." We don't just want low error; we want low error AND small weights.
**Ridge (L2) Cost**:$$J(\theta) = \underbrace{\frac{1}{2m} \sum (y_{pred} - y)^2}_{\text{MSE}} + \underbrace{\lambda \sum \theta^2}_{\text{Penalty}}$$**Lasso (L1) Cost**:$$J(\theta) = \underbrace{\frac{1}{2m} \sum (y_{pred} - y)^2}_{\text{MSE}} + \underbrace{\lambda \sum |\theta|}_{\text{Penalty}}$$**Note**: We calculate this just to track progress. It is not strictly needed for the gradient descent update, but it's great for debugging.

In [7]:
def _compute_cost(self, y_predicted, y, n_samples):
        # 1. Standard Error (MSE)
        cost = np.sum((y_predicted - y) ** 2) / (2 * n_samples)
        
        # 2. Add Regularization Penalty
        if self.mode == 'l2':
            penalty = self.lambda_param * np.sum(self.weights ** 2)
        else: # l1
            penalty = self.lambda_param * np.sum(np.abs(self.weights))
            
        return cost + penalty

**Step 4**: The Derivatives (The "How")
We need to find the gradient (slope) to update our weights.When we take the derivative of the penalty terms, we get:Ridge Derivative: The derivative of $\theta^2$ is $2\theta$.$$dw = \text{Standard Gradient} + 2\lambda \theta$$Lasso Derivative: The derivative of $|\theta|$ is the sign of $\theta$ (either +1 or -1).$$dw = \text{Standard Gradient} + \lambda \cdot \text{sign}(\theta)$$

In [9]:
def _get_gradients(self, X, y, y_predicted, n_samples):
        # 1. Calculate Standard Gradient (dW)
        dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
        db = (1 / n_samples) * np.sum(y_predicted - y)
        
        # 2. Add Regularization Term to dW
        if self.mode == 'l2':
            # Ridge: Derivative of w^2 is 2w
            dw += (2 * self.lambda_param / n_samples) * self.weights
            
        elif self.mode == 'l1':
            # Lasso: Derivative of |w| is sign(w)
            dw += (self.lambda_param / n_samples) * np.sign(self.weights)
            
        # Note: We rarely regularize the bias (db), so we leave it alone.
        return dw, db

**Step 5**: The Training Loop (Putting it Together)The Concept:Finally, we run the loop. We predict, check gradients, and update weights.$$\theta_{new} = \theta_{old} - \text{learning\_rate} \times \text{gradient}$$

In [11]:
def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iterations):
            # 1. Predict
            y_predicted = np.dot(X, self.weights) + self.bias
            
            # 2. Calculate Gradients (with regularization)
            dw, db = self._get_gradients(X, y, y_predicted, n_samples)
            
            # 3. Update Weights
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

**Conclusion**: The "Tax" on Complexity
In this notebook, we built Regularised Linear Regression from scratch. We discovered that regularisation is essentially adding a "tax" to our model's learning process.
- Standard Regression cares only about getting the answer right (minimising error).
- Regularised Regression cares about getting the answer right simply.

By adding the penalty term ($\lambda$) to our cost function, we force the model to choose:
- is it worth increasing this weight to lower the error slightly, or is the "tax" (penalty) too high? Which one should you choose?
- Use Ridge (L2) when you want to keep all your features but reduce the impact of noise. It shrinks weights close to 0 but rarely exactly to 0.
- Use Lasso (L1) when you suspect many features are useless. It can shrink weights all the way to 0, effectively deleting bad features from your equation.

**The Implementation Example**
The code below demonstrates how to actually run the class we wrote. I have commented on every single "knob" (variable) you can turn so you can explain exactly how tuning works to your audience.

In [14]:
import numpy as np

class RegularizedLinearRegression:
    def __init__(self, learning_rate=0.001, n_iterations=1000, lambda_param=0.1, mode='l2'):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.lambda_param = lambda_param
        self.mode = mode  # 'l1' for Lasso, 'l2' for Ridge
        self.weights = None
        self.bias = None

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

    def _get_gradients(self, X, y, y_predicted, n_samples):
        # 1. Calculate Standard Gradient (dW)
        dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
        db = (1 / n_samples) * np.sum(y_predicted - y)
        
        # 2. Add Regularisation Term to dW
        if self.mode == 'l2':
            # Ridge: Derivative of w^2 is 2w
            dw += (2 * self.lambda_param / n_samples) * self.weights
            
        elif self.mode == 'l1':
            # Lasso: Derivative of |w| is sign(w)
            dw += (self.lambda_param / n_samples) * np.sign(self.weights)
            
        return dw, db
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iterations):
            # 1. Predict
            y_predicted = np.dot(X, self.weights) + self.bias
            
            # 2. Calculate Gradients (with regularisation)
            dw, db = self._get_gradients(X, y, y_predicted, n_samples)
            
            # 3. Update Weights
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

In [24]:
import numpy as np
# Note: Ensure the 'RegularizedLinearRegression' class is defined in a previous cell

# --- STEP 1: GENERATE DUMMY DATA ---
# We create 5 samples with 1 feature each.
# The relationship is roughly y = 2x + 1
X_train = np.array([[1], [2], [3], [4], [5]]) 
y_train = np.array([3, 5, 7, 9, 11]) 

# --- STEP 2: CONFIGURE THE MODEL ---
# Here is where you change variables to your liking:

model = RegularizedLinearRegression(
    learning_rate=0.01,   # How big of a step we take. 
                          # If too small: Model learns too slowly.
                          # If too big: Model might overshoot the answer.
    
    n_iterations=1000,    # How many times the model sees the data.
                          # More iterations = better fit, but takes longer.
    
    lambda_param=0.5,     # THE REGULARISATION STRENGTH.
                          # 0.0 = No regularisation (Standard Regression).
                          # 100.0 = Huge penalty (Weights will be crushed to near 0).
                          # Try changing this to see how weights shrink!
    
    mode='l2'             # THE TYPE OF PENALTY.
                          # 'l2' (Ridge) = Good for general stability.
                          # 'l1' (Lasso) = Good for feature selection.
)

# --- STEP 3: TRAIN THE MODEL ---
# The .fit() method runs the gradient descent loop we wrote earlier.
print("Training model...")
model.fit(X_train, y_train)

# --- STEP 4: INSPECT RESULTS ---
# Let's see what weights the model learned.
print(f"Final Weights: {model.weights}")
print(f"Final Bias: {model.bias}")

# --- STEP 5: MAKE A PREDICTION ---
# Predict the value for a new input, x = 6.
# We expect the answer to be around 13 (since y = 2x + 1).
X_test = np.array([[6]])
prediction = model.predict(X_test)

print(f"Prediction for input 6: {prediction}")

Training model...
Final Weights: [1.86079214]
Final Bias: 1.3889763125214563
Prediction for input 6: [12.55372914]


1. The Equation the Model Found
The model learned the relationship between $X$ and $y$ is:$$y = 1.86x + 1.39$$
Final Weights (1.86): This is the slope.
Final Bias (1.39): This is the y-intercept (where the line hits the vertical axis).
2. Why isn't it exactly $y = 2x + 1$?
You might remember the "true" data followed the pattern $y = 2x + 1$.
True Slope: 2.0
My Slope: 1.86
Why is it lower?
Because of Regularisation!You set lambda_param=0.5. This "tax" punished the model for having a weight of 2.0. To lower its "tax bill," the model shrank the weight down to 1.86. This proves the regularisation code is actually doing its job!
3. The Prediction Check
The model was asked: "If input is 6, what is the output?"The Math: $1.86(6) + 1.39 \approx 12.55$The "Perfect" Answer: $2(6) + 1 = 13$The answer is slightly off (12.55 vs 13), but that is the trade-off. You accepted a slightly less accurate line in exchange for a "simpler" (smaller weight) model. In the real world, this "simpler" model is much safer against noise and overfitting.