# Imports

In [1]:
import numpy as np

np.random.seed(42)

# What is Gradient Descent?

Gradient descent is a fundamental optimization algorithm used in machine learning and optimization problems to minimize the cost function or loss function. The concept revolves around iteratively adjusting the parameters of a model in the direction of the steepest descent of the gradient of the cost function. The gradient represents the direction of the steepest increase in the function. In each iteration, the algorithm calculates the gradient of the cost function with respect to the parameters, and then updates the parameters in the opposite direction of the gradient by a certain step size known as the learning rate. This process continues until convergence, where the gradient becomes nearly zero, indicating that the algorithm has reached a local minimum.

# Selecting the Functions

### Function to Replicate

**The Sigmoid Function**

The sigmoid function is a mathematical function that maps any real-valued number to a value between 0 and 1. It introduces non-linearity to the network and enables it to learn complex patterns in the data.

**Properties**

- **Range**: The output of the sigmoid function is always between 0 and 1. As \( x \) approaches negative infinity, the output approaches 0, and as \( x \) approaches positive infinity, the output approaches 1.
- **Smoothness**: The sigmoid function is smooth and differentiable everywhere, which makes it suitable for optimization algorithms such as gradient descent.
- **S-shaped curve**: The graph of the sigmoid function resembles the letter "S", hence the name "sigmoid". This shape introduces non-linearity to the network, allowing it to model complex relationships in the data.


In [2]:
def sigmoid(W: np.array, b: np.array, X: np.array):
    return (1 / (1 + np.exp(-(np.dot(X, W) + b))))

### Loss Function

Mean squared error (MSE) is a fundamental metric used in statistics and machine learning to quantify the average squared difference between the actual values and the predicted values. It is calculated by taking the average of the squared differences between the predicted and true values for each data point.

MSE is favored for its mathematical properties, such as being non-negative and sensitive to the magnitude of errors, making it a widely adopted measure for evaluating model performance and guiding optimization efforts.

In [None]:
def error(W: np.array, b: np.array, X: np.array, y_true: np.array):
    m = X.shape[0]

    y_hat = sigmoid(W, b, X)

    cost = np.sum((y_hat - y_true)**2, axis = 0) / m

    return cost

# Generating the Data

In [3]:
# Taking number of features as 5 and instances as 100
n_features = 3
m = 100

In [4]:
W_true = np.full((n_features, 1), 0.5)
b_true = 1

# We are only focused on the algorithm, so we generate this randomly

X = np.random.rand(m, n_features)

y_true = sigmoid(W_true, b_true, X)
y_true.shape, X.shape

((100, 1), (100, 3))

# Calculating Gradients

In [5]:
def grad(W: np.array, b: np.array, X: np.array, y_true: np.array):
    m = X.shape[0]
    n_features = X.shape[1]

    # Calculating predicted y
    y_hat = sigmoid(W, b, X)

    # Initializing gradients
    grad_W = np.zeros_like(W)
    grad_b = 0

    for i in range(m):
        for j in range(n_features):
            grad_W[j] += np.dot((y_hat[i] - y_true[i]) * (y_hat[i])* (1 - y_hat[i]), X[i][j])
        grad_b += (y_hat[i] - y_true[i]) * (y_hat[i]) * (1 - y_hat[i])

    return grad_W, grad_b

# Gradient Descent Loop

In [9]:
def gradient_descent(W: np.array, b: np.array, X: np.array, y_true: np.array, epochs = 100, learning_rate = 0.1):
    m = X.shape[0]

    for epoch in range(1, epochs + 1):

        # Compute gradients
        grad_W, grad_b = grad(W, b, X, y_true)

        # Update weights and bias
        W -= learning_rate * grad_W
        b -= learning_rate * grad_b

        # Printing the Progress
        if epoch % 10 == 0:
            print(f"Epoch: {epoch}")
            print(f"Weights: {W}")
            print(f"Bias: {b}")
            print("-----------------------------------")

    return W, b

In [7]:
W = np.random.rand(n_features, 1)
b = 1

In [10]:
gradient_descent(W, b, X, y_true, epochs = 100, learning_rate = 0.03)

Epoch: 0
Weights: [[0.23151505]
 [0.54758624]
 [0.54064995]]
Bias: [1.00461619]
-----------------------------------
Epoch: 10
Weights: [[0.25468119]
 [0.55742402]
 [0.54963451]]
Bias: [1.03225451]
-----------------------------------
Epoch: 20
Weights: [[0.26904467]
 [0.55939198]
 [0.55098635]]
Bias: [1.04264798]
-----------------------------------
Epoch: 30
Weights: [[0.27981704]
 [0.55841785]
 [0.54951411]]
Bias: [1.04639438]
-----------------------------------
Epoch: 40
Weights: [[0.28893044]
 [0.55631601]
 [0.54698947]]
Bias: [1.04742517]
-----------------------------------
Epoch: 50
Weights: [[0.29716913]
 [0.55381435]
 [0.54412192]]
Bias: [1.04732672]
-----------------------------------
Epoch: 60
Weights: [[0.3048667 ]
 [0.55121394]
 [0.54120328]]
Bias: [1.0467613]
-----------------------------------
Epoch: 70
Weights: [[0.31217081]
 [0.54863954]
 [0.53835272]]
Bias: [1.0460091]
-----------------------------------
Epoch: 80
Weights: [[0.31915063]
 [0.54614157]
 [0.53561673]]
Bias:

(array([[0.33163468],
        [0.54166255],
        [0.53077951]]),
 array([1.04360017]))