### 1. Linear function: $f(x)=Wx+b$
This is the backbone of every neural network.


In [3]:
x,W,b = 2.0,3.0,1.0
# forward
y = W*x +b
print("Output",y)

# backward intuition
# output changes directly with W,x, and b
dy, dy_dw,dy_dx,dy_db = 1, x, W, 1
print("Gradient wrt W:",dy*dy_dw)
print("Gradient wrt x:",dy*dy_dx)
print("Gradient wrt b:",dy*dy_db)


Output 7.0
Gradient wrt W: 2.0
Gradient wrt x: 3.0
Gradient wrt b: 1


#### What to notice
* b always gets full responsibility
* W is blamed proportional to input
* This is why scaling inputs matters

### 2. Sigmoid activation

Sigmoid squashes values between 0 and 1.

In [4]:
import math
def sigmoid(z):
    return 1/(1+math.exp(-z))
z = 1.2
# forward
a = sigmoid(z)
print("Sigmoid output:",a)
# backward intuition
da_dz = a*(1-a)# sensitivity depends on output
dz = 1
print("Gradient wrt z:",dz*da_dz)

Sigmoid output: 0.7685247834990175
Gradient wrt z: 0.17789444064680576


#### What to notice
* Near 0 or 1, gradient becomes very small
* This explains vanishing gradients

### 3. ReLU activation
Simple, fast, and widely used.

In [5]:
def relu(z):
    return max(0,z)
def relu_grad(z):
    return 1 if z>0 else 0

z = - 0.5
# forward 
a = relu(z)
print("ReLU output:",a)
# backward
da = 1
print("Gradient wrt z:", da*relu_grad(z))

ReLU output: 0
Gradient wrt z: 0


#### What to notice
* If neuron is off, it sends zero gradient
* This can kill learning if many neurons stay inactive

### 4. Softmax (intuition only)

Softmax turns scores into probabilities.

In [6]:
import numpy as np
scores = np.array([2.0,1.0,0.1])
# forward
exp_scores = np.exp(scores)
probs = exp_scores/np.sum(exp_scores)

print("Probabilities:",probs)

Probabilities: [0.65900114 0.24243297 0.09856589]


#### Gradient intuition
* Increasing one score increases its probability
* But decreases others
* Outputs are coupled, unlike sigmoid or ReLU
* Usually  softmax gradients  computed using Libraries.

### 5. Mean Squared Error (MSE)

Measures how wrong predictions are.

In [7]:
y_true = 10.0
y_pred = 7.0
# forward
loss = (y_pred - y_true)**2
print("Loss:",loss)

# backward 
dloss_dy = 2*(y_pred - y_true)

print("Gradient wrt prediction:",dloss_dy)

Loss: 9.0
Gradient wrt prediction: -6.0


#### What to notice
* Bigger error → bigger correction
* Direction tells model whether to increase or decrease output

### 6. Putting it together 

Linear + loss = learning

In [8]:
W, b, x,y_true, lr = 1.0,0.0, 3.0,10.0, 0.1
# forward
y_pred = W*x + b
loss = (y_pred -y_true)**2
# backward
dloss_dy = 2*(y_pred-y_true)
dy_dW = x
dy_db = 1

dW = dloss_dy*dy_dW
db = dloss_dy*dy_db

# update
W -= lr*dW
b -= lr*db

print("Update W:",W)
print("Update b:",b)
# This is gradient descent in its simplest form.

Update W: 5.2
Update b: 1.4000000000000001


### How to practice effectively
Try this:
* Change inputs and see gradient signs flip
* Stack Linear → ReLU → Linear → MSE
* Print gradients at every step
* Break ReLU (use negative input) and observe learning stop