## Neural Network Basics

A gentle, practical introduction to deep learning fundamentals. This notebook explains key ideas in simple terms: what ML vs DL means, how neural networks are built, what MLPs are, common activation and loss functions, and how gradient descent and backpropagation train a network.



In [13]:
# Simple contrast: linear model vs small MLP in scikit-learn
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

X, y = make_moons(n_samples=600, noise=0.2, random_state=45)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=45)

lr = LogisticRegression().fit(X_train, y_train)
mlp = MLPClassifier(hidden_layer_sizes=(16, 16), max_iter=500, random_state=45).fit(X_train, y_train)

print("LogReg acc:", accuracy_score(y_test, lr.predict(X_test)))
print("MLP acc:", accuracy_score(y_test, mlp.predict(X_test)))


LogReg acc: 0.8444444444444444
MLP acc: 0.9388888888888889


## What is Machine Learning (ML) vs Deep Learning (DL)?

- **Machine Learning (ML)**: Teach computers to make predictions/decisions from data. You provide features (inputs) and true labels/targets; the algorithm learns patterns.
- **Deep Learning (DL)**: A subset of ML that uses multi-layer neural networks to automatically learn useful representations (features) from raw data (images, audio, text), often at large scale.

Key differences in simple terms:
- **Feature engineering**: ML often needs hand-crafted features; DL learns features end-to-end from data.
- **Data scale**: DL shines with lots of labeled data and compute.
- **Model depth**: DL stacks many layers (hence "deep").

When to use DL:
- You have complex inputs (images, audio, text) and enough data.
- You want to replace manual feature engineering with learned representations. 


In [2]:
# Minimal forward pass through a 2-layer perceptron
import numpy as np

x = np.array([0.2, -0.4])
W1 = np.array([[0.3, -0.1],
               [0.25, 0.4]])
b1 = np.array([0.0, 0.0])
W2 = np.array([[0.5],
               [-0.3]])
b2 = np.array([0.1])

relu = lambda z: np.maximum(0, z)

z1 = W1 @ x + b1
h1 = relu(z1)
z2 = W2.T @ h1 + b2
print("z1:", z1)
print("h1:", h1)
print("z2 (logit):", z2)


z1: [ 0.1  -0.11]
h1: [0.1 0. ]
z2 (logit): [0.15]


## Neural Network Structure (High-Level)

A neural network is a stack of layers that transform inputs into outputs. Each layer applies a simple operation and passes its result to the next layer.

- **Neurons**: Small units that compute a weighted sum of inputs plus a bias, then pass it through an activation function.
- **Layers**: Collections of neurons. Common types: input, hidden, output.
- **Weights and biases**: Parameters the model learns to minimize error.
- **Activation function**: Non-linearity that lets networks learn complex patterns.

Basic forward pass idea:
1. Start with input features x.
2. Multiply by weights and add bias: z = W x + b.
3. Apply activation: a = activation(z).
4. Repeat across layers until you get predictions. 


In [3]:
# Tiny example contrasting ML vs DL (feature engineering vs end-to-end)
# Here, we simulate a hand-crafted feature and a small MLP on raw features
import numpy as np

np.random.seed(1)
X = np.random.randn(100, 2)
y = (X[:, 0] * 0.7 + (X[:, 1] > 0).astype(float) + 0.2*np.random.randn(100) > 0).astype(int)

# "Manual feature": thresholded second feature
feat_manual = (X[:, 1] > 0).astype(float)
acc_manual = (feat_manual == y).mean()

# Simple 2-4-1 MLP trained by a few steps of SGD (very rough)
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros(4)
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros(1)

relu = lambda z: np.maximum(0, z)
sigmoid = lambda z: 1/(1+np.exp(-z))

lr = 0.1
for _ in range(50):
    # batch gradient (not optimized)
    Z1 = X @ W1 + b1
    H1 = relu(Z1)
    P = sigmoid(H1 @ W2 + b2)[:, 0]
    # gradients
    dZ2 = P - y
    dW2 = H1.T @ dZ2[:, None] / len(X)
    db2 = dZ2.mean()
    dH1 = dZ2[:, None] @ W2.T
    dZ1 = dH1 * (Z1 > 0)
    dW1 = X.T @ dZ1 / len(X)
    db1 = dZ1.mean(axis=0)
    # update
    W2 -= lr*dW2
    b2 -= lr*db2
    W1 -= lr*dW1
    b1 -= lr*db1

# evaluate
P = sigmoid(relu(X @ W1 + b1) @ W2 + b2)[:, 0]
pred = (P > 0.5).astype(int)
acc_mlp = (pred == y).mean()

print(f"Manual feature acc: {acc_manual:.2f}\nMLP acc: {acc_mlp:.2f}")


Manual feature acc: 0.63
MLP acc: 0.82


## Multilayer Perceptrons (MLPs)

An MLP is a classic feedforward neural network:
- Layers are arranged in a sequence: input → one or more hidden layers → output.
- Each hidden layer fully connects to the next (every neuron connects to all neurons in the next layer).
- Suitable for tabular data and as building blocks in many models.

Key ideas:
- **Capacity vs overfitting**: More layers/neurons can model more complex patterns but may overfit small datasets.
- **Regularization**: Techniques like dropout, weight decay, early stopping reduce overfitting.
- **Initialization & normalization**: Good weight init and normalization (e.g., batch norm) help stable training. 


In [4]:
# Simple MLP layer forward pass
import numpy as np

np.random.seed(42)

x = np.array([1.0, -1.0, 0.5])   # 3 input features
W = np.random.randn(4, 3) * 0.1   # 4 neurons x 3 inputs
b = np.zeros(4)

def relu(z):
    return np.maximum(0, z)

z = W @ x + b
h = relu(z)
print("z:", np.round(z, 3))
print("h (ReLU):", np.round(h, 3))


z: [0.096 0.164 0.058 0.077]
h (ReLU): [0.096 0.164 0.058 0.077]


## Activation Functions (Intuition + Common Choices)

Why we need them:
- Without activations, a stack of linear layers collapses to a single linear transform. Activations add non-linearity so networks can model complex patterns.

Common functions:
- **ReLU (Rectified Linear Unit)**: `ReLU(z) = max(0, z)`
  - Simple, fast, works well in many models.
  - Variants: LeakyReLU, GELU.
- **Sigmoid**: Squashes values to (0, 1). Often used for binary classification output.
- **Tanh**: Squashes to (-1, 1). Zero-centered, but can saturate.
- **Softmax**: Turns a vector into probabilities that sum to 1. Used for multi-class outputs.

Practical tips:
- Start with ReLU (or GELU) in hidden layers.
- Use sigmoid for binary output, softmax for multi-class output. 


In [5]:
# Activation functions in NumPy
import numpy as np

z = np.linspace(-5, 5, 11)
relu = np.maximum(0, z)
sigmoid = 1/(1+np.exp(-z))
tanh = np.tanh(z)

print("z:", z)
print("ReLU:", relu)
print("Sigmoid:", np.round(sigmoid, 3))
print("Tanh:", np.round(tanh, 3))


z: [-5. -4. -3. -2. -1.  0.  1.  2.  3.  4.  5.]
ReLU: [0. 0. 0. 0. 0. 0. 1. 2. 3. 4. 5.]
Sigmoid: [0.007 0.018 0.047 0.119 0.269 0.5   0.731 0.881 0.953 0.982 0.993]
Tanh: [-1.    -0.999 -0.995 -0.964 -0.762  0.     0.762  0.964  0.995  0.999
  1.   ]


## Loss (Cost) Functions

The loss tells us how wrong the model is. Training means adjusting weights to reduce this loss.

Common choices:
- **Mean Squared Error (MSE)**: For regression (predicting numbers). Penalizes squared difference between prediction and target.
- **Binary Cross-Entropy (Log Loss)**: For binary classification. Compares predicted probability vs true label (0/1).
- **Categorical Cross-Entropy**: For multi-class classification. Used with softmax outputs.

Notes:
- Loss choice should match the task and output activation.
- Add regularization terms (e.g., L2) to discourage overly large weights. 


In [6]:
# Simple loss function examples
import numpy as np

# Regression with MSE
y_true = np.array([2.5, 0.0, 2.1])
y_pred = np.array([2.7, -0.1, 2.0])
mse = np.mean((y_pred - y_true)**2)
print("MSE:", mse)

# Binary classification with BCE
def bce(p, y):
    p = np.clip(p, 1e-9, 1-1e-9)
    return -(y*np.log(p) + (1-y)*np.log(1-p))

print("BCE( p=0.9, y=1 ):", bce(0.9, 1))
print("BCE( p=0.1, y=1 ):", bce(0.1, 1))


MSE: 0.020000000000000028
BCE( p=0.9, y=1 ): 0.10536051565782628
BCE( p=0.1, y=1 ): 2.3025850929940455


## Gradient Descent (How We Learn)

Goal: Find weights that minimize the loss.

Simple idea:
1. Pick initial weights (random).
2. Compute the loss on your data.
3. Compute the gradient: how much each weight affects the loss.
4. Move weights a little in the direction that reduces the loss.
5. Repeat until the loss stops improving.

Update rule (conceptually):
- New weights = old weights − learning_rate × gradient

Variants:
- **SGD**: Uses small batches of data at a time.
- **Momentum**: Smooths updates to move faster in the right direction.
- **Adam**: Adapts learning rates per-parameter; a strong default in practice. 


In [7]:
# Gradient descent visualization for a simple quadratic
import numpy as np

f = lambda w: (w - 3)**2
fprime = lambda w: 2*(w - 3)

w = 0.0
lr = 0.3
for step in range(10):
    grad = fprime(w)
    w = w - lr * grad
    print(f"step={step:02d} w={w:.4f} f(w)={f(w):.4f}")


step=00 w=1.8000 f(w)=1.4400
step=01 w=2.5200 f(w)=0.2304
step=02 w=2.8080 f(w)=0.0369
step=03 w=2.9232 f(w)=0.0059
step=04 w=2.9693 f(w)=0.0009
step=05 w=2.9877 f(w)=0.0002
step=06 w=2.9951 f(w)=0.0000
step=07 w=2.9980 f(w)=0.0000
step=08 w=2.9992 f(w)=0.0000
step=09 w=2.9997 f(w)=0.0000


## Backpropagation (How We Compute Gradients)

Backpropagation is a fast way to compute all gradients using the chain rule from calculus.

Intuition:
- Do a forward pass to compute predictions and loss.
- Work backwards layer by layer, figuring out how much each weight contributed to the error.
- Use these gradients to update weights (via gradient descent or Adam).

Key points:
- Efficient: Reuses intermediate results from the forward pass.
- Exact (for the given loss and activations), up to numerical precision.
- Most DL frameworks (PyTorch, TensorFlow) do this automatically (autograd). 


In [9]:
# Backprop demonstration with autograd (PyTorch)
import torch

torch.manual_seed(0)

x = torch.tensor([[0.5, -1.0]], dtype=torch.float32)  # (1,2)
y = torch.tensor([[1.0]], dtype=torch.float32)        # (1,1)

W1 = torch.randn(2, 2, requires_grad=True) * 0.5
b1 = torch.zeros(2, requires_grad=True)
W2 = torch.randn(2, 1, requires_grad=True) * 0.5
b2 = torch.zeros(1, requires_grad=True)

h1 = torch.relu(x @ W1 + b1)
p = torch.sigmoid(h1 @ W2 + b2)  # (1,1)

loss = torch.nn.functional.binary_cross_entropy(p, y)
loss.backward()  # compute gradients via backprop


## Quick Recap

- **ML vs DL**: DL uses deep neural networks to learn features directly from data.
- **Network structure**: Layers of neurons with weights, biases, and activations.
- **MLP**: A sequence of fully connected layers; a basic, versatile architecture.
- **Activations**: Add non-linearity; ReLU/GELU are common in hidden layers.
- **Loss**: Measures prediction error; choose based on the task.
- **Gradient Descent**: Procedure to minimize loss by updating weights.
- **Backpropagation**: Efficient method to compute gradients.

Next steps: Try building a tiny MLP in your favorite framework (PyTorch or TensorFlow) on a small dataset. 
