### Starting Point: Linear Regression

Think back to linear regression, the simplest predictive model.

- You have inputs (features) like $x_1, x_2, ..., x_D$.  
- You have an output (target) $y$.  
- The goal is to learn a function $h(x)$ that predicts $y$ for new input values.  

In math form, linear regression says:  

$$
h(x) = A^T \phi(x) + b
$$

where:  
- $A$ = vector of weights (slopes)  
- $b$ = bias (intercept)  
- $\phi(x)$ = feature transformations of the inputs (for instance, squaring or taking logs to bend a straight line into a curve)

The computer’s job during training is to find $A$ and $b$ that make predictions closest to the real data by minimizing a loss function, usually the sum of squared errors between predictions and true values.  
This setup is a convex problem - imagine a smooth bowl-shaped surface; gradient descent slides down to the bottom easily.



### Turning a Linear Model into a Neural Network

A neural network is just a generalization of linear regression with a few smart upgrades.

#### 1. Using Activation Functions (Nonlinearities)

Each “feature function” $\phi$ can be replaced by an activation function, which introduces non‑linearity.  
That’s what allows neural networks to model complex, curved, and layered relationships.

Common examples:
- Sigmoid: squashes input into values between 0 and 1 - like a smooth on/off switch.  
- Tanh: similar but outputs between −1 and 1, helpful for centered data.  
- ReLU (Rectified Linear Unit): passes positive values unchanged and zeros out negatives - simple, fast, and the most widely used today.

Activations are like gates that decide whether certain signals pass through based on their strength.



#### 2. Adding Weights Everywhere

In linear regression, weights only appear on the output side.  
In neural networks, every connection between layers has its own set of weights ($A_1, A_2, \dots, A_L$) and biases ($b_1, b_2, \dots, b_L$).  

This makes the model far more flexible - each layer can transform the data in a unique way.



#### 3. Stacking Layers (Depth)

Instead of one big transformation, neural networks apply many transformations in sequence - layer by layer.

- Each layer takes the outputs of the previous layer and applies its weights, bias, and activation function.  
- The outputs of one layer become the inputs to the next.

If we have $L$ layers:
- $c_1$ = output of layer 1  
- $c_2$ = output of layer 2 = activation of ($A_2 c_1 + b_2$)  
- … and so on, until  
- the final output layer produces $h(x)$, the model’s prediction.

Layers in the middle are called hidden layers because we don’t directly observe what they compute - they capture internal representations.



#### 4. Flexible Output Layers

The last layer determines what kind of problem you’re solving:
- Regression: no activation on the output (continuous values).  
- Binary classification: add a sigmoid to output probabilities between 0 and 1.  
- Multiclass classification: use a softmax function so outputs sum to 1 across classes.

So by swapping the output layer, the same architecture can handle very different tasks.



### Training the Network

Just like in linear regression, training means finding all the weights and biases ($A$’s and $b$’s) that minimize the loss.  
The big difference is that now the model is nonlinear and nested - one layer’s output feeds into another - so the loss surface is no longer a simple bowl.  

Gradient descent (specifically, backpropagation) still works, but training is trickier:
- Many local minima, not just one global bottom.
- More parameters mean more data and computation are needed.

Still, this complexity is what makes neural networks capable of modeling images, language, and other rich data - far beyond what simple regression can do.



Analogy:  
If linear regression is like fitting one straight beam between two points, a neural network is like building an entire bridge of interlinked beams - each section correcting and shaping the last - allowing it to adapt to terrains (data) of any shape.


