# Advanced Learning Algorithms: Neural Networks Part 1

## 1. Introduction to Neural Networks (NNs)

### Motivation and History
- **Original Motivation**: To create software that mimics how the human brain learns and thinks. While modern NNs are very different from biological brains, some terminology is inspired by neuroscience.
- **Key Brain Component**: The **neuron**, which receives electrical impulses (inputs), performs a computation, and sends outputs to other neurons.
- **History**:
    - **1950s**: Work began but fell out of favor.
    - **1980s-1990s**: Gained popularity again, used in applications like handwritten digit recognition for postal codes.
    - **Late 1990s**: Fell out of favor again.
    - **~2005 onwards**: Resurgence under the new branding of **Deep Learning**. Since then, NNs have revolutionized fields like speech recognition, computer vision, and natural language processing (NLP).

### Why are Neural Networks So Effective Now?
The recent success of NNs is driven by two main factors:
1.  **Data Scale**: Society has become digitized, generating massive amounts of data (Big Data). Traditional algorithms like linear and logistic regression plateau in performance even with more data. NNs, especially large ones, can continue to improve their performance as the amount of data increases.
2.  **Computational Power**: The development of faster CPUs and especially **GPUs (Graphics Processing Units)** has made it feasible to train large neural networks. GPUs were originally for graphics but are highly effective for the matrix computations used in deep learning.

![Performance vs Data](https://i.imgur.com/rLoCf2F.png)
*(A conceptual graph showing how different sized NNs scale with data compared to traditional ML algorithms)*

---

## 2. Neural Network Model Representation

### The Simplest Neuron
- A single artificial neuron can be modeled as a **logistic regression unit**.
- It takes one or more inputs, performs a computation, and produces a single output.
- **Input**: Feature vector $\vec{x}$.
- **Computation**: $z = \vec{w} \cdot \vec{x} + b$
- **Output (Activation)**: $a = g(z) = \frac{1}{1 + e^{-z}}$
- Here, **'a'** stands for **activation**, a term from neuroscience referring to how strongly a neuron is firing.

### Building a Network with Layers
A neural network is formed by connecting multiple neurons together in layers.
- **Input Layer**: This isn't a computational layer; it simply holds the input features $\vec{x}$.
- **Hidden Layer**: A group of neurons that takes inputs from the previous layer and computes a set of activations. These activations are then passed to the next layer.
    - They are called "hidden" because their "correct" values are not present in the training data; we only see the inputs ($x$) and the final outputs ($y$).
- **Output Layer**: The final layer of neurons that produces the network's prediction, $\hat{y}$.

#### Demand Prediction Example
Let's predict if a T-shirt will be a top seller based on four features: *price, shipping cost, marketing, and material*.
1.  **Input Layer**: $\vec{x}$ = [price, shipping, marketing, material]
2.  **Hidden Layer**: We can design a hidden layer to learn intermediate concepts or features. For example:
    - Neuron 1: Learns **affordability** (from price, shipping).
    - Neuron 2: Learns **awareness** (from marketing).
    - Neuron 3: Learns **perceived quality** (from price, material).
3.  **Output Layer**: Takes the outputs (activations) from the hidden layer (affordability, awareness, quality) and makes the final prediction.

A key strength of NNs is that we **don't need to manually engineer these hidden features**. The network learns the most useful features by itself during training. In a standard implementation, **every neuron in a layer connects to all outputs from the previous layer**.

![NN with one hidden layer](https://i.imgur.com/3D1W7bO.png)

### Neural Networks as Feature Learners
- A neural network can be viewed as a more powerful version of logistic regression that **learns its own features**.
- The hidden layers transform the original input features $\vec{x}$ into a new, more useful set of features (the activations $\vec{a}$). The output layer then performs logistic regression on these learned features.
- This automates the process of **feature engineering**.

### Deeper Networks
- A neural network can have multiple hidden layers.
- The output of the first hidden layer becomes the input for the second hidden layer, and so on.
- A network with multiple hidden layers is often called a **Multilayer Perceptron (MLP)**.
- The **architecture** of a network (number of hidden layers and number of neurons per layer) is a key design choice that impacts performance.

---

## 3. Application: Computer Vision

Neural networks excel at learning hierarchical features, which is very powerful for tasks like face recognition.
- **Input**: A flattened vector of pixel brightness values from an image. For a 1000x1000 pixel image, this would be a vector with 1,000,000 features.
- **Layer 1 (First Hidden Layer)**: Neurons in this layer learn to detect simple patterns like small edges and lines at different orientations.
- **Layer 2 (Second Hidden Layer)**: Neurons combine the edges from the previous layer to detect more complex shapes, like eyes, noses, and ears.
- **Layer 3 (Third Hidden Layer)**: Neurons combine the facial parts to recognize larger face shapes.
- **Output Layer**: Uses the rich set of features from the final hidden layer to identify the person.

The remarkable thing is that the network learns this entire hierarchy of features **automatically from the data**. If trained on cars instead of faces, it would learn to detect wheels, windows, etc.

![NN for Vision](https://i.imgur.com/K1j11oF.png)

---

## 4. A Closer Look at a Layer

### Notation
To talk about specific layers and neurons, we use the following notation:
- $a_j^{[l]}$: Activation of the $j^{th}$ neuron in layer $l$.
- $\vec{w}_j^{[l]}$: Weight vector for the $j^{th}$ neuron in layer $l$.
- $b_j^{[l]}$: Bias parameter for the $j^{th}$ neuron in layer $l$.
- $\vec{a}^{[l-1]}$: The vector of activations from the previous layer ($l-1$), which serves as the input to layer $l$.
- By convention, the input layer is **Layer 0**, so $\vec{a}^{[0]} = \vec{x}$.

### Computation for a Single Neuron
The computation for a single neuron $j$ in layer $l$ is:
$$ z_j^{[l]} = \vec{w}_j^{[l]} \cdot \vec{a}^{[l-1]} + b_j^{[l]} $$
$$ a_j^{[l]} = g(z_j^{[l]}) $$
Where $g$ is the **activation function**. So far, we have used the sigmoid function:
$$ g(z) = \frac{1}{1 + e^{-z}} $$

### Forward Propagation
**Forward propagation** (or inference) is the process of computing the network's output by passing the input data from left to right through the layers.

For a network with 2 hidden layers and 1 output layer:
1. **Input**: Start with $\vec{a}^{[0]} = \vec{x}$.
2. **Layer 1**: For each neuron $j$ in layer 1, compute $a_j^{[1]} = g(\vec{w}_j^{[1]} \cdot \vec{x} + b_j^{[1]})$. Collect these into a vector $\vec{a}^{[1]}$.
3. **Layer 2**: For each neuron $j$ in layer 2, compute $a_j^{[2]} = g(\vec{w}_j^{[2]} \cdot \vec{a}^{[1]} + b_j^{[2]})$. Collect these into a vector $\vec{a}^{[2]}$.
4. **Layer 3 (Output)**: Compute $a_1^{[3]} = g(\vec{w}_1^{[3]} \cdot \vec{a}^{[2]} + b_1^{[3]})$.
5. **Final Prediction**: The output is $\hat{y} = a_1^{[3]}$. For binary classification, you can threshold this value at 0.5 to get a prediction of 0 or 1.

---

## 5. Implementing Neural Networks

### Data Representation in NumPy and TensorFlow
- **NumPy**: In previous courses, we often used **1D arrays** (e.g., `np.array([200, 17])`) for feature vectors.
- **TensorFlow**: Prefers data to be represented as **2D matrices (or tensors)**, even for a single example. This is for computational efficiency.
    - A single training example with 2 features would be a `(1, 2)` matrix: `np.array([[200, 17]])`. Notice the double square brackets.
    - A `(M, N)` matrix has `M` rows and `N` columns.
- **Tensor**: A `tf.Tensor` is TensorFlow's primary data structure, similar to a NumPy array but optimized for running on GPUs/TPUs. You can convert between them using `.numpy()` (Tensor to NumPy) and `tf.convert_to_tensor()` (NumPy to Tensor).

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# === Example: Coffee Roasting (2 features, 2 hidden layers, 1 output) ===

# --- Method 1: Layer-by-layer Forward Prop (for demonstration) ---
print("--- Method 1: Manual Forward Prop ---")
# Assume X is a single data point, correctly shaped as a (1, 2) matrix
X_coffee = np.array([[200.0, 17.0]]) # Temperature, Duration

# Define Layer 1 (3 units)
layer_1 = Dense(units=3, activation='sigmoid')
# Define Layer 2 (1 unit)
layer_2 = Dense(units=1, activation='sigmoid')

# Pass data through the layers
a1 = layer_1(X_coffee)
a2 = layer_2(a1)
print(f"Output a2: {a2.numpy()}")

# Optional: Threshold for binary prediction
if a2 >= 0.5:
    y_hat = 1
else:
    y_hat = 0
print(f"Prediction: {y_hat}")


# --- Method 2: Using the Sequential API (the standard way) ---
print("\n--- Method 2: Using the Sequential API ---")
# Define the model architecture
model_coffee = Sequential([
    Dense(units=3, activation='sigmoid', name='L1'),
    Dense(units=1, activation='sigmoid', name='L2')
])

# To use the model, you'd typically train it with .fit()
# For now, let's just do prediction (inference)
# Note: The weights will be random until the model is trained
prediction = model_coffee.predict(X_coffee)
print(f"Prediction from Sequential model: {prediction}")


# === Example: Digit Recognition (64 features, 2 hidden layers, 1 output) ===
print("\n--- Digit Recognition Example ---")
# Assume X_digit is a single data point, correctly shaped as a (1, 64) matrix
X_digit = np.random.rand(1, 64) # 8x8 image flattened

model_digit = Sequential([
    Dense(units=25, activation='sigmoid', name='L1'),
    Dense(units=15, activation='sigmoid', name='L2'),
    Dense(units=1, activation='sigmoid', name='L3_output')
])

# Make a prediction
digit_prediction = model_digit.predict(X_digit)
print(f"Prediction for digit: {digit_prediction}")



--- Method 1: Manual Forward Prop ---
Output a2: [[0.82609373]]
Prediction: 1

--- Method 2: Using the Sequential API ---
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
Prediction from Sequential model: [[0.6917728]]

--- Digit Recognition Example ---
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
Prediction for digit: [[0.27761093]]


### Building a Model with TensorFlow
TensorFlow is a popular library for deep learning. We use the Keras API within TensorFlow, which is very user-friendly.

#### A. Building Layer by Layer (for understanding)
You can define each layer and manually pass the data through it.
- A standard fully-connected layer is called a **Dense** layer in Keras.
- `tf.keras.layers.Dense(units=3, activation='sigmoid')` creates a layer with 3 neurons using the sigmoid activation function.
- You then call the layer like a function: `a1 = layer1(x)`.

#### B. Building with `tf.keras.Sequential` (the common way)
The `Sequential` model is the most common way to build simple, stacked networks. It allows you to define the entire network architecture in one block.
- You provide a list of layers to the `Sequential` constructor.
- **Training**: `model.compile(...)` and `model.fit(X, y, ...)`.
- **Inference**: `model.predict(X_new)`.

In [3]:
import numpy as np

# Sigmoid activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Implementation of a single dense layer
def dense_layer(a_in, W, b):
    """
    Computes the output of a dense layer.
    Args:
        a_in (ndarray (n,)): Input activations from the previous layer (n features).
        W (ndarray (n, m)): Weight matrix, where n is features in, m is units out.
        b (ndarray (m,)): Bias vector.
    Returns:
        a_out (ndarray (m,)): Output activations of the current layer.
    """
    num_units = W.shape[1]
    a_out = np.zeros(num_units)
    for j in range(num_units):
        w_j = W[:, j] # Get the weight vector for the j-th neuron
        z_j = np.dot(w_j, a_in) + b[j]
        a_out[j] = sigmoid(z_j)
    return a_out

# Implementation of the full forward propagation
def sequential_model(x, W1, b1, W2, b2):
    """
    Performs forward propagation for a 2-layer neural network.
    Args:
        x (ndarray (n,)): Input features.
        W1, b1: Parameters for the first layer.
        W2, b2: Parameters for the second layer.
    Returns:
        a2 (ndarray): Final output/prediction.
    """
    # Compute activations for the first hidden layer
    a1 = dense_layer(x, W1, b1)

    # Compute activations for the second layer (output layer)
    a2 = dense_layer(a1, W2, b2)

    return a2

# --- Example Usage (Coffee Roasting) ---
# Note: In a real scenario, these weights W and biases b would be learned during training.
# Here we use placeholder values.

# A single input example (1D array for this implementation)
x_example = np.array([200.0, 17.0])

# Parameters for Layer 1 (2 inputs, 3 units)
W1_example = np.array([
    [1, -3, 5],
    [2, 4, -6]
])
b1_example = np.array([-1, 1, 2])

# Parameters for Layer 2 (3 inputs, 1 unit)
W2_example = np.array([
    [1],
    [2],
    [3]
])
b2_example = np.array([1])


# Make a prediction
prediction = sequential_model(x_example, W1_example, b1_example, W2_example, b2_example)
print(f"\n--- From-Scratch Implementation ---")
print(f"Input: {x_example}")
print(f"Prediction: {prediction}")


--- From-Scratch Implementation ---
Input: [200.  17.]
Prediction: [0.99330715]


### Implementing Forward Prop from Scratch (Python & NumPy)
Understanding the low-level implementation is crucial for debugging and building intuition, even if you primarily use libraries like TensorFlow.
- The goal is to create a `dense` function that computes the activations for one layer.
- This function will loop through each neuron in the layer, compute its $z$ and $a$ values, and collect them.
- A master `sequential` or `predict` function can then call this `dense` function for each layer in the network to perform the full forward pass.