## **Lecture 8. Introduction to Feedforward Neural Networks**
### **Summary of Feed-Forward Neural Networks**

## **1. Introduction to Feed-Forward Neural Networks**
- **Goal**: Understand feed-forward neural networks (FFNNs) as models and learn them from data.
- Neural networks are extensions of nonlinear predictors, but they **optimize the feature representation** for a specific task&#8203;:contentReference[oaicite:0]{index=0}.
- **Key Challenge**: Jointly learning feature representation and model parameters (chicken-and-egg problem).

---

## **2. Biological Inspiration and Artificial Abstraction**
- **Real Neural Networks**:
   - Composed of neurons that aggregate input signals through dendrites and propagate them through axons.
- **Artificial Neurons**:
   - Simplified as linear classifiers with input coordinates weighted by parameters.
   - Aggregate input: $ z = \sum_{i=1}^d x_i w_i + w_0 $.
   - Output: Nonlinear transformation $ f(z) $ using **activation functions**.

### **Common Activation Functions**:
1. **Linear**: $ f(z) = z $ – used for output layers.
2. **ReLU** (Rectified Linear Unit): $ f(z) = \max(0, z) $.
3. **Tanh**: Smooth nonlinearity squashing values between -1 and 1.

---

## **3. Network Architecture**
- **Layer Structure**:
   - **Input Layer**: Holds input coordinates.
   - **Hidden Layers**: Intermediate transformations of input data.
   - **Output Layer**: Final prediction.
- **Width**: Number of units in a layer.  
- **Depth**: Number of layers in the network.

- **Computation**:
   - Each unit takes weighted inputs from the previous layer and passes the aggregate input through an activation function.  
   - Output layers compute a final weighted combination of hidden unit activations.

---

## **4. Role of Hidden Layers**
- Hidden layers transform input data into a representation that simplifies the final classification task.
- **Hidden Units**:
   - Function like linear classifiers, creating **decision boundaries** in input space.
   - Nonlinear activation functions allow for richer transformations of data.

### **Example**:
- A linearly inseparable problem in $ x_1, x_2 $ can become linearly separable in the hidden layer's transformed space.

### **Visualization**:
- **Linear Units**: Maintain linear mappings, limiting transformation power.
- **Nonlinear Units** (e.g., Tanh, ReLU): Enable more expressive mappings, improving separability in hidden layer activations.

---

## **5. Power of Depth and Redundancy**
- Deep architectures combine layers to perform increasingly abstract computations.
- Redundancy in hidden units (e.g., expanding dimensions) helps learn better representations and simplifies optimization.

### **Why Neural Networks Work**:
1. **Data Availability**: Large datasets allow learning complex models.
2. **Computation**: Modern hardware (GPUs, TPUs) efficiently handles parallel computations.
3. **Optimization**: Stochastic gradient descent (SGD) is effective for learning large models.
4. **Modularity**: Neural networks serve as flexible computational components for diverse tasks.

---

## **6. Summary of Key Points**
- Feed-forward neural networks are composed of **input layers, hidden layers**, and **output layers**.  
- **Hidden layers** play a critical role by transforming input data into a representation that is easier for the output layer to classify.  
- **Activation functions** like ReLU and Tanh enable nonlinearity, allowing networks to model complex data.  
- The depth of a network (number of layers) and redundancy in hidden units help achieve better performance.  
- Despite their complexity, neural networks can be trained efficiently with simple optimization techniques like **stochastic gradient descent (SGD)**.


## **Lecture 9. Feedforward Neural Networks, Back Propagation, and Stochastic Gradient Descent (SGD) 4 of 5 possible points**

# **Summary of Feed-Forward Neural Networks and Learning**

## **1. Learning Feed-Forward Neural Networks**
- **Goal**: Train a feed-forward neural network to learn a **mapping** from input $ x $ to output $ y $.  
- Neural networks optimize **feature representation** and **parameters** simultaneously to minimize prediction error.
- **Key Challenge**: Compute gradients of the loss with respect to the parameters efficiently using **backpropagation**.

---

## **2. Stochastic Gradient Descent (SGD)**
- **SGD Process**:
   1. Compute the **loss**: Measures the difference between predicted output and target $ y $.  
   2. Compute the **gradient**: Derivative of the loss with respect to each parameter.  
   3. Update parameters:  
      $
      \theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla L(\theta_{\text{old}})
      $
      - $ \eta $: Learning rate.

### **Backpropagation**:
- **Purpose**: Efficiently compute gradients for all layers of the network.  
- **Steps**:
   1. Compute the **forward pass**: Evaluate activations layer by layer.
   2. Compute the **loss** at the output.
   3. Propagate the gradients **backward** using chain rule:
      - Gradients are calculated layer by layer and propagated to earlier layers.

---

## **3. Example of Backpropagation**
- **Simple Network**: One unit per layer, input $ x $, output $ f_L $, and loss $ L $.
   - **Forward Pass**: $ z_1 = x \cdot w_1 $, $ f_1 = \tanh(z_1) $, etc.
   - **Gradient Computation**:
      1. Compute the loss gradient with respect to final layer output:
         $
         \frac{\partial L}{\partial f_L} = - (y - f_L) \quad \text{(squared loss example)}
         $
      2. Propagate gradients backward through the layers using Jacobians:
         $
         \frac{\partial L}{\partial f_1} = \frac{\partial L}{\partial f_2} \cdot \frac{\partial f_2}{\partial f_1}
         $
- Issues in deep networks:
   - **Vanishing Gradients**: Small derivatives cause gradients to shrink.  
   - **Exploding Gradients**: Large derivatives lead to unstable updates.

---

## **4. Overcapacity and Learning Representations**
- **Overcapacity**: Providing more hidden units than necessary can facilitate learning and optimization.  
- Example:
   - Networks with extra hidden units can find better representations and achieve lower loss.  
   - Some units may not be "useful," but redundancy helps optimization.  

### **Visualization**:
- Hidden units evolve during training to provide decision boundaries that simplify the classification task.
- Adding random initializations or more hidden units can smooth the decision boundaries.

---

## **5. Challenges in Training Deep Networks**
- **Gradient Vanishing/Explosion**:
   - Gradients become very small or very large as they propagate through layers.  
   - Solutions:
     - Use **ReLU** activation functions.
     - Introduce normalization techniques like **Batch Normalization**.

- **Initialization**:
   - Proper initialization of weights (including offsets) avoids artifacts like phantom boundaries.

- **Local Optima**:
   - Stochastic gradient descent typically finds **locally optimal solutions**, which are often sufficient for good performance.

---

## **6. Summary of Key Points**
- Feed-forward neural networks are trained using **stochastic gradient descent** and **backpropagation**.
- **Backpropagation** propagates the loss gradient efficiently through layers to update parameters.
- **Overcapacity** in networks (adding more hidden units) improves optimization and facilitates learning.
- Challenges like gradient vanishing/explosion and proper initialization must be addressed for successful training.


## **Lecture 10. Recurrent Neural Networks 1** 

# **Summary of Recurrent Neural Networks (RNNs) and Sequence Modeling**

## **1. Introduction to Sequence Modeling**
- **Goal**: Predict properties of sequences, such as the next word in a sentence, sentiment, or translation.
- Unlike feed-forward networks, recurrent neural networks (RNNs) model sequences more flexibly by retaining history information in a state vector.

### **Challenges with Fixed-Length Representations**:
- Fixed-length history may miss important information from earlier parts of the sequence.
- A flexible mechanism is needed to encode variable-length sequences into meaningful vectors.

---

## **2. Recurrent Neural Networks (RNNs)**
- **RNN Concept**:
  - RNNs process sequences by applying the same transformation repeatedly at each step.
  - They update a **state vector** $ s_t $ based on the previous state $ s_{t-1} $ and new input $ x_t $:  
    $
    s_t = \tanh(W_{ss} s_{t-1} + W_{sx} x_t)
    $ 
  - Parameters $ W_{ss} $ and $ W_{sx} $ are learned to optimize the task.

- **Key Properties**:
  1. **Retain State**: State $ s_t $ summarizes the sequence seen so far.
  2. **Parameter Sharing**: The same parameters are applied across all time steps, reducing complexity.
  3. **Variable-Length Sequences**: RNNs adapt to sequences of any length.

---

## **3. Applications of RNNs**
RNNs encode sequences into vectors for various prediction tasks:
1. **Next Word Prediction**: Predict the next word in a sentence.
2. **Sequence-Level Tasks**: Predict properties like sentiment or classify entire sequences.
3. **Machine Translation**: Encode a sentence into a vector and decode it into another language.

---

## **4. Encoding Sequences into Vectors**
- RNNs turn sequences into feature vectors **piecemeal**:
   1. Start with an initial state $ s_0 = 0 $.
   2. At each step $ t $, combine the previous state $ s_{t-1} $ with the current word vector $ x_t $.
   3. Apply a **nonlinear transformation** (e.g., $ \tanh $) to produce the new state $ s_t $.

- **Example**:
   - $ s_1 $ summarizes the first word.
   - $ s_2 $ combines $ s_1 $ and the second word, continuing sequentially until the entire sequence is represented.

---

## **5. Gated Architectures for RNNs**
- **Challenge**: Vanilla RNNs suffer from **vanishing** or **exploding gradients** when processing long sequences.
- **Solution**: Use **gating mechanisms** to control state updates:
   - **Gated Recurrent Unit (GRU)**: Adds a gating network to retain or overwrite information.
   - **LSTM (Long Short-Term Memory)**:
     - Adds gates to **forget**, **input**, and **output** information:
       - **Forget Gate**: Controls which parts of the previous state to discard.
       - **Input Gate**: Controls how much new information to add.
       - **Output Gate**: Controls what part of the memory to reveal as visible state.

---

## **6. Training RNNs**
- RNNs are trained using **backpropagation through time (BPTT)**:
   - Compute the loss at the output.
   - Backpropagate gradients **through time steps** to update parameters.  
- **Issues**:
   - **Vanishing Gradients**: Small gradients diminish, making it hard to train long sequences.
   - **Exploding Gradients**: Large gradients cause instability.

---

## **7. Encoding vs. Decoding**
- **Encoding**: Turn a sequence into a meaningful vector (e.g., for sentiment analysis).
- **Decoding**: Use the encoded vector to generate predictions, including sequences (e.g., translation).

### **Power of Encoded Vectors**:
- Encoded representations allow objects like sentences, images, or events to be mapped into the same space.
- This enables tasks like translating sentences to images, relating disparate objects, and flexible sequence modeling.

---

## **8. Key Takeaways**
- RNNs enable flexible modeling of sequences by maintaining a **state vector** that evolves over time.
- **Gated architectures** (GRU, LSTM) improve training stability and address long-sequence challenges.
- RNNs encode sequences into vectors that can be used for various tasks, from next-word prediction to machine translation.
- Backpropagation through time (BPTT) is used for training, but gradient issues must be managed carefully.


## **Lecture 11. Recurrent Neural Networks 2** 

# **Summary of Sequence Generation and Recurrent Neural Networks**

## **1. Introduction to Sequence Generation**
- **Goal**: Use recurrent neural networks (RNNs) to generate sequences, such as sentences or character streams.
- **Key Idea**:
   - Predict the next word in a sequence based on previous words.
   - Translate inputs (e.g., words, images) into meaningful vector representations and decode them back into sequences.

---

## **2. Markov Models**
- **First-Order Markov Model**:
   - Predicts the next word using only the **immediate preceding word**.
   - Probability of a sentence is the product of the probabilities of generating each word conditioned on the previous word.

### **Steps to Generate a Sentence**:
1. Start with a **beginning symbol** `<beg>`.
2. Sample the first word using a probability table.
3. Use each generated word to predict the next word until the **end symbol** `<end>` is reached

### **Maximum Likelihood Estimation**:
- To train the Markov model:
   - Count pairs of successive words across a corpus.
   - Normalize these counts to derive probabilities.

---

## **3. Feed-Forward Neural Networks for Sequence Modeling**
- **Transition from Markov Models**:
   - Replace the fixed probability table with a **feed-forward neural network**.
   - Input: One-hot vector of the previous word.
   - Output: A probability distribution over the next word using **softmax**

- **Extension to Higher Orders**:
   - Include multiple preceding words as inputs.
   - Introduce **hidden layers** to model complex combinations of preceding words.

---

## **4. Recurrent Neural Networks (RNNs)**
- RNNs generalize Markov models and feed-forward networks by allowing:
   1. **Variable-length history**: Retains information from earlier steps.
   2. **State Persistence**: State $ s_t $ summarizes all previous words and updates with new input $ x_t $:  
      $
      s_t = \tanh(W_{ss} s_{t-1} + W_{sx} x_t)
      #
- **Architecture**:
   - Input: One-hot vector of the previous word.
   - Hidden State: Maintains the history of the sequence.
   - Output: Probability distribution over the next word via **softmax**.

---

## **5. Training RNNs**
- **Backpropagation Through Time (BPTT)**:
   - Compute gradients of the loss function at each time step.
   - Update parameters using stochastic gradient descent.

- **Issues**:
   - **Vanishing Gradients**: Gradients shrink over long sequences.
   - **Exploding Gradients**: Gradients grow excessively, causing instability.

---

## **6. Advanced Sequence Models**
- **Gated RNNs**:
   - Add mechanisms (gates) to control how information is retained or overwritten in the hidden state.
- **Long Short-Term Memory (LSTM)**:
   - Introduces gates:
      - **Forget Gate**: Controls what information to discard.
      - **Input Gate**: Controls what new information to add.
      - **Output Gate**: Controls what information to expose as output.

---

## **7. Sequence-to-Sequence Tasks**
- **Encoding and Decoding**:
   - **Encoder**: Turns sequences (e.g., sentences, images) into vectors.
   - **Decoder**: Converts vectors back into sequences.
   - Example: Translating a sentence from one language to another.

### **Key Steps**:
1. Start with a vector (initial state) instead of zeros.
2. Use RNN (or LSTM) to sequentially generate words, conditioned on the state and previously generated words.

---

## **8. Applications**
- **Language Modeling**: Predict the next word or character in a sentence.
- **Sentiment Analysis**: Summarize a sentence into a vector to predict sentiment.
- **Machine Translation**: Convert input vectors into sequences in another language.
- **Image Captioning**: Encode an image into a vector and decode it into a sentence.

---

## **9. Key Takeaways**
- RNNs overcome limitations of fixed-length history in Markov models.
- They maintain a hidden state that evolves over time, enabling sequence predictions.
- Advanced models like LSTMs handle long sequences better with gates for retaining or forgetting information.
- RNNs can be extended to sequence-to-sequence tasks, such as translation and captioning, by encoding and decoding vector representations.
