# 8.1 Neurons as CLassifiers and Supervised Learning

### **Supervised Learning and Perceptrons**

1. **Classification Problem**:
   - **Task**: Given images, classify them as containing faces or not.
   - **Approach**: Use machine learning to find a hyperplane that separates the images into classes.

2. **Perceptron Model**:
   - **Basic Idea**: A perceptron sums its inputs, compares the sum to a threshold, and outputs a result based on this comparison.
   - **Mathematical Model**: If the weighted sum of inputs exceeds the threshold, the output is +1; otherwise, it’s -1.
   - **Geometric Interpretation**: The perceptron essentially defines a hyperplane (or a line in 2D) that separates classes.

3. **Learning in Perceptrons**:
   - **Objective**: Adjust weights and threshold to correctly classify inputs.
   - **Learning Rule**: Update weights based on the error between desired and actual outputs.
   - **Limitations**: Perceptrons can only classify linearly separable data. For more complex tasks, multilayer perceptrons (MLPs) are needed.

4. **Multilayer Perceptrons (MLPs)**:
   - **Structure**: Consist of multiple layers (input, hidden, and output layers).
   - **Function**: Can handle more complex, non-linear classification problems like XOR.
   - **Continuous Outputs**: Use functions like the sigmoid function for regression tasks, providing outputs in a continuous range.

5. **Training MLPs**:
   - **Backpropagation**: An algorithm used to minimize error by propagating errors backward through the network and adjusting weights accordingly.
   - **Gradient Descent**: Used to find the optimal weights by minimizing the error function.

6. **Application Example**:
   - **Truck Backing Simulation**: Demonstrates using MLPs for practical tasks like backing a truck into a loading dock, showing the training process and gradual improvement.


# 8.2 Reinforcement Learning: Predicting Rewards


1. **Reinforcement Learning (RL):** In RL, an agent (like Pavlov's dog) interacts with an environment and aims to maximize the cumulative future rewards. The agent learns by exploring actions and receiving feedback in the form of rewards or punishments.

2. **Pavlov’s Classical Conditioning:** Pavlov’s experiments with dogs showed that animals can learn to predict rewards based on stimuli (like a bell). This prediction can lead to anticipatory responses, such as salivation.

3. **Temporal Difference (TD) Learning:** To predict future rewards, RL algorithms use TD learning, which updates predictions based on differences between successive predictions. Richard Bellman’s dynamic programming approach helps in recursively calculating these updates, handling the challenge of missing future rewards.

4. **Neuroscientific Insights:** Studies have shown that dopamine neurons in the ventral tegmental area (VTA) of the brain encode reward prediction errors. Before learning, these neurons respond strongly at the time of reward. After learning, they respond more to the predictive stimulus and less to the reward itself. This aligns with TD learning principles.

5. **Prediction Errors:** If a reward is omitted when expected, it results in a negative prediction error, leading to a decrease in neuron firing rates in the VTA. This reflects the discrepancy between expected and actual outcomes.


# 8.3 Reinforcement Learning: Time for Action!

1. **Reinforcement Learning Framework**:
   - The framework involves an agent interacting with an environment, where the agent observes the state, receives rewards, and takes actions. 
   - The goal is to learn a policy that maximizes the expected total future reward.

2. **Policy and Value**:
   - A policy ($\pi$) maps states to actions to maximize expected rewards.
   - The value of a state (or the expected reward for being in that state) is crucial for deciding which action to take.

3. **Example with the Rat**:
   - In a maze with varying food rewards, the rat needs to decide whether to go left or right at each location to maximize future rewards.
   - States are the locations in the maze, and actions are "go left" or "go right."
   - The expected reward for each state is computed based on the rewards of neighboring states.

4. **Temporal Difference Learning (TD Learning)**:
   - TD Learning is used to update the value of states based on the difference between predicted and actual rewards.
   - It helps the agent learn the value of each state as it explores the environment.

5. **Actor-Critic Learning**:
   - This algorithm has two components:
     - **Critic**: Evaluates the current policy by learning the value of states using TD Learning.
     - **Actor**: Selects actions based on a probabilistic policy that uses the Q-function (value of state-action pairs) and the softmax function to balance exploration and exploitation.
   - The algorithm learns an optimal policy by repeating the process of evaluating and improving the policy.

6. **Exploration vs. Exploitation**:
   - Initially, the agent explores various actions to learn about the environment.
   - Over time, it exploits the knowledge to choose actions that maximize rewards based on learned values.

7. **Application to Real-World Problems**:
   - Reinforcement learning has been applied to complex tasks like autonomous helicopter flight, showcasing its practical utility.

8. **Mapping to the Brain**:
   - There are theoretical parallels between reinforcement learning components and brain structures like the basal ganglia.
