<h1 align="center">INFO621 - Advanced Machine Learning Applications</h1>

<h2 align="center"><strong>Homework 2: Neural Language and Sequence Modeling</strong></h2>

# Guidelines

**Worth:** 10% of your final grade (**100 points total**)  
**Submission Deadline:** Friday, February 21, 11:59 PM (Tucson Time)  


### **Instructions**

- For exercises involving code, write your solutions in the provided code chunks. You may add additional code chunks if needed.  
- For exercises involving plots, ensure all axes and legends are labeled and give each plot an informative title.  
- For exercises requiring descriptions or interpretations, use full sentences and provide clear, concise explanations.  

### **Policies**

**Sharing/Reusing Code Policy:**  
You are allowed to use online resources (e.g., RStudio Community, StackOverflow) but **must explicitly cite** any external code you use or adapt. Failure to do so will be considered plagiarism, regardless of the source.  

**Late Submission Policy:**  
- **Less than 1 day late:** -25% of available points.  
- **1-7 days late:** -50% of available points.  
- **7 days or more late:** No credit will be awarded, and feedback will not be provided.  

**Declaration of Independent Work:**  
You must acknowledge your submission as your independent work by including your **name** and **date** at the end of the "Declaration of Independent Work" section.  

### **Grading**

- **Total Points:** 100 points.

- **Grade Breakdown:**  
  - **Part 1 (40 Points Total):**  
    - Multiple-Choice Questions: 5 questions, 4 points each.  
    - Descriptive Questions: 4 questions, 5 points each.  

  - **Part 2 (40 Points Total):**  
    - 10 code completion tasks, 4 points each.  

  - **Part 3 (20 Points Total):**  
    - 4 code completion tasks, 5 points each.  

# Part 1. Written Questions (40 points)

- **Multiple-Choice Questions**: 5 questions, 4 points each  
- **Descriptive Questions**: 4 questions, 5 points each

## 1.1 Multiple-Choice Questions (20 points)

1. **What is the key role of the activation function in a perceptron? (5 points)**

   - A) To initialize weights  
   - B) To compute the weighted sum of inputs  
   - C) To introduce non-linearity into the model
   - D) To apply regularization  
   - E) To update weights through gradient descent  

**Your Answer**: 

2. **Which of the following is NOT a type of neural network activation function? (5 points)**

   - A) Sigmoid  
   - B) Tanh  
   - C) ReLU  
   - D) Softmax  
   - E) Gradient

**Your Answer**: C

3. **What is backpropagation primarily used for in a neural network? (5 points)**

   - A) Data normalization  
   - B) Computing gradients for weight updates
   - C) Generating predictions from the model  
   - D) Preventing overfitting  
   - E) Reducing the training time  

**Your Answer**:

4. **Which regularization method involves halting training at a point to avoid overfitting? (5 points)**

   - A) Dropout  
   - B) Batch Normalization  
   - C) Weight Decay  
   - D) Early Stopping
   - E) Adaptive Learning  

**Your Answer**:

5. **What is one major advantage of using mini-batch gradient descent over full-batch gradient descent? (5 points)**

   - A) It avoids the vanishing gradient problem
   - B) It improves the generalization of the model
   - C) It balances computational efficiency and convergence stability
   - D) It guarantees convergence to a global minimum  
   - E) It completely eliminates overfitting

**Your Answer**:

## 1.2 Open-Ended Questions (20 points)

### Q1. Perceptron and XOR

**Scenario**:
Imagine you are tackling a binary classification task where the data exhibits a pattern similar to the XOR problem—a classic example where data points cannot be separated by a single straight line. This scenario illustrates the challenge of linear separability and sets the stage for understanding why a basic model might struggle with such tasks.

**Question**:
Why is a single-layer perceptron unable to solve the XOR problem, and how does adding multiple layers to a neural network overcome this limitation?

**Hint**:
Consider the concept of linear separability and how deeper networks can perform non-linear transformations.

**Answer**

### Q2. Loss Functions: Binary Cross-Entropy vs. Mean Squared Error

**Scenario**
Imagine you are training a neural network to distinguish between two classes, such as identifying whether an email is spam or not. You need a reliable way to quantify how well your model's predictions match the actual outcomes. This scenario highlights the importance of selecting the right loss function based on the task at hand.  

**Question**

What is the purpose of the loss function in a neural network?**
**How do binary cross-entropy (BCE) and mean squared error (MSE) differ in their application?

**Hint**  
Think about the types of tasks (**classification vs. regression**) each loss function is best suited for.

**Answer**

### Q3. Importance of Non-Linearity  

**Scenario**

Consider a scenario where you are building a **neural network** for a **complex task**, such as **recognizing handwritten digits**. The ability to distinguish between similar but distinct patterns hinges on the network’s capacity to model **non-linear relationships**. Without **non-linear transformations**, the network's power would be drastically limited.  

**Question**  
Why is non-linearity critical in neural networks, and what would be the consequence if the network's activation functions were purely linear?  

**Hint**  
Think about how **stacking linear layers** without **non-linear transformations** would affect the **model’s expressive power**.  

**Answer**  

### Q4. Dropout as a Regularization Method

**Scenario**  
Imagine training a **deep neural network** that performs **exceptionally well on the training data** but **fails to generalize to unseen data**—a common problem known as **overfitting**. To combat this, you decide to incorporate a technique that forces the network to learn **robust and redundant features**, thereby improving its **generalization ability**.  

**Question**  
What is dropout, and why is it particularly effective as a regularization method in deep neural networks?

**Hint**  
Consider how **dropout temporarily disables neurons during training** and what effect this has on the **network’s learning process**.  

**Answer**


# Part 2. Implementing a Simple Neural Network from Scratch (40 points)

## **Objective**
In this homework, you will implement various components of a simple neural network from scratch. You are required to complete the following tasks:

1. **Perceptron Basics**: Implement the forward pass of a perceptron, including computing the weighted sum and applying the sigmoid activation function.
2. **Multi-Layer Perceptron**: Extend your implementation to include a two-layer neural network, using ReLU for the hidden layer and softmax for the output layer.
3. **Loss Function**: Implement the binary cross-entropy loss function to evaluate the model's performance.
4. **Backpropagation**: Compute the gradients of the loss with respect to the weights and biases using the backpropagation algorithm.
5. **Training the Neural Network**: Train the neural network using gradient descent by iteratively updating the weights and biases to minimize the loss.

By completing these tasks, you will gain hands-on experience in building and training neural networks and understanding their underlying mechanics.


In [2]:
# Necessary Imports
import numpy as np

## **2.1 Perceptron Basics (2 missing blocks, 8 points in total)**

A perceptron calculates a weighted sum of inputs and applies an activation function to produce an output. The sigmoid activation function is defined as:

$\text{sigmoid}(z) = \frac{1}{1 + e^{-z}}$

### **Task**:
1. Compute the weighted sum $z = x \cdot w + b$, where $x$ is the input vector, $w$ is the weight vector, and $b$ is the bias term.
2. Apply the sigmoid activation function to $z$ to compute the perceptron output.


In [25]:
class Perceptron:
    def __init__(self, weights, bias):
        """
        Initialize the Perceptron.

        Parameters:
        - weights: Array of weights for the input features.
        - bias: Bias term.
        """
        self.weights = np.array(weights)
        self.bias = bias

    def sigmoid(self, z):
        """
        Sigmoid activation function.

        Parameters:
        - z: Weighted sum of inputs.

        Returns:
        - Sigmoid activation value.
        """
        #############################################################
        return 1/((1+np.exp(-z)))
    

        #############################################################

    def forward(self, x):
        """
        Forward pass for the perceptron.

        Parameters:
        - x: Input features.

        Returns:
        - Output after applying sigmoid activation.
        """
        #############################################################
        # Your Turn: write your own code here
        
        z = (np.dot(x,self.weights) + self.bias)
        
        return self.sigmoid(z)
    
        #
        #############################################################


# Example usage
weights = [0.1, 0.2, 0.3]
bias = 0.5
perceptron = Perceptron(weights, bias)
x = np.array([1.0, 2.0, 3.0])
output = perceptron.forward(x)
print("Output:", output)

Output: 0.8698915256370021


## **2.2 Multi-Layer Perceptron (3 missing blocks, 12 points in total)**

A multi-layer perceptron (MLP) consists of layers of neurons, each performing a weighted sum followed by a non-linear activation function.

The **ReLU** (Rectified Linear Unit) activation function is defined as:

$\text{ReLU}(z) = \max(0, z)$

The **softmax** function for the output layer is defined as:

$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$

### **Task**:
1. Compute the hidden layer activations: $ z_1 = X \cdot W_1 + b_1 $, $ a_1 = \text{ReLU}(z_1) $.
2. Compute the output layer activations: $ z_2 = a_1 \cdot W_2 + b_2 $, $ a_2 = \text{softmax}(z_2) $.
3. Write the forward function of the Perceptron.


In [39]:
class MultiLayerPerceptron:
    def __init__(self, W1, b1, W2, b2):
        """
        Initialize the Multi-Layer Perceptron.

        Parameters:
        - W1: Weights for the hidden layer.
        - b1: Biases for the hidden layer.
        - W2: Weights for the output layer.
        - b2: Biases for the output layer.
        """
        self.W1 = np.array(W1)
        self.b1 = np.array(b1)
        self.W2 = np.array(W2)
        self.b2 = np.array(b2)

    def relu(self, z):
        """
        ReLU activation function.

        Parameters:
        - z: Weighted sum of inputs.

        Returns:
        - ReLU activation value.
        """
        #############################################################
        # Your Turn: write your own code here
        
        return np.maximum(0,z)

        # 
        #############################################################

    def softmax(self, z):
        """
        Softmax activation function.

        Parameters:
        - z: Weighted sum of inputs.

        Returns:
        - Softmax probabilities.
        """
        #############################################################
        # Your Turn: write your own code here
        
        return (np.exp(z))/(np.sum(np.exp(z)))

        #
        #############################################################

    def forward(self, X):
        """
        Forward pass for the multi-layer perceptron.

        Parameters:
        - X: Input features.

        Returns:
        - Output probabilities.
        """
        #############################################################
        
        # Your Turn: write your own code here
        
        z1 = np.dot(X,self.W1)+self.b1
        a1 = self.relu(z1)
        z2 = np.dot(a1,self.W2)+self.b2
        a2 = self.softmax(z2)
        
        return a2

        #
        #############################################################


# Example usage
W1 = np.random.rand(3, 4)
b1 = np.random.rand(4)
W2 = np.random.rand(4, 2)
b2 = np.random.rand(2)
X = np.random.rand(5, 3)
mlp = MultiLayerPerceptron(W1, b1, W2, b2)
output = mlp.forward(X)
print("Output Activations:", output)

Output Activations: [[0.07614884 0.04654982]
 [0.2586917  0.15700724]
 [0.09902744 0.0706877 ]
 [0.10898894 0.06001826]
 [0.07072233 0.05215773]]


## **2.3 Loss Function (1 missing blocks, 4 points in total)**

The binary cross-entropy loss quantifies the error between predicted probabilities ($y_{pred}$) and actual labels ($y_{true}$). The formula is:

$L = -\left(y_{true} \cdot \log(y_{pred}) + (1 - y_{true}) \cdot \log(1 - y_{pred})\right)$

### **Task**:
1. Compute the loss for each sample using the formula.

In [40]:
class BinaryCrossEntropy:
    def __init__(self):
        """
        Initialize the BinaryCrossEntropy class.
        """
        pass

    def compute(self, y_true, y_pred):
        """
        Compute binary cross-entropy loss.

        Parameters:
        - y_true: Ground truth labels.
        - y_pred: Predicted probabilities.

        Returns:
        - Binary cross-entropy loss.
        """
        #############################################################
        # Your Turn: write your own code here
        
        return -(np.dot(y_true,np.log(y_pred))) + np.dot((1-y_true),(1-y_pred))

        #
        #############################################################


# Example usage
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.2, 0.8, 0.7, 0.1])
bce = BinaryCrossEntropy()
loss = bce.compute(y_true, y_pred)
print("Loss:", loss)

Loss: 2.3851790109107687


## **2.4 Backpropagation (1 missing blocks, 4 points in total)**

In this task, you will compute:
1. The gradient of the loss with respect to the outputs ($ \frac{\partial L}{\partial \text{output}} $).
2. The gradients of the loss with respect to the weights ($ \frac{\partial L}{\partial W} $) and biases ($ \frac{\partial L}{\partial b} $) using the chain rule.

### **Task**:
Complete the `backpropagation` function by:
1. Computing the gradient of the loss w.r.t. outputs.

In [15]:
class Backpropagation:
    def __init__(self):
        """
        Initialize the Backpropagation class.
        """
        pass

    def compute(self, X, y_true, weights, biases, activations):
        """
        Compute gradients for weights and biases.

        Parameters:
        - X: Input data.
        - y_true: Ground truth labels.
        - weights: Weight matrix.
        - biases: Bias vector.
        - activations: Predicted outputs.

        Returns:
        - grad_weights: Gradient of loss w.r.t weights.
        - grad_biases: Gradient of loss w.r.t biases.
        """
        #############################################################
        # Your Turn: write your own code here
        
        grad_weights = np.gradient(BinaryCrossEntropy().compute(y_true,activations)/weights)
        
        grad_biases = np.gradient(BinaryCrossEntropy().compute(y_true,activations)/biases)
        
        return grad_weights, grad_biases

        #
        #############################################################


X = np.array([[1.0, 2.0], [3.0, 4.0]])
y_true = np.array([[1], [0]])
weights = np.array([[0.5], [0.5]])
biases = np.array([0.1])
activations = np.array([[0.8], [0.6]])
bp = Backpropagation()
grad_weights, grad_biases = bp.compute(X, y_true, weights, biases, activations)
print("Gradient of Loss w.r.t Weights:", grad_weights)
print("Gradient of Loss w.r.t Biases:", grad_biases)

ValueError: shapes (2,1) and (2,1) not aligned: 1 (dim 1) != 2 (dim 0)

## **2.5 Training the Neural Network (3 missing blocks, 12 points in total)**

Training involves:
1. Performing a forward pass to compute predictions.
2. Calculating the loss.
3. Backpropagating gradients to update weights and biases.

The update rule is:
$W = W - \eta \cdot \frac{\partial L}{\partial W}, \quad b = b - \eta \cdot \frac{\partial L}{\partial b}$

### **Task**:
Complete the training loop to:
1. Perform forward passes.
2. Compute the loss.
3. Backpropagate gradients.

In [None]:
class NeuralNetworkTrainer:
    def __init__(self, perceptron, loss_function, backpropagation, learning_rate=0.01, epochs=100):
        """
        Initialize the trainer.

        Parameters:
        - perceptron: Perceptron instance for forward passes.
        - loss_function: BinaryCrossEntropy instance for loss computation.
        - backpropagation: Backpropagation instance for gradient computation.
        - learning_rate: Learning rate for gradient descent.
        - epochs: Number of training iterations.
        """
        self.perceptron = perceptron
        self.loss_function = loss_function
        self.backpropagation = backpropagation
        self.learning_rate = learning_rate
        self.epochs = epochs

    def train(self, X, y_true):
        """
        Train the neural network using gradient descent.

        Parameters:
        - X: Input features.
        - y_true: Ground truth labels.
        """
        for epoch in range(self.epochs):
            # Forward pass
            #############################################################
            # Your Turn: Use the Perceptron instance for forward passes
            
            Perceptron()

            #
            #############################################################

            # Compute loss
            #############################################################
            # Your Turn: Use the BinaryCrossEntropy instance for loss computation

            #
            #############################################################

            # Backpropagation
            #############################################################
            # Your Turn: Use the Backpropagation instance to compute gradients

            #
            #############################################################

            # Update parameters
            self.perceptron.weights -= self.learning_rate * grad_weights
            self.perceptron.bias -= self.learning_rate * grad_biases

            # Print loss every 10 epochs
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {loss:.4f}")


# Initialize instances of the required classes
weights = np.random.rand(3, 1)
bias = np.random.rand(1)
perceptron = Perceptron(weights, bias)
loss_function = BinaryCrossEntropy()
backpropagation = Backpropagation()

# Initialize the trainer
trainer = NeuralNetworkTrainer(perceptron, loss_function, backpropagation, learning_rate=0.01, epochs=100)

# Training data
X = np.random.rand(5, 3)
y_true = np.array([[1], [0], [1], [1], [0]])

# Train the model
trainer.train(X, y_true)

# Part 3: Implementing RNNs from scratch (20 points)

## Objective
In this section, you will implement a basic Recurrent Neural Network (RNN) from scratch. You'll complete the following components:

1. Initialize RNN parameters (5 points)
2. Implement the forward pass for a single time step (5 points)
3. Process a sequence through the RNN (5 points)
4. Compute the loss and implement backpropagation (5 points)

## Hints
The equation for updating the hidden state is:

$h_t = \tanh(x_t \cdot W_{xh} + h_{t-1} \cdot W_{hh} + b_h)$

The equation for computing the output is:

$y_t = h_t \cdot W_{hy} + b_y$

In [7]:
# Install PyTorch
!pip install torch



In [42]:
import torch
import torch.nn as nn
import numpy as np

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        """
        Initialize the RNN parameters.

        Parameters:
        - input_size: Size of input features
        - hidden_size: Size of hidden state
        - output_size: Size of output
        """
        super(SimpleRNN, self).__init__()

        #############################################################
        # Your Turn: Initialize the network parameters
        # Initialize:
        # - self.W_xh: Input to hidden weights
        # - self.W_hh: Hidden to hidden weights
        # - self.W_hy: Hidden to output weights
        # - self.bh: Hidden bias
        # - self.by: Output bias
        # Use torch.randn() for initialization
        
        torch.randn()

        #
        #############################################################


    def tanh(self, x):
        """Helper function: tanh activation"""
        return torch.tanh(x)

    def forward_step(self, x, h_prev):
        """
        Implement one step of the RNN.

        Parameters:
        - x: Input at current time step (batch_size, input_size)
        - h_prev: Previous hidden state (batch_size, hidden_size)

        Returns:
        - h_next: Next hidden state
        - y: Output at current time step
        """
        #############################################################
        # Your Turn: Implement the forward computation for a single time step
        # Compute:
        # 1. Next hidden state using tanh activation
        # 2. Output for this time step
        # Hint: h_next = tanh(W_xh @ x + W_hh @ h_prev + bh)
        #       y = W_hy @ h_next + by
        
        h_next = tanh(np.dot(self.W_xh, x) + np.dot(self.W_hh, h_prev) + self.bh)
        
        y = np.dot(self.W_hy,self.h_next) + self.by

        #
        #############################################################

    def forward(self, x_sequence):
        """
        Process a sequence of inputs.

        Parameters:
        - x_sequence: Input sequence (seq_length, batch_size, input_size)

        Returns:
        - outputs: Sequence of outputs
        - hidden_states: Sequence of hidden states
        """
        # Initialize lists to store outputs and hidden states
        outputs = []
        hidden_states = []
        batch_size = x_sequence.shape[1]

        # Initialize first hidden state with zeros
        h_t = torch.zeros((batch_size, self.W_hh.shape[0]))

        #############################################################
        # Your Turn: Process sequence one step at a time
        # Hint: Use a for loop to iterate over the sequence,
        #       and call forward_step for each time step.
        #       Store the results in hidden_states and outputs lists.
        
        for i in length(x_sequence): 
            x = self.forward_step(x_sequence[i])
            inst_out = x.y
            inst_hidden = x.h_next
            outputs.append(inst_out)
            outputs.append(inst_hidden)

        #
        #############################################################

        # Store results in lists and return
        return torch.stack(outputs), torch.stack(hidden_states)

def compute_loss_and_gradients(rnn, x_sequence, y_sequence):
    """
    Compute loss and gradients for the RNN.

    Parameters:
    - rnn: SimpleRNN instance
    - x_sequence: Input sequence
    - y_sequence: Target sequence

    Returns:
    - loss: Mean squared error loss
    - gradients: Dictionary of gradients for each parameter
    """
    # Convert inputs to tensors and enable gradient tracking
    x_sequence = x_sequence.clone().detach().requires_grad_(True)
    y_sequence = y_sequence.clone().detach().requires_grad_(True)

    # Zero any existing gradients
    rnn.W_xh.requires_grad_(True)
    rnn.W_hh.requires_grad_(True)
    rnn.W_hy.requires_grad_(True)
    rnn.bh.requires_grad_(True)
    rnn.by.requires_grad_(True)

    # Forward pass
    outputs, hidden_states = rnn.forward(x_sequence)

    #############################################################
    # Your Turn: Compute MSE Loss
    
    mse = (sum((y_true-y_pred)**2))/len(y_true)

    #
    #############################################################

    # Backward pass
    loss.backward()

    # Collect gradients
    gradients = {
        'W_xh': rnn.W_xh.grad,
        'W_hh': rnn.W_hh.grad,
        'W_hy': rnn.W_hy.grad,
        'bh': rnn.bh.grad,
        'by': rnn.by.grad
    }

    return loss, gradients

Run the following code to test your implementation.

In [43]:
# Example usage
input_size = 10
hidden_size = 20
output_size = 5
batch_size = 32
seq_length = 15

# Create model instance
rnn = SimpleRNN(input_size, hidden_size, output_size)

# Generate sample data
x_sequence = torch.randn(seq_length, batch_size, input_size)
y_sequence = torch.randn(seq_length, batch_size, output_size)

# Forward pass
outputs, hidden_states = rnn(x_sequence)
print("Output shape:", outputs.shape)
print("Hidden states shape:", hidden_states.shape)

# Compute loss and gradients
loss, gradients = compute_loss_and_gradients(rnn, x_sequence, y_sequence)
print("Loss:", loss.item())

TypeError: randn() received an invalid combination of arguments - got (), but expected one of:
 * (tuple of ints size, *, torch.Generator generator, tuple of names names, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, torch.Generator generator, Tensor out = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, Tensor out = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, tuple of names names, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)


# Declaration of Independent Work  

I hereby declare that this assignment is entirely my own work and that I have neither given nor received unauthorized assistance in completing it. I have adhered to all the guidelines provided for this assignment and have cited all sources from which I derived data, ideas, or words, whether quoted directly or paraphrased.

Furthermore, I understand that providing false declaration is a violation of the University of Arizona's honor code and will result in appropriate disciplinary action consistent with the severity of the violation.

**Name:** ___________________________  
**Date:** ___________________________  