# Assignment 12 Solutions

Submitted By: ANSARI PARVEJ

#### 1.	How does unsqueeze help us to solve certain broadcasting problems?

**Ans:** 

In PyTorch, unsqueeze is a method that allows you to increase the number of dimensions of a tensor by adding one or more dimensions of size 1 in a specified position. This operation can be useful in broadcasting problems, where you have tensors with different shapes and you want to perform element-wise operations between them.

Broadcasting is a technique in PyTorch that allows tensors with different shapes to be used in arithmetic operations. It works by expanding one or both tensors so that they have the same shape. When the shapes differ, PyTorch automatically broadcasts the smaller tensor to match the larger tensor's shape. However, this broadcasting only works if the tensor dimensions match or if one of the tensors has a dimension of size 1.

This is where unsqueeze can be useful. By adding a dimension of size 1 to a tensor, you can effectively match its shape with another tensor's shape, allowing for element-wise operations between them.

#### 2.	How can we use indexing to do the same operation as unsqueeze?

**Ans:**

We can use indexing to achieve the same operation as unsqueeze. To do this, we can create a new axis with a size of 1 using the indexing syntax.

For example, let's say we have a tensor x with shape (3, 4). To add a new axis with a size of 1 at the end of the tensor, we can use the following indexing syntax:

- x[:, :, None]

Here, None creates a new axis with a size of 1. The resulting tensor will have shape (3, 4, 1).


#### 3.	How do we show the actual contents of the memory used for a tensor?

**Ans:**

To show the actual contents of the memory used for a tensor in PyTorch, we can use the numpy() method of the tensor object to convert it to a NumPy array, and then print the array.

Here's an example:

In [None]:
import torch

# create a tensor
x = torch.tensor([1, 2, 3])

# convert the tensor to a NumPy array and print it
print(x.numpy())

# output: [1,2,3]

#### 4.	When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)

In [None]:
#**Ans:**

import torch

# create a matrix and a vector
m = torch.ones(3, 3)
v = torch.tensor([1, 2, 3])

# add the vector to the matrix
result = m + v

# print the result
print(result)

# output: 

tensor([[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]])


#### 5.	Do broadcasting and expand_as result in increased memory use? Why or why not?

**Ans:**

Broadcasting and expand_as do not result in increased memory use for the output tensor, as they do not create new memory allocations for the result. Instead, they manipulate the strides and shape of the output tensor to allow for element-wise operations with other tensors without copying or reallocating the data in memory.

Broadcasting modifies the shape and strides of the input tensors to enable them to be combined element-wise. This is done without actually creating a new tensor or allocating any additional memory. Instead, the new tensor shape and strides are computed dynamically at runtime, allowing the tensor to be used in element-wise operations with other tensors without creating a copy.

Similarly, the expand_as method creates a new tensor with the same data as the original tensor, but with additional dimensions added according to the shape of the provided tensor argument. Again, this is done without allocating any additional memory for the result tensor. Instead, the new tensor is created with the same underlying storage as the original tensor, but with different shape and strides.

#### 6.	Implement matmul using Einstein summation.

In [None]:
#**Ans:**

import torch

# create two matrices
A = torch.randn(3, 4)
B = torch.randn(4, 5)

# matrix multiplication using Einstein summation
C = torch.einsum('ij,jk->ik', A, B)

# print the result
print(C)


#### 7.	What does a repeated index letter represent on the lefthand side of einsum?

**Ans:**

When a repeated index letter appears on the left-hand side of an Einstein summation notation expression, it indicates that a sum should be taken over that index.

For example, consider the expression ii->, which computes the sum of the diagonal elements of a square matrix. The repeated index letter i on the left-hand side indicates that a sum should be taken over the diagonal elements of the input matrix. The notation -> on the right-hand side indicates that the output should be a scalar, i.e., a tensor with no dimensions.

#### 8.	What are the three rules of Einstein summation notation? Why?

**Ans:**

The three rules of Einstein summation notation are:

- Repeated indices are summed over.
- Free indices on the left-hand side of the expression correspond to output indices on the right-hand side.
- Each index can appear at most twice in a given expression.

These rules are used to define tensor contraction operations, which are fundamental operations in tensor algebra. The Einstein summation notation provides a concise and efficient way to express these operations using index notation.

The first rule states that repeated indices are summed over. This means that if an index appears twice in an expression, then the expression represents a sum over that index. For example, the expression A_i B_i represents the sum of the element-wise products of the vectors A and B.

The second rule states that free indices on the left-hand side of the expression correspond to output indices on the right-hand side. This means that if an index appears on the left-hand side of the expression but not on the right-hand side, then it is an index of the output tensor. For example, the expression A_i B_j represents a matrix multiplication, with the output tensor having indices (i, j).

The third rule states that each index can appear at most twice in a given expression. This means that the expression cannot have more than two occurrences of any index, since each occurrence represents a summation over that index. For example, the expression A_i B_i C_j is not valid, since the index i appears three times.

These rules provide a consistent and concise notation for manipulating tensors in linear algebra, and are used extensively in various scientific fields and in machine learning frameworks like PyTorch and TensorFlow.

#### 9.	What are the forward pass and backward pass of a neural network?

**Ans:**

The forward pass and backward pass are two key steps in the training of a neural network.

The forward pass is the computation of the output of the neural network for a given input. The input is fed into the network, and the network applies a series of mathematical operations (such as linear transformations, non-linear activations, pooling, etc.) to the input data to compute the output. The forward pass is essentially the process of propagating the input through the neural network to compute the output.

The backward pass, also known as backpropagation, is the process of computing the gradients of the loss function with respect to the weights of the neural network. In other words, it is the process of computing how much each weight contributed to the error in the output, and adjusting the weights accordingly to reduce the error. The gradients are computed using the chain rule of calculus, which involves computing the gradients of the output with respect to each intermediate variable in the network. The backward pass is essentially the process of propagating the error backwards through the network to update the weights.

#### 10.	Why do we need to store some of the activations calculated for intermediate layers in the forward pass?

**Ans:**

Storing some of the activations calculated for intermediate layers in the forward pass is necessary for performing the backward pass, also known as backpropagation, which is a crucial step in training neural networks.

In the backward pass, we need to compute the gradients of the loss function with respect to the weights of the neural network. This involves computing the gradients of the output with respect to each intermediate variable in the network. These intermediate variables correspond to the activations of the neurons in the hidden layers of the network.

To compute the gradients of the output with respect to these intermediate variables, we need to store the intermediate activations during the forward pass. Specifically, we need to store the activations for each layer that will be used in the backward pass for computing the gradients. These activations are then used in the backward pass to compute the gradients of the loss function with respect to the weights.

If we did not store these intermediate activations during the forward pass, we would not be able to compute the gradients for the hidden layers, and therefore would not be able to train the neural network. Storing these intermediate activations allows us to efficiently compute the gradients during backpropagation and update the weights of the network to minimize the loss function.

Therefore, storing some of the activations calculated for intermediate layers in the forward pass is crucial for training neural networks and is a necessary step in the backpropagation algorithm.

#### 11.	What is the downside of having activations with a standard deviation too far away from 1?

**Ans:**

Having activations with a standard deviation that is too far away from 1 can lead to a number of issues in training neural networks.

When the standard deviation of the activations is too small (i.e., close to 0), the activations become too flat and the gradients can vanish during backpropagation. This means that the gradients become so small that they cannot effectively update the weights of the network, leading to slow convergence or even stagnation of the training process. This is known as the vanishing gradient problem.

On the other hand, when the standard deviation of the activations is too large, the activations become too spread out and the gradients can explode during backpropagation. This means that the gradients become so large that they cause the weights to update too much, leading to unstable training and divergence of the training process. This is known as the exploding gradient problem.

In both cases, having activations with a standard deviation too far away from 1 can lead to unstable training and poor performance of the neural network.

#### 12.	How can weight initialization help avoid this problem?

**Ans:**

Weight initialization is an important technique for avoiding the problem of vanishing or exploding gradients in neural networks. By initializing the weights of the network appropriately, we can ensure that the activations have a reasonable standard deviation and avoid the problems associated with having activations that are too small or too large.

One common approach to weight initialization is to use a Gaussian distribution with mean 0 and standard deviation that depends on the size of the input and output layers of the weight matrix. For example, the Xavier initialization method initializes the weights with a Gaussian distribution with mean 0 and standard deviation sqrt(2 / (n_in + n_out)), where n_in and n_out are the number of input and output neurons in the weight matrix, respectively. This ensures that the variance of the activations is approximately 1, which helps to avoid the problems of vanishing or exploding gradients.