# Basics

Let's go through each of these PyTorch functions to understand their uses and differences:

### `torch.matmul`

`torch.matmul` performs matrix multiplication between two tensors. The behavior depends on the dimensionality of the tensors:
- For two 1-D tensors, it computes their dot product.
- For two 2-D tensors, it computes their matrix multiplication.
- For higher-dimensional tensors, it performs batched matrix multiplication, treating the extra dimensions as batch dimensions.

Example:
```python
a = torch.tensor([[1, 2], [3, 4]])
b = torch.tensor([[5, 6], [7, 8]])
result = torch.matmul(a, b)
```

### `torch.bmm`

`torch.bmm` performs batch matrix multiplication between two 3-D tensors. Each tensor's first dimension is considered as the batch size, and `torch.bmm` performs matrix multiplication for each pair of matrices in the batch.

Example:
```python
a = torch.rand(3, 2, 5)  # 3 matrices of shape 2x5
b = torch.rand(3, 5, 4)  # 3 matrices of shape 5x4
result = torch.bmm(a, b)  # 3 matrices of shape 2x4
```

### `torch.swapdims` (or `torch.transpose` in versions prior to 1.8.0)

`torch.swapdims` swaps two dimensions of a tensor. It's useful for reordering the dimensions of a tensor without changing its data.

Example:
```python
x = torch.randn(2, 3, 5)
result = torch.swapdims(x, 0, 1)  # Swaps the first and second dimension
```

### `torch.unsqueeze`

`torch.unsqueeze` adds a dimension of size one at a specified position in the tensor's shape. It's useful for increasing the dimensionality of a tensor, often for aligning tensor shapes for broadcasting.

Example:
```python
x = torch.tensor([1, 2, 3])
result = torch.unsqueeze(x, 0)  # Result shape is now (1, 3)
```

### `torch.squeeze`

`torch.squeeze` removes all dimensions of size one from the tensor's shape. If a specific dimension is specified, it removes that dimension only if it is of size one.

Example:
```python
x = torch.rand(1, 3, 1, 5)
result = torch.squeeze(x)  # Result shape is now (3, 5)
```

### `torch.argmax`

`torch.argmax` returns the indices of the maximum value of all elements in the input tensor. It can also operate along a specified dimension.

Example:
```python
x = torch.tensor([[1, 2, 3], [4, 3, 2]])
result = torch.argmax(x)  # Returns the index of the overall max value, which is 3 (flattened index)
result_dim = torch.argmax(x, dim=1)  # Returns the indices of max values along dim 1, which is [2, 0]
```

Each of these functions serves a specific purpose in tensor manipulation, from performing mathematical operations to altering the shape or dimensionality of tensors, which are fundamental in developing and manipulating models in PyTorch.

In [26]:
a = torch.tensor([[1, 2], [3, 4]])
b = torch.tensor([[5, 6], [7, 8]])
result = torch.matmul(a, b)
result

tensor([[19, 22],
        [43, 50]])

In [27]:
a = torch.rand(3, 2, 5)  # 3 matrices of shape 2x5
b = torch.rand(3, 5, 4)  # 3 matrices of shape 5x4
result = torch.bmm(a, b)  # 3 matrices of shape 2x4
result

tensor([[[0.9343, 1.5857, 1.6803, 0.6953],
         [0.7715, 1.1902, 1.1333, 0.3764]],

        [[1.7452, 1.3827, 1.5974, 0.9578],
         [1.7699, 1.7691, 1.9269, 1.1248]],

        [[0.6482, 0.5338, 0.5232, 1.0784],
         [0.8460, 0.8159, 0.5877, 1.6714]]])

In [28]:
x = torch.randn(2, 3, 5)
result = torch.swapdims(x, 0, 1)  # Swaps the first and second dimension
result

tensor([[[ 0.5636, -1.5072, -1.6107, -1.4790,  0.4323],
         [ 0.9733, -1.0151, -0.5419, -0.4410, -0.3136]],

        [[-0.1250,  0.7821,  0.5635, -0.1091,  0.7152],
         [-0.1293, -0.7150,  2.1698,  2.0207,  0.2539]],

        [[ 0.0391,  1.3059,  0.2466, -1.9776,  0.3370],
         [ 0.9364,  0.7122, -0.0318,  0.1016,  1.3433]]])

In [29]:
x = torch.tensor([1, 2, 3])
result = torch.unsqueeze(x, 0)  # Result shape is now (1, 3)
result

tensor([[1, 2, 3]])

In [30]:
x = torch.rand(1, 3, 1, 5)
result = torch.squeeze(x)  # Result shape is now (3, 5)
result

tensor([[0.2606, 0.0931, 0.9193, 0.2999, 0.6325],
        [0.3265, 0.5406, 0.9662, 0.7304, 0.0667],
        [0.6985, 0.9746, 0.6315, 0.8352, 0.9929]])

In [31]:
x = torch.tensor([[1, 2, 3], [4, 3, 2]])
result = torch.argmax(x)  # Returns the index of the overall max value, which is 3 (flattened index)
result_dim = torch.argmax(x, dim=1)  # Returns the indices of max values along dim 1, which is [2, 0]
result

tensor(3)

`torch.sum` is a PyTorch function that computes the sum of all elements in the input tensor, or along a specified dimension if given. It is a versatile function that can be used for reducing tensors by summing their elements, which is useful in various mathematical and neural network operations.

### Syntax

The basic syntax of `torch.sum` is as follows:

```python
torch.sum(input, dim=None, keepdim=False, dtype=None) -> Tensor
```

- **`input`** (Tensor): The input tensor whose elements you want to sum.
- **`dim`** (int or tuple of python:ints, optional): The dimension or dimensions along which the elements will be summed. If not specified, the sum of all elements will be returned.
- **`keepdim`** (bool, optional): Whether the output tensor has `dim` retained or not. If `True`, the output tensor will have the same number of dimensions as the input, with the length of 1 in the reduced dimensions. Default is `False`, which reduces the dimension.
- **`dtype`** (torch.dtype, optional): The desired data type of the returned tensor. If specified, the input tensor is casted to `dtype` before performing the operation. Default is `None`, which infers the dtype from the input tensor.

### Examples

#### Sum of All Elements

```python
import torch

a = torch.tensor([[1, 2, 3], [4, 5, 6]])
total_sum = torch.sum(a)
print(total_sum)  # Output: tensor(21)
```

#### Sum Along a Specific Dimension

```python
column_sum = torch.sum(a, dim=0)  # Sum along columns
print(column_sum)  # Output: tensor([5, 7, 9])

row_sum = torch.sum(a, dim=1)  # Sum along rows
print(row_sum)  # Output: tensor([ 6, 15])
```

#### Keeping Dimension After Sum

```python
column_sum_keepdim = torch.sum(a, dim=0, keepdim=True)
print(column_sum_keepdim)  # Output: tensor([[5, 7, 9]])

row_sum_keepdim = torch.sum(a, dim=1, keepdim=True)
print(row_sum_keepdim)  # Output: tensor([[ 6], [15]])
```

### Use Cases

`torch.sum` is widely used in neural network operations, such as computing loss functions, normalizing data, implementing custom layers or functions, and aggregating model outputs. It's a fundamental operation for tensor manipulation and analysis in PyTorch.

In [None]:
import torch

a = torch.tensor([[1, 2, 3], [4, 5, 6]])
total_sum = torch.sum(a)
print(total_sum)  # Output: tensor(21)

In [None]:
column_sum = torch.sum(a, dim=0)  # Sum along columns
print(column_sum)  # Output: tensor([5, 7, 9])

row_sum = torch.sum(a, dim=1)  # Sum along rows
print(row_sum)  # Output: tensor([ 6, 15])

In [None]:
column_sum_keepdim = torch.sum(a, dim=0, keepdim=True)
print(column_sum_keepdim)  # Output: tensor([[5, 7, 9]])

row_sum_keepdim = torch.sum(a, dim=1, keepdim=True)
print(row_sum_keepdim)  # Output: tensor([[ 6], [15]])

Manipulate an image using PyTorch

1. **Load the Image**: You're attempting to load an image from a URL. This part cannot be executed in this environment, but the approach is correct for a typical Python environment with internet access.

2. **Convert to NumPy Array**: Then, you convert the PIL image to a NumPy array. This is a common practice when working with images in Python, as it allows for easier manipulation.

3. **Convert to PyTorch Tensor**: Next, you convert the NumPy array to a PyTorch tensor. This is useful for performing tensor operations using PyTorch.

4. **Print Tensor Shape**: You print the shape of the tensor to understand its dimensions, which typically are `(height, width, channels)` for an image.

5. **Transpose the Image**: You attempt to transpose the image using `torch.transpose`, swapping the height and width dimensions. However, this operation does not include the color channels dimension, which might lead to an error or unexpected behavior when visualizing the image.

6. **Permute the Image for Visualization**: Lastly, you use `torch.permute` to correctly swap the dimensions, including the color channels dimension, to ensure the image can be visualized correctly. The correct way to transpose an image tensor including the color channel is indeed using `torch.permute`.

Here's the corrected approach to transpose an image tensor and visualize it using Matplotlib, assuming `t_img` is your image tensor:

```python
import matplotlib.pyplot as plt
import numpy as np
import torch

# Assuming t_img is your image tensor with shape [H, W, C]

# Correctly transpose the image tensor to [W, H, C] for visualization
t_img_T = torch.permute(t_img, (1, 0, 2))

# Convert the transposed tensor to a NumPy array and visualize
plt.imshow(t_img_T.numpy())
plt.show()
```

The key takeaway is to use `torch.permute` for correctly reordering the dimensions of an image tensor when you need to include the color channels in the transposition.

In [None]:
from PIL import Image
import requests
from io import BytesIO
import numpy as np
from matplotlib import pyplot as plt

In [None]:
plt.imshow(img)
plt.show()

In [None]:
t_img = torch.tensor(img)
print(t_img.shape)

In [None]:
t_img_T =  torch.transpose(t_img,dim0=1,dim1=0)
plt.imshow(np.array(t_img_T))
plt.show()

In [None]:
t_img_T =  torch.permute(t_img,(1,0,2))
plt.imshow(np.array(t_img_T))
plt.show()

# Autograds

# Embeddings

Simple example that demonstrates how to use the torch.nn.Embedding layer as part of a larger model, such as a basic neural network for processing sequences. This example will cover creating an embedding layer, using it within a model, and a simple training loop. The context here is a toy example for educational purposes.

In [None]:
# Import Necessary Libraries
import torch
import torch.nn as nn
import torch.optim as optim

This model will be a simple feed-forward neural network for classification purposes, with one embedding layer followed by a couple of linear layers.

In [None]:
# Simple Model with an Embedding Layer
class SimpleNNWithEmbedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SimpleNNWithEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x)  # Apply embedding layer
        x = torch.mean(x, dim=1)  # Example of pooling/aggregating embeddings
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [None]:
# Initialize the Model, Loss Function, and Optimizer

# Parameters for our model
vocab_size = 100  # Example vocabulary size
embedding_dim = 10  # Size of each embedding vector
hidden_dim = 16  # Hidden dimension size
output_dim = 2  # Output dimension size (e.g., for binary classification)

model = SimpleNNWithEmbedding(vocab_size, embedding_dim, hidden_dim, output_dim)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Training Loop
for epoch in range(1):  # Just for demo, run for more epochs
    for inputs, labels in data_loader:
        optimizer.zero_grad()  # Zero the gradients
        outputs = model(inputs)  # Forward pass
        loss = loss_function(outputs, labels)  # Compute loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update weights
    print(f"Epoch {epoch}, Loss: {loss.item()}")

*    This example simplifies many aspects of training a neural network, such as data preparation and evaluation.

*    The forward method of our model applies the embedding layer to the input indices, aggregates the embeddings (in this case, by averaging), and then passes the result through additional layers.

*    The training loop consists of the typical steps: forward pass, loss computation, backward pass, and parameter update.

This example should give you a starting point for understanding how to incorporate embedding layers into PyTorch models and how they can be trained. Remember to adapt and expand upon this example based on the specifics of your task and data.

# Parameters

In PyTorch, `torch.nn.parameter.Parameter` is a subclass of `torch.Tensor` that is used to represent parameters of a model. Parameters are tensors that get special treatment during the training process, primarily because they are automatically added to the list of parameters (`model.parameters()`) to be optimized when training a model. This means that the optimizer will update these tensors during the backward pass.

### Key Characteristics

- **Trainable**: By default, parameters are considered trainable, meaning their values are adjusted through backpropagation during the training process.
- **Automatically Registered**: When used within a `torch.nn.Module`, any `Parameter` is automatically registered as a parameter of the module. This is done by assigning `Parameter` objects as attributes of the module.
- **Included in Model's Parameters**: When you call `model.parameters()`, it returns an iterator of all parameters in the model, which includes any instance of `Parameter`.

### Usage

The primary use of `Parameter` is to define variables that should be considered as parameters of a model—weights, biases, etc.—that will be learned during the training process.

### Example

When you define a custom layer or module, you might use `Parameter` to explicitly define tensors that should be treated as parameters:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class CustomLinearLayer(nn.Module):
    def __init__(self, input_features, output_features):
        super(CustomLinearLayer, self).__init__()
        # Define weight and bias as Parameters, making them part of the model's parameters
        self.weight = nn.Parameter(torch.Tensor(input_features, output_features))
        self.bias = nn.Parameter(torch.Tensor(output_features))
        
        # Initialize parameters (weights and biases)
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))  # Example initialization
        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / math.sqrt(fan_in)
        nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, x):
        # Use the parameters in a forward pass
        return F.linear(x, self.weight, self.bias)
```

In this example, `self.weight` and `self.bias` are instances of `Parameter`, which makes them automatically trainable parameters of the `CustomLinearLayer` module. This means they will be updated by the optimizer during training.

### Conclusion

`torch.nn.parameter.Parameter` is a powerful tool for defining trainable parameters within PyTorch models. It ensures that parameters are automatically considered for updates during the training process, making it easier to build and train complex neural networks.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CustomLinearLayer(nn.Module):
    def __init__(self, input_features, output_features):
        super(CustomLinearLayer, self).__init__()
        # Define weight and bias as Parameters, making them part of the model's parameters
        self.weight = nn.Parameter(torch.Tensor(input_features, output_features))
        self.bias = nn.Parameter(torch.Tensor(output_features))

        # Initialize parameters (weights and biases)
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))  # Example initialization
        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / math.sqrt(fan_in)
        nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, x):
        # Use the parameters in a forward pass
        return F.linear(x, self.weight, self.bias)

* What is new to us is the `nn.Parameter` class.
* It is just a (subclass of) tensor with requires_grad set to True by default.
* `torch.nn.parameter.Parameter(data=None, requires_grad=True)`
* A method in the **base class** looks for **Parameter** in the module.If found, then it registers it to the list (in fact, dict) of parameters.

# Linear

`torch.nn.Linear` is a module provided by PyTorch that applies a linear transformation to the incoming data. It's one of the most commonly used layers in neural networks for creating fully connected layers. The linear transformation it applies can be mathematically represented as \(y = xA^T + b\), where:
- \(x\) is the input feature vector,
- \(A\) is the weight matrix,
- \(b\) is the bias vector, and
- \(y\) is the output of the module.

### Parameters

The `torch.nn.Linear` module has the following main parameters:

- `in_features`: size of each input sample (the size of \(x\)),
- `out_features`: size of each output sample (the dimension of \(y\)),
- `bias`: a boolean value indicating whether to include a bias term \(b\) in the transformation. The default value is `True`.

### Usage

Here's a basic example of how to use the `torch.nn.Linear` module:

```python
import torch
import torch.nn as nn

# Create a Linear layer
linear_layer = nn.Linear(in_features=10, out_features=5)

# Example input (batch_size=1, in_features=10)
input_tensor = torch.randn(1, 10)

# Apply the linear layer
output_tensor = linear_layer(input_tensor)

print(output_tensor)
```

In this example, `linear_layer` is an instance of `torch.nn.Linear` that takes inputs with 10 features and transforms them into outputs with 5 features. This is achieved by multiplying the input by a weight matrix (of size [10, 5] in this case) and adding a bias vector (of size [5]) if `bias` is `True`.

### Key Points

- **Weight and Bias**: The weights and bias of the linear layer are automatically initialized but can be customized post-initialization. They are accessible through `linear_layer.weight` and `linear_layer.bias`, respectively, and are instances of `torch.nn.parameter.Parameter`, meaning they are trainable parameters.
- **Training**: During the training process, these parameters are automatically adjusted through backpropagation to minimize the loss function.
- **Applications**: `torch.nn.Linear` is a fundamental building block in neural networks and can be used in various architectures, including feedforward neural networks, convolutional neural networks (for fully connected layers), and recurrent neural networks.

`torch.nn.Linear` is a versatile and essential module in PyTorch, making it straightforward to add fully connected layers to your models.

In [None]:
import torch
import torch.nn as nn

# Create a Linear layer
linear_layer = nn.Linear(in_features=10, out_features=5)

# Example input (batch_size=1, in_features=10)
input_tensor = torch.randn(1, 10)

# Apply the linear layer
output_tensor = linear_layer(input_tensor)

print(output_tensor)

**Base class**

 * The **nn** module contains all the components to **conveniently** build any deep learning architecture.
 * **nn.Module** provides different type of layers (called modules in pytorch) such as `linear` layer, `convolutional` layer,..
 * All the `modules or models` that we create must subclass this `nn.Module` baseclass

 * They must **implement the forward** method in the subclass. The rest will be taken care by the methods defined in the base class.

 * I really urge you to take look at the source code of **nn.Module**

 * Let's try to reproduce one of the existing layers (modules) called Linear.

In [None]:
class LinearLayer(nn.Module):

  def __init__(self,in_features,out_features):
    super(LinearLayer,self).__init__()
    self.in_features = in_features
    self.out_features = out_features
    self.w = nn.Parameter(torch.randn(in_features, out_features))
    self.b = nn.Parameter(torch.randn(out_features))

  def forward(self,x):
    out = torch.matmul(x,self.w)+self.b

# LayerNorm

`torch.nn.LayerNorm` is a PyTorch module that applies Layer Normalization to the input data. Layer Normalization is a technique to normalize the inputs across the features for each data sample independently, which can help stabilize the learning process and lead to faster convergence. It's particularly useful in deep learning models where the normalization of activations can significantly impact performance.

### How LayerNorm Works

Layer Normalization works by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training sample. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization performs normalization for each individual sample. This makes it particularly effective in situations where batch sizes are small or vary between iterations.

### Parameters

`torch.nn.LayerNorm` has the following parameters:

- `normalized_shape`: the shape of the input tensor, or a subset of the input tensor dimensions, that should be normalized. For example, if the input tensor has the shape `(N, C, H, W)` (batch size, channels, height, width), and you wish to normalize across the `C, H, W` dimensions, you would set `normalized_shape` to `(C, H, W)`.
- `eps`: a value added to the denominator for numerical stability. The default value is `1e-5`.
- `elementwise_affine`: a boolean indicating whether to learn affine parameters (scale and shift) for each feature. The default value is `True`. If `True`, it learns a scale and shift parameter for each feature dimension.

### Usage

Here's an example of how to use `torch.nn.LayerNorm`:

```python
import torch
import torch.nn as nn

# Example input tensor of shape (batch_size, num_features)
input_tensor = torch.randn(2, 5)

# Apply LayerNorm
layer_norm = nn.LayerNorm(normalized_shape=5, elementwise_affine=True)
output_tensor = layer_norm(input_tensor)

print(output_tensor)
```

In this example, `layer_norm` is an instance of `torch.nn.LayerNorm` that normalizes the input tensor across its features (`num_features` dimension). The `normalized_shape` parameter is set to the number of features in the input tensor. If `elementwise_affine` is `True`, the module also learns a scale and shift for each feature dimension, further enhancing the capability of the model to fit the training data.

### Key Points

- **Independence from Batch Size**: Layer Normalization's effectiveness does not depend on the batch size, making it suitable for tasks with variable batch sizes or where small batch sizes are preferred.
- **Use Cases**: It's widely used in models where controlling the internal distribution of activations is crucial, such as recurrent neural networks (RNNs) and Transformers.
- **Versatility**: Layer Normalization can be applied to various types of data and model architectures, making it a versatile choice for normalization needs in deep learning models.

Layer Normalization is a powerful tool to improve the training dynamics and stability of deep neural networks, especially in architectures where batch normalization might not be applicable or optimal.

In [None]:
import torch
import torch.nn as nn

# Example input tensor of shape (batch_size, num_features)
input_tensor = torch.randn(2, 5)

# Apply LayerNorm
layer_norm = nn.LayerNorm(normalized_shape=5, elementwise_affine=True)
output_tensor = layer_norm(input_tensor)

print(output_tensor)

# Module List

`torch.nn.ModuleList` is a PyTorch container module designed to hold a list of `nn.Module` instances. It is similar to a Python list, but with added functionalities that are specific to PyTorch, making it suitable for use within neural network architectures. One of the key features of `ModuleList` is that it correctly registers the modules contained within it as submodules of the parent module. This registration is crucial for ensuring that all parameters of the modules within the `ModuleList` are visible to PyTorch optimizers, and properly included in the model's parameter list for training.

### Key Characteristics

- **Automatic Registration**: When you add instances of `nn.Module` to a `ModuleList`, they are automatically registered as submodules (or components) of the parent module. This means their parameters are included in the parent's parameters, and they are properly tracked for training and updating.
- **Flexibility**: `ModuleList` provides a flexible way to work with lists of modules, especially when the exact number of modules you need is dynamic or when you want to apply the same operation to a series of modules in a loop.

### Usage

`ModuleList` is particularly useful when you have a variable number of modules that perform similar operations, or when you want to iterate over modules. Here's an example of how to use `ModuleList`:

```python
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layers = nn.ModuleList([
            nn.Linear(10, 20),
            nn.ReLU(),
            nn.Linear(20, 30)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
```

In this example, `MyModel` contains a `ModuleList` of layers. During the forward pass, it applies each layer in the `ModuleList` to the input sequentially. This pattern is very common in neural network architectures, especially when constructing networks with variable or dynamic structures.

### Differences from Python Lists

While you can use a regular Python list to store modules, doing so does not automatically register the modules with the parent module, which means their parameters won't be tracked or updated during training. `ModuleList` overcomes this limitation by ensuring proper registration and parameter tracking.

### Conclusion

`torch.nn.ModuleList` is a useful and necessary tool for effectively managing collections of modules in PyTorch. It ensures that modules are correctly registered as part of the overall model, making it easy to define, iterate, and train complex neural network architectures.

In [None]:
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layers = nn.ModuleList([
            nn.Linear(10, 20),
            nn.ReLU(),
            nn.Linear(20, 30)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# Sequential

`torch.nn.Sequential` is a container module in PyTorch that sequences together other modules and layers in a specific order. It simplifies the process of building neural networks by allowing you to stack different layers and modules in the order they should be executed during the forward pass. When you input data into a `Sequential` container, it passes through all its modules sequentially, from the first to the last, with the output of one module becoming the input to the next.

### Key Characteristics

- **Simplicity**: It provides a clean and straightforward way to define a model by stacking layers and modules without explicitly defining the `forward` method.
- **Automatic Forward Pass**: The `Sequential` container automatically handles the forward pass through all its contained modules in the order they were added.
- **Flexibility**: While it's most commonly used for simple linear stacks of layers, you can also include modules that contain branching, pooling, or other more complex behaviors, as long as the overall data flow remains sequential.

### Usage

Here's an example of how to use `torch.nn.Sequential`:

```python
import torch.nn as nn

# Define a simple sequential model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 30),
    nn.ReLU(),
    nn.Linear(30, 10)
)

# Now you can pass input data to this model directly
# Example input tensor
input_tensor = torch.randn(5, 10)  # batch size of 5, input features of 10
output_tensor = model(input_tensor)
```

In this example, `model` is an instance of `torch.nn.Sequential` containing a stack of linear layers interspersed with ReLU activations. This model can take an input tensor and automatically apply all its layers in sequence.

### Advantages and Limitations

- **Advantages**: `Sequential` is great for quickly building models when the data flow is a simple linear stack of layers. It makes the code shorter and cleaner.
- **Limitations**: Since `Sequential` automatically defines the forward pass, it's not suitable for models that require branching, multiple inputs/outputs at different layers, or any non-linear data flow. For such models, you would need to define a custom `nn.Module` and explicitly implement the forward pass.

### Conclusion

`torch.nn.Sequential` is a convenient tool for defining straightforward neural networks in PyTorch. It allows for quick and clean model definition, making it ideal for many common architectures. However, for more complex models with non-sequential data flows, a custom `nn.Module` with an explicitly defined `forward` method would be necessary.

In [None]:
import torch.nn as nn

# Define a simple sequential model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 30),
    nn.ReLU(),
    nn.Linear(30, 10)
)

# Now you can pass input data to this model directly
# Example input tensor
input_tensor = torch.randn(5, 10)  # batch size of 5, input features of 10
output_tensor = model(input_tensor)

# Cross Entropy Loss

`torch.nn.CrossEntropyLoss` is a loss function provided by PyTorch that is commonly used for classification tasks. It combines `nn.LogSoftmax` and `nn.NLLLoss` (negative log likelihood loss) in a single class, which makes it very convenient and computationally efficient for training models on classification problems.

### How it Works

- **Softmax Application**: First, it applies the softmax function to the output logits (predictions) of the model to obtain a probability distribution over classes for each sample.
- **Negative Log Likelihood**: Then, it computes the negative log likelihood of the true class labels given the predicted probability distribution. This measures how well the model's predictions match the true labels.

### Characteristics

- It expects the model outputs to be raw, unnormalized scores (also known as logits) for each class.
- The target labels should be indices specifying the class label for each sample (for single-label classification) or a tensor of the same shape as the input containing probabilities (for multi-label classification with soft targets).

### Usage

Here's an example of how to use `torch.nn.CrossEntropyLoss`:

```python
import torch
import torch.nn as nn

# Define the size of each input sample and the number of classes
input_size = 10  # number of input features
num_classes = 4  # number of classes

# Example model
model = nn.Linear(input_size, num_classes)

# Example input and true labels
input_tensor = torch.randn(3, input_size)  # batch size of 3
target = torch.tensor([2, 0, 1])  # true class indices

# CrossEntropyLoss
criterion = nn.CrossEntropyLoss()

# Forward pass: compute predicted outputs by passing inputs to the model
output = model(input_tensor)

# Compute the loss
loss = criterion(output, target)

print(loss)
```

### Key Points

- **No need for Softmax**: You do not need to apply a softmax layer to your model's output before passing it to `CrossEntropyLoss` since it's already included within the loss computation.
- **Target Format**: For single-label classification, the target tensor should contain the class indices. For multi-label classification with soft and hard targets, different considerations apply, and you might need to use another loss function like `torch.nn.BCEWithLogitsLoss`.
- **Numerical Stability**: Combining softmax and negative log likelihood in one operation is numerically more stable than applying them separately.

`torch.nn.CrossEntropyLoss` is widely used for training neural networks on classification tasks due to its efficiency and convenience, making it a fundamental component of many machine learning pipelines in PyTorch.

In [None]:
import torch
import torch.nn as nn

# Define the size of each input sample and the number of classes
input_size = 10  # number of input features
num_classes = 4  # number of classes

# Example model
model = nn.Linear(input_size, num_classes)

# Example input and true labels
input_tensor = torch.randn(3, input_size)  # batch size of 3
target = torch.tensor([2, 0, 1])  # true class indices

# CrossEntropyLoss
criterion = nn.CrossEntropyLoss()

# Forward pass: compute predicted outputs by passing inputs to the model
output = model(input_tensor)

# Compute the loss
loss = criterion(output, target)

print(loss)

# Simple NN for MNIST

In PyTorch, the `forward` method defines the forward pass of a neural network. When you subclass `torch.nn.Module` to create your own model, you override the `forward` method to specify how your model processes input tensors and produces output tensors. The `forward` method is where you define the operations that your model performs on its input data, such as applying layers, activation functions, and any other computations involved in producing the model's output.

Here's a simplified overview of how it works:

- **Definition**: You define the `forward` method as part of your subclass of `torch.nn.Module`. This method takes the input tensor(s) as its argument(s) and returns the output tensor(s).
- **Execution**: When you call your model on an input tensor, PyTorch automatically calls the `forward` method with the input tensor. You don't call `forward` directly; instead, you call the model itself with the input data, and the `__call__` method of `nn.Module` ensures that your `forward` method is invoked.
- **Computation**: Inside the `forward` method, you specify the computations to be performed on the input tensors, using other modules (like convolutional layers, linear layers, or activation functions) that you've defined as attributes in your model's constructor (`__init__` method). The sequence of operations in the `forward` method defines the computational graph for the forward pass.

### Example

Here's a simple example to illustrate how the `forward` method is used:

In this example:
- The `SimpleNN` class inherits from `torch.nn.Module`.
- Two linear layers are defined in the `__init__` method.
- The `forward` method specifies that the input tensor `x` should first pass through the first linear layer (`fc1`), then a ReLU activation function, and finally through the second linear layer (`fc2`).
- The model is called with an input tensor, which triggers the `forward` method and produces an output tensor.

The `forward` method is essential for defining the behavior of your neural network during the forward pass, determining how input data is transformed into outputs by the model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(in_features=10, out_features=20)  # First layer
        self.fc2 = nn.Linear(in_features=20, out_features=10)  # Second layer

    def forward(self, x):
        x = F.relu(self.fc1(x))  # Apply ReLU activation function after first layer
        x = self.fc2(x)          # Output layer
        return x

# Create an instance of the model
model = SimpleNN()

# Create a dummy input tensor
input_tensor = torch.randn(1, 10)  # Batch size of 1, 10 features

# Call the model on the input tensor
output_tensor = model(input_tensor)

print(output_tensor)

Brief overview of the key parts and some insights:

1. **Loading the MNIST Dataset**: You've used `torchvision.datasets.MNIST` to load the MNIST dataset, specifying that it should be transformed into tensors using `transforms.ToTensor()`. This is a common practice to prepare data for training with PyTorch models.

2. **Inspecting the Dataset**: You've printed the classes, shape of the data, and shape of the targets to get a better understanding of the dataset structure.

3. **Preparing the Data**: You've selected the first 1000 samples for training and normalized them by dividing by the maximum value. This normalization step is crucial for helping the neural network learn more effectively.

4. **Building a Simple Linear Model**: You've used `nn.Linear` to create a simple linear model (fully connected layer) and examined its parameters (weights and biases). This is a basic building block for neural networks.

5. **Making Predictions and Calculating Loss**: You've passed an input through the model, used softmax to get class probabilities, and then calculated the loss using `nn.CrossEntropyLoss`, which is a common loss function for classification tasks.

6. **Setting Up an Optimizer**: You've initialized an SGD optimizer with the model's parameters and a learning rate. Optimizers are used to update model parameters based on gradients computed during backpropagation.

7. **Training the Model**: You've demonstrated how to compute gradients with `.backward()`, visualize gradients, update model parameters with `optimizer.step()`, and reset gradients with `optimizer.zero_grad()`.

8. **Building a Two-Layer Neural Network**: You've shown two approaches for constructing a neural network model in PyTorch: using `nn.Sequential` for a straightforward stack of layers and subclassing `nn.Module` for more flexibility and control over the model architecture.

9. **Training the Two-Layer Neural Network**: You've outlined a training loop that iterates over epochs, computes loss for each training sample, updates model parameters, and zeroes gradients. Additionally, you've visualized the initial and final weights of the first layer to observe changes after training.

This code snippet encapsulates many fundamental aspects of working with PyTorch, from data preparation and model definition to training and evaluation. It's a solid foundation for understanding how to implement and train neural networks using PyTorch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

# Step 1: Load MNIST Dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Step 2: Define the Model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)  # Input layer to hidden layer
        self.fc2 = nn.Linear(128, 10)     # Hidden layer to output layer

    def forward(self, x):
        x = x.view(-1, 28*28)  # Flatten the image
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = FFN()

# Step 3: Define Loss Function and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Step 4: Train the Model
epochs = 5
loss_trace = []

for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item()}')
    loss_trace.append(loss.item())

# Plot the training loss
plt.plot(loss_trace)
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Note: This is a basic example for educational purposes.

# Transformer

`torch.nn.Transformer` is a module provided by PyTorch that implements the Transformer model architecture as described in the paper "Attention is All You Need" by Vaswani et al. The Transformer model has been highly influential in the field of natural language processing (NLP) and beyond, due to its effectiveness in handling sequential data without the need for recurrent neural networks (RNNs) or convolutional neural networks (CNNs). Instead, it relies entirely on a mechanism known as self-attention to draw global dependencies between input and output.

### Key Components

The `torch.nn.Transformer` module encapsulates the entire Transformer model, which comprises several key components:

- **Encoder**: The encoder maps an input sequence to a sequence of continuous representations. It consists of a stack of identical layers, each containing two main sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.
- **Decoder**: The decoder is responsible for generating the output sequence. It also consists of a stack of identical layers, but with an additional multi-head attention mechanism that attends to the encoder's output.
- **Positional Encoding**: Since the model does not use recurrence or convolution, a positional encoding is added to the input embeddings at the bottom of the encoder and decoder stacks to include information about the sequence order.

### Usage

The Transformer model is highly versatile and can be used for a wide range of tasks, such as machine translation, text summarization, and more. Here's a simplified example of how to use `torch.nn.Transformer`:

```python
import torch
import torch.nn as nn

# Model parameters
d_model = 512  # The dimensionality of the input and output of the model
nhead = 8  # The number of heads in the multi-head attention models
num_encoder_layers = 6  # The number of sub-encoder-layers in the encoder
num_decoder_layers = 6  # The number of sub-decoder-layers in the decoder
dim_feedforward = 2048  # The dimensionality of the feed-forward network model in encoder and decoder

# Initialize the transformer model
transformer_model = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward)

# Example input (src) and target (tgt) tensors
src = torch.rand((10, 32, d_model))  # (sequence length, batch size, feature size)
tgt = torch.rand((20, 32, d_model))  # (sequence length, batch size, feature size)

# Forward pass
output = transformer_model(src, tgt)
```

### Key Points

- **Flexibility**: The `torch.nn.Transformer` module is highly configurable, allowing adjustments to the number of layers, heads, and other parameters to suit different tasks and datasets.
- **Attention Mechanism**: The core of the Transformer is the self-attention mechanism, which enables the model to weigh the importance of different parts of the input data differently.
- **Positional Encoding**: It's crucial to use a positional encoding (or an alternative method to incorporate sequence order information) when working with `torch.nn.Transformer`, as the model itself does not inherently understand sequence order.

The Transformer model represents a significant advancement in sequence modeling, offering parallelizability and efficiency improvements over traditional RNN-based approaches. Its architecture has served as the foundation for numerous state-of-the-art models in NLP, including BERT, GPT, and many others.

In [None]:
import torch
import torch.nn as nn

# Model parameters
d_model = 512  # The dimensionality of the input and output of the model
nhead = 8  # The number of heads in the multi-head attention models
num_encoder_layers = 6  # The number of sub-encoder-layers in the encoder
num_decoder_layers = 6  # The number of sub-decoder-layers in the decoder
dim_feedforward = 2048  # The dimensionality of the feed-forward network model in encoder and decoder

# Initialize the transformer model
transformer_model = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward)

# Example input (src) and target (tgt) tensors
src = torch.rand((10, 32, d_model))  # (sequence length, batch size, feature size)
tgt = torch.rand((20, 32, d_model))  # (sequence length, batch size, feature size)

# Forward pass
output = transformer_model(src, tgt)

Comprehensive overview of the configurable options available for the `torch.nn.Transformer` module in PyTorch. These parameters allow for extensive customization of the Transformer model to suit various tasks and data characteristics. Let's break down these parameters and their roles in the Transformer architecture:

### Core Parameters

- **`d_model`**: The number of expected features in the encoder/decoder inputs. This is the size of the vectors that the Transformer processes in each position of the sequence.
- **`nhead`**: The number of heads in the multi-head attention mechanisms. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
- **`num_encoder_layers`**: The number of sub-encoder-layers in the encoder. Each layer consists of a multi-head self-attention mechanism followed by a position-wise fully connected feed-forward network.
- **`num_decoder_layers`**: The number of sub-decoder-layers in the decoder. Similar to the encoder layers, but with an additional multi-head attention mechanism that attends to the encoder's output.
- **`dim_feedforward`**: The dimension of the feed-forward network model in encoder and decoder intermediate layers. This is the size of the hidden layer in the feed-forward networks.
- **`dropout`**: The dropout rate, a regularization technique to prevent overfitting by randomly setting a fraction of the input units to 0 during training.

### Advanced Configuration

- **`activation`**: The activation function of the encoder/decoder's intermediate layer. Common options are "relu" and "gelu".
- **`custom_encoder`/`custom_decoder`**: Custom encoder or decoder to replace the default Transformer encoder or decoder.
- **`layer_norm_eps`**: The epsilon value for layer normalization components, to prevent division by zero.
- **`batch_first`**: If `True`, the input and output tensors are expected in the format `(batch, seq, feature)`, which is more aligned with other PyTorch modules. The default is `False`, meaning the format is `(seq, batch, feature)`.
- **`norm_first`**: Determines the order of layer normalization and attention/feedforward operations within the encoder and decoder layers.
- **`bias`**: If set to `False`, linear layers and layer normalization will not learn an additive bias. The default is `True`.

### Masking and Causality

- **`src_mask`**, **`tgt_mask`**, **`memory_mask`**: Additive masks for the source, target, and encoder output sequences. These masks allow you to prevent the model from attending to certain positions.
- **`src_key_padding_mask`**, **`tgt_key_padding_mask`**, **`memory_key_padding_mask`**: Masks that indicate which elements within the batch are padding elements and should not be attended to.
- **`src_is_causal`**, **`tgt_is_causal`**, **`memory_is_causal`**: Hints to the model indicating whether the corresponding masks are causal. A causal mask prevents the model from attending to future positions during prediction, which is crucial for tasks like language modeling.

These parameters give users the flexibility to tailor the Transformer model to a wide range of tasks beyond simple sequence-to-sequence modeling, including but not limited to language translation, text generation, and more complex sequential tasks that require careful control over the flow of information.