<a href="https://colab.research.google.com/github/athahibatullah/llm-from-scratch/blob/main/Pytorch_Appendix_A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A.1 What is Pytorch?**

## **A.1.1 The Three Core Components of PyTorch**

PyTorch is an open source Python-based deep learning library. Pytorch has been the most widely used deep learning library for research since 2019 by a wide margin.

One of the reasons PyTorch is so popular is its user-friendly interface and efficiency. Despite its accessibility, it doesn't compromise on flexibility, allowing advanced users to tweak lower-level aspects of their models for customization and optimization. In short, for many practitioners and researchers, PyTorch offers just the right balance between usability and features.

<img src="https://drive.google.com/uc?id=1CwRGaZ3_oEKUdxZbwXZiK86Jbppfi6ag">

* Tensor Library: PyTorch is a tensor library that extends the concept of the array-oriented programming library NumPy with the additional feature that accelerates computation on GPUs, thus providing a seamless switch between CPUs and GPUs.

* Automatic Differentiation Engine: PyTorch is an automatic differentiation engine, also known as autograd, that enables the automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization.

* Deep Learning Library: PyTorch is a deep learning library. It offers modular, flexible, and efficient building blocks, including pretrained models, loss functions, and optimizers, for designing and training a wide range of deep learning models, catering to both researchers and developers.

# **A.2 Understanding Tensors**

Tensors represent a mathematical concept that generalizes vector and matrices to potentially higher dimensions. Tensors are mathematical object that can be characterized by their order (or rank), which provides the number of dimensions. For example, scalar is tensor rank 0, vector is tensor rank 1, and matrices is tensor rank 2.

<img src="https://drive.google.com/uc?id=1Gn7mKtrvucwRRZxmq2x69tF-ygOAK5iL">



From a computational perspective, tensors serve as data containers. They hold multidimensional data, where each dimension represent different feature. Tensor library like PyTorch can create, manipulate, and compute with these arrays efficiently. In this context, a tensor library functions as an array library.

PyTorch tensors are similar to NumPy arrays but have several additional features that are important for deep learning. For example, automatic differentiation engine, simplifying computing gradients. PyTorch also support GPU to speed up deep neural network training process.

## **A.2.1 Scalars, Vectors, Matrices, and Tensors**

As mentioned earlier, PyTorch tensors are data containers for array-like structures. Scalar is zero-dimensional tensors (0D tensor), Vectors is one-dimensional tensors (1D tensor), Matrix is two-dimensional tensors (2D tensor). For three-dimensional and above, we just call it 3D tensor and so forth. We can create objects of PyTorch's Tensor class using the torch.tensor function as shown:

In [2]:
import torch

print(torch.__version__)

2.6.0+cu124


In [None]:
print(torch.cuda.is_available())

False


In [None]:
# create a 0D tensor (scalar) from a Python integer
tensor0d = torch.tensor(1)

# create a 1D tensor (vector) from a Python list
tensor1d = torch.tensor([1, 2, 3])

# create a 2D tensor from a nested Python list
tensor2d = torch.tensor([[1, 2],
                         [3, 4]])

# create a 3D tensor from a nested Python list
tensor3d_1 = torch.tensor([[[1, 2], [3, 4]],
                           [[5, 6], [7, 8]]])

# create a 3D tensor from NumPy array
ary3d = np.array([[[1, 2], [3, 4]],
                  [[5, 6], [7, 8]]])
tensor3d_2 = torch.tensor(ary3d)  # Copies NumPy array
tensor3d_3 = torch.from_numpy(ary3d)  # Shares memory with NumPy array

In [None]:
ary3d[0, 0, 0] = 999
print(tensor3d_2) # remains unchanged

tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [None]:
print(tensor3d_3) # changes because of memory sharing

tensor([[[999,   2],
         [  3,   4]],

        [[  5,   6],
         [  7,   8]]])


## **A.2.2 Tensor Data Types**

PyTorch adopts the default 64-bit integer data type from Python. We can access the data type of a tensor via the .dtype attribute of a tensor

In [None]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d.dtype)

torch.int64


If we create tensors from float, by default PyTorch creates 32-bit precision

In [None]:
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


This is primarily due 32-bit is sufficent for most deep learning tasks while consuming less memory and computational resoureces than 64-bit. Moreover, GPU architectures are optimized for 32-bit computations, and using this type of data can significantly speed up model training and inference.

It's also possible to conver 64-bit into 32-bit

In [None]:
floatvec = tensor1d.to(torch.float32)
print(floatvec.dtype)

torch.float32


## **A.2.3 Common PyTorch Tensor Operations**

This is our code to generate tensors from earlier:

In [None]:
tensor2d = torch.tensor([[1, 2, 3],
                         [4, 5, 6]])
tensor2d

tensor([[1, 2, 3],
        [4, 5, 6]])

.shape is used to access the shape of tensor

In [None]:
tensor2d.shape # row,column

torch.Size([2, 3])

We can also reshape the tensor from 2x3 to 3x2

In [None]:
tensor2d.reshape(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

However, .view is more common for reshaping tensor

In [None]:
tensor2d.view(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

Similar to .reshape and .view, PyTorch also offers many syntax with the same operations & computations. This is because initially PyTorch followed the original Torch syntax convention but then, by popular request, added syntax to make it similar to NumPy.

Next, we have .T to transpose a tensor

In [None]:
tensor2d.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

.matmul is used to multiply two matrices in PyTorch

In [None]:
tensor2d.matmul(tensor2d.T)

tensor([[14, 32],
        [32, 77]])

we can also use @ operator to do the same as .matmul

In [None]:
tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

# **A.3 Seeing Models as Computation Graphs**

Now let's look at PyTorch's automatic differentiation engine, also known as autograd. Pytorch's autograd system provides functions to compute gradients in dynamic computational graphs automatically.

A computational graph is a directed graph that allows us to express and visualize mathematical expressions. In deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network. We will need this to compute the required gradients for backpropagation, the main training algorithm for neural networks.

Let's look at a concrete example to illustrate the concept of a computation graph. The code in the following listing implements the forward pass (prediction step) of a simple logistic regression classifier, which can be seen as a single-layer neural network. It returns a score between 0 and 1, which is compared to the true class label (0 or 1) when computing the loss.

In [None]:
import torch.nn.functional as F # This import statement is a common convention in PyTorch to prevent long lines of code

y = torch.tensor([1.0])  # true label
x1 = torch.tensor([1.1]) # input feature
w1 = torch.tensor([2.2]) # weight parameter
b = torch.tensor([0.0])  # bias unit

z = x1 * w1 + b          # net input
a = torch.sigmoid(z)     # activation & output

loss = F.binary_cross_entropy(a, y)
print(loss)

tensor(0.0852)


Figure below illustrates the concept of computation graph

<img src="https://drive.google.com/uc?id=1TxVSmxeVfWNPiksT0XNhglafj_xlEBfS">

PyTorch builds such a computation graph in the background, and we can use this to calculate gradients of a loss function with respect to the model parameters (here w1 and b) to train the model.

# **A.4 Automatic Differentiation Made Easy**

We can build a computational graph internally by default if one of its terminal nodes has the requires_grad attribute set to True. This is useful if we want to compute gradients. Gradients are required when training neural networks via the popular backpropagation algorithm.

<img src="https://drive.google.com/uc?id=1FBYR12vad2hwlzIYKT1eDMIdATZV2z3l">

Figure above shows partial derivatives, which measure the rate at which a function changes with respect to one of its variables. A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.

On a high level, all you need to know for this book is that the chain rule is a way to compute gradients of a loss function given the model's parameters in a computation graph. This provides the information needed to update each parameter to minimize the loss function, which serves as a proxy for measuring the model's performance using a method such as gradient descent.

This all related to automatic differentiation (autograd) (PyTorch's second component). PyTorch's autograd engine constructs a computational graph in the background by tracking every operation performed on tensors. Then by calling the grad function, we can compute the gradient of the loss concerning the model parameter w1.


In [3]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


We have been using the grad function manually, which can be useful for experimentation, debugging, and demonstrating concepts. But in practice, PyTorch provides even more high-level tools to automate this process. For example, by calling .backward on the loss, PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensor's .grad attributes

In [4]:
loss.backward()

print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])
