<a href="https://colab.research.google.com/github/annesjyu/NLP2/blob/main/NLP2_PyTorch_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch Basics

PyTorch is a leading deep learning framework that provides extensive libraries to develop complex Neural Network architectures. NLP has greatly benefited from deep learning techniques, especially in areas like language modeling,

## Learning objectives

* Computational Graphs
* Tensors
* Operations with tensors
* Indexing, slicing, and joining
* Computing gradients
* Use CUDA tensors with GPUs

## Important Prerequisite Terms

1.   features
2.   targets
3.   models
4.   parameters
5.   hyperparameters
6.   predictions
7.   loss functions
8.   learning/training
9.   testing


## Computational Graph

A computational graph is a fundamental concept that underpins how neural networks are constructed and executed.

1. Node

> Node represents either a **variable** (like input data, weights, biases) or an operation (like mathematical operations, activation functions). The edges of the graph represent the flow of data from one node to another.

2. Backpropagation and Autograd

> PyTorch is the autograd system. When performing backpropagation for training neural networks, gradients are computed automatically.

3. Eager Execution

> PyTorch operations are evaluated as they are called, which is different from static graphs used in other frameworks like TensorFlow (prior to version 2.0), where the graph is defined and compiled before it is run. Eager execution makes PyTorch more intuitive and user-friendly, especially for debugging, as it allows for normal Python debugging tools to be used.

4. Optimization and GPU Acceleration

> Since the graph outlines the entire sequence of operations, PyTorch can optimize memory usage and computation for better performance.

Reference:

> Vijay. (2023). *Deep Learning: Computational Graphs*. Medium, March 2021. [https://vijay110402.medium.com/deep-learning-computational-graphs-f0e12d82f78d](https://vijay110402.medium.com/deep-learning-computational-graphs-f0e12d82f78d)

For example,

A computational graph is used to implement a function $f(A,B,C) =(A*B)+C$:

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*HK6gaBlCJLQOTldCURi7qQ.gif" height=200>

A computational graph is used to implement a basic neural network layer:

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*d3uM1IwDZWqvEU2G0p0gxA.gif" height=200>



In [6]:
from IPython  import display
from IPython.display import IFrame

# Display a webpage in an IFrame
url = "https://www.youtube.com/embed/JAB_plj2rbA?si=x7woApkNANHhq9Ow"
iframe = IFrame(url, width=560, height=315)
display.display(iframe)

## Dynamic (Define-by-Run) Nature

### **Static Computational Graphs** (TensorFlow)

A static computational graph means the graph's structure is defined and compiled before it's run. Once compiled, it cannot be changed. This is what TensorFlow used initially (up until v2.0, where it embraced more dynamic graphs through Eager Execution).

<img src="https://miro.medium.com/v2/resize:fit:504/format:webp/0*4UHwQnsmUjyD7VtW.gif">

#### Advantages:
* Efficiency: Once compiled, the graph can be optimized, leading to faster execution and less resource consumption.
Portability: The graph can be saved, deployed, and run without the code that generated it.

* Visualization: Easy to visualize and debug using tools like TensorBoard.

#### Disadvantages:
* Less Intuitive: Harder for Python programmers to debug and understand, as the code doesn't execute Python line by line.

* Flexibility: Less flexible in changing the graph during runtime, making it difficult for dynamic models and research.

### Dynamic Computational Graphs (PyTorch)

A dynamic computational graph, also known as an "imperative" or "define-by-run" graph, is constructed on the fly during execution. This approach is used by PyTorch.

#### Advantages:

*Intuitiveness: More intuitive and pythonic. The graph is built as the code is run, making it easier to understand and debug.

*Flexibility: Easy to change and adapt the graph dynamically, which is particularly useful for models where the structure changes every iteration (e.g., with variable input lengths or recursive neural networks).

#### Disadvantages:
*Overhead: The flexibility can come with a runtime overhead, as the graph needs to be built from scratch at each iteration.

*Optimization: Less opportunity for upfront optimization compared to static graphs.

## Installing PyTorch

In [7]:
import torch
import numpy as np
torch.manual_seed(1234)

<torch._C.Generator at 0x7f26fae6abf0>

## Tensors

In PyTorch, a tensor is a fundamental data structure that is similar to an array or a matrix. Tensors are used to encode the inputs and outputs of a model, as well as the model’s parameters.

* Scalar is a single number. Rank 0 tensor.
* Vector is an array of numbers - Rank 1 tensor.
* Matrix is a 2-D array of numbers - Rank 2 tensor.
* Tensors are N-D arrays of numbers - Rank N tensor.

### Features

* GPU Support

> Unlike NumPy arrays, PyTorch tensors can be stored and operated on a Graphics Processing Unit (GPU) to accelerate computing. This is crucial for training deep learning models efficiently.

* Automatic Differentiation

> PyTorch tensors support automatic differentiation. When tensors are used in neural networks, PyTorch tracks all operations on them. This feature is fundamental for the backpropagation algorithm, as it allows PyTorch to automatically compute gradients.

* Interoperability with NumPy

> PyTorch tensors can be easily converted to and from NumPy arrays. This allows for leveraging the vast ecosystem of tools and libraries available for NumPy.

* Dynamic Computational Graph:

> PyTorch uses a dynamic computational graph, which means that the graph is generated on the fly as operations are performed on tensors.

* Data Type and Device: Each tensor in

> PyTorch has a data type (such as float, int) and a device (such as CPU or GPU) on which it is stored. This allows for control over the precision and computing resources used in calculations.

#### Creating Tensors

You can create tensors by specifying the shape as arguments.  Here is a tensor with 2 rows and 3 columns

In [20]:
def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

Create a tensor from a uniform distribution on the interval $[0,1)$.

In [None]:
describe(torch.Tensor(2, 3))

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1.5975e-43, 1.4013e-43, 5.6052e-44],
        [5.7453e-44, 4.4842e-44, 0.0000e+00]])


Create a tensor from the standard normal distribution.

In [None]:
describe(torch.randn(2, 3))

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[ 0.0461,  0.4024, -1.0115],
        [ 0.2167, -0.6123,  0.5036]])


**It**'s common in prototyping to create a tensor with random numbers of a specific shape.

In [None]:
x = torch.rand(2, 3)
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.7749, 0.8208, 0.2793],
        [0.6817, 0.2837, 0.6567]])


You can also initialize tensors of **ones** or **zeros**.

In [None]:
describe(torch.zeros(2, 3))
x = torch.ones(2, 3)
describe(x)
# Any function with an underscore refers to an in-place operation.
x.fill_(5)
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0., 0., 0.],
        [0., 0., 0.]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1., 1., 1.],
        [1., 1., 1.]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[5., 5., 5.],
        [5., 5., 5.]])


Tensors can be initialized and then filled in place.

Note: operations that end in an underscore (`_`) are in place operations.

In [None]:
x = torch.Tensor(3,4).fill_(5)
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([3, 4])
Values: 
tensor([[5., 5., 5., 5.],
        [5., 5., 5., 5.],
        [5., 5., 5., 5.]])


Tensors can be initialized from a list of lists

In [None]:
x = torch.Tensor([[1, 2,],
                  [2, 4,]])
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[1., 2.],
        [2., 4.]])


Tensors can be initialized from numpy matrices. It is important to convert between NumPy arrays and PyTorch tensors.

In [None]:
npy = np.random.rand(2, 3)
describe(torch.from_numpy(npy))
print(npy.dtype)

Type: torch.DoubleTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.7856, 0.4701, 0.9814],
        [0.5040, 0.8839, 0.3517]], dtype=torch.float64)
float64


#### Tensor Types

The FloatTensor has been the default tensor that we have been creating all along

In [None]:
import torch
x = torch.arange(6).view(2, 3)
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])


Use contructors - FloatTensor, LongTensor, or use a typecasting method, `dtype`.

In [None]:
x = torch.FloatTensor([[1, 2, 3],
                       [4, 5, 6]])
describe(x)

x = x.long()
describe(x)

x = torch.tensor([[1, 2, 3],
                  [4, 5, 6]], dtype=torch.int64)
describe(x)

x = x.float()
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1, 2, 3],
        [4, 5, 6]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1, 2, 3],
        [4, 5, 6]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])


In [None]:
x = torch.randn(2, 3)
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[ 1.5385, -0.9757,  1.5769],
        [ 0.3840, -0.6039, -0.5240]])


### Math Operations

In [None]:
# plus
describe(x + x)
# add func
describe(torch.add(x, x))

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[ 3.0771, -1.9515,  3.1539],
        [ 0.7680, -1.2077, -1.0479]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[ 3.0771, -1.9515,  3.1539],
        [ 0.7680, -1.2077, -1.0479]])


#### Concatenation and Joining

In [None]:
x = torch.arange(6)
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([6])
Values: 
tensor([0, 1, 2, 3, 4, 5])


In [None]:
# Reshape
x = x.view(2, 3)
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])


In [None]:
# concatenation
describe(torch.cat([x, x], dim=0))
describe(torch.cat([x, x], dim=1))

Type: torch.LongTensor
Shape/size: torch.Size([4, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5],
        [0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 6])
Values: 
tensor([[0, 1, 2, 0, 1, 2],
        [3, 4, 5, 3, 4, 5]])


In [None]:
describe(torch.stack([x, x], dim=0))
describe(torch.stack([x, x], dim=1))

Type: torch.LongTensor
Shape/size: torch.Size([2, 2, 3])
Values: 
tensor([[[0, 1, 2],
         [3, 4, 5]],

        [[0, 1, 2],
         [3, 4, 5]]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 2, 3])
Values: 
tensor([[[0, 1, 2],
         [0, 1, 2]],

        [[3, 4, 5],
         [3, 4, 5]]])


In [None]:
describe(torch.sum(x, dim=0))
describe(torch.sum(x, dim=1))

Type: torch.LongTensor
Shape/size: torch.Size([3])
Values: 
tensor([3, 5, 7])
Type: torch.LongTensor
Shape/size: torch.Size([2])
Values: 
tensor([ 3, 12])


In [None]:
describe(torch.transpose(x, 0, 1))

Type: torch.LongTensor
Shape/size: torch.Size([3, 2])
Values: 
tensor([[0, 3],
        [1, 4],
        [2, 5]])


In [None]:
import torch
x = torch.arange(6).view(2, 3)
describe(x)
describe(x[:1, :2])
describe(x[0, 1])

Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([1, 2])
Values: 
tensor([[0, 1]])
Type: torch.LongTensor
Shape/size: torch.Size([])
Values: 
1


In [None]:
indices = torch.LongTensor([0, 2])
print(indices)
describe(torch.index_select(x, dim=1, index=indices))

tensor([0, 2])
Type: torch.LongTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[0, 2],
        [3, 5]])


In [None]:
indices = torch.LongTensor([0, 0])
print(indices)
describe(torch.index_select(x, dim=0, index=indices))

tensor([0, 0])
Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [0, 1, 2]])


In [None]:
row_indices = torch.arange(2).long()
col_indices = torch.LongTensor([0, 1])
describe(x[row_indices, col_indices])

Type: torch.LongTensor
Shape/size: torch.Size([2])
Values: 
tensor([0, 4])


Long Tensors are used for indexing operations and mirror the `int64` numpy type

In [None]:
x = torch.LongTensor([[1, 2, 3],
                      [4, 5, 6],
                      [7, 8, 9]])
describe(x)
print(x.dtype)
print(x.numpy().dtype)

Type: torch.LongTensor
Shape/size: torch.Size([3, 3])
Values: 
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
torch.int64
int64


You can convert a FloatTensor to a LongTensor

In [None]:
x = torch.FloatTensor([[1, 2, 3],
                       [4, 5, 6],
                       [7, 8, 9]])
x = x.long()
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([3, 3])
Values: 
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])


### Special Tensor initializations

We can create a vector of incremental numbers

In [None]:
x = torch.arange(0, 10)
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([10])
Values: 
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


Sometimes it's useful to have an integer-based arange for indexing

In [None]:
x = torch.arange(0, 10).long()
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([10])
Values: 
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


## Advanced Matrix Operations

Using the tensors to do linear algebra is a foundation of modern Deep Learning practices

Reshaping allows you to move the numbers in a tensor around.  One can be sure that the order is preserved.  In PyTorch, reshaping is called `view`:

In [None]:
x = torch.arange(0, 20)

print(x.view(1, 20))
print(x.view(2, 10))
print(x.view(4, 5))
print(x.view(5, 4))
print(x.view(10, 2))
print(x.view(20, 1))

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19]])
tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
tensor([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]])
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]])
tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17],
        [18, 19]])
tensor([[ 0],
        [ 1],
        [ 2],
        [ 3],
        [ 4],
        [ 5],
        [ 6],
        [ 7],
        [ 8],
        [ 9],
        [10],
        [11],
        [12],
        [13],
        [14],
        [15],
        [16],
        [17],
        [18],
        [19]])


We can use view to add size-1 dimensions, which can be useful for combining with other tensors.  This is called broadcasting.

In [None]:
x = torch.arange(12).view(3, 4)
y = torch.arange(4).view(1, 4)
z = torch.arange(3).view(3, 1)

print(x)
print(y)
print(z)
print(x + y)
print(x + z)

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
tensor([[0, 1, 2, 3]])
tensor([[0],
        [1],
        [2]])
tensor([[ 0,  2,  4,  6],
        [ 4,  6,  8, 10],
        [ 8, 10, 12, 14]])
tensor([[ 0,  1,  2,  3],
        [ 5,  6,  7,  8],
        [10, 11, 12, 13]])


Unsqueeze and squeeze will add and remove 1-dimensions.

In [None]:
x = torch.arange(12).view(3, 4)
describe(x)

x = x.unsqueeze(dim=1)
describe(x)

x = x.squeeze()
describe(x)

Type: torch.LongTensor
Shape/size: torch.Size([3, 4])
Values: 
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
Type: torch.LongTensor
Shape/size: torch.Size([3, 1, 4])
Values: 
tensor([[[ 0,  1,  2,  3]],

        [[ 4,  5,  6,  7]],

        [[ 8,  9, 10, 11]]])
Type: torch.LongTensor
Shape/size: torch.Size([3, 4])
Values: 
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])


all of the standard mathematics operations apply (such as `add` below)

In [None]:
x = torch.rand(3,4)
print("x: \n", x)
print("--")
print("torch.add(x, x): \n", torch.add(x, x))
print("--")
print("x+x: \n", x + x)

x: 
 tensor([[0.6662, 0.3343, 0.7893, 0.3216],
        [0.5247, 0.6688, 0.8436, 0.4265],
        [0.9561, 0.0770, 0.4108, 0.0014]])
--
torch.add(x, x): 
 tensor([[1.3324, 0.6686, 1.5786, 0.6433],
        [1.0494, 1.3377, 1.6872, 0.8530],
        [1.9123, 0.1540, 0.8216, 0.0028]])
--
x+x: 
 tensor([[1.3324, 0.6686, 1.5786, 0.6433],
        [1.0494, 1.3377, 1.6872, 0.8530],
        [1.9123, 0.1540, 0.8216, 0.0028]])


The convention of `_` indicating in-place operations continues:

In [None]:
x = torch.arange(12).reshape(3, 4)
print(x)
print(x.add_(x))

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
tensor([[ 0,  2,  4,  6],
        [ 8, 10, 12, 14],
        [16, 18, 20, 22]])


There are many operations for which reduce a dimension.  Such as sum:

In [None]:
x = torch.arange(12).reshape(3, 4)
print("x: \n", x)
print("---")
print("Summing across rows (dim=0): \n", x.sum(dim=0))
print("---")
print("Summing across columns (dim=1): \n", x.sum(dim=1))

x: 
 tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
---
Summing across rows (dim=0): 
 tensor([12, 15, 18, 21])
---
Summing across columns (dim=1): 
 tensor([ 6, 22, 38])


#### Indexing, Slicing, Joining and Mutating

In [None]:
x = torch.arange(6).view(2, 3)
print("x: \n", x)
print("---")
print("x[:2, :2]: \n", x[:2, :2])
print("---")
print("x[0][1]: \n", x[0][1])
print("---")
print("Setting [0][1] to be 8")
x[0][1] = 8
print(x)

x: 
 tensor([[0, 1, 2],
        [3, 4, 5]])
---
x[:2, :2]: 
 tensor([[0, 1],
        [3, 4]])
---
x[0][1]: 
 tensor(1)
---
Setting [0][1] to be 8
tensor([[0, 8, 2],
        [3, 4, 5]])


We can select a subset of a tensor using the `index_select`

In [None]:
x = torch.arange(9).view(3,3)
print(x)

print("---")
indices = torch.LongTensor([0, 2])
print(torch.index_select(x, dim=0, index=indices))

print("---")
indices = torch.LongTensor([0, 2])
print(torch.index_select(x, dim=1, index=indices))

tensor([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])
---
tensor([[0, 1, 2],
        [6, 7, 8]])
---
tensor([[0, 2],
        [3, 5],
        [6, 8]])


We can also use numpy-style advanced indexing:

In [None]:
x = torch.arange(9).view(3,3)
indices = torch.LongTensor([0, 2])

print(x[indices])
print("---")
print(x[indices, :])
print("---")
print(x[:, indices])

tensor([[0, 1, 2],
        [6, 7, 8]])
---
tensor([[0, 1, 2],
        [6, 7, 8]])
---
tensor([[0, 2],
        [3, 5],
        [6, 8]])


We can combine tensors by concatenating them.  First, concatenating on the rows

In [None]:
x = torch.arange(6).view(2,3)
describe(x)
describe(torch.cat([x, x], dim=0))
describe(torch.cat([x, x], dim=1))
describe(torch.stack([x, x]))

Type: torch.LongTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([4, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5],
        [0, 1, 2],
        [3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 6])
Values: 
tensor([[0, 1, 2, 0, 1, 2],
        [3, 4, 5, 3, 4, 5]])
Type: torch.LongTensor
Shape/size: torch.Size([2, 2, 3])
Values: 
tensor([[[0, 1, 2],
         [3, 4, 5]],

        [[0, 1, 2],
         [3, 4, 5]]])


We can concentate along the first dimension.. the columns.

In [None]:
x = torch.arange(9).view(3,3)

print(x)
print("---")
new_x = torch.cat([x, x, x], dim=1)
print(new_x.shape)
print(new_x)

tensor([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])
---
torch.Size([3, 9])
tensor([[0, 1, 2, 0, 1, 2, 0, 1, 2],
        [3, 4, 5, 3, 4, 5, 3, 4, 5],
        [6, 7, 8, 6, 7, 8, 6, 7, 8]])


We can also concatenate on a new 0th dimension to "stack" the tensors:

In [None]:
x = torch.arange(9).view(3,3)
print(x)
print("---")
new_x = torch.stack([x, x, x])
print(new_x.shape)
print(new_x)

tensor([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])
---
torch.Size([3, 3, 3])
tensor([[[0, 1, 2],
         [3, 4, 5],
         [6, 7, 8]],

        [[0, 1, 2],
         [3, 4, 5],
         [6, 7, 8]],

        [[0, 1, 2],
         [3, 4, 5],
         [6, 7, 8]]])


#### Linear Algebra Tensor Functions

Transposing allows you to switch the dimensions to be on different axis. So we can make it so all the rows are columsn and vice versa.

In [None]:
x = torch.arange(0, 12).view(3,4)
print("x: \n", x)
print("---")
print("x.tranpose(1, 0): \n", x.transpose(1, 0))

x: 
 tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
---
x.tranpose(1, 0): 
 tensor([[ 0,  4,  8],
        [ 1,  5,  9],
        [ 2,  6, 10],
        [ 3,  7, 11]])


A three dimensional tensor would represent a batch of sequences, where each sequence item has a feature vector.  It is common to switch the batch and sequence dimensions so that we can more easily index the sequence in a sequence model.

Note: Transpose will only let you swap 2 axes.  Permute (in the next cell) allows for multiple

In [None]:
batch_size = 3
seq_size = 4
feature_size = 5

x = torch.arange(batch_size * seq_size * feature_size).view(batch_size, seq_size, feature_size)

print("x.shape: \n", x.shape)
print("x: \n", x)
print("-----")

print("x.transpose(1, 0).shape: \n", x.transpose(1, 0).shape)
print("x.transpose(1, 0): \n", x.transpose(1, 0))

x.shape: 
 torch.Size([3, 4, 5])
x: 
 tensor([[[ 0,  1,  2,  3,  4],
         [ 5,  6,  7,  8,  9],
         [10, 11, 12, 13, 14],
         [15, 16, 17, 18, 19]],

        [[20, 21, 22, 23, 24],
         [25, 26, 27, 28, 29],
         [30, 31, 32, 33, 34],
         [35, 36, 37, 38, 39]],

        [[40, 41, 42, 43, 44],
         [45, 46, 47, 48, 49],
         [50, 51, 52, 53, 54],
         [55, 56, 57, 58, 59]]])
-----
x.transpose(1, 0).shape: 
 torch.Size([4, 3, 5])
x.transpose(1, 0): 
 tensor([[[ 0,  1,  2,  3,  4],
         [20, 21, 22, 23, 24],
         [40, 41, 42, 43, 44]],

        [[ 5,  6,  7,  8,  9],
         [25, 26, 27, 28, 29],
         [45, 46, 47, 48, 49]],

        [[10, 11, 12, 13, 14],
         [30, 31, 32, 33, 34],
         [50, 51, 52, 53, 54]],

        [[15, 16, 17, 18, 19],
         [35, 36, 37, 38, 39],
         [55, 56, 57, 58, 59]]])


Permute is a more general version of tranpose:

In [None]:
batch_size = 3
seq_size = 4
feature_size = 5

x = torch.arange(batch_size * seq_size * feature_size).view(batch_size, seq_size, feature_size)

print("x.shape: \n", x.shape)
print("x: \n", x)
print("-----")

print("x.permute(1, 0, 2).shape: \n", x.permute(1, 0, 2).shape)
print("x.permute(1, 0, 2): \n", x.permute(1, 0, 2))

x.shape: 
 torch.Size([3, 4, 5])
x: 
 tensor([[[ 0,  1,  2,  3,  4],
         [ 5,  6,  7,  8,  9],
         [10, 11, 12, 13, 14],
         [15, 16, 17, 18, 19]],

        [[20, 21, 22, 23, 24],
         [25, 26, 27, 28, 29],
         [30, 31, 32, 33, 34],
         [35, 36, 37, 38, 39]],

        [[40, 41, 42, 43, 44],
         [45, 46, 47, 48, 49],
         [50, 51, 52, 53, 54],
         [55, 56, 57, 58, 59]]])
-----
x.permute(1, 0, 2).shape: 
 torch.Size([4, 3, 5])
x.permute(1, 0, 2): 
 tensor([[[ 0,  1,  2,  3,  4],
         [20, 21, 22, 23, 24],
         [40, 41, 42, 43, 44]],

        [[ 5,  6,  7,  8,  9],
         [25, 26, 27, 28, 29],
         [45, 46, 47, 48, 49]],

        [[10, 11, 12, 13, 14],
         [30, 31, 32, 33, 34],
         [50, 51, 52, 53, 54]],

        [[15, 16, 17, 18, 19],
         [35, 36, 37, 38, 39],
         [55, 56, 57, 58, 59]]])


Matrix multiplication is `mm`:

In [None]:
torch.randn(2, 3, requires_grad=True)

tensor([[-0.4790,  0.8539, -0.2285],
        [ 0.3081,  1.1171,  0.1585]], requires_grad=True)

In [None]:
x1 = torch.arange(6).view(2, 3).float()
describe(x1)

x2 = torch.ones(3, 2)
x2[:, 1] += 1
describe(x2)

describe(torch.mm(x1, x2))

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0., 1., 2.],
        [3., 4., 5.]])
Type: torch.FloatTensor
Shape/size: torch.Size([3, 2])
Values: 
tensor([[1., 2.],
        [1., 2.],
        [1., 2.]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[ 3.,  6.],
        [12., 24.]])


In [None]:
x = torch.arange(0, 12).view(3,4).float()
print(x)

x2 = torch.ones(4, 2)
x2[:, 1] += 1
print(x2)

print(x.mm(x2))

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([[1., 2.],
        [1., 2.],
        [1., 2.],
        [1., 2.]])
tensor([[ 6., 12.],
        [22., 44.],
        [38., 76.]])


See the [PyTorch Math Operations Documentation](https://pytorch.org/docs/stable/torch.html#math-operations) for more!

## Computing Gradients

In [None]:
x = torch.tensor([[2.0, 3.0]], requires_grad=True)
z = 3 * x
print(z)

tensor([[6., 9.]], grad_fn=<MulBackward0>)


In this small snippet, you can see the gradient computations at work.  We create a tensor and multiply it by 3.  Then, we create a scalar output using `sum()`.  A Scalar output is needed as the the loss variable. Then, called backward on the loss means it computes its rate of change with respect to the inputs.  Since the scalar was created with sum, each position in z and x are independent with respect to the loss scalar.

The rate of change of x with respect to the output is just the constant 3 that we multiplied x by.

In [None]:
x = torch.tensor([[2.0, 3.0]], requires_grad=True)
print("x: \n", x)
print("---")
z = 3 * x
print("z = 3*x: \n", z)
print("---")

loss = z.sum()
print("loss = z.sum(): \n", loss)
print("---")

loss.backward()

print("after loss.backward(), x.grad: \n", x.grad)


x: 
 tensor([[2., 3.]], requires_grad=True)
---
z = 3*x: 
 tensor([[6., 9.]], grad_fn=<MulBackward0>)
---
loss = z.sum(): 
 tensor(15., grad_fn=<SumBackward0>)
---
after loss.backward(), x.grad: 
 tensor([[3., 3.]])


### Example: Computing a conditional gradient

$$ \text{ Find the gradient of f(x) at x=1 } $$
$$ {} $$
$$ f(x)=\left\{
\begin{array}{ll}
    sin(x) \text{ if } x>0 \\
    cos(x) \text{ otherwise } \\
\end{array}
\right.$$

In [None]:
def f(x):
    if (x.data > 0).all():
        return torch.sin(x)
    else:
        return torch.cos(x)

In [None]:
x = torch.tensor([1.0], requires_grad=True)
y = f(x)
y.backward()
print(x.grad)

tensor([0.5403])


We could apply this to a larger vector too, but we need to make sure the output is a scalar:

In [None]:
x = torch.tensor([1.0, 0.5], requires_grad=True)
y = f(x)
describe(y)
# this is meant to break! can you fix it??
y.backward()
print(x.grad)

Type: torch.FloatTensor
Shape/size: torch.Size([2])
Values: 
tensor([0.8415, 0.4794], grad_fn=<SinBackward0>)
None


Making the output a scalar:

In [None]:
x = torch.tensor([1.0, 0.5], requires_grad=True)
y = f(x)
y.sum().backward()
print(x.grad)

tensor([0.5403, 0.8776])


but there was an issue.. this isn't right for this edge case:

In [None]:
x = torch.tensor([1.0, -1], requires_grad=True)
y = f(x)
y.sum().backward()
print(x.grad)

tensor([-0.8415,  0.8415])


In [None]:
x = torch.tensor([-0.5, -1], requires_grad=True)
y = f(x)
y.sum().backward()
print(x.grad)

tensor([0.4794, 0.8415])


This is because we aren't doing the boolean computation and subsequent application of cos and sin on an elementwise basis.  So, to solve this, it is common to use masking:

In [None]:
def f2(x):
    mask = torch.gt(x, 0).float()
    return mask * torch.sin(x) + (1 - mask) * torch.cos(x)

x = torch.tensor([1.0, -1], requires_grad=True)
y = f2(x)
y.sum().backward()
print(x.grad)

tensor([0.5403, 0.8415])


In [None]:
def describe_grad(x):
    if x.grad is None:
        print("No gradient information")
    else:
        print("Gradient: \n{}".format(x.grad))
        print("Gradient Function: {}".format(x.grad_fn))

In [None]:
import torch
x = torch.ones(2, 2, requires_grad=True)
describe(x)
describe_grad(x)
print("--------")

y = (x + 2) * (x + 5) + 3
describe(y)
z = y.mean()
describe(z)
describe_grad(x)
print("--------")
z.backward(create_graph=True, retain_graph=True)
describe_grad(x)
print("--------")


Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
No gradient information
--------
Type: torch.FloatTensor
Shape/size: torch.Size([2, 2])
Values: 
tensor([[21., 21.],
        [21., 21.]], grad_fn=<AddBackward0>)
Type: torch.FloatTensor
Shape/size: torch.Size([])
Values: 
21.0
No gradient information
--------
Gradient: 
tensor([[2.2500, 2.2500],
        [2.2500, 2.2500]], grad_fn=<CopyBackwards>)
Gradient Function: None
--------


In [None]:
x = torch.ones(2, 2, requires_grad=True)

In [None]:
y = x + 2

In [None]:
y.grad_fn

<AddBackward0 at 0x78d41f2ac700>

### CUDA Tensors

They are specifically allocated on a GPU for CUDA operations. CUDA, which stands for Compute Unified Device Architecture, is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows developers to use NVIDIA GPUs for general purpose processing (an approach known as GPGPU, General-Purpose computing on Graphics Processing Units).

In [8]:
url = "https://www.youtube.com/embed/r9IqwpMR9TE?si=IYtbYya_rpxv1a1N"
iframe = IFrame(url, width=560, height=315)
display.display(iframe)

PyTorch's operations can seamlessly be used on the GPU or on the CPU.  There are a couple basic operations for interacting in this way.

In [None]:
print(torch.cuda.is_available())

True


In [None]:
x = torch.rand(3,3)
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([3, 3])
Values: 
tensor([[0.9149, 0.3993, 0.1100],
        [0.2541, 0.4333, 0.4451],
        [0.4966, 0.7865, 0.6604]])


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [None]:
x = torch.rand(3, 3).to(device)
describe(x)
print(x.device)

Type: torch.cuda.FloatTensor
Shape/size: torch.Size([3, 3])
Values: 
tensor([[0.1303, 0.3498, 0.3824],
        [0.8043, 0.3186, 0.2908],
        [0.4196, 0.3728, 0.3769]], device='cuda:0')
cuda:0


In [12]:
cpu_device = torch.device("cpu")

In [14]:
# this will break!
y = torch.rand(3, 3).to("cuda")
x = torch.rand(3, 3)
x + y

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [15]:
y = y.to(cpu_device)
x = x.to(cpu_device)
x + y

tensor([[1.0452, 0.7491, 0.4924],
        [1.0584, 0.7519, 0.7359],
        [0.9162, 1.1594, 1.0373]])

In [17]:
if torch.cuda.is_available(): # only is GPU is available
    a = torch.rand(3,3).to(device='cuda:0') #  CUDA Tensor
    print(a)

    b = torch.rand(3,3).cuda()
    print(b)

    print(a + b)

    # Error expected
    a = a.cpu()
    print(a + b)

tensor([[0.5274, 0.6325, 0.0910],
        [0.2323, 0.7269, 0.1187],
        [0.3951, 0.7199, 0.7595]], device='cuda:0')
tensor([[0.5311, 0.6449, 0.7224],
        [0.4416, 0.3634, 0.8818],
        [0.9874, 0.7316, 0.2814]], device='cuda:0')
tensor([[1.0585, 1.2775, 0.8134],
        [0.6739, 1.0903, 1.0006],
        [1.3825, 1.4515, 1.0409]], device='cuda:0')


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

### Exercises
(Answers are at the bottom)

#### References

1. PyTorch documentation. (2023). PyTorch 2.2 documentation. [https://pytorch.org/docs/stable/index.html](https://pytorch.org/docs/stable/index.html).

#### Exercise 1

Create a 2D tensor and then add a dimension of size 1 inserted at the 0th axis.

In [19]:
t = torch.rand(3,3)
describe(t)
describe(t.unsqueeze(0))

NameError: name 'describe' is not defined

#### Exercise 2

Remove the extra dimension you just added to the previous tensor.

NameError: name 't' is not defined

#### Exercise 3

Create a random tensor of shape 5x3 in the interval [3, 7)

#### Exercise 4

Create a tensor with values from a normal distribution (mean=0, std=1).

#### Exercise 5

Retrieve the indexes of all the non zero elements in the tensor torch.Tensor([1, 1, 1, 0, 1]).

#### Exercise 6

Create a random tensor of size (3,1) and then horizonally stack 4 copies together.

#### Exercise 7

Return the batch matrix-matrix product of two 3 dimensional matrices (a=torch.rand(3,4,5), b=torch.rand(3,5,4)).

#### Exercise 8

Return the batch matrix-matrix product of a 3D matrix and a 2D matrix (a=torch.rand(3,4,5), b=torch.rand(5,4)).

### Answers below

#### Exercise 1

Create a 2D tensor and then add a dimension of size 1 inserted at the 0th axis.

In [None]:
a = torch.rand(3,3)
a = a.unsqueeze(0)
print(a)
print(a.shape)

#### Exercise 2

Remove the extra dimension you just added to the previous tensor.

In [None]:
a = a.squeeze(0)
print(a.shape)

#### Exercise 3

Create a random tensor of shape 5x3 in the interval [3, 7)

In [None]:
3 + torch.rand(5, 3) * 4

#### Exercise 4

Create a tensor with values from a normal distribution (mean=0, std=1).

In [None]:
a = torch.rand(3,3)
a.normal_(mean=0, std=1)

#### Exercise 5

Retrieve the indexes of all the non zero elements in the tensor torch.Tensor([1, 1, 1, 0, 1]).

In [None]:
a = torch.Tensor([1, 1, 1, 0, 1])
torch.nonzero(a)

#### Exercise 6

Create a random tensor of size (3,1) and then horizonally stack 4 copies together.

In [None]:
a = torch.rand(3,1)
a.expand(3,4)

#### Exercise 7

Return the batch matrix-matrix product of two 3 dimensional matrices (a=torch.rand(3,4,5), b=torch.rand(3,5,4)).

In [None]:
a = torch.rand(3,4,5)
b = torch.rand(3,5,4)
torch.bmm(a, b)

#### Exercise 8

Return the batch matrix-matrix product of a 3D matrix and a 2D matrix (a=torch.rand(3,4,5), b=torch.rand(5,4)).

In [None]:
a = torch.rand(3,4,5)
b = torch.rand(5,4)
torch.bmm(a, b.unsqueeze(0).expand(a.size(0), *b.size()))

tensor([[[2.3908, 1.8123, 2.2887, 1.2006],
         [1.6499, 1.2374, 1.8164, 0.8209],
         [1.9252, 1.5492, 1.8218, 1.3872],
         [2.3996, 1.5903, 2.0939, 1.5061]],

        [[2.5568, 1.9145, 2.3988, 1.5418],
         [2.3120, 1.6481, 1.8575, 1.3524],
         [1.1413, 0.8692, 1.2374, 0.7383],
         [2.1882, 1.9155, 2.1293, 1.5624]],

        [[1.5678, 1.2558, 1.3171, 0.7612],
         [1.9138, 1.2136, 1.7878, 1.1857],
         [1.9839, 1.2644, 1.7041, 1.2728],
         [1.6183, 1.4240, 1.5611, 0.9084]]])

### END