# Recitation 1: PyTorch tutorial
_Date_: 2025-09-04

## Tensor

### Introduction
In a computational perspective, a tensor is a data structure that represents a multi-dimensional array. Terms "vector", "matrix" sound more familiar than "tensor". In fact, "tensor" is a more general term describing both vector and matrix. That is,

* vector is a one-dimensional tensor $\mathbb{R}^{n}$: $$\mathbf{x} = [1, 5, 9]$$
* matrix is a two-dimensional tensor $\mathbb{R}^{n \times m}$: $$X = \begin{bmatrix} 2 & 4 & 6 \\ 7 & 10 & 3\end{bmatrix} = \begin{bmatrix} \text{---} & \mathbf{x_1} & \text{---} \\ \text{---} & \mathbf{x_2} & \text{---} \end{bmatrix}$$
* A three-dimensional tensor is $\mathbb{R}^{n \times m \times k}$

### PyTorch's tensor
Implementation-wise, it's similar to Numpy's `ndarray`, but adds more features to adapt deep learning concepts:
* `autograd` for back propagation updating weights of a neural net
* store tensors in GPU memory for matrix multiplications in parallel

### Exercise: `torch` basics

In [17]:
import torch

In [18]:
# Exercise 1: Intialize a tensor
v = torch.zeros(5)  # vector
X = torch.zeros(5,2)  # matrix
T = torch.zeros(5,2,3)   # 3-dimensional tensor

print(f"v:\n{'-' * 5}\n{v}, shape:{v.shape}, dimensions: {v.ndim}")
print(f"\nX:\n{'-' * 5}\n{X}, shape:{X.shape}, dimensions: {X.ndim}")
print(f"\nT:\n{'-' * 5}\n{T}, shape:{T.shape}, dimensions: {T.ndim}")

v:
-----
tensor([0., 0., 0., 0., 0.]), shape:torch.Size([5]), dimensions: 1

X:
-----
tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]]), shape:torch.Size([5, 2]), dimensions: 2

T:
-----
tensor([[[0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.]]]), shape:torch.Size([5, 2, 3]), dimensions: 3


In [19]:
# Exercise 2: Random or constant tensor
ones_tensor = torch.full((5,), 1)  # tensor with all ones
zeros_tensor = torch.zeros(5) # tensor with all zeros
rand_tensor = torch.rand(5)  # tensor with random values

print(f"ones:\n{'-' * 5}\n{ones_tensor}")
print(f"\nzeros:\n{'-' * 5}\n{zeros_tensor}")
print(f"\nrandom:\n{'-' * 5}\n{rand_tensor}")

ones:
-----
tensor([1, 1, 1, 1, 1])

zeros:
-----
tensor([0., 0., 0., 0., 0.])

random:
-----
tensor([0.5980, 0.8485, 0.0442, 0.4108, 0.0837])


In [20]:
# Exercise 3: Get access to attributes of a tensor
tensor = torch.zeros(3,4)  # Initialize a (3 x 4) tensor
print(f"Shape of tensor: {tensor.shape}")
print(f"\nData type of tensor: {tensor.dtype}")
print(f"\nDevice of tensor: {tensor.device}")

Shape of tensor: torch.Size([3, 4])

Data type of tensor: torch.float32

Device of tensor: cpu


In [21]:
# Exercise 4: Indexing and slicing
tensor = torch.zeros(4,4)  # Initialize a (4 x 4) tensor

print(f"Tensor as:\n{'-' * 20}\n{tensor}")
print(f"\nFirst row of tensor:\n{'-' * 20}\n{tensor[0]}")
print(f"\nLast column of tensor:\n{'-' * 20}\n{tensor[:, -1]}")

Tensor as:
--------------------
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

First row of tensor:
--------------------
tensor([0., 0., 0., 0.])

Last column of tensor:
--------------------
tensor([0., 0., 0., 0.])


In [22]:
# Exercise 5: Manipulation
x = torch.rand(2,4)  # Initialize a (2 x 4) tensor
print(f"x as:\n{'-' * 20}\n{x}")

x[:,1].zero_()      # Replace all values of the second column as 0 (masking)
print(f"\nAfter masking:\n{'-' * 20}\n{x}")

y = torch.rand(1,4)  # Initialize a (1 x 4) tensor
print(f"\ny as:\n{'-' * 20}\n{y}")

xy = torch.cat((x,y)) # Vertically stack x and y, resulting in a (3 x 4) tensor
print(f"\nAfter stacking:\n{'-' * 20}\n{xy}")

print(f"\n dimensions of stacked tensor:\n{xy.shape}")

x as:
--------------------
tensor([[0.5595, 0.3762, 0.2428, 0.5637],
        [0.6477, 0.9716, 0.8429, 0.1957]])

After masking:
--------------------
tensor([[0.5595, 0.0000, 0.2428, 0.5637],
        [0.6477, 0.0000, 0.8429, 0.1957]])

y as:
--------------------
tensor([[0.6036, 0.0696, 0.8351, 0.6708]])

After stacking:
--------------------
tensor([[0.5595, 0.0000, 0.2428, 0.5637],
        [0.6477, 0.0000, 0.8429, 0.1957],
        [0.6036, 0.0696, 0.8351, 0.6708]])

 dimensions of stacked tensor:
torch.Size([3, 4])


In [23]:
# Exercise 6: Arithimic operations
u =  torch.randn(3,4) # Initialize a (3 x 4) tensor
v = torch.randn(4,2)  # Initialize a (4 x 2) tensor
print(f"u:\n{'-' * 20}\n{u}")
print(f"\nv:\n{'-' * 20}\n{v}")

u_times_v = u @ v  # multiply u and v using operator ``@``
u_times_v = torch.matmul(u, v)  # multiply u and v using instance function ``torch.matmul``
print(f"\nu @ v:\n{'-' * 20}\n{u_times_v}")

k = torch.randn(3,4)  # Initialize a (3 x 4) tensor
u_elem_times_k = u * k  # Compute element-wise product between u and k using ``*``
u_elem_times_k = torch.mul(u,k)  # Compute element-wise product between u and k using ``torch.mul``
print(f"\nu * k:\n{'-' * 20}\n{u_elem_times_k}")

u:
--------------------
tensor([[ 0.0559, -0.3794,  1.5184,  0.6403],
        [ 0.4333, -0.5603, -0.2830,  0.3149],
        [ 0.0869,  1.7221,  0.4788,  1.5434]])

v:
--------------------
tensor([[-0.2789,  0.3984],
        [ 0.8057,  0.8326],
        [-0.6939, -0.4142],
        [-0.0186,  0.1564]])

u @ v:
--------------------
tensor([[-1.3868, -0.8225],
        [-0.3817, -0.1274],
        [ 1.0023,  1.5116]])

u * k:
--------------------
tensor([[-0.1500,  0.2362, -1.9582, -0.3361],
        [ 0.0903,  0.2550,  0.0290,  0.0069],
        [ 0.0156, -0.8477, -0.4300, -1.0642]])


## Datasets & Dataloaders

### Introduction
Date preprocessing is one of the most fundamental and essential step among the standard pipeline training a neural network. Specifically, in NLP task, this step would do feature representation which translates a text data into numerical values. Commonly, this step does following things:
* Clean and organize text data (e.g. remove punctuations, special characters and split dataset)
* Featurize text to numerics, which may include below steps
  * Build vocabulary from corpus
  * Figure out how to numerically represent (or encode) a string given the vocabulary
    * sparse encoding?
    * dense encoding?
  * Encode labels

Eventually, implement above steps using python class or functions and integrate them with PyTorch's `torch.utils.data.Dataset` class.

### PyTorch dataset and dataloaders
`Dataset` can be thought as a list of data samples, which means it is indexed like a python list. Normally, this class includes the implementation of featurization, so accessing an individual sample returns numerical values. 

`DataLoader` wraps an iterable around a `Dataset` object so it's memory efficient in the training loop (recall python's generator), and it's capable of batchifying the data.

### Exercise: preprocess dataset

In [26]:
from typing import Dict, List, Tuple
from torch.utils.data import Dataset, DataLoader

In [27]:
# Below corpus is generated by AI
corpus = """
The old bookstore on the corner smelled of paper and dust
Have you ever wondered what lies beyond the farthest star
Please hand me the blue folder from the top shelf
Although the weather was cold, we enjoyed our walk along the beach
What an incredible view from the mountaintop
He practiced the piano for an hour every day; his dedication was admirable
The new software update will be installed automatically tonight
She brewed a cup of tea and watched the rain fall outside her window
Innovation often arises from the intersection of different fields of study
The children laughed as the puppy chased its tail in circles
Can we reschedule our meeting for early next week
The project was a success, but there were many challenges along the way
"""

# Generate synthetic labels
labels = ['y', 'n', 'n', 'n', 'y', 'y', 'n', 'n', 'y', 'y']

In [63]:
# Exercise 1: Implement functions for tokenizing a sentence into tokens
def tokenize(instance: str) -> List[str]:
    """Tokenize a text data instance into a list features"""
    tokens = instance.lower().strip().split()
    return tokens

tokenize("The old bookstore on the corner smelled of paper and dust")

['the',
 'old',
 'bookstore',
 'on',
 'the',
 'corner',
 'smelled',
 'of',
 'paper',
 'and',
 'dust']

In [64]:
words = []

for sentence in corpus.split('\n'):
    words.extend(tokenize(sentence))

words

['the',
 'old',
 'bookstore',
 'on',
 'the',
 'corner',
 'smelled',
 'of',
 'paper',
 'and',
 'dust',
 'have',
 'you',
 'ever',
 'wondered',
 'what',
 'lies',
 'beyond',
 'the',
 'farthest',
 'star',
 'please',
 'hand',
 'me',
 'the',
 'blue',
 'folder',
 'from',
 'the',
 'top',
 'shelf',
 'although',
 'the',
 'weather',
 'was',
 'cold,',
 'we',
 'enjoyed',
 'our',
 'walk',
 'along',
 'the',
 'beach',
 'what',
 'an',
 'incredible',
 'view',
 'from',
 'the',
 'mountaintop',
 'he',
 'practiced',
 'the',
 'piano',
 'for',
 'an',
 'hour',
 'every',
 'day;',
 'his',
 'dedication',
 'was',
 'admirable',
 'the',
 'new',
 'software',
 'update',
 'will',
 'be',
 'installed',
 'automatically',
 'tonight',
 'she',
 'brewed',
 'a',
 'cup',
 'of',
 'tea',
 'and',
 'watched',
 'the',
 'rain',
 'fall',
 'outside',
 'her',
 'window',
 'innovation',
 'often',
 'arises',
 'from',
 'the',
 'intersection',
 'of',
 'different',
 'fields',
 'of',
 'study',
 'the',
 'children',
 'laughed',
 'as',
 'the',
 'p

In [43]:
# Exercise 2: Implement functions for building vocabulary and label map

def build_vocabulary(tokens: List[str], most_common: int) -> Dict[str, int]:
    from collections import Counter
    word_freq = Counter(tokens).most_common(most_common)
    vocab = [word for word, _ in word_freq]
    vocab.extend(['<PAD>', '<UNK>'])

    return {w: i for i, w in enumerate(vocab)}
        

def build_label_map(labels: List[str]) -> Dict[str, int]:
    label_set = set(labels)

    return {label: i for i, label in enumerate(label_set)}

In [44]:
k = 5
vocab = build_vocabulary(words, k)
label_map = build_label_map(labels)

print(f"Vocabulary with most frequent {k} words:\n{'-' * 50}\n{vocab}")
print(f"\nLabel map:\n{'-' * 50}\n{label_map}")

Vocabulary with most frequent 5 words:
--------------------------------------------------
{'the': 0, 'of': 1, 'from': 2, 'was': 3, 'and': 4, '<PAD>': 5, '<UNK>': 6}

Label map:
--------------------------------------------------
{'n': 0, 'y': 1}


In [65]:
# Exercise 3: Implement functions for featurizing processed data into numerical representations
def to_sparse_vector(instance: str, vocab: Dict[str, int]) -> List[int]:
    """Encode a sentence to a sparse vector by multi-hot encoding"""
    # change sentence to set of tokens
    token_set = set(instance.lower().split())
    # if token in dict, add 1 at its index in the vector 
    vector = [0] * len(vocab)
    for token in token_set:
        if token in vocab:
            vector[vocab[token]] = 1
    return vector

def to_dense_vector(instance: str, vocab: Dict[str, int]) -> List[int]:
    """Encode a sentence as a list of integers"""
    tokens = tokenize(instance)
    vector = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    return vector

In [66]:
sparse_vector = [to_sparse_vector(sentence, vocab) for sentence in corpus.strip().split('\n')]
dense_vector = [to_dense_vector(sentence, vocab) for sentence in corpus.strip().split('\n')]

In [54]:
sparse_vector

[[1, 1, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0, 0],
 [1, 0, 1, 0, 0, 0, 0],
 [1, 0, 0, 1, 0, 0, 0],
 [1, 0, 1, 0, 0, 0, 0],
 [1, 0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0],
 [1, 1, 0, 0, 1, 0, 0],
 [1, 1, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 1, 0, 0, 0]]

In [67]:
dense_vector

[[0, 6, 6, 6, 0, 6, 6, 1, 6, 4, 6],
 [6, 6, 6, 6, 6, 6, 6, 0, 6, 6],
 [6, 6, 6, 0, 6, 6, 2, 0, 6, 6],
 [6, 0, 6, 3, 6, 6, 6, 6, 6, 6, 0, 6],
 [6, 6, 6, 6, 2, 0, 6],
 [6, 6, 0, 6, 6, 6, 6, 6, 6, 6, 6, 3, 6],
 [0, 6, 6, 6, 6, 6, 6, 6, 6],
 [6, 6, 6, 6, 1, 6, 4, 6, 0, 6, 6, 6, 6, 6],
 [6, 6, 6, 2, 0, 6, 1, 6, 6, 1, 6],
 [0, 6, 6, 6, 0, 6, 6, 6, 6, 6, 6],
 [6, 6, 6, 6, 6, 6, 6, 6, 6],
 [0, 6, 3, 6, 6, 6, 6, 6, 6, 6, 6, 0, 6]]

In [None]:
# Exercise 3: Implement the custom dataset using ``Dataset``
class CustomDataset(Dataset):
    def __init__(self, corpus: str, labels: List[str], top_k: int):
        ...

    def __len__(self) -> int:
        ...

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        ...

In [None]:
corpus_dataset = CustomDataset(corpus, labels, 5)

In [None]:
print(f"Length of the dataset is {len(corpus_dataset)}")

i = 3
sp_v, ds_v, y = corpus_dataset[i]
print(f"\nFor the {i+1} th element of the dataset:\n{'-' * 50}")
print(f"\nSparse vector:\n{'-' * 10}\n{sp_v}")
print(f"\nDense vector:\n{'-' * 10}\n{ds_v}")
print(f"\nLabel:\n{'-' * 10}\n{y}")