## Machine Learning and Artificial Intelligence 
Summer High School Academic Program for Engineers (2025)
## PyTorch Basics

torch is an open-source machine library and a scientific computing framework. PyTorch is the python frontend to torch.  

PyTorch provides the *tensor* data structure, which is similar to a numpy ndarray, but with added features. Notably, tensor operations can be accellerated on a GPU. Tensors also
offer support for automatic differentiation / backpropagation, so we do not have to implement this from scratch. 

Pytorch comes with a higher-level library to construct neural networks (torch.nn), and provides infrastructure for working with data sets. 

In [6]:
import torch

### Tensors

Tensors are used for 1) storing data 2) storing activations (and pre-activations) in a neural network 3) storing parameters.
They support most of the same operations as numpy arrays do. 

In [12]:
t = torch.tensor([1,2,3])
t.shape

torch.Size([3])

In [18]:
W = torch.tensor([[1,2,3],[4,5,6],[7,8,9]])

In [29]:
W

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

In [35]:
W.shape

torch.Size([3, 3])

In [33]:
W[0,1] #Note: the result is a 0D tensor = a scalar

tensor(2)

In [37]:
W[0,1].shape

torch.Size([])

In [20]:
W @ t 

tensor([14, 32, 50])

In [22]:
W.T

tensor([[1, 4, 7],
        [2, 5, 8],
        [3, 6, 9]])

In [24]:
t.reshape(-1,1)

tensor([[1],
        [2],
        [3]])

### Autograd

When we create a tensor we can set the parameter `requires_grad=True`. PyTorch will then track operations on these tensors, 
and automatically construct a computational graph in the background. 

When we call the `.backward()` method on a tensor, PyTorch traverses this graph in reverse and applies the chain rule to compute gradients automatically.


In [56]:
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

y = (a + b) * (b + 1)  # computation graph built here

y.backward()  # compute dy/dx
print(f"partial derivative w.r.t a: {a.grad}, partial derivative w.r.t b: {b.grad}")

partial derivative w.r.t a: 4.0, partial derivative w.r.t b: 9.0


We can use this to rebuild the neural network we created from scratch in numpy. 

In [256]:
W1 = torch.rand(10 ,4, requires_grad = True)
b1 = torch.rand(10, requires_grad = True)

W2 = torch.rand(10, 10, requires_grad = True)
b2 = torch.rand(10, requires_grad = True)

W3 = torch.rand(3, 10, requires_grad = True)
b3 = torch.rand(3, requires_grad = True)

def loss(prediction, y_one_hot):
    return -torch.sum(y_one_hot * torch.log(prediction)) 

def forward(x):
    z1 = W1 @ x + b1
    a1 = torch.sigmoid(z1)
    z2 = W2 @ a1 + b2
    a2 = torch.sigmoid(z2)
    scores = W3 @ a2 + b3    
    print(scores)
    probs = torch.softmax(scores,-1)
    return probs

def step(x,y): 
    probs = forward(x)
    loss_val = loss(probs, y)
    loss_val.backward()  # THIS DOES THE BACKPROP FOR US :-)

    

In [138]:
# Dummy data
data_x = torch.tensor([   # 4 attributes 
    [0.5, -0.2, 0.1, 0.4],
    [1.5,  0.2, 1.1, -0.4],
    [0.3,  0.8, 0.5, 0.7],
    [0.6,  0.3, -0.9, 1.0],
    [1.0, -0.1, 0.2, -0.3]
])

# There are 3 classes 0, 1, 2.
data_y = torch.tensor([0, 2, 1, 1, 0])

In [140]:
def one_hot(y): 
    return torch.nn.functional.one_hot(y, num_classes=3)

step(data_x[0], one_hot(data_y[0]))

tensor([3.2363, 4.5511, 6.0235], grad_fn=<AddBackward0>)


### The Module Abstraction 

The `torch.nn` component contains a number of tools to build neural networks more easily. Rather than building neural networks from scratch using individual tensors, which can become very cumbersome for larger networks, we can construct the network from `Modules`. 




In [150]:
import torch.nn as nn # common convention

`nn.Module` is the base class for all other Modules. We can get a standard neural network layer using the `nn.Sequential` Module. 

Each Module has a `.forward(x)` method that returns the activation of the module when applied to the tensor `x`. Alternatively, the module can just be called *as if it was a function*, which will implicitly call `.forward`.


In [156]:
linear = nn.Linear(4, 3)  #linear layer without an activation function.  
test_input = data_x[0]
linear(test_input)

tensor([-0.4205,  0.3464, -0.2782], grad_fn=<ViewBackward0>)

In [204]:
linear.forward(test_input) # equivalent, but directly calling the module is preferred

tensor([-0.4205,  0.3464, -0.2782], grad_fn=<ViewBackward0>)

Each module keeps track of its parameters (weights and biases). All of these have `requires_grad=True` set by default. But the beauty of the module abstraction is that we rarely have to look at these. 

In [172]:
for p in linear.parameters(): 
    print(p)

Parameter containing:
tensor([[ 0.3690,  0.4537,  0.1393, -0.1728],
        [ 0.2088,  0.3057,  0.3565, -0.1958],
        [-0.1450,  0.1597, -0.1829,  0.0631]], requires_grad=True)
Parameter containing:
tensor([-0.4591,  0.3458, -0.1807], requires_grad=True)


**Writing your own Modules**

Modules can be nested inside of other modules -- this is fairly common in larger neural networks, and also the way you typically write your own neural network in pyTorch.

The basic approach is to inherit from nn.Module (review the notes on inheritance in the Python review section -- but really all this means is that the methods of nn.Module become available in your custom module). We add all component modules in the `__init__` method, which is called when the new class is first instantiated. Then we write a `forward` method to define the computation flow through the network. autograd will construct the computation graph for us in the background. 

In [317]:
from torch.nn import functional as F # anoter common convention

class Model(nn.Module): # inherit from the nn.Module class

    def __init__(self): 
        # This is the constructor. 
        super().__init__() # call constructor of the parent class
        
        # Specify all the component modules inside of __init__. 
        self.hidden1 = nn.Linear(4,10)
        self.hidden2 = nn.Linear(10,10)
        self.out_layer = nn.Linear(10,3)

    def forward(self, x):
        # Need to implement the forward method to specify computation flow
        z1 = self.hidden1(x)
        a1 = F.relu(z1)
        z2 = self.hidden2(a1)
        a2 = F.relu(z2) 
        z3 = self.out_layer(a2) 
        return z3 


# Note: We are NOT applying softmax here. Below, we will use pyTorch's CategoricalCrossentropy loss, which implicitly computes the softmax 
# before comparing to the target one-hot vector. 

In [341]:
model = Model()
model(data_x[0])

tensor([ 0.2202,  0.3002, -0.0720], grad_fn=<ViewBackward0>)

### Training Loop

We can now write a training loop, either for our hand-built neural network or for the the Module version.  
We will use <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html">nn.CrossEntropyLoss</a>. This implementation allows us to compare the prediction directly to an integer class. So the target can just be an an integer scalar, not a one-hot vector. CrossEntropyLoss also computes the softmax operation implicitly before comparing to the target. 

PyTorch provides a number of differen optimization algorithms. Here we will just use the basic Stochastic Gradient Descent algorithm we already know. But more advanced algorithms are available in the <a href="https://docs.pytorch.org/docs/stable/optim.html">torch.optim</a> package. 


In [350]:
def train(model, data_x, data_y, epochs=100, learning_rate=0.1):
    
    loss_fn = nn.CrossEntropyLoss() 
    
    #register the model's parameters with the optimizer, so it knows what to update. 
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 

    model.train() # Put the model in training mode. This doesn't hve any effect for now, but it's good practice. 
    
    for epoch in range(epochs):
        total_loss = 0.0
        for i in range(len(data_x)):
            x = data_x[i]                
            y = data_y[i]

            prediction = model(x)            # raw class scores            

            loss = loss_fn(prediction, y)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}: Loss = {total_loss}")
    

In [346]:
train(model, data_x, data_y) 

Epoch 1: Loss = 5.398195028305054
Epoch 2: Loss = 5.267715513706207
Epoch 3: Loss = 5.1859389543533325
Epoch 4: Loss = 5.046169936656952
Epoch 5: Loss = 4.857535719871521
Epoch 6: Loss = 4.721572637557983
Epoch 7: Loss = 4.468655467033386
Epoch 8: Loss = 4.21475487947464
Epoch 9: Loss = 4.056803077459335
Epoch 10: Loss = 3.7503445148468018
Epoch 11: Loss = 3.5339067578315735
Epoch 12: Loss = 3.381823390722275
Epoch 13: Loss = 3.214255303144455
Epoch 14: Loss = 3.069305196404457
Epoch 15: Loss = 2.9365364760160446
Epoch 16: Loss = 2.8356797993183136
Epoch 17: Loss = 2.694264739751816
Epoch 18: Loss = 2.6512050330638885
Epoch 19: Loss = 2.55254065990448
Epoch 20: Loss = 2.475143790245056
Epoch 21: Loss = 2.368068002164364
Epoch 22: Loss = 2.3232356309890747
Epoch 23: Loss = 2.2044807337224483
Epoch 24: Loss = 2.1476596482098103
Epoch 25: Loss = 2.0254349149763584
Epoch 26: Loss = 1.9394690729677677
Epoch 27: Loss = 1.8480589017271996
Epoch 28: Loss = 1.6801892146468163
Epoch 29: Loss = 1

### Evaluation 

We can write another function to evaluate the model on some test data. 

Note that we are turning off gradient computation here, because otherwise pyTorch would automatically store all activations during the forward pass and compute the gradients. Turning this off saves time and memory. 

In [378]:
def evaluate(model, data_x, data_y):
    correct = 0
    
    model.eval()  # place the model in evaluation mode -- again this doesn't have an effect for now, but is good practice. 
    
    with torch.no_grad():  # turn off automatic gradient computation and storage -- not needed since we are no longer training. 
        for i in range(data_x.shape[0]):
            x = data_x[i]
            y = data_y[i]

            prediction = model(x)                 # raw class scores. We are just interested in the max, so no need to use softmax.
            predicted = torch.argmax(prediction)  # predicted class index -- the one with the max score.

            if predicted == y:
                correct += 1

    accuracy = correct / data_x.shape[0]
    print(f"Accuracy: {accuracy * 100:.2f}%")
    return accuracy

In [371]:
evaluate(model, data_x, data_y) # Run it on the training data for now. 

Accuracy: 100.00%


1.0

### Datasets and Batching


***Mini Batches***
So far, we have trained the model on a single input/output pair at a time: present the input, compute the loss, compute the gradients and update the weights. This process can be slow with large amounts of training data, and it also tends to be unstable as a single example can drastically change the weights. One important trick is to shuffle the data so in each epoch they are presented in different order (the "stochastic" in SGD). 

Alternatively, we could use the entire available training data for the forward pass, average the losses (to compute the training error) and compute the gradients for this aggregate error. But with large datasets this can result in overfitting and can be quite memory intensive.

As a compromise, we often present the data in "mini batches": perform the forward pass on a few data items at a time, average the losses for those items, then compute the gradient. This is especially useful on GPUs, where we can efficiently parallelize tensor operations and "stack" together the forward and backward computation for multiple data items at the same time. 

***Dataset and Dataloader***
pyTorch provides a mechanism for maintaining data sets and automatically creating batches for training. 

The class `torch.utils.data.Dataset` is a base class that can be implemented by various concrete Dataset implementations. It has two methods: 

* __len__(self) — returns the number of samples in the dataset. Enables the `len(data)` method.
* __getitem__(self, k) — returns an (input, output) tuple for the data at index k. Enables indexing, such as `data[5]`.

We can build our own Dataset.

In [428]:
from torch.utils.data import Dataset, DataLoader

class DummyData(Dataset): 

    def __init__(self):
        # Possibly load the data from a file and store it in a data structure. Here we will just use the dummy data from above
        # Dummy data
        self.data_x = torch.tensor([   # 4 attributes 
            [0.5, -0.2, 0.1, 0.4],
            [1.5,  0.2, 1.1, -0.4],
            [0.3,  0.8, 0.5, 0.7],
            [0.6,  0.3, -0.9, 1.0],
            [1.0, -0.1, 0.2, -0.3]])
        
        self.data_y = torch.tensor([0, 2, 1, 1, 0])
        
    
    def __len__(self): 
        return self.data_x.shape[0]

    def __getitem__(self,k): 
        return (self.data_x[k], self.data_y[k])

        

In [430]:
data = DummyData()

In [432]:
len(data)

5

In [434]:
data[4]

(tensor([ 1.0000, -0.1000,  0.2000, -0.3000]), tensor(0))

The DataLoader will automatically batch the data and allow us to obtain one item at a time for training. 

In [437]:
loader = DataLoader(data, batch_size = 2, shuffle=True)

In [439]:
for batch in loader: 
    print(batch)

[tensor([[ 1.5000,  0.2000,  1.1000, -0.4000],
        [ 0.3000,  0.8000,  0.5000,  0.7000]]), tensor([2, 1])]
[tensor([[ 1.0000, -0.1000,  0.2000, -0.3000],
        [ 0.5000, -0.2000,  0.1000,  0.4000]]), tensor([0, 0])]
[tensor([[ 0.6000,  0.3000, -0.9000,  1.0000]]), tensor([1])]


If the data is in a tensor, PyTorch actually provides an existing TensorDataset class. 

In [460]:
from torch.utils.data import TensorDataset
data = TensorDataset(data_x, data_y)
loader = DataLoader(data, batch_size = 2, shuffle=True)

In [462]:
for batch in loader:
    print(batch)

[tensor([[ 0.6000,  0.3000, -0.9000,  1.0000],
        [ 0.3000,  0.8000,  0.5000,  0.7000]]), tensor([1, 1])]
[tensor([[ 0.5000, -0.2000,  0.1000,  0.4000],
        [ 1.0000, -0.1000,  0.2000, -0.3000]]), tensor([0, 0])]
[tensor([[ 1.5000,  0.2000,  1.1000, -0.4000]]), tensor([2])]


Now update the train method to use the dataset: 

In [487]:
model = Model() # Reset the model

def train(model, dataloader, epochs=100, learning_rate=0.1):
    
    loss_fn = nn.CrossEntropyLoss() 
    
    #register the model's parameters with the optimizer, so it knows what to update. 
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 

    model.train() # Put the model in training mode. This doesn't hve any effect for now, but it's good practice. 
    
    for epoch in range(epochs):
        total_loss = 0.0
        for x,y in dataloader:            

            prediction = model(x)            # raw class scores            

            loss = loss_fn(prediction, y)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}: Loss = {total_loss}")

train(model, loader)  # This now uses batches! 

### GPU Acceleration 

Graphics Processing Units (GPUs) are specialized processors originally developed to accelerate graphics rendering and animation, such as those used in video games and CGI for films.

GPUs are are designed for parallel processing: They can apply the same operation across many data elements simultaneously. This makes them  well-suited for the kinds of computations common in neural networks, such as element-wise operations and matrix multiplication.

PyTorch can used Nvidia's CUDA computing library behind the scenes to accellerate tensor operations. There are also backends for other GPU frameworks, such as Apple's MPS (Metal Performance Shaders) for their Silicon architecture. 

First, we can check if CUDA is available. 

In [500]:
torch.cuda.is_available()

False

All we need to do in order for operations to be performed by the GPU is to make sure the relevant tensors (data and weights) are moved to the GPU memory. 

In [506]:
model.to('cuda')

AssertionError: Torch not compiled with CUDA enabled