# PyTorch: More control, but more work

After Tensorflow+Keras, PyTorch is hands-down the most widely used neural network library in Python.  It's generally more popular in academia, and it competes pretty well with Tensorflow in the private/corporate space too.  But, it has a very different approach to working with models.

Keras hides away a lot of the details of  neural network from you.  This is not necessarily a bad thing.  It strikes a very good balance between abstraction and usability.  Keras doesn't try too hard to anticipate what you want to do with it, and it can very effectively cover most use cases for most people.  Tensorflow--which Keras is built on top of--has an extremely bumpy and weird learning curve.  Tensorflow is basically a different programming language, with a fundamentally different model of how code works, embedded into a Python library.  It can be very clunky to work with.  Writing Tensorflow code does not feel like writing Python.

PyTorch is designed to be closer to Tensorflow than to Keras, but to do as much using native Python constructs as possible.  As a result, writing code with PyTorch feels like writing Python code, and you can use everything you've learned about Python so far.  The end result is that you're dealing with more of the nitty-gritty details about creating and training networks, but it's much easier than if you were using Tensorflow directly.  It's not always as easy as Keras, but it's not too far from Keras for simple models.  (And for more complex models, PyTorch is _way_ more flexible).

There are a lot of other differences between the libraries, but honestly, most of them don't matter when you're getting started.  (E.g.: the difference between dynamic and static graphs; different tools for parallelism and deployment; integration with monitoring and logging; ease of running on embedded devices; and so on).  

But maybe most importantly: PyTorch is _way_ less of a pain to install.  To install PyTorch, go to the PyTorch website and follow [the installation instructions.](https://pytorch.org/get-started/locally/)  Note: if you're on Linux, and have an AMD GPU, PyTorch has support for ROCm!  Windows/Mac users, you're not so lucky; you still need to have an NVidia GPU, or just run the code on your CPU.  If you have an NVidia GPU and don't know what version of CUDA you have/should use, just use the default selection.  The installation can take a while, so be warned.

# PyTorch Quickstart

One very important thing: PyTorch has a `Tensor` class that's core to the entire library.  The `Tensor` is basically a Numpy `array`, but it can be easily moved to a GPU for faster math operations.  So instead of using Numpy `arrays`, we'll use PyTorch `Tensor`s.  (Keras/Tensorflow also have `Tensor`s, but we didn't use them in the previous notebooks; the code would look almost identical if we had, though).

Let's load our Diamonds data again and convert it into PyTorch Tensors.

In [1]:
import os
import pandas as pd

if os.path.isfile("diamonds.csv"):
    diamonds = pd.read_csv("diamonds.csv")
else:
    diamonds = pd.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv")
    diamonds.to_csv("diamonds.csv", index=False)
    
diamonds = pd.get_dummies(
    diamonds,
    ["color", "clarity", "cut"],
)

# Train-test split; ~20% data for testing.
diamonds = diamonds.sample(frac=1, replace=False).reset_index(drop=True)
test = diamonds.loc[:9999]
train = diamonds.loc[9999:]

print(test.shape)
print(train.shape)

train_x = train.drop(columns=["price"]).values
test_x = test.drop(columns=["price"]).values

# .reshape(-1,1) --> this will avoid some pytorch warnings
# later.
train_y = train["price"].values.reshape(-1,1)
test_y = test["price"].values.reshape(-1,1)

(10000, 27)
(43941, 27)


In [2]:
# Convert to tensors
import torch # imports as torch, not pytorch!

train_x = torch.Tensor(train_x)
train_y = torch.Tensor(train_y)
test_x = torch.Tensor(test_x)
test_y = torch.Tensor(test_y)

print(train_x)

  from .autonotebook import tqdm as notebook_tqdm


tensor([[ 0.7000, 61.1000, 56.0000,  ...,  0.0000,  0.0000,  1.0000],
        [ 0.3400, 63.7000, 57.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.4000, 63.4000, 59.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 1.5200, 62.6000, 60.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.5200, 62.3000, 55.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.5100, 61.5000, 57.0000,  ...,  1.0000,  0.0000,  0.0000]])


`torch.Tensor` objects default to float32 for better performance. but Pandas/Numpy default to float64.  There is some loss of numeric precision, but it really doesn't matter in almost any scenario.  PyTorch has its own datatypes which have all the same names as the ones in Numpy: `float32`, `float64`, `int16`, etc.

In [3]:
print(train_x.dtype)

torch.float32


Now, let's build the same simple neural network as in the last two notebooks!

In [4]:
model = torch.nn.Sequential(
    torch.nn.Linear(in_features=train_x.shape[1], out_features=128),
    torch.nn.Linear(in_features=128, out_features=128),
    torch.nn.Linear(in_features=128, out_features=128),
    torch.nn.Linear(in_features=128, out_features=1)
)
print(model)

Sequential(
  (0): Linear(in_features=26, out_features=128, bias=True)
  (1): Linear(in_features=128, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): Linear(in_features=128, out_features=1, bias=True)
)


Note a few big differences between this and the Keras `Sequential` model:
1. We pass multiple positional arguments to the `Sequential` contructor, not a list of layers
2. We're not specifying the activation--PyTorch requires activation to be specified like layers (Keras allowsus to specify activation in the layer's constructor, or as a separate layer--we did it in the layer constructors).  Not specifying an activation = linear/identity activation.  If we wanted to use something like ReLU activation, we would add `torch.nn.ReLU()` after the layer we want to apply it to.
3. We specify the number of features coming in, and going out, of the layers.  This is a bit of extra work but it does end up allowing us to be more flexible with our model (but only in fairly exotic use cases).

Now let's train this model.

In [5]:
# Create the loss function and optimizer object
# These objects track a lot of state internally between
# calls, so they're objects and not plain functions.
loss_fn = torch.nn.MSELoss(reduction="mean")
optimizer = torch.optim.Adam(
    # Parameters to track and update
    model.parameters(), 
    
    # Learning rate
    lr=1e-3
)

We can write the training loop ourselves pretty easily.  Remember that neural networks are trained by:
1. Grab a few observations, at random, from the training dataset.
2. Run them through the model to generate predictions.
3. Check how good the predictions are.
4. Update the model's parameters.

We have to do a bit of work to to the batching and shuffling, but it's not too much.

In [6]:
from sklearn.metrics import r2_score

def batches(x, y, batchsize=128):
    # Generator that yields (x, y) pairs
    indices = torch.randperm(x.shape[0])
    for i in range(0, len(indices), batchsize):
        idx = indices[i:i+batchsize]
        yield x[idx], y[idx]
        
# Training loop
for epoch in range(25):
    for (x, y) in batches(
        train_x,
        train_y.reshape(-1,1),
        128,
    ):
        
        # Note that we use the model like a callable/function;
        # we don't call a .predict() method on it.
        preds = model(x)
        
        # Calculate the loss.
        loss = loss_fn(preds, y)
        
        # The next three lines do the backpropagation step.
        # Reset the optimizer's gradient information.
        optimizer.zero_grad()
        
        # Calculate the gradient of the compute graph
        # based on the loss function.
        loss.backward()
        
        # Use the optimizers to take a step down the gradient
        # of all free parameters in the compute graph, i.e.,
        # to update the graph's paramters.
        optimizer.step()
        
    # print the loss after each epoch.
    # torch.no_grad() --> context manager that prevents
    # pytorch from calculating gradients.  We don't need
    # them here anyways.
    with torch.no_grad():
        test_preds = model(test_x)
        loss = loss_fn(test_preds, test_y)
        r2 = r2_score(test_y, test_preds)
        print(f"Epoch {epoch:5<} - MSE Loss={int(loss):,} - R2={r2:.4f}")

Epoch 0 - MSE Loss=3,384,875 - R2=0.7865
Epoch 1 - MSE Loss=2,363,933 - R2=0.8509
Epoch 2 - MSE Loss=2,232,256 - R2=0.8592
Epoch 3 - MSE Loss=2,151,837 - R2=0.8643
Epoch 4 - MSE Loss=2,031,320 - R2=0.8719
Epoch 5 - MSE Loss=1,971,408 - R2=0.8757
Epoch 6 - MSE Loss=1,842,296 - R2=0.8838
Epoch 7 - MSE Loss=1,772,493 - R2=0.8882
Epoch 8 - MSE Loss=1,645,327 - R2=0.8962
Epoch 9 - MSE Loss=1,539,982 - R2=0.9029
Epoch 10 - MSE Loss=1,477,258 - R2=0.9068
Epoch 11 - MSE Loss=1,834,693 - R2=0.8843
Epoch 12 - MSE Loss=1,348,638 - R2=0.9149
Epoch 13 - MSE Loss=1,339,713 - R2=0.9155
Epoch 14 - MSE Loss=1,313,609 - R2=0.9172
Epoch 15 - MSE Loss=1,314,494 - R2=0.9171
Epoch 16 - MSE Loss=1,298,895 - R2=0.9181
Epoch 17 - MSE Loss=1,307,711 - R2=0.9175
Epoch 18 - MSE Loss=1,601,847 - R2=0.8990
Epoch 19 - MSE Loss=1,352,297 - R2=0.9147
Epoch 20 - MSE Loss=1,323,062 - R2=0.9166
Epoch 21 - MSE Loss=1,342,844 - R2=0.9153
Epoch 22 - MSE Loss=1,378,596 - R2=0.9131
Epoch 23 - MSE Loss=1,313,490 - R2=0.9172
Ep

There's a lot of "magic" happening in the above code.  PyTorch, like Keras, is constructing a _compute graph_ by adding connections every time we do anything that logically connects two objects.  So:
- Specifying `model.parameters()` when creating our optimizer creates a link between the model's parameters and the optimizer.  PyTorch will remember this link.
- Calcuating the loss function creates a connection between the loss function, the true values, and the model's predictions.
- The model creates connections between each of its layers and their parameters.

So, when we call `optimizer.step()`, PyTorch is looking at all the things the optimizer has ben connected to (directly and indirectly), and doing the update step for all of them.

Some people find all this implicit "magic" to be a bad thing; I tend to agree more than I disagree.  It's not too bad when you have a good handle on the underlying steps and what they're actually doing to the network, but one of the criticisms of PyTorch is that it strikes a kind of werid balance between how much you have to do by hand, and how much PyTorch does for you, behind the scenes.  Fortunately, the code is usually pretty simple, and very well documented, so this ends up being mostly a philosophical point rather than a practical one.

Note that we aren't doing any sort of early stopping.  We could do that, but we would have to implement it ourselves.  It's not actually that hard: just store the last few values of whatever we're tracking, then check if the next value we get is less than any of them.  Something like:

```python
losses = []
for epoch in range(25):
    for (x, y) in batches(...):
        # do model stuff
    loss = loss_fn(...)
    
    # stop if no improvement for 10 epochs
    if any(loss < i for i in losses[-10:]):
        losses.append(loss)
    else:
        break
```

One dirty secret here: we've been running this model on the CPU.  To move it--and our data--to the GPU, we can do two things:

1. Use the `.to()` method to move tensors and models from the CPU to the GPU (or vice versa).
2. Set the `device=` keyword argument when creating tensors/models/layers.

In [7]:
# Get the list of device
if torch.cuda.is_available():
    print("CUDA is available!")
    device = torch.device("cuda")
else:
    print("CUDA is not available :(")
    device = torch.device("cpu")
print(device)

# Create the tensor directly on the device
gpu_tensor = torch.tensor([1,2,3,4,5], device=device)
print("Created on the GPU:", gpu_tensor)
print("Device:", gpu_tensor.device)
print()

CUDA is available!
cuda
Created on the GPU: tensor([1, 2, 3, 4, 5], device='cuda:0')
Device: cuda:0



In [8]:
# Move a tensor to the GPU after creating it on the CPU
gpu_tensor = torch.tensor([1,2,3,4,5])
print("Created on the CPU:", gpu_tensor)
print("Device:", gpu_tensor.device)
gpu_tensor = gpu_tensor.to("cuda")
print("Moved to GPU:", gpu_tensor)
print("Device:", gpu_tensor.device)

Created on the CPU: tensor([1, 2, 3, 4, 5])
Device: cpu
Moved to GPU: tensor([1, 2, 3, 4, 5], device='cuda:0')
Device: cuda:0


PyTorch doesn't have a way to default to constructing tensors/models on the GPU, which according to them is because this can be a pretty expensive and slow operation; it's better to move thing over to the GPU when they need to be moved, not sooner.  (Also, your GPU probably has less RAM than your main system, so you can store more stuff in the CPU/main system memory anyways).

Let's re-run the training loop, but put everything on the GPU.  One issue we'll run into: when `torch.Tensor`s are on the GPU, we can't plug them in to scikit-learn's metric functions, because PyTorch doesn't automatically convert GPU tensor to Numpy `array`s (which can only live in CPU memory).  This is annoying, but it's done for a reason--moving data between the GPU and CPU can cause all sorts of slowdowns and other issues, so PyTorch wants you to do that explicitly, so it doesn't have to worry about it.

Fortunately, it's very easy to define our own function to calculate the $R^2$ metric.

In [9]:
from tqdm import tqdm

def pytorch_r2(pred, true):
    """Compute the r-squared metric."""
    ss_res = ((true - pred) ** 2).sum()
    ss_tot = ((true - true.mean()) ** 2).sum()
    return 1 - (ss_res / ss_tot)

# Everything needs to be on the GPU!
# We can also use .cuda() and .cpu() in place of .to();
# this is just a shortcut for .to("cuda") and .to("cpu"),
# respectively.
# NOTE: using any of these methods on a Sequential model
# will update it *in-place*, but for individual Tensors,
# it *returns a copy* of the moved Tensor.
gpu_model = torch.nn.Sequential(
    torch.nn.Linear(in_features=train_x.shape[1], out_features=128),
    torch.nn.Linear(in_features=128, out_features=128),
    torch.nn.Linear(in_features=128, out_features=128),
    torch.nn.Linear(in_features=128, out_features=1)
).cuda()

# Re-create the optimizer and make sure it points at the GPU model's
# parameters.
gpu_optimizer = torch.optim.Adam(
    gpu_model.parameters(), 
    lr=1e-3,
)

# Move our data to the GPU
train_x_gpu = train_x.cuda()
train_y_gpu = train_y.cuda()
test_x_gpu = test_x.cuda()
test_y_gpu = test_y.cuda()

# Training loop--this time with early stopping
losses = []
for epoch in range(500):
    for (x, y) in batches(train_x_gpu, train_y_gpu, 128):
        preds = gpu_model(x)
        loss = loss_fn(preds, y)
        gpu_optimizer.zero_grad()
        loss.backward()
        gpu_optimizer.step()
        
    # print the loss after each epoch.
    # Notice the hoops we're jumping through with device management.
    # There might be a better way to d this, but I've not spent enough
    # time with PyTorch to know how.
    with torch.no_grad():
        test_preds = gpu_model(test_x_gpu)
        loss = loss_fn(test_preds, test_y_gpu)
        r2 = pytorch_r2(test_y_gpu, test_preds)
        print(f"Epoch {epoch:5<} - MSE Loss={int(loss):7<,} - R2={r2:.4f}")
        
        # Early stopping: validation loss must improve, by any margin,
        # within 5 epochs.
        if (
            len(losses) < 5
            or any(loss < i for i in losses[-5:])
        ):
            pass
        else:
            print("Early stopping.")
            break
        losses.append(loss)

Epoch 0 - MSE Loss=3,902,924 - R2=0.6935
Epoch 1 - MSE Loss=2,431,993 - R2=0.8153
Epoch 2 - MSE Loss=2,469,485 - R2=0.8243
Epoch 3 - MSE Loss=2,127,600 - R2=0.8433
Epoch 4 - MSE Loss=2,106,534 - R2=0.8429
Epoch 5 - MSE Loss=1,980,015 - R2=0.8559
Epoch 6 - MSE Loss=2,270,354 - R2=0.8138
Epoch 7 - MSE Loss=1,854,507 - R2=0.8624
Epoch 8 - MSE Loss=1,791,943 - R2=0.8624
Epoch 9 - MSE Loss=1,714,606 - R2=0.8674
Epoch 10 - MSE Loss=1,611,747 - R2=0.8817
Epoch 11 - MSE Loss=1,593,832 - R2=0.8848
Epoch 12 - MSE Loss=1,510,548 - R2=0.8927
Epoch 13 - MSE Loss=1,446,722 - R2=0.9044
Epoch 14 - MSE Loss=1,396,417 - R2=0.8997
Epoch 15 - MSE Loss=1,421,135 - R2=0.8983
Epoch 16 - MSE Loss=1,331,716 - R2=0.9150
Epoch 17 - MSE Loss=1,321,524 - R2=0.9084
Epoch 18 - MSE Loss=1,630,032 - R2=0.8870
Early stopping.


That's a basic sketch of a PyTorch model.  Building the model and the loss function and optimizer looks pretty similar to Keras, but note the following major differences:

1. Keras uses a scikit-learn-style API with `.fit()` and `.predict()`.  PyTorch asks you do do manually manage the training loop--it's not much code for a basic training loop, and it gives you a lot more control, but you have to do more work.
2. Keras transparently handles devices (CPU vs GPU).  PyTorch asks you to do this yourself.
3. PyTorch does a lot of stuff implicitly: calling `loss.backward()` calculates gradients for _every tensor that was used at any point in the calculation of `loss`_, and sets the gradients as attributes on those tensors. Similarly, `optimizer.step()` steps through every tensor that you gave the optimizer when you initialized it, checks the gradients, and updates them in-place.  There's a lot of back magic going on here.  Keras doesn't expose any of this; you just call `.fit()` and Keras handles all the other details.
4. In general, Keras ia _a neural network library._  PyTorch, though, is a GPU math library with automatic gradients/differentiation. You have to implement the neural network bits yourself, but it's very easy to see what needs to be done.

Both libaries are excellent.  Which one you want to use is really just a matter of personal preference.  I personally prefer PyTorch--it's where most people are flocking these days, since it's much easier to create complex, exotic networks using it rather than Keras/Tensorflow.  PyTorch is also a lot easier to install.  But Keras has a lot more quality-of-life features that are immediately and easily accessible.

# Stuff not covered

I was originally planning to cover how to build more complex models using PyTorch's built-in classes and subclassing them.  That ended up getting a bit too messy, so I've cut it for time.  But, see the notebook on Python `class`es for a quickstart on the Python language tools for doing that.

PyTorch also integrates with TensorBoard, as well as its own in-house tools for monitoring and visualizing models, but I haven't shown how to do that (because I haven't looked at how to do it).

There are also a _lot_ of other neural network libraries.  JAX (coupled with FLAX or Haiku) is currently the biggest up-and-comer.  JAX is pretty interesting from a technical perspective, but I haven't spent enough time with it to say much more than that.  Picking a neural network library can be hard, and requires comparing a lot of apples to a lot of oranges--and almost all of that stuff only matters if you're thinking of deploying models in a production environment, rather than using them for research projects or experimentation.