# You can do machine learning

These are some ideas that helped me understand machine learning in PyTorch.

## Step 1. Developing in Python

### Python

Python is an interpreted language that is very popular in data science. It's slow but very flexible. Dynamic typing makes it perfect for torturing actual software developers.

Python has "virtual environments" instead of projects. The package manager is called `pip`.

### Visual Studio Code

VS Code seems to be a popular, cross-platform app for Python development. It's a text editor, not a full-fledged IDE, but there is a rich marketplace of extensions that provide language-specific behavior.

### Jupyter notebooks

Notebooks are a cool way to meld markdown with code, creating executable documentation.

In [None]:
import math
import matplotlib.pyplot as plt

x = range(-6, 7)
y = [ 1 / (1 + math.exp(-z)) for z in x ]   # sigmoid function

plt.plot(x, y)
plt.grid()
plt.show()

Note that ranges in Python are [inclusive, exclusive).

## Step 2. PyTorch

PyTorch is an open source machine learning framework created by Meta that is becoming more popular than Google's Tensorflow. It can run on either the CPU (slow) or GPU (fast, called "CUDA").

In [None]:
import torch

print(torch.__version__)

## Step 3. Tensors

A tensor is an N-dimensional array.

<img src="https://miro.medium.com/1*6Z892ClZGon03_Mawj4Pew.png" width="400"/>

### 0-dimensional tensor

Also known as a scalar. Here we create a tensor from a single numeric value:

In [None]:
ndim0 = torch.tensor(3.1415)
print(ndim0)

### 1-dimensional tensor

Also known as a vector. One way to create a 1D tensor is from a Python list:

In [None]:
ndim1 = torch.tensor([1, 2, 4, 8, 16, 32])
print(ndim1)

Note: A vector can also be interpreted as a position in N-dimensional space. For example, the vector above is a point in 6D space. Both interpretations are valid, but don't get them mixed up.

### 2-dimensional tensor

Also known as a matrix, which is essentially a table of rows and columns. Here, we generate a 1D tensor of 12 integers, and then change its shape to 2D:

In [None]:
ndim1 = torch.arange(0, 12)
ndim2 = ndim1.view(3, 4)   # 3 rows by 4 columns
print(ndim2)

We can directly modify the data in the tensor:

In [None]:
ndim2[1,0] = -100
print(ndim2)

Note that the view shares its data with the original tensor:

In [None]:
print(ndim1)

We can also transpose a tensor's rows and columns:

In [None]:
print(ndim2.transpose(0, 1))   # transpose dim-0 and dim-1

### 3-dimensional tensor

Here's a 3D tensor with random values:

In [None]:
ndim3 = torch.rand(2, 3, 4)   # 2 layers X 3 rows X 4 columns
print(ndim3)

You can ask any tensor for its shape:

In [None]:
print(ndim3.shape)

### Higher dimensions

It gets harder to visualize tensors as the number of dimensions increases:

In [None]:
ndim4 = torch.rand(2, 3, 2, 3)   # 2 hyperlayers X 3 layers X 2 rows X 3 columns
print(ndim4)

### Changing dimensionality

Start with a 1D vector:

In [None]:
vector = torch.arange(-4, 5)   # 1D: 9 columns
print(vector)
print(vector.shape)

If we insert a dimension of size 1 in front, the columns stay columns:

In [None]:
unsq0 = vector.unsqueeze(0)   # 2D: 1 row x 9 columns
print(unsq0)                  # note the extra pair of brackets in the output
print(unsq0.shape)

But if we insert a dimension of size 1 at the end, the columns become rows:

In [None]:
unsq1 = vector.unsqueeze(1)   # 2D: 9 rows x 1 column
print(unsq1)
print(unsq1.shape)

## Step 4. Neural network building blocks

Some examples from the "nn" zoo:

In [None]:
import torch.nn as nn

In general, these are functions of type `Tensor -> Tensor`, so a complex neural network can be built by composition. During training, the model learns the best values for the parameters inside these blocks.

### Linear layer

Applies a linear transformation: **y = xW<sup>T</sup> + B**

Where:
* **x** is the input tensor
* **y** is the output tensor
* **W** is a 2D tensor of weight parameters
* **B** is a 1D tensor of bias parameters
* **xW<sup>T</sup>** is matrix multiplication of **x** by the transpose of **W**.

In [None]:
linear = nn.Linear(in_features = 20, out_features = 30)
print("Weight:", linear.weight.shape)
print("Bias:", linear.bias.shape)
print("Total # of parms:", sum(parm.numel() for parm in linear.parameters()))   # (30 x 20) + 30

Linear transforms are useful for "projecting" a tensor into a different shape with the same number of dimensions:

In [None]:
x = torch.rand(5, 6, 20)   # last dimension must match linear input
print("Input:", x.shape)
y = linear(x)              # result is still 3D, but now with 30 columns
print("Output:", y.shape)

### Activation functions

#### ReLU

Activation functions provide non-linear transformations. They typically don't have any parameters. One simple activation function is "rectified linear unit" (ReLU), which maps any negative input to zero, and any non-negative input to itself:

In [None]:
relu = nn.ReLU()
x = torch.arange(-10, 11)
y = relu(x)

plt.plot(x, y)
plt.axis("equal")
plt.grid()
plt.show()

### Dropout layer

Randomly sets some of the input to zero during training (and the remaining elements are scaled up proportionally). Dropout prevents the model from becoming too reliant on a small set of parameters.

A dropout layer has no parameters, but its dropout rate (e.g. 20%) is a hyperparameter.

In [None]:
dropout = nn.Dropout(0.2)

x = torch.ones(3, 4)
print(x)
y = dropout(x)
print(y)

The dropout is recomputed during each application, so results are not deterministic:

In [None]:
y = dropout(x)
print(y)

### Layer normalization

Normalizes input around its mean and standard deviation. This reduces training time by reining in large values.

In [None]:
numColumns = 4
norm = nn.LayerNorm(numColumns)   # expect last dimension of this size

x = torch.arange(0.0, 12.0).view(-1, numColumns)   # -1: PyTorch infers the # of rows
print(x)
y = norm(x)
print(y)

The normalization can have weight and bias parameters that the model learns during training:

In [None]:
print("Weight:", norm.weight)
print("Bias:", norm.bias)

### Embedding

An embedding is a lookup table that maps scalar indexes to vectors, where each vector represents the embedding of the corresponding key into a higher dimensional space.

In [None]:
embedding = nn.Embedding(num_embeddings = 4, embedding_dim = 3)
key = torch.tensor(0)
v = embedding(key)
print(v)

The embedding vectors are learnable parameters of the model:

In [None]:
print(embedding.weight.shape)

## Step 5. Building a model

### Non-linear regression

Let's say we have simple non-linear function that we want a neural network to learn. (This is like using a bomb to kill an ant, but it illustrates the basic principle well.)

In [None]:
def targetFunc(x):
    return 3 * x ** 2 + 10

We need some training data, consisting of inputs to and outputs from the function:

In [None]:
domain = 6.0
xBatch = torch.arange(-domain, domain+1).unsqueeze(1)   # each row is a distinct input
print("Input:\n", xBatch)

yTarget = targetFunc(xBatch)                            # each output row corresponds to a specific input
print("Output:\n", yTarget)

Note that we've used tensors to create a dataset of inputs/outputs all at once, but you can also think of `xBatch` as a single input, with corresponding output `yTarget`.

If we stick to thinking of these as pairs of input/output scalars, the target function is a parabola:

In [None]:
plt.plot(xBatch, yTarget)
plt.grid()
plt.show()

Is it possible to approximate a parabola with a pair of linear functions sandwiched around ReLU, none of which are curved?

Since the target function is `scalar -> scalar`, the model should have a single input value and a single output value, but we can give the model lots of parameters in between.

In [None]:
model = nn.Sequential(
    nn.Linear(1, 10),   # scalar -> 10-dimensional space
    nn.ReLU(),          # non-linear
    nn.Linear(10, 1))   # 10-dimensional space -> scalar

The model's parameters are initialized with random values by default.

In [None]:
for (name, parm) in model.named_parameters():
    print("\n{}: {}".format(name, parm.data.shape))
    print("   ", parm.data)

numParms = sum(parm.numel() for parm in model.parameters())
print("\nTotal # of parameters:", numParms)

To train the model, we need:
* A loss function that determines how far off target the model is, and
* An optimizer that will attempt to minimize the loss. The optimizer's "learning rate" determines how much of an adjustment it will make to the model's parameters on each iteration.

In [None]:
lossFunc = nn.MSELoss()   # mean squared error
optimizer = torch.optim.SGD(model.parameters(), lr = 0.001)   # stochastic gradient descent

Now, we're ready to train the model. During each iteration:

1. Make a forward pass to generate a batch of predicted values.
2. Calculate the difference between the predicted and target values. This is called the "loss".
3. Make a backward pass from the loss through the model to calculate gradients, which are the directions in which each parameter much be adjusted.
4. Adjust each parameter by a small amount accordingly. We don't want to overshoot the best value.

In [None]:
for epoch in range(100000):

    # forward pass
    yPrediction = model(xBatch)

    # calculate loss
    loss = lossFunc(yPrediction, yTarget)
    if (epoch & (epoch-1) == 0):   # check for power of 2
        print("Epoch {}, loss {}".format(epoch, loss))
        with torch.no_grad():
            plt.plot(xBatch, yTarget)
            plt.plot(xBatch, yPrediction)
        plt.grid()
        plt.show()

    # backward pass
    optimizer.zero_grad()
    loss.backward()

    # adjust parameters
    optimizer.step()

The model has learned parameter values that approximate the target function well.

In [None]:
for (name, parm) in model.named_parameters() :
    print("{}: {}".format(name, parm.data))