## Welcome!

Welcome to the Zero to GPT course!  Over the next few lessons, you'll go from no deep learning knowledge to being able to train your own GPT model.

At a high level, GPT works by predicting what words should come after your prompt.  If you prompt GPT-3 with `Write me a song about deep learning`, it returns something like this:

```
Verse 1:
Deep learning is the way of the future,
Data mining through layers of complexity
Processing information with intelligence
Where will this technology take us?
```

GPT models are also called deep learning models.  Deep learning is a subfield of machine learning, which is a subfield of AI.  In deep learning, we train a neural network composed of many layers to transform an input into an output.

Neural networks can't understand text directly, so GPT has to convert your prompt to numbers, make predictions, then convert the predictions back to text.  The process works like this:

1. Split your prompt up into tokens (tokens are similar to words), and convert each token to a vector of numbers
2. Feed the vectors into a neural network
3. Predict what vectors come after the input tokens
4. Convert the vectors back into text

Splitting the prompt into tokens, then converting the predictions back into text is called natural language processing (NLP).  NLP is how we can use computers to work with text.

The key part of GPT is the third step, where the network predicts which tokens will come after the input tokens.  To do this, the network uses its parameters.  You can think of parameters like memory.  This memory has been trained on a lot of data (gigabytes or terabytes).  Parameters enable neural networks to make accurate predictions about what should come after your prompt.

Usually, the more parameters a model has, the more human-like it seems.  The smallest GPT models have around 100M parameters, and the largest have 176B.  The model uses the parameters to transform your input - usually by adding the parameters to your input, or multiplying your input by the parameters.  By doing this repeatedly across several layers, the model generates predictions.

By the end of this course, you'll understand how to train deep neural networks that work like GPT.  This course is broadly split up into three parts:

1. Understanding deep learning - how to create and train deep neural networks, including gradient descent and backpropagation.
2. Deep learning for NLP - deep learning architectures to work with text, including RNNs, encoder/decoders, attention, and transformers.
3. Scaling up models - putting together the building blocks to train a model with a large number of parameters.

In the rest of this lesson, we'll cover some basics that you'll need to understand to start the course.  Neural networks use matrix operations to apply their parameters to input data.  In Python, the NumPy package enables us to do these operations.  If you already understand NumPy, you can skip to the next lesson.  If not, let's learn some fundamentals.

## NumPy fundamentals

NumPy is a Python package that enables us to work with arrays of data more efficiently.  Arrays are similar to lists in Python.

To understand why we need NumPy, let's start with one of the most basic machine learning algorithms, linear regression.  In linear regression, we make a prediction using the equation $\hat{y} = wx + b$.  In deep learning, we call $w$ the weight, and $b$ the bias.

Let's say that we want to predict tomorrow's temperature using three data points from today - today's max temperature, today's min temperature, and how much it rained today.

Let's read in our dataset using pandas:

In [1]:
import pandas as pd

# Read in the data
data = pd.read_csv("../data/clean_weather.csv", index_col=0)
# Fill in any missing values in the data with past values
data = data.ffill()

# Show the first 5 rows of the data
data.head(5)

Unnamed: 0,tmax,tmin,rain,tmax_tomorrow
1970-01-01,60.0,35.0,0.0,52.0
1970-01-02,52.0,39.0,0.0,52.0
1970-01-03,52.0,35.0,0.0,53.0
1970-01-04,53.0,36.0,0.0,52.0
1970-01-05,52.0,35.0,0.0,50.0


In the table above, tomorrow's temperature is `tmax_tomorrow`, today's max temperature is `tmax`, today's min temperature is `tmin`, and how much it rained today is `rain`.

We have data from `1970` to the present.  With linear regression, we can extend our equation to multiple predictors like this:

$$\hat{y} = w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + b$$

So to get a prediction for tomorrow's temperature $\hat{y}$, we can take a value called $w_{1}$, and multiply it by `tmax`, then take $w_{2}$ and multiply it by `tmin`, then take $w_{3}$ and multiply it by `rain`.  We'd then add in $b$.  Here's how that could look:

In [2]:
.7 * 60 + .3 * 35 + .1 * 0 + 10

62.5

The above could be our prediction for the first row of the data, if our $w$ values are `.7`, `.3`, and `.1`, and our $b$ value is `10`.  We'll discuss how you can calculate the correct $w$ and $b$ values in the next lesson, but for now, let's just use these values.

Whenever we want to make a new prediction, we apply the same equation:

In [3]:
w1 = .7
w2 = .3
w3 = .1
b = 10

w1 * 52 + w2 * 39 + w3 * 0 + b

58.099999999999994

What if instead of 3 predictor variables (`tmax`, `tmin`, and `rain`), we had `10`?  Or `100`?  It would get annoying to keep track of that many individual $w$ values.

Luckily, we can use linear algebra to help us out.  Matrix multiplication is a linear algebra operation defined like this:

$$
\begin{equation}
    A \times B =
    \begin{bmatrix}
      a_{11} & a_{12} \\
      a_{21} & a_{22}
    \end{bmatrix}
    \times
    \begin{bmatrix}
    b_{11} \\
    b_{21}
    \end{bmatrix}
    =
    \begin{bmatrix}
      a_{11}b_{11} + a_{12}b_{21} \\
      a_{21}b_{11} + a_{22}b_{21}
    \end{bmatrix}
\end{equation}
$$

We can visualize how it works with this gif:

![Matrix mult](images/intro/matrix_mult.gif)

As you can see, we essentially take each row of matrix A and multiply it by each column of matrix B, then add everything together.  The number of columns in the first matrix has to equal the number of rows in the second matrix.

This is very useful to us when we're multiplying weights by input numbers.  We can take our $x$ values, put them into a matrix, then multiply by the weights (also in a matrix):

In [6]:
# Convert the first 3 rows of data into a numpy matrix from a pandas dataframe
# the first column of the matrix is tmax, second is tmin, third is rain

x = data[["tmax", "tmin", "rain"]].iloc[:3].to_numpy()
x

array([[60., 35.,  0.],
       [52., 39.,  0.],
       [52., 35.,  0.]])

In [7]:
import numpy as np

# Create a matrix with our weights

w = np.array([.7, .3, .1])

w

array([0.7, 0.3, 0.1])

We can verify the shapes of our matrices to make sure they can be multiplied:

In [8]:
# Print the shape (number of rows and columns) in x
x.shape

(3, 3)

In [9]:
# Print the shape of w
w.shape

(3,)

As we can see, `x` is a matrix with `3` rows and `3` columns.  `w` is actually a vector with `3` columns.  A vector is known as a one-dimensional array, because it only has length in a single dimension.

You can also get arrays with more than 2 dimensions, which we'll work with in later lessons.  Each dimension of the array will have a length corresponding to the size in that dimension.

`x` has length `3` in dimension `0` (rows) and dimension `1` columns.  `w` has length `3` in dimension `0`.

We'll need to convert `w` from a vector into a matrix to multiply our two matrices.  To do that, we can use the numpy `reshape` method.  We pass in our desired lengths in each dimension:

In [10]:
# Reshape w into a 3 by 1 matrix
w = w.reshape(3,1)
w

array([[0.7],
       [0.3],
       [0.1]])

In [11]:
w.shape

(3, 1)

We now have a 3x3 matrix, and a 3x1 matrix.  Since the number of columns in `x` matches the number of rows in `w`, we can multiply them:

In [12]:
x @ w

array([[52.5],
       [48.1],
       [46.9]])

We can now add in our $b$ value, `10`.  $b$ is just a single number.  When we add it to `x @ w`, it will be broadcasted across the matrix (added to each element):

In [13]:
b = np.array([10])
x @ w + b

array([[62.5],
       [58.1],
       [56.9]])

This gives us the same results as when we manually multiplied our `w` and `x` values:

In [14]:
w1 = .7
w2 = .3
w3 = .1
b = 10

w1 * 60 + w2 * 35 + w3 * 0 + b

62.5

This is the power of matrix multiplication - it enables us to store all of our parameters ($w$ and $b$ values) into arrays and then use them to modify our inputs.  This is a lot faster than keeping track of each individual variable!