# LLM Learning

Following [this guide](https://www.youtube.com/watch?v=UU1WVnMk4E8) to learn about large language models.

## Setting up with a small training model using [The Lord of the Rings](lotr.txt) text

First, we open the file and read it with `text` and then we get each character (sorted) from that into `chars`.

In [None]:
import torch

device = torch.device("mps") if torch.backends.mps.is_available() else "cpu"
# device = "cuda" if torch.cuda.is_available() else "cpu"

block_size = 8
batch_size = 4

print(device)

mps


In [17]:
with open("lotr.txt", "r", encoding="utf-8") as f:
    text = f.read()

chars = sorted(set(text)) # gets all of the characters in the text file using set, then sorts them
vocabulary_size = len(chars)
print(vocabulary_size)

85


In [12]:
string_to_int = { ch:i for i, ch in enumerate(chars) }      # creates a dict "\n": 0, " ": 1, etc.
int_to_string = { i:ch for i, ch in enumerate(chars) }      # creates a dict 0: "\n", 1: " ", etc.
encode = lambda s: [string_to_int[c] for c in s]            # converts an input string to a list of ints using the string_to_int map
decode = lambda l: "".join([int_to_string[i] for i in l])   # converts an input list of ints to a string using the int_to_string map

To show how these functions work, let's look at them in a more traditional way.

In [13]:
# def string_to_int():
#     return_dict = {}
#     for i, ch in enumerate(chars):
#         return_dict[ch] = i
#     return return_dict

# def int_to_string():
#     return_dict = {}
#     for i, ch in enumerate(chars):
#         return_dict[str(i)] = ch
#     return return_dict

# def encode(s):
#     return_s = []
#     for c in s:
#         return_s.append(string_to_int[c])
#     return return_s

# def decode(s):
#     return_s = ""
#     for i in s:
#         return_s += int_to_string[i]
#     return return_s

To see how `encode` and `decode` works, here's an example of encoding and decoding *"The Lord of the Rings"*.

In [14]:
print(encode("The Lord of the Rings"))
print(decode(encode("The Lord of the Rings")))

[44, 60, 57, 1, 36, 67, 70, 56, 1, 67, 58, 1, 72, 60, 57, 1, 42, 61, 66, 59, 71]
The Lord of the Rings


### Some info on tokenizers

The example above, `string_to_int`, `int_to_string`, `encode`, and `decode`, uses **character-level tokenizers**. This takes each character and converts it to its equivalent integer and back. This leaves us with a small vocabulary and a large amount of tokens to convert.

### On `pytorch`

Having simple lists for the `encoded` data isn't efficient, so we'll use `pytorch`'s tensors instead.

In [15]:
data = torch.tensor(encode(text), dtype=torch.long)       # encodes as tensor with the datatype as a long list of integers
print(data[:100])

tensor([27, 60, 53, 68, 72, 57, 70,  1, 33,  0, 25, 66,  1, 45, 66, 57, 76, 68,
        57, 55, 72, 57, 56,  1, 40, 53, 70, 72, 77,  0, 33, 66,  1, 53,  1, 60,
        67, 64, 57,  1, 61, 66,  1, 72, 60, 57,  1, 59, 70, 67, 73, 66, 56,  1,
        72, 60, 57, 70, 57,  1, 64, 61, 74, 57, 56,  1, 53,  1, 60, 67, 54, 54,
        61, 72, 10,  1, 38, 67, 72,  1, 53,  1, 66, 53, 71, 72, 77,  8,  1, 56,
        61, 70, 72, 77,  8,  1, 75, 57, 72,  1])


### Some info on training and validation splits

Let's say we have a text corpus, a document with tons of text. We would make our training set 80% of it and validation would be the remaining 20%. If we, instead, trained on the entire 100%, it would essentially memorize the entire piece of text and we wouldn't get anything useful out of it. The purpose of language modeling is to generate text that's like the training data, which is why we put it into splits. So, when we use the training split, it'll memorize the 80% and will only generate on that 80%. We do this to make sure that the generations are unique.

### On the name of this file, `bigram` language

The definition for bigram is: *a pair of consecutive written units such as letters, syllables, or words*. Let's say we have the word "Hello".

| | |
|---|---|
| Start of context | "H" |
| "H" | "e" |
| "l" | "l" |
| "l" | "o" |

### On inputs and targets

It only uses the previous character to predict the next. Looking further into it, consider this info:

```python
block_size = 5

# ... [5, 67, 21, 58, 40] 35 ...    [:block_size]
# ... 5 [67, 21, 58, 40, 35] ...    [1:block_size+1]
```

In the above example, we use one list of integers, the input, with another beneath, which are the same as the first but offset by one, which are the targets for prediction. The `block_size` is the amount of training characters for predictions. We then look at the difference the target is away from the prediction to train the model.

Below, let's implement it in python.

In [None]:
n = int(0.8 * len(data))    # the split point in the data (80%)
train_data = data[:n]       # the training data is the first 80% of the data
val_data = data[n:]         # the validation data is the last 20% of the data

x = train_data[:block_size]
y = train_data[1:block_size + 1]

print(train_data)

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} target is {target}")

tensor([27, 60, 53,  ...,  1, 67, 66])
when input is tensor([27]) target is 60
when input is tensor([27, 60]) target is 53
when input is tensor([27, 60, 53]) target is 68
when input is tensor([27, 60, 53, 68]) target is 72
when input is tensor([27, 60, 53, 68, 72]) target is 57
when input is tensor([27, 60, 53, 68, 72, 57]) target is 70
when input is tensor([27, 60, 53, 68, 72, 57, 70]) target is 1
when input is tensor([27, 60, 53, 68, 72, 57, 70,  1]) target is 33


While this can do predictions, it is not scalable. This is sequential, which is what a CPU does, which can operate quickly but only sequentially. When using GPUs, we can do simpler tasks very quickly or in parallel. So, if using a GPU, we would take a bunch of these operations and stack them for a GPU to run at the same time, which will scale or training data. So, using the code from above, we could get multiple `x`, `y` variables and do the training `for` loop at once, instead of one at a time.

This batch of blocks is a hyperparameter called `batch_size`. So, we would have maybe 8 `batches` of `block`s that are 12 in size. In other words, the `block_size` is the length of the sequence and the `batch_size` is how many of these are running at the same time. Without using a GPU, we wouldn't get the speed or performance we'd get with one. Since I don't have a GPU and will be using my M1 Mac, I won't get the performance I'd get with a GPU but the M1 chip will perform very well still for my uses.

### `torch` examples and info

Check out the [torch-examples](torch-examples.ipynb) file to see some of `torch`'s capabilities and some m1 (GPU)/CPU comparisons.

Continue with [More `PyTorch` Functions](https://www.youtube.com/watch?v=UU1WVnMk4E8&t=2869s&pp=0gcJCdkCDuyUWbzu)