<a href="https://colab.research.google.com/github/badlogic/genai-workshop/blob/main/04_unsupervised_learning_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> **Note:** Before you continue, make sure the notebook runtime on Google Colab is set to T4. Click on the arrow to the right of the "RAM/Disk" area in the top right corner, then click "Runtime Type" and select T4. This will start the notebook with a GPU-enabled runtime.

# Language Models
Language models are at the core of the recent AI hype, brought to you by OpenAI's ChatGPT, products like GitHub's Copilot, or open-source models like Llama.

As described in [the unsupervised learning overview](https://colab.research.google.com/drive/10tlC17BRVoX9aPp66orqiI16iUzLx-4p?usp=sharing), the goal of language models is it to predict the likelihood of a sequence of **tokens** in a language. A token can be a word or punctuation in the simplest case. There are more sophisticated tokenizers that break words down into sub-units, like [binary pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) or [SentencePiece](https://github.com/google/sentencepiece).

A language model learns patterns and structures based on large amounts of text data to generate or understand text. The patterns are learned in an unsupervised manner directly from the training data.

Such models can then be used to predict the next token in a sequence, or fill in a token in-between other tokens of a text.

Let's create a simple language model of our own, to build an initial intution about such models, before we dive into the current state-of-the-art in more depth.

# A naive n-gram language model
The simplest language model we can build is an [n-gram model](https://en.wikipedia.org/wiki/Word_n-gram_language_model), based on word tokens.


## Data Preparation
Before we start building the model, we need some data to learn from. We'll use a small [Wikipedia dataset](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-raw-v1), consisting of about ~44k paragraphs from Wikipedia articles, of which we use ~36k to train the language model ("train" split).

In [2]:
!pip -q install datasets

In [3]:
from datasets import load_dataset
import pandas as pd

paragraphs = load_dataset('wikitext', 'wikitext-2-raw-v1')["train"];
print(f'{len(paragraphs)} paragraphs')
pd.DataFrame(paragraphs["text"][:100], columns=["text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

36718 paragraphs


Unnamed: 0,text
0,
1,= Valkyria Chronicles III = \n
2,
3,Senjō no Valkyria 3 : Unrecorded Chronicles (...
4,"The game began development in 2010 , carrying..."
...,...
95,"75 @,@ 000 buck & ball cartridges - percussio..."
96,"14 @,@ 000 buck & ball cartridges - flint \n"
97,275 paper fuzes \n
98,"117 rounds , 6 @-@ pounder canister shot \n"


Next, we need to split each paragraph into sentences, and each sentence into word tokens. We use the sentence splitter and tokenizer from [NLTK](https://www.nltk.org/) to achieve this.

In [4]:
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

nltk.download('punkt')
sentences = [word_tokenize(sentence) for text in paragraphs['text'] for sentence in sent_tokenize(text)]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Here's one tokenized sentence.

In [5]:
pd.DataFrame(sentences[10], columns=["token"])

Unnamed: 0,token
0,It
1,met
2,with
3,positive
4,sales
5,in
6,Japan
7,","
8,and
9,was


## Training

Time to train the language model. The algorithm may look a bit daunting, but the concept is very simply.

We first check that the provided parameter `n` is valid, that is, is bigger or equal to 1.

Next we instantiate a dictionary called `n_gram_stats`, which maps from a prefix to a dictionary. The dictionary for a prefix contains all the tokens and their counts that were observed in the training data following the prefix.

Next, we iterate through all sentences. For each sentence, we iterate through all token positions.

For each token position, we initialize the prefix to the empty string. We then iterate through all token positions from the current token position to the current token position + `n`. In each step, we get the current token for the position and add one to the dictionary entry of its prefix. We then expand the prefix with this token, and move on to the next token position.

Once we've collected all statistics, we convert each dictionary for a prefix to a list, and sort the `(token, count)` tuples in descending order by count. This will later help us speed up generation.

> **Note:** standard n-gram models are probabilistic in nature and would normalize the counts to model probability distributions. For our purposes, it's sufficient to stick to the raw counts.

Execute the code below to see the training algorithm build a model over the single sentence `"The quick brown fox jumped over the lazy dog"` using `n=5`. The debug parameter is set to `True`, so the training code will output all n-grams it records counts for.

In [11]:
from collections import defaultdict

def train_lm(sentences, n, debug=False):
  if (n < 1):
    raise ValueError("n must be >= 1")
  n_gram_stats = defaultdict(lambda: defaultdict(int))
  for sentence in sentences:
    for i in range(len(sentence)):
      prefix = ""
      for j in range(0, n):
        if i + j < len(sentence):
          token = sentence[i + j]
          n_gram_stats[prefix][token] += 1
          if (debug):
            print(f'{prefix} -> {token}')
          prefix += (" " if prefix else "") + token

  for prefix, token_counts in n_gram_stats.items():
    sorted_token_counts = sorted(token_counts.items(), key=lambda item: item[1], reverse=True)
    n_gram_stats[prefix] = sorted_token_counts

  return n_gram_stats

sentence = ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog", "."];
model = train_lm([sentence], 5, True)

 -> The
The -> quick
The quick -> brown
The quick brown -> fox
The quick brown fox -> jumped
 -> quick
quick -> brown
quick brown -> fox
quick brown fox -> jumped
quick brown fox jumped -> over
 -> brown
brown -> fox
brown fox -> jumped
brown fox jumped -> over
brown fox jumped over -> the
 -> fox
fox -> jumped
fox jumped -> over
fox jumped over -> the
fox jumped over the -> lazy
 -> jumped
jumped -> over
jumped over -> the
jumped over the -> lazy
jumped over the lazy -> dog
 -> over
over -> the
over the -> lazy
over the lazy -> dog
over the lazy dog -> .
 -> the
the -> lazy
the lazy -> dog
the lazy dog -> .
 -> lazy
lazy -> dog
lazy dog -> .
 -> dog
dog -> .
 -> .


We can also output the recorded counts for each `prefix -> token` pair the model learned.

In [12]:
for prefix, tokens in model.items():
  for token, count in tokens:
    print(f'"{prefix}", "{token}" -> {count}')

"", "The" -> 1
"", "quick" -> 1
"", "brown" -> 1
"", "fox" -> 1
"", "jumped" -> 1
"", "over" -> 1
"", "the" -> 1
"", "lazy" -> 1
"", "dog" -> 1
"", "." -> 1
"The", "quick" -> 1
"The quick", "brown" -> 1
"The quick brown", "fox" -> 1
"The quick brown fox", "jumped" -> 1
"quick", "brown" -> 1
"quick brown", "fox" -> 1
"quick brown fox", "jumped" -> 1
"quick brown fox jumped", "over" -> 1
"brown", "fox" -> 1
"brown fox", "jumped" -> 1
"brown fox jumped", "over" -> 1
"brown fox jumped over", "the" -> 1
"fox", "jumped" -> 1
"fox jumped", "over" -> 1
"fox jumped over", "the" -> 1
"fox jumped over the", "lazy" -> 1
"jumped", "over" -> 1
"jumped over", "the" -> 1
"jumped over the", "lazy" -> 1
"jumped over the lazy", "dog" -> 1
"over", "the" -> 1
"over the", "lazy" -> 1
"over the lazy", "dog" -> 1
"over the lazy dog", "." -> 1
"the", "lazy" -> 1
"the lazy", "dog" -> 1
"the lazy dog", "." -> 1
"lazy", "dog" -> 1
"lazy dog", "." -> 1
"dog", "." -> 1


With `n=5`, we learn the counts of sequences with up to `n` tokens. The model can not learn the counts of longer sequences. We can increase `n`, but will eventually run out of memory.

Let's build the model over all sentences from our Wikipedia training set using a token window size of `n=8`. We also want to measure the approximate RAM consumption of the resulting model, as well as the execution time of the training.

In [13]:
import psutil
import os
import gc
import time

# Function to get current RAM usage in MB
def get_ram_usage():
    process = psutil.Process(os.getpid())
    return f"{process.memory_info().rss / 1024 / 1024: .2f} MB"

for i in range(20):
  gc.collect()
print(f'Before: {get_ram_usage()}')
start_time = time.time()

model = train_lm(sentences, 8)

print(f'After: {get_ram_usage()}')
print(f'Took: {time.time() - start_time} seconds')

Before:  555.16 MB
After:  3760.29 MB
Took: 41.685065269470215 seconds


Execution takes around 30-60 seconds, while the model approximately takes up ~3.2GB of RAM. That's a lot for such a small dataset and token window size!

We could improve both execution time and RAM usage by employing a smarter encoding scheme. For the sake or brevity, that is left as an exercise to the reader. We should appreciate better what large language models are capable of though. For only 2 to 4 time that amount of (V)RAM, we can run an infinitely more capable large language model like [Llama 7B](https://huggingface.co/meta-llama/Llama-2-7b).

## Generating sentences
Given a token sequence, we can now predict the next most likely token using our language model. Let's write a function called `predict_lm` that does just that.

It takes as input the model, the sequence for which we want to predict the next token, and two optional parameters `top_k`, which defines how many of he most probable tokens to pick, and `randomize`, which defines if a token should be picked at random from the `top_k` candidate tokens, or if the most probable token should be picked. These two optional parameters define how "creative" our language model is allowed to get.

The function first tokenizes the input sequence.

It then tries to find the longest prefix from the end of the sequence in the model for which a token list is available.

If no token list can be found, it fetches the token list for the prefix `""`, which is the list of unigram counts.

Next, it selects the top-k tokens from the list based on their count.

If randomization is enabled, it will pick a token at random from this list. Otherwise, it will pick the token with the highest count.

In [14]:
import random

def predict_lm(model, sequence, top_k=5, randomize=True, debug=False):
    input_tokens = word_tokenize(sequence)
    n = max(1, len(input_tokens))
    token_list = None

    while n > 0 and token_list is None:
        prefix = " ".join(input_tokens[-n:])
        if debug:
          print(f'Probing prefix "{prefix}"')
        token_list = model.get(prefix)
        if token_list is None:
          n -= 1
        else:
          if debug:
            print(f'Full token list for prefix "{prefix}": {token_list}')

    if token_list is None:
      token_list = model.get("")

    top_k_tokens = token_list[:top_k]
    if debug:
      print(f'Top-k tokens for prefix "{prefix}": {top_k_tokens}')
    if randomize:
        next_token = random.choice(top_k_tokens)[0]
    else:
        next_token = token_list[0]
    return next_token

Let's see what the model predicts for "The cat in the"

In [15]:
predict_lm(model, "The cat in", 10, True, True)

Probing prefix "The cat in"
Probing prefix "cat in"
Full token list for prefix "cat in": [('the', 1), ('every', 1), ('its', 1)]
Top-k tokens for prefix "cat in": [('the', 1), ('every', 1), ('its', 1)]


'every'

Let's write a function called `complete` which generates a complete sentence based on an input query.

It takes the same inputs as `predict_lm`, plus an additional parameter `max_tokens`, which defines the maximum number of tokens to generate.

The function will call `predict_lm` in a loop, until either a "." was predicted, or the maximal number of tokens was generated.

The generated token is appended to the current sequence. This extended sequence is then fed back into the model prediction function to generate the next token.

This is also how large language models work at the core.

In [16]:
def complete_lm(model, sequence, max_tokens=40, top_k=5, randomize=True, debug=False):
  for i in range(max_tokens):
    next = predict_lm(model, sequence, top_k, randomize, debug)
    sequence += " " + next
    if (next == "."):
      break;

  return sequence

In [17]:
complete_lm(model, "The president of", 40, 5, True, True)

Probing prefix "The president of"
Full token list for prefix "The president of": [('the', 3)]
Top-k tokens for prefix "The president of": [('the', 3)]
Probing prefix "The president of the"
Full token list for prefix "The president of the": [('Supreme', 1), ('Constitutional', 1), ('university', 1)]
Top-k tokens for prefix "The president of the": [('Supreme', 1), ('Constitutional', 1), ('university', 1)]
Probing prefix "The president of the university"
Full token list for prefix "The president of the university": [('later', 1)]
Top-k tokens for prefix "The president of the university": [('later', 1)]
Probing prefix "The president of the university later"
Full token list for prefix "The president of the university later": [('recalled', 1)]
Top-k tokens for prefix "The president of the university later": [('recalled', 1)]
Probing prefix "The president of the university later recalled"
Full token list for prefix "The president of the university later recalled": [('that', 1)]
Top-k tokens fo

'The president of the university later recalled that he was often mentally jumbled and disorganized near the end of his employment .'

From the debug output, we can see that the model has memorized longer sequences, which it will faithfully reproduce, for lack of alternatives. Let's try a few more.

In [18]:
complete_lm(model, "Cassini")

'Cassini probe approached the planet in 2000 and took very detailed images of its atmosphere .'

In [19]:
complete_lm(model, "Schwarzenegger")

'Schwarzenegger character Terminator , he prompts Leslie to do that impression as well .'

In [20]:
complete_lm(model, "And then the")

'And then the federal Treasury stopped paying out gold at face value .'

In [21]:
complete_lm(model, "Aliens have")

'Aliens have to wait `` , referring to a scene in the `` New York `` episode where Kurt discussed the planned move with Rachel .'

Let us contrast this with a large language model.

# Playing with pre-trained large language models
Even our small n-gram language model takes up considerable amount of RAM and a surprising amount of training time.

More capable large language models take many orders of magnitude more training time (days to months), as they are trained on trillions of tokens instead of just a few hundred thousand tokens. Their architecture is much more complex, with the number of model parameters ranging from a few hundred millions (BERT) to a trillion (like Google Brain's [latest entry](https://aibusiness.com/nlp/google-brain-unveils-trillion-parameter-ai-language-model-the-largest-yet)). They are also not just learning simple n-gram statistics, but complex latent variables, that take into account linguistic and semantic properties of text at a much, much deeper level.

As such, the community around large language models has started the trend of publishing **pre-trained large language models**, which allows organizations and individuals without the necessary resources to use and adapt these models.

A popular hub for exchanging models is [Hugging Face](https://huggingface.co/models), where pre-trained, **open-weights models** like Llama, Mistral, and others can be found. The models are published togehter with the model architecture, which allows them to be run via [Hugging Face transformers](https://huggingface.co/docs/transformers/en/index) and similar libraries and frameworks.

We will dive into the details for large language model architecture later. For now, we want to load a "small" large language model and compare its output with our own language model above. Ultimately, large language models and our punny n-gram do the same thing: predict the next token for a sequence.

## GPU support
Running such large language models on the CPU is possible, but not the most enjoybale past time in terms of speed. For any productive work, GPU support is mandatory.

To run a large language model on the GPU in Google Colab, we need to select a runtime with GPU support, like T4 (free), A100 (paid), or V100 (paid). Click on `Runtime -> Change Runtime` above, switch to a GPU runtime, and restart the session.

## Quantization
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/bitsandbytes/FP8-scheme.png)

The free Google Colab T4 runtime features a GPU with 16GB of RAM. How much RAM a model requires can be ballparked via its parameter count.

The number of parameters is usually part of the model name, e.g. the Mistral-7B model has 7 billion parameters. How big is one parameter? All of these models are made up of deep neural networks. Like the one we built in the supervised learning section. One parameter is a weight (or bias) in those deep neural networks. Weights and biases are encoded as floating point numbers. Floating point numbers can have a bit depth ranging from [4-bit](https://arxiv.org/abs/2310.16836) to 64-bit. Most open-weight models are published with [16-bit wide float](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) or [bfloat](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) weights and biases.

A 7B model with weights encoded as 16-bit (b)float thus requires at a minimum 14GB of RAM. This does not include working memory required to hold auxiliary data.

Getting a model to fit into a consumer grade GPU, or a Google Colab T4 instance is hard to impossible.

Thus, pracitioners have come up with quantization schemes that reduce the bit-depth of model weights to 8- or even 4-bits. This reduces memory requirements by 2x-3x, and can also speeds up inference. Of course, there are also downsides: since the weights are compressed lossily, the model may suffer some quality degradation. For many models, this degradation is often acceptable though, especially during development.

Once such quantization method is called [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). It quantizes model weights to 4-bit, while trying to maintain as much fidelity as possible. AutoAWQ is compatible with the Huggin Face Transformers library, which we'll use to load and use AWQ quantized LLMs.

Let's install AutoAWQ:


In [22]:
!pip -q install autoawq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.8/76.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## Loading Mistral-7B
A handful of notorious users on Hugging Face seem to have it made their lifegoal to quantize any and all open-weight LLM models in existance. The most prominent is "TheBloke", who also provides an AWQ quantized version of the [Mistral-7B model](https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ).

Mistral is a family of LLMs by [Mistral AI](https://mistral.ai/contact/), a french company working in the AI space. Mistral models are regularly at the top of [LLM model benchmarks](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

Like most LLM families, Mistral comes in [several flavours](https://huggingface.co/mistralai). Generally, we differentiate between pre-trained models with and without **instruction fine-tuning**. Models without instruction fine-tuning are sort of "raw". They can generally not follow instructions given to them, but will merely try to come up with the most likely next token, given a sequence. Just like our n-gram model above. Just way smarter.

Let's load the AWQ quantized Mistral-7B from Hugging Face. To do so, we use the Hugging Face Transformers class `AutoModelForCausalLLM`. The class name already tells us, what the class does.

`AutoModel` means, that the class knows how to load a specific model's architecture and its corresponding parameters.

The `ForCausalLLM` part means that the model to be loaded is expected to be a large language model for next word prediction (also known as causal language modeling). Check out the [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto) documentation for more details.

The `AutoModelForCausalLM` class has a static method `from_pretrained()` which takes the identifier of a causal large language model to be downloaded from the Hugging Face Hub. The identifier is composed of the name of the user or organization that has uploaded the model to the hub, e.g. "TheBloke", and the model name, e.g. "Mistral-7B-v0.1-AWQ", separated by a slash. You can also specify a local directory instead.

The method will cache the model locally, so subsequent loads are faster. Once downloaded, the `from_pretrained` method then instantiates the concrete model class, in our case [MistralForCausalLM](https://huggingface.co/docs/transformers/v4.38.1/en/model_doc/mistral#transformers.MistralForCausalLM). This class knows how to instantiate all the modules and layers of the model. Once instantiated, the model parameters from the downloaded (or local) model file(s) are loaded into the model.

The model can be instantiated CPU-side, or GPU-side. We can let the `from_pretrained` method pick the best option by passing `device_map="cuda"` as a parameter. [CUDA](https://developer.nvidia.com/cuda-zone) by NVIDIA is the defacto framework when it comes to doing computations on a GPU. It also only works for NVIDIA GPUs. Setting the parameter to "cuda" will make the method try to put the model on a CUDA enabled GPU if available.

A lot of words for 3 lines of code. Here's how we load the model:


In [23]:
from transformers import AutoModelForCausalLM

model_name = "TheBloke/Mistral-7B-v0.1-AWQ"
llm = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda")

config.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Printing the model will tell us about its internal architecture. Constrast and compare this to our own little deepish neural network.

In [24]:
print(llm)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=False, w_bit=4, group_size=128)
          (k_proj): WQLinear_GEMM(in_features=4096, out_features=1024, bias=False, w_bit=4, group_size=128)
          (v_proj): WQLinear_GEMM(in_features=4096, out_features=1024, bias=False, w_bit=4, group_size=128)
          (o_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=False, w_bit=4, group_size=128)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): WQLinear_GEMM(in_features=4096, out_features=14336, bias=False, w_bit=4, group_size=128)
          (up_proj): WQLinear_GEMM(in_features=4096, out_features=14336, bias=False, w_bit=4, group_size=128)
          (down_proj): WQLinear_GEMM(in_features=143

## Tokenization
While large language models deal with text, they can not directly work with direct text input. Just as in case of our n-gram model, we need to use a tokenizer to preprocess any text we want to pass to the large language model.

Tokenizers for large language model are a science in their own right. We've already mentioned [binary pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) and many other tokenizers, which do not operate on word boundaries, like in our n-gram case.

Whatever model we use, we must ensure to use the same tokenizer that was used for pre-training it. Thankfully, the information which tokenizer is needed for a model is part of the model configuration. Which means, we can automatically load it, just like we did with with the model itself. The Hugging Face Transformers library has another auto class for that, `AutoTokenizer`. Here is how we load the tokenizer for the mistral model:

In [25]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/962 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Tokenizing a text string is equally simple:

In [26]:
tokens = tokenizer("Peach is in a different castle")
tokens

{'input_ids': [1, 3242, 595, 349, 297, 264, 1581, 19007], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer returns a set of ids. The first id with value `1` signals the start of the sequence. The remaining ids map to one token from the vocabulary the tokenizer has. This tokenizer has a total of 32000 tokens in its vocabulary. When a string is tokenized, it gets broken down into ids from that vocabulary.

We can decode the ids back to text:

In [27]:
for i in range(len(tokens["input_ids"])):
  print(tokenizer.decode(tokens["input_ids"][i]))

<s>
Pe
ach
is
in
a
different
castle


Where `<s>` is the textual encoding for the special "beginning of stream" token.

The `attention_mask` is used to tell the model which of the tokens it can ignore (`0`) and which tokens it should consider (`1`). For our purpose, we'll always have an attention mask full of `1`s when passing input to the LLM.

The values returned by the `tokenizer` call above live CPU-side. However, our model lives GPU-side. We need to invoke the tokenizer with an additional parameter `return_tensors`, so it returns PyTorch tensors. We then transfer all those tensors to the GPU via `value.to("cuda")`.

In [28]:
def tokenize_llm(input, max_length=4096):
  inputs = tokenizer(input, return_tensors="pt", max_length=1024, truncation=True);
  inputs = {key: value.to("cuda") for key, value in inputs.items()}
  return inputs

tokenize_llm("Peach is in a different castle")

{'input_ids': tensor([[    1,  3242,   595,   349,   297,   264,  1581, 19007]],
        device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

The input ids and attention mask have been converted to PyTorch tensors, which live on the GPU. We are ready to pass input to the LLM for it to predict the next word in a sequence.

## Logits and predicting the next word
Let's start by tokenizing an input we also used with our n-gram model:

In [29]:
input = tokenize_llm("Cassini")
input

{'input_ids': tensor([[    1, 13367,  3494]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1]], device='cuda:0')}

The `MistralModelForCausalLM` instance is actually a PyTorch `Module`, like our deep-ish neural network from the previous section. As such, we can make it predict output via:

In [30]:
llm.eval()
output = llm(**input)
print(type(output))

<class 'transformers.modeling_outputs.CausalLMOutputWithPast'>


The function returns an instance of [CausalLMOutputWithPast](https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast). We are only interested in the `logits`, so lets have a look at what they are.

In [31]:
output.logits.shape

torch.Size([1, 3, 32000])

It's a PyTorch tensor, with a very weird shape. Let's break it down.

The first dimension of size `1` is the batch size, which is equivalent to the number of inputs we submitted. We only submitted a single input, so its size is 1. In production, we'd submit multiple inputs at once, which is called batching. This way we use all (or most) of the available resources in parallel.

The next dimension (`3`) is the sequence length. We passed in the tokens for `Cassini`, which got tokenized to `3` tokens (including the beginning of stream token with id `1`).

The final dimension has size `32000`. That is equivalent to the number of tokens in the vocabulary of the tokenizer.

We thus get one vector with 32000 dimensions for each input token. Each element in this vector corresponds to one token in the vocabulary the tokenizer understands. The index of the element is equal to the id of the corresponding token in the vocabulary. And the value of a vector element gives us the (unnormalized) log-probability that this token is the next token in the sequence after the current token! This (unnormalized) log-probability for a token is also called the **logit score**.

This is very similar to the token list we fetch for a prefix in our n-gram model. Except, we do it for the longest prefix match, instead of for every token position in the input.

We can visualize the top 5 predicted tokens for each position in the input sequence.

In [32]:
import torch
data = []
for i in range(output.logits.shape[1]):
    logits_for_position = output.logits[0, i, :]
    topk_values, topk_indices = torch.topk(logits_for_position, 5)
    topk_tokens_with_scores = [(tokenizer.decode([idx.item()], skip_special_tokens=True), score.item()) for idx, score in zip(topk_indices, topk_values)]
    for rank, (token, score) in enumerate(topk_tokens_with_scores, start=1):
        data.append({
            "Position": i,
            "Input Token": tokenizer.decode(input["input_ids"][0][i]),
            "Predicted Next Token": token,
            "Logit Score": score
        })
pd.DataFrame(data)

Unnamed: 0,Position,Input Token,Predicted Next Token,Logit Score
0,0,<s>,#,12.25
1,0,<s>,##,11.101562
2,0,<s>,###,10.703125
3,0,<s>,The,10.0625
4,0,<s>,User,10.039062
5,1,Cass,ie,11.3125
6,1,Cass,andra,11.164062
7,1,Cass,ini,10.84375
8,1,Cass,idy,10.445312
9,1,Cass,per,10.351562


A few interesting observations:

* For the beginning of stream token `<s>`, that is for an empty input sequence, the model really wants to output "#", "##", "###". These are clearly markdown headings, which were likely very abundant in the training data.
* For the next token `Cass`, the model thinks common names like `Cassie` or `Cassandra` would be appropriate.
* For the last token in our input sequence, `ini`, the model predicts `Sign` as the most likely next token, followed by `is`

For our final trick, we'll write a `complete_llm` function, that completes an input sequence up to a maximum number of generated tokens. We'll use the model's `generate` method for that, to make our life a little easier.

In [33]:
def complete_llm(llm, input, max_length = 40):
    input = tokenize_llm(input)
    output_ids = llm.generate(**input, max_length=max_length, num_return_sequences=1)
    generated_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    return generated_text

In [34]:
complete_llm(llm, "The Cassini probe")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'The Cassini probe has been orbiting Saturn for over a decade, and has been sending back some amazing images of the planet and its moons. But the probe is running out of fuel'

In [35]:
complete_llm(llm, "And then the")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'And then the world changed.\n\nI’m not talking about the pandemic. I’m talking about the murder of George Floyd.\n\nI’m not talking about the protests'

In [None]:
complete_llm(llm, "Aliens have")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'Aliens have been a part of the Star Wars universe since the very beginning. In fact, the first Star Wars movie was originally titled Star Wars: Episode IV – The Star Wars.\n\n'

We can also check logits on OpenAI's playground.

![](https://marioslab.io/uploads/genai/openai-logits.png)

## Running LLMs locally
If you have a PC with a recent NVIDIA GPU and enough VRAM (Minimum 8GB) or a MacBook with an M1 or above (Minimum 16GB), you can try to run these models on your machine as well.

There are different ways to run the models. All of them fetch the models from Hugging Face.

* [LM Studio](https://lmstudio.ai/) a comprehensive GUI application that makes downloading and chatting with open-weight LLMs simple. Also supports serving models via an OpenAI compatible REST API.
* [Ollama](https://ollama.com/), more or less a command line tool to quickly pull models from Hugging Face and run them locally. Also allows serving those files via a REST API that is compatible with the OpenAI API.
* [llama.cpp](https://github.com/ggerganov/llama.cpp), a C++ library to load and run many open-weight models. Comes with its own file format (GGUF), and includes CLI tools for all kinds of tasks. Models can also be served via an OpenAI compatible REST API through [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/server/).

llama.cpp is the underpinning of many local open-weight model runners, including LM Studio and Ollama. It recently got support for [**LoRA fine-tuning**](https://rentry.org/cpu-lora) though only on the CPU, which is ... not great.

If you use Ollama, you can get a local OpenAI compatible server running as follows:

```
ollama pull mixtral:latest
ollama serve
```

You can then use the OpenAI Python module to talk to your local model or an OpenAI model, and thereby compare and contrast the performance of both:

```
use_local_model = True
if use_local_model:
  client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
  )
  model_name="mixtral:latest"
else:
  client = OpenAI(api_key="<your OpenAI API key>")
  model_name="gpt-3.5-turbo"

# The rest of your code stays the same!
completion = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)
```

You can use this trick for any OpenAI API based code in the subsequent sections!
