---
title: Language Model
math:
    '\abs': '\left\lvert #1 \right\rvert'
    '\norm': '\left\lVert #1 \right\rVert'
    '\Set': '\left\{ #1 \right\}'
    '\set': '\operatorname{set}'   
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\hR': '\R{\hat{#1}}'
    '\RM': '\mathbf{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
    '\SFM': '\operatorname{SFM}'
    '\utag': '\stackrel{\text{(#1)}}{#2}'
    '\uref': '\text{(#1)}'
    '\minimal': '\operatorname{minimal}'
---

::::{attention}
This notebook is optional and NOT required for any course assessment activities. Lab tutor may go through them if time is available.
::::

In [None]:
from __init__ import show
import os
from IPython.display import JSON
import transformers as tfm
import torch

In [None]:
if not input('Load JupyterAI? [Y/n]').lower()=='n':
    %reload_ext jupyter_ai

## Problem Formulation

What is a language model?

From [Wikipedia](https://en.wikipedia.org/wiki/Language_model):

> A language model is a model of the human brain's ability to produce natural language.

To put it simply, a causal language model completes an input prompt such as

> A language model is ...

into a realistic text like the one from the Wikipedia. More formally:

::::{prf:definition} language model
:label: def:LM

A language model is a generative (artificial neural) network trained on a *dataset* of *samples/examples* of *random text/source* $\R{s}$ to generate realistic text $\hR{s}$ when given a prompt $\R{u}$ for $\R{s}$. The goal is to make the conditional pmf $p_{\hR{s}|\R{u}}$ as close as possible to $p_{\R{s}|\R{u}}$, but without knowing the joint distribution of $\R{u}$ and $\R{s}$. The statistical "closeness" can be measured by a [divergence](https://en.wikipedia.org/wiki/Divergence_(statistics)) such as the Kullback-Leibler/information [divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)

$$
D(\R{s}\|\hR{s}|\R{u}) = \underbrace{\E\left[ \log \frac1{p_{\hR{s}|\R{u}}(\R{s}|\R{u})}\right]}_{\text{cross entropy $H(\R{s}\|\hR{s}|\R{u})$}} - \underbrace{\E\left[\log \frac{1}{p_{\R{s}|\R{u}}(\R{s}|\R{u})} \right]}_{\text{entropy $H(\R{s}|\R{u})$}}.
$$ (eq:D)

::::

::::{prf:remark} Training objective

We denotes random variables in sanserif font. For instance, the entropy $H(\R{s}|\R{u})$ above is an expectation, denoted by $\E[\cdot]$, of the log reciprocal of the probability mass $p_{\R{s}|\R{u}}(\R{s}|\R{u})$, which is random because the arguments $\R{s}$ and $\R{u}$ are random.

The divergence in [](#eq:D) is an important statistical distance in Information Theory and Machine Learning. Since the entropy $H(\R{s}|\R{u})$ does not depend on the generative network, i.e., $p_{\hR{s}|\R{u}}$, minimizing the divergence is equivalent to minimizing the cross entropy $H(\R{s}\|\hR{s}|\R{u})$, which is often used as an objective function in training a model.

::::

For the expression in [](#eq:D) is called a divergence because it satisfies the following properties:[^divergence]

[^divergence]: The divergence is not a (pseudo-)metric as it is not symmetric, i.e., $D(\R{s}\|\hR{s}|\R{u})\not\equiv D(\hR{s}\|\R{s}|\R{u})$.

::::{prf:proposition} non-negativity of divergence
:label: pro:D:non-negative

The divergence in [](#eq:D) is non-negative, i.e.,

$$
\begin{align}
D(\R{s}\|\hR{s}|\R{u}) &\geq 0, && \text{or equivalently}\\
H(\R{s}\|\hR{s}|\R{u}) &\geq H(\R{s}|\R{u}).
\end{align}
$$ (eq:D:non-negative)

Equality holds if and only if the conditional distributions are identical almost surely, i.e., the random event[^random]

$$
p_{\hR{s}|\R{u}}(\R{s}|\R{u}) = p_{\R{s}|\R{u}}(\R{s}|\R{u})
$$ (eq:D:0)

occurs with probability $1$.

::::

[^random]: [](#eq:D:0) is a random event because $\R{u}$ and $\R{s}$ are random.

::::{prf:proof}
:nonumber:

To prove the non-negativity [](#eq:D:non-negative) of the divergence, we start from equation [](#eq:D):

\begin{align}
\E\left[ \log p_{\R{s}|\R{u}}(\R{s}|\R{u}) - \log p_{\hR{s}|\R{u}}(\R{s}|\R{u}) \right]
&= \E\left[\log \frac{p_{\R{s}|\R{u}}(\R{s}|\R{u})}{p_{\hR{s}|\R{u}}(\R{s}|\R{u})} \right]\\
&\utag{a}= \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} \log \frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} \right]\\
&\utag{b}\geq \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})}\right] \log \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} \right]\\
&\utag{c}= \E\left[\sum_{x} p_{\R{s}|\R{u}}(x|\R{u}) \right] \log \E\left[\sum_{x} p_{\R{s}|\R{u}}(x|\R{u}) \right]\\
&\utag{d}= 1\cdot \log 1 = 0,
\end{align}

- $\uref{a}$ and $\uref{c}$ follow from the definition of expectation. To show $\uref{a}$, note that the R.H.S. (with an appropriate choice of $f$) is
  
  $$
  \begin{align}
  \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} f(\R{u},\R{s})\right]
  &= \E\left[\sum_{x} \sout{p_{\hR{s}|\R{u}}(x|\R{u})} \frac{p_{\R{s}|\R{u}}(x|\R{u})}{\sout{p_{\hR{s}|\R{u}}(x|\R{u})}} f(\R{u},x)\right]\\
  &= \E\left[\sum_{x} p_{\R{s}|\R{u}}(x|\R{u}) f(\R{u},x)\right]\\
  &= \E\left[f(\R{u},\R{s})\right],
  \end{align}
  $$
  which gives the L.H.S. of $\uref{a}$. $\uref{c}$ can be shown similarly with $f(u,x)=1$.

- $\uref{b}$ results from applying [Jensen's inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality) (see [](#lem:jensen) below) to the convex function $r \mapsto r \log r$.

- Since $r \mapsto r \log r$ is strictly convex, the inequality holds with equality if and only if $\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})}$ for some constant $C$ almost surely. However, $C$ must be 1, which implies [](#eq:D:0), as probability mass must sum to $1$ over all possible outcomes, i.e.,
  $$
  \sum_x p_{\R{s}|\R{u}}(x|\R{u}) = 1 = \sum_x p_{\hR{s}|\R{u}}(x|\R{u}),
  $$ 
  which also justifies $\uref{d}$. (Q.E.D.)

::::

The above proof relies on the strict convexity of $f(r):= r\log(r)$, i.e.:

::::{prf:definition} convexity

A function $f:\mc{R} \to \mathbb{R}$ is convex if for all $r_1,r_2\in \mc{R}$,

$$
\begin{align}
\lambda f(r_1) + (1-\lambda) f(r_2) \geq f(\lambda r_1 + (1-\lambda) r_2) && \forall \lambda \in [0,1].
\end{align}
$$ (eq:convexity)

$f$ is strictly convex if the above inequality is strict whenever $r_1\neq r_2$ and $\lambda\in (0,1)$.

::::

::::{prf:lemma} Jensen's inequality
:label: lem:jensen

For any random variable $\R{r}$ and convex function $f$ satisfying [](#eq:convexity), we have

$$
\E[f(\R{r})] \geq f(\E[\R{r}]).
$$ (eq:jensen)

If $f$ is strictly convex, equality holds if and only if $f(\R{r})$ is deterministic, i.e., equal to a constant almost surely.

::::

::::{exercise}
:label: ex:jensen

Prove the Jensen's inequality in [](#eq:jensen) for discrete random variable $\R{r}$ taking values from a finite set $\mc{R}$.[^jensen]

:::{hint}
:class: dropdown

Consider a [proof by induction](https://en.wikipedia.org/wiki/Mathematical_induction) on the size $n$ of the support set 

$$
\operatorname{supp}(\R{r}) := \Set{r\in \mc{R} | p_{\R{r}}(r)>0} = \Set{r_i | i\in [n]:=\Set{0,\dots,n-1}}.
$$

The base case with $n = 1$ is trivial. To show the inductive step, note that [](#eq:convexity) can be obtained from [](#eq:jensen) by setting 

$$
\begin{align}
p_{\R{r}}(r_0) &= \lambda\\
p_{\R{r}}(r_1) &= 1-\lambda.
\end{align}
$$

:::

::::

[^jensen]: We have restricted Jensen's inequality to discrete random variables here for simplicity, even though the inequality holds more generally.

YOUR ANSWER HERE

## Tokenization

Just like we compose a text using words from a vocabulary, a language model also generates a text from a vocabulary consisting of meaningful units called tokens.

::::{prf:definition} tokenizer

A tokenizer encodes an input text $x$ into a sequence

$$
y=(x_0, x_1, \dots, x_{n-1}) := f(x) \in \mc{X}^n,
$$ (eq:tokens)

and decodes the sequence back to $x = f^{-1}(y)$. Each $x_i$, called a token, takes its value from the same vocabulary $\mathcal{Y}$.

::::

 The following code creates a tokenizer from the configuration files under `model_path` using [`AutoTokenizer.from_pretrained`](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained):

In [None]:
# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)
show(tokenizer)

The configurations of the tokenizer is specified in [JSON format](https://en.wikipedia.org/wiki/JSON), which is a collection of key/value pairs where the keys are names given as strings:

In [None]:
JSON(filename=os.path.join(model_path, "tokenizer_config.json"))

In [None]:
JSON(filename=os.path.join(model_path, "special_tokens_map.json"))

In [None]:
JSON(filename=os.path.join(model_path, "tokenizer.json"))

::::{note}

`AutoTokenizer` automatically selects a more specific tokenizer type such as `LlamaTokenizerFast` for the specified model.

::::

To encode and decode a text using the tokenizer:

In [None]:
text = "A language model is a probabilistic model of a natural language."
ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(ids)
assert text == decoded_text
ids

In [None]:
show(tokenizer.encode)

In [None]:
show(tokenizer.decode)

For efficient implementation of `encode` and `decode`, tokens are represented by integers known as token IDs. The mapping from IDs to tokens is provided by the dictionary below:

In [None]:
show(tokenizer.vocab)

To obtain the tokens from IDs, we can use the method `convert_ids_to_tokens`:

In [None]:
tokens = tokenizer.convert_ids_to_tokens(ids)
tokens

::::{exercise}
:label: reverse_dict

Write a function `reverse_dict(d)` to create a new dictionary where the keys are the values of the input dictionary `d`, and the values are the original keys. If multiple keys share the same value, only the last key should be kept. Assume the values of `d` are hashable.

::::

In [None]:
def reverse_dict(d):
    # YOUR CODE HERE
    raise NotImplementedError

In [None]:
# tests
reversed_vocab = reverse_dict(tokenizer.vocab)
assert tokens == [reversed_vocab[i] for i in ids]

Note that a token needs not be an English word. For instance:

In [None]:
tokens[5], tokens[6], tokens[-1]

The tokens can be punctuations such as `.` and even subwords that are meaningful by itself such as `▁probabil` and `isstic`.

::::{exercise}
:label: ex:meta

Why some tokens such as `▁probabil` has a meta symbol `▁` (which is not the same an underscore `_`) but some does not such as `istic`?

:::{hint}
:class: dropdown

See the [tokenization process](https://github.com/google/sentencepiece#whitespace-is-treated-as-a-basic-symbol).

:::

::::

YOUR ANSWER HERE

## Generation

A language model generates a text one token at a time just like we speak a text word-by-word. The model is probabilistic in the sense that each token is generated randomly according some distribution. The sequence of randomly generated tokens is called a [*stochastic/random process*](https://en.wikipedia.org/wiki/Stochastic_process). If each token is generated independently based on some previously generated tokens, the process is said to be *auto-regressive*.

::::{prf:definition} auto-regressive generation

The generated text $\hR{s}$ of a language model in [](#def:LM) is the decoding $\hR{s} = f^{-1}(\R{x})$ of the sequence of tokens $\R{x}$ in [](#eq:tokens) where:

- For some integer $n>0$ called the context length, the new tokens $\R{x}_{n+t}$ for $t\in \mathbb{N}$ is sampled independently based on the realization of the last $n$ tokens $\R{x}_{t:n+t}$ called the *context*, i.e.,

  $$
  p_{\R{x}_{n+t}|\R{x}_{:n+t}}(x_{n+t}|x_{:n+t}) = p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t})
  $$ (eq:auto-regressive)
  
  for all $x_{:n+t+1}\in \mc{X}^{n+t+1}$.
- the initial sequence of tokens is the sequence $\R{x}_{:n} = f(\R{s})$ of tokens for the input prompt $\R{s}$.
  
::::

A potential confusion is to think that a token cannot depend on other tokens outside the context.

::::{exercise}
:label: ex:auto-regressive

Give a counter-example that a process satisfying [](#eq:auto-regressive) can have $\R{x}_{n+t}$ depend on $\R{x}_{:t}$.

:::{hint}
:class: dropdown

See the [data processing inequality](https://en.wikipedia.org/wiki/Data_processing_inequality) and [conditional independence](https://math.stackexchange.com/questions/22407/independence-and-conditional-independence-between-random-variables).

:::

::::

YOUR ANSWER HERE

If $\R{s}$ is a sequence of tokens shorter (longer) than the context length $n$, it can be left-padded (left-truncated) by the tokenizer:

In [None]:
text = "A language model is a probabilistic model of a natural language."
encoding = tokenizer(text, padding='max_length', truncation=True)
len(encoding.keys()), len(encoding.input_ids), len(encoding.attention_mask)

In [None]:
show(tokenizer.__call__)

The above call to `tokenizer` returns a dictionary consisting of two lists, both with the same length as the context length:

In [None]:
tokenizer.model_max_length

`input_ids` points to the list of token IDs:

In [None]:
show(encoding.input_ids)

Note the `input_ids` is left-padded by the padding token ID:

In [None]:
tokenizer.pad_token_id, tokenizer.pad_token

Intuitively, the padding tokens should not be used to generate new tokens. To avoid unnecessary computations, the attention mask explicitly gives $0$ attention/importance/weight to those special tokens:

In [None]:
show(encoding.attention_mask)

There are also other special tokens that should normally be masked off:

In [None]:
tokenizer.special_tokens_map_extended

To load a language model:

In [None]:
bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
    model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)

A language model is a type of neural network consisting of layers of computational units called neurons. To generate text quickly, the above code attempts to utilize a Graphics Processing Unit (GPU) whenever available. It further quantizes the model to a lower precision, specifically 8-bit instead of the original 16-bit, to reduce the memory footprint.

Finally, to generate the text, run the following cell:

::::{tip}

If it takes too long to generate, reduce `max_length` parameter to `50` or smaller.

::::

In [None]:
# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_length=100)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

Note that repeatedly runing the above code will generate the same text. This is because, instead of sampling from the distribution in [](#eq:auto-regressive), it makes a hard decision:

::::{prf:definition} hardening
:label: def:hard-decision

A sequence $x^*$ is called the hard decision of $\R{x}$ in [](#eq:auto-regressive) if it is one of the most probable sequence, i.e.,

$$
x^*_{n+t} \in \arg\max_{x_{n+t}\in \mc{X}} p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x^*_{t:n+1}),
$$

where [$\arg\max_{x_{n+t}}$](https://en.wikipedia.org/wiki/Arg_max) denotes the set of optimal solutions $x_{n+t}\in \mc{X}$ that maximize the conditional pmf.

::::

To perform the sampling, we can pass the keyword argument `do_sample=True` to `model.generate` as follows:

In [None]:
# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_length=100, do_sample=True)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

Verify that the code generates the tokens randomly by running it repeatedly.

The generated text might have been cut off in the middle of a sentence. Although you can increase `max_length` to a sufficiently large value to ensure that generation terminates with an end-of-sequence token (`eos_token`), this can result in excessively long outputs. Fortunately, there are [other stopping criteria](https://huggingface.co/docs/transformers/v4.46.3/en/internal/generation_utils#transformers.StoppingCriteria) implemented that can help control the length and content of the generated text more effectively.

::::{exercise}
:label: ex:stopping_criteria

Modify the call to `model.generate` to stop at a line break `"\n"`.

:::{hint}
:class: dropdown

Use `StoppingCriteriaList` and `StopStringCriteria` from `transformer`.

:::

::::

In [None]:
# Assign the desired stopping criteria to `stopping_criteria`.
# YOUR CODE HERE
raise NotImplementedError
stopping_criteria

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, 
                              max_length=2000, # Make this big as the default is 20
                              stopping_criteria=stopping_criteria
                             )

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat.strip())

In [None]:
# test
assert "A language model is" in shat
assert "\n" not in shat.strip()

## Chat Completion

A language model can also be trained to complete a chat, following the [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT). The [Chat Completion API](https://huggingface.co/docs/api-inference/en/tasks/chat-completion#api-specification). A chat can be represented as a list of chat messages:

In [None]:
chat = [
    {"role": "system", "content": "You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background."},
    {"role": "user", "content": "What is a language model?"}
]

Each message is associated with a role:

- The `system` message sets the behavior for the AI assistant.
- The `user` message represents the user's query.

The tokenizer can be used to convert the list of messages into a single text for the language model to complete in the same way as before:

In [None]:
# Apply the chat template
formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print("Formatted chat:\n", formatted_chat)

Note that `<|system|>`, `<|user|>`, `<|assistant|>`, and `<|end|>` are special tokens used to mark the different chat messages. The chat template can be printed as follows:

In [None]:
chat_template = tokenizer.get_chat_template()
print("Chat template:\n", chat_template)

This is a [Jinja](https://en.wikipedia.org/wiki/Jinja_(template_engine)) template, which uses Python programming syntax such as iterations and conditionals to render the text from an input list `messages` of dictionaries.

We can now call the language model to complete the text as before:

In [None]:
# Tokenize input text and generate output
u = formatted_chat
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_length=200)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

::::{exercise}
:label: ex:decode_chat_messages

Complete the following function that

- takes an input list of tokens, obtained from a text in the chat template above, and
- return a list of chat messages as dictionaries according to the Chat Completion API.

::::

In [None]:
def decode_chat_messages(ids):
    roles = {32006: "system", 32010: "user", 32001: "assistant"}
    output = []
    # YOUR CODE HERE
    raise NotImplementedError
    return output

In [None]:
# tests
generated_text = """
<|system|> You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background.<|end|>
<|user|> What is a language model?<|end|>
<|assistant|> A language model is a type of artificial intelligence (AI) system that is designed to understand, interpret, and generate human language. It is a mathematical representation of how words and phrases are likely to occur in a given language. Language models are used in various applications, such as speech recognition, machine translation, text generation, and natural language processing (NLP).
"""

assert decode_chat_messages(tokenizer.encode(generated_text)) == [
    {
        "role": "system",
        "content": "You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background.",
    },
    {"role": "user", "content": "What is a language model?"},
    {
        "role": "assistant",
        "content": "A language model is a type of artificial intelligence (AI) system that is designed to understand, interpret, and generate human language. It is a mathematical representation of how words and phrases are likely to occur in a given language. Language models are used in various applications, such as speech recognition, machine translation, text generation, and natural language processing (NLP).\n",
    },
]