In [2]:
import warnings
warnings.filterwarnings('ignore')
from tabulate import tabulate
from transformers import AutoTokenizer, AutoModel, pipeline, set_seed
from ipywidgets import interact, widgets
from IPython.display import display
from IPython.utils import io

# Transformer LMs

<br>
<br>

*"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."*  ~ Shazeer (2020)


## Goals of this Presentation

- Provide an introduction to transformer language models (LMs)
- Understand basic architecture and its variations
- Understand sources of failure
- Highlight details of interest for philosophers and cognitive scientists

# Generative AI

- Models that generate text, images, videos etc. that can be directly consumed.
- In the case of text: typically sequence-to-sequence (seq2seq)
- Model that can be used for generation:
    + Recurrent Neural Networks
    + Transformers <span style="color:red">&larr;*today*</span>
    + Diffusion models (images)
    + ...

## Language Models

- A language model (LM) provides a probability distribution over tokens given a context
- $ p(word|context) \approx \Phi(context) $
- Old approach: Counting occurrences
- Neural language models: Around since the early 2000s

# Transformer Models
- Introduced by Vaswani et al. (2017)
- A type of neural architecture based on the attention mechanism
- Efficient for processing and producing sequences

![original transformer architecture](img/transformer.svg "original transformer")

(image taken from Vaswani et al. (2017) and modified)

![simplified transformer architecture](img/transformer_basic.png "simplified transformer")

![transformer block expanded](img/transformer_block_expand.png "expanded transformer block")

## Attention Mechanism

- A way to contextualise representations
- Representations are combined based on their similarity (dot-product)


---
- "Attention Is all You Need" (Vaswani et al. 2017)
- "Attention Is not all You Need" (Dong et al. 2021)
- "Attention is Turing Complete" (Pérez et al. 2021) &larr; under certain assumptions
- https://www.isattentionallyouneed.com/

# Generation

- How to get from state to vocabulary
- Different decoding strategies
- Sampling from the probabilitiy distrubtion:
    - Temperature: Adjusts probability distribution

# Semantic Space

- Embeddings/activations live in a cartesian space
- Geometric interpretation of embeddings
- Space exhibits interesting regularities
  - Even when reduced to 2D
  - Why would that be?

# Training

- Large amounts of data
- In batches: predict target word (masked or next) + backpropagation
- Models might be undertrained

## Backpropagation
1. Pass batch of data through
2. Calculate loss: Divergence between output and target
3. Use gradient to adjust weights.

## Some bold claims

- The neural network is not a blank slate
    - Architectural bias
    - Random initialisation
    - Lottery ticket hypothesis
- The training is not just gradual
    - Especially when it comes to compositional skills

# What I have not covered

- Prompting
- Hyperparameter search
- Finetuning
- RLHF
- Multi-modality

# References

- Dong, Y., Cordonnier, J.-B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. Proceedings of the 38th International Conference on Machine Learning, 2793–2803. https://proceedings.mlr.press/v139/dong21a.html
- Pérez, J., Barceló, P., & Marinkovic, J. (2021). Attention is Turing-Complete. Journal of Machine Learning Research, 22(75), 1–35.
- Shazeer, N. (2020). GLU Variants Improve Transformer (arXiv:2002.05202). arXiv. https://doi.org/10.48550/arXiv.2002.05202
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30. https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html