# Language models

<div class="alert alert-block alert-info"> <b>Activity.</b> Predict the word that comes next <br>
    <ol>1. I went to ___</ol>
    <ol>2. Would you like another ___ </ol>
    <ol>3. He went up the stairs ___ </ol>
    <ol>4. I would like ___ but today I don't have time </ol>
    <ol>5. There are ___ animals in the zoo </ol>
    <ol>6. This is ___ last chance you'll get</ol>
</div>




Language models are *self-supervised* neural network models trained on, e.g., 

1. *Causal language modeling*: predict next word given previous $n$ words; e.g., (1)-(3) above)
2. *Masked language modeling*: predict the masked word, e.g., (4)-(6))

These are general tasks. They are claimed to allow large language models to learn general linguistic properties. There is a large research area devoted to studying which natural language properties language models (do not) learn from self-supervised training on large amounts of unstructured data.


<div class="alert alert-block alert-info"> <b>Discussion.</b> What linguistic properties do you expect a model to learn from causal or masked language modeling? Which ones do you expect it to have a harder time with?
</div>

***

## Transformers

Transformers are models that process sequential data (language input). What makes them special, and a large reason for their success, is that they have an *attention mechanism*. We won't go into the details here, but see [Attention is all you need!](https://arxiv.org/abs/1706.03762) and [Hugging Face's tutorial on transformers](https://huggingface.co/course). Intuitively, and by contrast to other models, they have the ability to output contextualized representations. 

In all likelihood, you will mainly be using *pre-trained* language models (and most likely transformers). The reason is that training a large language model requires a lot of data and computing time. Nowadays, large companies are main source of new models.

Using a pre-trained language model and fine-tuning it for a particular task is called *transfer learning*

<div class="alert alert-block alert-info"> <b>Discussion.</b> What are the advantages of using a pre-trained model? What are the disadvantages?
</div>

### Data hungry algorithms and bias

In the tutorial, you came across the following example of bias induced by training data

In [2]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

ImportError: tokenizers>=0.11.1,!=0.11.3,<0.13 is required for a normal functioning of this module, but found tokenizers==0.13.1.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

<div class="alert alert-block alert-info"> <b>Discussion.</b> What other types of biases (linguistic and cultural) may a language model inadvertedly incorporate?
    
    
Identifying and mitigating bias in language models is a fast growing area in industry and academia.
</div>