# Language models

<div class="alert alert-block alert-info"> <b>Activity.</b> Predict the word that comes next <br>
    <ol>1. I went to ___</ol>
    <ol>2. Would you like another ___ </ol>
    <ol>3. He went up the stairs ___ </ol>
    <ol>4. I would like ___ but today I don't have time </ol>
    <ol>5. There are ___ animals in the zoo </ol>
    <ol>6. This is ___ last chance you'll get</ol>
</div>




Language models are *self-supervised* nerual network models trained on, e.g., 

1. *Causal language modeling*: predict next word given previous $n$ words; e.g., (1)-(3) above)
2. *Masked language modeling*: predict the masked word, e.g., (4)-(6))

These are general tasks. They are claimed to allow large language models to learn general linguistic properties. There is a large research area devoted to studying which natural language properties language models (do not) learn from self-supervised training on large amounts of unstructured data.

## Transformers

Transformers are models that process sequential data (language input). What makes them special, and a large reason for their success, is that they have an *attention mechanism*. We won't go into the details here, but see [Attention is all you need!](https://arxiv.org/abs/1706.03762) and [Hugging Face's tutorial on transformers](https://huggingface.co/course)

In all likelihood, you will mainly be using *pre-trained* language models (and most likely transformers). The reason is that training a large language model requires a lot of data and computing time. Nowadays, large companies are main source of new models.

Using a pre-trained language model and fine-tuning it for a particular task is called *transfer learning*

<div class="alert alert-block alert-info"> <b>Discussion.</b> What are the advantages of using a pre-trained model? What are the disadvantages?
</div>

### Data hungry algorithms and bias

In the tutorial, you came across the following example of bias induced by training data

In [1]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


<div class="alert alert-block alert-info"> <b>Discussion.</b> What other types of biases (linguistic and cultural) may a language model inadvertedly incorporate?
</div>

# Transformers and spaCy

`spaCy` has a wrapper for the `transformers` library, called `spacy-transformers`. There are some tasks that the wrapper does not support (for instance, <MASK>-prediction, since it is considered a Natural Language Generation task); but many are supported, including fine tuning.
    
More information can be found on spaCy's webpage and [here](https://explosion.ai/blog/spacy-transformers)
    
    
### Word embeddings
Besides fine-tuning, you can also easily access a language model's word embeddings

In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')  # make sure to use larger package!


doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))







doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

  print(doc1, "<->", doc2, doc1.similarity(doc2))
  print(french_fries, "<->", burgers, french_fries.similarity(burgers))


I like salty fries and hamburgers. <-> Fast food tastes very good. 0.27134929909014804
salty fries <-> hamburgers 0.40727245807647705
I like salty fries and hamburgers. <-> Fast food tastes very good. 0.0
salty fries <-> hamburgers 0.0


  print(doc1, "<->", doc2, doc1.similarity(doc2))
  print(doc1, "<->", doc2, doc1.similarity(doc2))
  print(french_fries, "<->", burgers, french_fries.similarity(burgers))
  print(french_fries, "<->", burgers, french_fries.similarity(burgers))


In [19]:
nlp = spacy.load("en_core_web_trf")

doc = nlp("Apple shares rose on the news. Apple pie is delicious.")

# doc[0] accesses the emmbedding of the first token = 'Apple'
print(doc[0])  # Apple

# doc[7] accesses the emmbedding of the 8th token = 'Apple'
print(doc[7])  # Apple

# they are not the same, because the embedding are context sentitive (check with cosine similarity)    
print(doc[0].similarity(doc[7])) # 0.43365735

#doc1 = nlp(u"The labrador barked.")
#doc2 = nlp(u"The labrador swam.")
#doc3 = nlp(u"the labrador people live in canada.")

#for doc in [doc1, doc2, doc3]:
#    labrador = doc[1]
#    dog = nlp(u"dog")
#    print(labrador.similarity(dog))

Apple
Apple
1.0


<div class="alert alert-block alert-success"> <b>Activity.</b> Use spaCy's en_core_web_sm's word embeddings to answer the following questions
    1. 
</div>



import spacy
import en_core_web_sm