![](img/575_banner.png)

# Lecture 8: More transformers

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

> [Attention is all you need!](https://arxiv.org/pdf/1706.03762.pdf)

## Imports, LO

### Imports

In [1]:
import sys
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim

pd.set_option("display.max_colwidth", 0)

<br><br>

### Learning outcomes

From this lecture you will be able to 

- Broadly explain how transformers are used in a language model.
- Broadly explain how transformers are used for autoregressive text generation. 
- Broadly explain how bi-directional attention works. 
- Broadly explain masking and masked language models.
- Explain the difference between causal language model and bi-directional language model. 
- Explain why do we get contextual embeddings with bidirectional models. 
- Use basic functionalities provided in the 🤗 Transformers library.  

<br><br>

### Attributions

This material is heavily based on [Jurafsky and Martin, Chapter 10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) and [Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf).

<br><br><br><br>

## ❓❓ Questions for you


### Exercise 8.1: Discuss the following questions with your neighbour

- (A) What are the key operations of self-attention? 
- (B) What is the purpose of query ($Q$), key ($K$), and value ($V$) matrices?
- (C) How to compute contextual representation of a token using $Q$, $K$, $V$ matrices and the key operations of self-attention?    
- (D) What are positional embeddings? 
- (D) What is multihead attention? 

<br><br><br><br>

### Transformers as a language model 

How are transformers deployed as language models? 
- We use a semi-supervised model for training language models. 
- Given a training corpus of plain text we'll train the model autoregressively to predict the next token in a sequence $y_t$, using cross-entropy loss. 
- At each time step, cross-entropy loss gives the negative log probability the model assigns to the next word in the training sequence.

![](img/transformer_language_model.png)

<!-- <img src="img/transformer_language_model.png" width="600" height="600"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/10.pdf)

- For training, we create input/output sequences of length sequence length from a large amount of text (e.g., the whole Wikipedia), as shown below. 

> Input: So long and thanks for

> Gold output: long and thanks for all

- At each step, given all the preceding words, the final transformer layer produces an output distribution over the entire vocabulary. 
- During training, the probability assigned to the correct word is used to calculate the cross-entropy loss for each item in the sequence.  
- The loss for a training sequence is the average cross-entropy loss over the entire sequence.
- With transformers, each training item can be processed in parallel since the output for each element in the sequence is computed separately.

### Autoregressive text generation 

- Once we have a trained model, we can autoregressively generate novel text just as with RNN-based models. 
- Incrementally generating words by repeatedly sampling the next word conditioned on our previous choices is called **autoregressive generation** or **causal LM generation**.
- The neural text generation process is similar to generation with **Markov models**. 

![](img/transformer_autoregressive_text_generation.png)

<!-- <img src="img/transformer_autoregressive_text_generation.png" width="600" height="600"> -->

- Text generation, along with image generation and code generation, constitute a new area of AI that is often called **generative AI**.
- The language model above is called **auto-regressive** or **causal** language model.
- The models in the GPT family are autoregressive language models which are primarily designed for text generation. 

### Decoding methods 

- Greedy
- Sampling 
    - Temperature
- Beam search
    - Greedy approach to select the token with the highest probability at each time step might be problematic because the token that looks good to the decoder now might turn out later to have been the wrong choice.  
    - Beam search is a popular choice which keeps track of $k$ best tokens at each time step.  
    - For more details, refer to Section 10.4 in [this chapter](https://web.stanford.edu/~jurafsky/slp3/10.pdf). 

<br><br>

## Bidirectional transformer encoders 
- Models such as [**BERT**](https://en.wikipedia.org/wiki/BERT_(language_model)) and its variant **RoBERTa** are **bidirectional transformer models**.  
- These models are primarily designed for a wide range of NLP tasks (e.g., text classification)
- Remember the [sentence transformers](https://www.sbert.net/) you used in DSCI 563 lab1 to get sentence embeddings? These sentence embeddings are based on **BERT**. 

There are three main components of BERT. 

- Bidirectional self-attention
- Masking
- Next sentence prediction 

### Bidirectional self-attention

- We have seen backward looking self-attention.
- This is also referred to as causal (left-to-right) transformer model. 
- Each output is computed using only information seen **earlier** in the context. 
- This is great for autoregressive generation. 
- But in the context of sequence classification or sequence labeling they have an obvious shortcoming because they do not have access to the context on the right of the current token.
- The hidden state computation is solely based on the current and the earlier elements in the in the sequence and it ignores the potentially useful information located to the right of each time step.  

![](img/self_attention.png)

<!-- <img src="img/self_attention.png" width="600" height="600"> -->

#### Bidirectional self-attention

- Bidirectional self-attention overcomes this limitation 
    - Allow the self-attention mechanism to access elements from the entire input sequence. 
- Again, it's used to map sequences of input embeddings $(x_1, \dots, x_n)$ to sequences of output embeddings of the same length $(y_1, \dots, y_n)$. 
- The model attends to all inputs, both before and after the current one.    
- Information flows in both directions in bidirectional self-attention. 

![](img/bidirectional_self_attention.png)

<!-- <img src="img/bidirectional_self_attention.png" width="600" height="600"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- With bidirectional encoders we get contextual representations of tokens in the input sequence which are generally useful across a range of downstream applications. 

- All the computations are exactly the same as before. 
- The matrix below shows $q_i \cdot k_i$ comparisons. We do not set the values in the upper triangle to $\infty$ anymore. 

![](img/self_attention_calc_all.png)

<!-- <img src="img/self_attention_calc_all.png" width="400" height="400"> -->

- What's the challenge with training bidirectional encoders? 
- We trained the causal transformer language model by making them iteratively predict the next word. 
- Can we use the same approach? 
<br><br><br><br>

### "fill-in-the-blank" task

- No because that would be cheating, as we have access to the right context as well. 
- So instead of predicting the next word, the model learns to perform a fill-in-the-blank task (also referred to as **cloze task**).

> ##### I am studying science at UBC because I want to ___ as a data scientist. 

> ##### The ___ in the exam where the fire alarm is ___ are really stressed. 

- Given an input sequence with one or more elements missing, the learning task is to predict the missing elements. 
- During training the model is deprived of one or more elements of an input sequence and the model predicts the probability distribution over all words in the vocabulary for each of these missing items. 
- Use cross-entropy loss for each of the model's predictions to drive the learning process. 

### Masking

- There are several ways to deprive the model of one or more elements of an input sequence during training. 
    - Remove a token and replace it by a special token called `[MASK]` and learn to recover the masked token 
    - Corrupt the input by replacing a token with a random token from the vocabulary and ask it to recover the original input  
- BERT used this approach of masked language modeling for training.         
- In BERT, $15\%$ of the input tokens in a training sequence are sampled for learning. Of these
    - $80\%$ are replaced with `[MASK]`
    - $10\%$ are replaced with randomly selected tokens and 
    - the remaining $10\%$ are left unchanged.

![](img/BERT-MLM.png)

<!-- <img src="img/BERT-MLM.png" width="600" height="600"> -->

### (Optional) Next Sentence Prediction (NSP)

- This is particularly designed to help BERT learn relationships between pairs of sentences. 

For example, given a premise and hypothesis the task is to determine whether the given hypothesis is entailed, contradicted, or neutral with respect to the premise. 

> Premise: "The weather outside is sunny and warm."

> Hypothesis 1: "People are enjoying outdoor activities."

> Hypothesis 2: "People are staying indoors due to the weather."

![](img/NSP-loss-calculation.png)

<!-- <img src="img/NSP-loss-calculation.png" width="600" height="600"> -->

- BERT is trained to predict whether a pair of sentences are consecutive or not. 
- BERT training involves NSP objective + MLM objective to form final loss

## Contextual embeddings 

The representations created by masked language models are called **contextual embeddings**.

- The methods like word2vec learned a single vector embedding for each unique word $w$ in the vocabulary. 
- By contrast, the representations created by masked language have contextual information. Each word is represented by a different vector each time it appears in a different context.   
- Did we create contextual embeddings with causal language models? 

### BERT model parameters 

The original bidirectional transformer encoder model (BERT) consisted of the following:

- The corpus used in training BERT and other early transformer-based language models consisted of an **800 million**  word corpus of book texts called BooksCorpus and a **2.5 Billion** word corpus derived from the English Wikipedia, for a combined size of **3.3 Billion words**. 
- Hidden layers of size **768** (If you recall the sentence embeddings from DSCI 563 lab 1 were 768 dimensional.)
- **12** layers of transformer blocks with **12** multihead attention layers each! 

The model has over **100M parameters**. 

### Transfer Learning through Fine-Tuning

- GPT and BERT extract generalizations from large amount of data that are useful for large number of downstream applications
- How do we use these representations in our tasks?
- Create interfaces from these models to downstream applications through a process called **fine-tuning** 
- Create applications on top of pre-trained models by adding a small set of application-specific parameters 
- Either freeze the training or only make minimal adjustments to the pre-trained language model parameters

### Bidirectional sequence classification

![](img/BERT-sequence-classification.png)
<!-- <img src="img/BERT-sequence-classification.png" width="600" height="600"> -->
[source](https://web.stanford.edu/~jurafsky/slp3/11.pdf)

### Bidirectional sequence labeling
![](img/BERT_sequence_labeling.png)
<!-- <img src="img/BERT_sequence_labeling.png" width="600" height="600"> -->
[source](https://web.stanford.edu/~jurafsky/slp3/11.pdf)

### General comments 

Language models are broadly categorized into three types.

- Decoder-only (e.g., GPT-3)
    - each output is computed only using information seen **earlier** in the context. 
    - given a prompt $X_{1:i}$ produces unidirectional contextual embeddings and a distribution over next tokens $X_{i+1}$ and recursively over the entire sequence $X_{i+1:L}$, where L is the length of the sequence.  
    - can naturally generate completion
    - ##### Example: MDS Cohort 7 is the ___
- Encoder-only (e.g., BERT or RoBERTa etc): 
    - Compute bidirectional contextual embeddings.
    - Cannot be directly used to generate text
    - The contextual embeddings are used for classification tasks 
    - ##### Example: MDS Cohort 7 is the best! $\rightarrow$ positive
- Encoder-decoder (e.g., T5, BART)
    - Encode input, decode output
    - Take the best of both worlds
    - Bidirectional contextual embeddings
    - Can naturally generate outputs

- Now you know some fundamentals of transformers. 
- There are many things related to transformers which we have not covered. You can look up the following if you want to know more. [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) is an excellent resource. 
- Transformers are not only for NLP. They have been successfully applied in many other domains often with state-of-the-art results. For example, 
    - [Vision Transformers](https://arxiv.org/pdf/2010.11929.pdf)
    - Bioinformatics: See [this](https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology) and [this](http://people.csail.mit.edu/tommi/papers/Ingraham_etal_neurips19.pdf).

### Break

![](img/eva-coffee.png)

<br><br><br><br>

## [🤗 Transformers library](https://huggingface.co/docs/transformers/index)
- The Hugging Face Transformers library is a popular open-source Python library for working with transformer models.
- It provides a wide range of pre-trained transformer models which have achieved top performance in many state-of-the-art NLP tasks.
- It provides 
    - an easy-to-use API that allows using these pre-trained models off-the-shelf for tasks such as text classification, question answering, named entity recognition, or machine translation.  
    - an easy-to-use API that allows developers to fine-tune pre-trained transformer models on their own NLP tasks.  
- It also includes utilities for tokenization, model inference, and training and other useful features for model visualization, model comparison, and model sharing via the Hugging Face model hub.  
- It supports various deep learning frameworks such as PyTorch and TensorFlow and provides a unified inferface to working with transformer models across these frameworks. 
- Excellent documentation and very useful tool for NLP practioners and researchers

### Installation 
First, install the library if it's not already in your course environment. On the command line, activate the course environment and install the library. [Here](https://huggingface.co/docs/transformers/installation) you will find installation instructions. 

`conda activate 575`
`pip install transformers`

### Using pre-trained transformer models  

- Let's try sentiment analysis with pre-trained models. 
- An easiest way to get started with using pre-trained transformer models is using [pipelines](https://huggingface.co/docs/transformers/pipeline_tutorial) which abstracts many things away from the user. 

In [2]:
from transformers import pipeline

input_text = "MDS Cohort 7 is the best!"
sentiment_classifier = pipeline("sentiment-analysis")
pd.DataFrame(sentiment_classifier(input_text))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,label,score
0,POSITIVE,0.99985


Right now it's using the default pre-trained model. You can use a model is more suitable for your specific language or data. For example, you are working on 

You can also pass a list of documents to the pipeline. 

In [3]:
sentiment_classifier = pipeline("sentiment-analysis")
input_text = ["MDS Cohort 7 is the best!", "I am sad that this is the last lecture of the program."]
pd.DataFrame(sentiment_classifier(input_text))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,label,score
0,POSITIVE,0.99985
1,NEGATIVE,0.999666


You can also explicitly specify the pre-trained model you want to use specific to your data. For example, if you want to do sentiment analysis on Twitter data, it's better to use a model trained on twitter data. 
There are more than 1000 sentiment analysis models publicly available on [the Hub](https://huggingface.co/models) and integrating them with Python just a few lines of code. Let's just go with one of the most commonly used model for sentiment analysis.  

In [4]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_classifier = pipeline("sentiment-analysis", model=model_name)
pd.DataFrame(sentiment_classifier(input_text))

Unnamed: 0,label,score
0,POSITIVE,0.99985
1,NEGATIVE,0.999666


Let's pass the model and tokenizer explicitly. 

In [5]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentiment_classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
pd.DataFrame(sentiment_classifier(input_text))

Unnamed: 0,label,score
0,POSITIVE,0.99985
1,NEGATIVE,0.999666


We get the same results! 

Let's examine what the tokenizer is doing. 

In [6]:
tokenized = tokenizer(input_text[1])
tokenized

{'input_ids': [101, 1045, 2572, 6517, 2008, 2023, 2003, 1996, 2197, 8835, 1997, 1996, 2565, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Attention mask is a binary mask which decides which elements in a sequence are attended to by a self-attention mechanism.  

In [7]:
tokens = tokenizer.tokenize(input_text[1])
tokens

['i',
 'am',
 'sad',
 'that',
 'this',
 'is',
 'the',
 'last',
 'lecture',
 'of',
 'the',
 'program',
 '.']

In [8]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids 

[1045, 2572, 6517, 2008, 2023, 2003, 1996, 2197, 8835, 1997, 1996, 2565, 1012]

In [9]:
decoded_string = tokenizer.decode(ids)
decoded_string

'i am sad that this is the last lecture of the program.'

#### Getting predictions with PyTorch

In [10]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
batch = tokenizer(input_text, padding=True, max_length=512, return_tensors="pt")
print(batch)

{'input_ids': tensor([[  101,  9108,  2015,  2522, 27794,  1021,  2003,  1996,  2190,   999,
           102,     0,     0,     0,     0],
        [  101,  1045,  2572,  6517,  2008,  2023,  2003,  1996,  2197,  8835,
          1997,  1996,  2565,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}




Note that it took the maximum length as the sequence length. The attention mask for the first sentence has zeros for the last four elements because these are paddings and not valid tokens in this sequence. So we do not want to attend to these tokens. 

In [11]:
with torch.no_grad():
    outputs = model(**batch)
    print(outputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
    print(predictions)
    labels = torch.argmax(predictions, dim=1)
    print(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.2119,  4.5918],
        [ 4.3894, -3.6148]]), hidden_states=None, attentions=None)
tensor([[1.5016e-04, 9.9985e-01],
        [9.9967e-01, 3.3396e-04]])
tensor([1, 0])


### Visualizing attention weights using [BertViz](https://github.com/jessevig/bertviz)

- interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5

In [12]:
from transformers import AutoTokenizer, AutoModel, utils
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
utils.logging.set_verbosity_error()  # Suppress standard warnings
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True) # note output_attentions parameter

In [13]:
from bertviz import head_view, model_view

The head view visualizes attention for one or more attention heads in the same layer.

In [14]:
inputs = tokenizer.encode(input_text[1], return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1]  # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
head_view(attention, tokens)

<IPython.core.display.Javascript object>

The model view shows attention across all layers and all heads. 

In [15]:
model_view(attention, tokens)

<IPython.core.display.Javascript object>

### Prompts

- Up until recently, language models were only a component of a large system such as speech recognition system or machine translation system.
- Now they are becoming more capable of being a standalone system. 
- Language models are capable of conditional generation. So they are capable of generating completion given a prompt.  
$$\text{prompt} \rightarrow \text{completion}$$
- This simple interface opens up lets us use language models for a variety of tasks by just changing the prompt. 
- Let's try a couple of prompts with the T5 encoder decoder language model. 
- The following examples are based on the documentation [here](https://huggingface.co/docs/transformers/model_doc/t5).

In [16]:
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead

In [17]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [18]:
sequence = ('''
           A transformer is a deep learning model that adopts the mechanism of self-attention, 
           differentially weighting the significance of each part of the input data. 
           It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).
           Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, 
           such as natural language, with applications towards tasks such as translation and text summarization. 
           However, unlike RNNs, transformers process the entire input all at once. 
           The attention mechanism provides context for any position in the input sequence. 
           For example, if the input data is a natural language sentence, 
           the transformer does not have to process one word at a time. 
           This allows for more parallelization than RNNs and therefore reduces training times.
           
           Transformers were introduced in 2017 by a team at Google Brain and are increasingly the model of choice 
           for NLP problems, replacing RNN models such as long short-term memory (LSTM). 
           The additional training parallelization allows training on larger datasets. 
           This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) 
           and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, 
           such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks. 
           
           Before transformers, most state-of-the-art NLP systems relied on gated RNNs, 
           such as LSTMs and gated recurrent units (GRUs), with added attention mechanisms. 
           Transformers also make use of attention mechanisms but, unlike RNNs, do not have a recurrent structure. 
           This means that provided with enough training data, attention mechanisms alone can match the performance 
           of RNNs with attention.
           
           Gated RNNs process tokens sequentially, maintaining a state vector that contains 
           a representation of the data seen prior to the current token. To process the 
           nth token, the model combines the state representing the sentence up to token n-1 with the information 
           of the new token to create a new state, representing the sentence up to token n. 
           Theoretically, the information from one token can propagate arbitrarily far down the sequence, 
           if at every point the state continues to encode contextual information about the token. 
           In practice this mechanism is flawed: the vanishing gradient problem leaves the model's state at 
           the end of a long sentence without precise, extractable information about preceding tokens. 
           The dependency of token computations on the results of previous token computations also makes it hard 
           to parallelize computation on modern deep-learning hardware. This can make the training of RNNs inefficient.
           
           These problems were addressed by attention mechanisms. Attention mechanisms let a model draw 
           from the state at any preceding point along the sequence. The attention layer can access 
           all previous states and weigh them according to a learned measure of relevance, providing 
           relevant information about far-away tokens.
           
           A clear example of the value of attention is in language translation, where context is essential 
           to assign the meaning of a word in a sentence. In an English-to-French translation system, 
           the first word of the French output most probably depends heavily on the first few words of the English input. 
           However, in a classic LSTM model, in order to produce the first word of the French output, the model 
           is given only the state vector after processing the last English word. Theoretically, this vector can encode 
           information about the whole English sentence, giving the model all the necessary knowledge. 
           In practice, this information is often poorly preserved by the LSTM. 
           An attention mechanism can be added to address this problem: the decoder is given access to the state vectors of every English input word, 
           not just the last, and can learn attention weights that dictate how much to attend to each English input state vector.
            ''')

In [19]:
prompt = "summarize: "
inputs = tokenizer.encode(prompt + sequence, return_tensors="pt", max_length=512, truncation=True)

In [20]:
summary_ids = model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)
summary_ids

tensor([[    0,     3,     9, 19903,    19,     3,     9,  1659,  1036,   825,
            24,  4693,     7,     8,  8557,    13,  1044,    18, 25615,     3,
             5,  9770,     3,    60, 14907, 24228,  5275,    41, 14151,   567,
             7,   201, 19903,     7,   433,     8,  1297,  3785,    66,    44,
           728,     3,     5,     8,  1388,  8557,   795,  2625,    21,   136,
          1102,    16,     8,  3785,  5932,     3,     5, 19903,     7,   130,
          3665,    16,  1233,    57,     3,     9,   372,    44, 10283,  2241,
             3,     5,     8,  1388,  8557,    19,  1126,    12,    24,    13,
             3,     9,     3,    60, 14907, 24228,  1229,    41, 14151,   567,
            61,     1]])

In [21]:
tokenizer.decode(summary_ids[0])

'<pad> a transformer is a deep learning model that adopts the mechanism of self-attention. unlike recurrent neural networks (RNNs), transformers process the entire input all at once. the attention mechanism provides context for any position in the input sequence. transformers were introduced in 2017 by a team at google brain. the attention mechanism is similar to that of a recurrent neural network (RNN)</s>'

<br><br>

Let's try translation with the same model. 

In [22]:
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids

labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids
labels

tensor([[  644,  4598,   229, 19250,     5,     1]])

In [23]:
tokenizer.decode(labels[0])

'Das Haus ist wunderbar.</s>'

<br><br><br><br>

### Increase in size of language models 
- The model sizes have increased by an order of 500x over the last 4 years. 

![](img/model-sizes.png)

[Source](https://stanford-cs324.github.io/winter2022/lectures/introduction/)

### Harms of large language models 

While these models are super powerful and useful, be mindful of the harms caused by these models. Some of the harms as summarized [here](https://stanford-cs324.github.io/winter2022/lectures/harms-1/) are: 

- performance disparties
- social biases and stereotypes
- toxicity
- misinformation
- security and privacy risks
- copyright and legal protections
- environmental impact
- centralization of power



<br><br><br><br>

## Summary and wrap up 

### Week 1 ✅

- Markov models, language models, text generation 

![](img/Markov_autocompletion.png)

<!-- <center> -->
<!-- <img src="img/Markov_autocompletion.png" height="800" width="800">  -->
<!-- </center>     -->
    

### Applications of Markov models

![](img/Markov_chain_applications.png)

<!-- <center> -->
<!-- <img src="img/Markov_chain_applications.png" width="500" height="500"> -->
<!-- </center>     -->
    

### Week 2 ✅

- Hidden Markov models, speech recognition, POS tagging

![](img/hmm_eks.gif)

<!-- <center> -->
<!-- <img src="img/hmm_eks.gif" height="800" width="800"> -->
<!-- </center> -->

    

### Week 3 ✅

- Topic modeling (Latent Dirichlet Allocation (LDA)), organizing documents 
- Introduction to Recurrent Neural Networks (RNNs)

![](img/TM_food_magazines.png)

<!-- <center> -->
<!-- <img src="img/TM_food_magazines.png" height="1000" width="1000">  -->
<!-- </center>     -->


### Week 4 ✅

- Transformers 


![](img/eva-accomplished.png)

### Final remarks 

That's all! I had fun teaching you this complex material. I very much appreciate your support, patience, and great questions ♥️!   

It has been a challenging year but we all tried to make the best out of it. I wish you every success in your job search. Stay in touch!

### Time for course evaluations

I would love to hear your thoughts on this course. When you get a chance, it'll be great if you fill in the evaluation survey for this course on [Canvas](https://canvas.ubc.ca/courses/106530/external_tools/4732). 

The evaluation closing date is: **April 28th, 2022**