# 🙌 Welcome

In [2]:
# @title
from IPython.display import HTML, display

centered_html_with_title = lambda source, title: f"""
<div style="text-align: center;">
    <div style="font-size: 34px; margin-bottom: 20px; color: #8ab4f7;">{title}</div>
    <iframe src="{source}" width="800" height="392" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>
</div>
"""
display(HTML(centered_html_with_title("https://giphy.com/embed/YOjP742CyBbg0zKCbl", "Let's get started")))

In [None]:
# @title
from IPython.display import HTML, display, Image, YouTubeVideo
from IPython.core.magic import register_line_magic
# just for styling pprint function. nothing special

def print_html(line, background_color):
  display(HTML(f"""
  <div style="display: inline-block; font-size:120%; border:1px solid black; padding: 15px; background-color: {background_color}; color: black; margin-bottom: 10px; border-radius:10px; border-width: 1px; border-style: solid; border-color: white; box-sizing: border-box;">
      {line}
  </div>
  """))


@register_line_magic
def note(line):
  print_html(line, "#94d4f5")


@register_line_magic
def warning(line):
  print_html(line, "#f59494")



styled_text_html = lambda x: f"""
<div style="padding:20px; color:#150d0a; margin:10px; font-size:220%; text-align:center; display:block; border-radius:20px; border-width: 1px; border-style: solid; border-color: white; background-color: #94d4f5; overflow:hidden;font-weight:500">
{x}</div>
"""

# 📍 Lecture Plan

In [None]:
# @title
display(HTML(styled_text_html("📍 Lecture Plan")))

# Key ideas in pretraining

- Make sure your model can process large-scale, diverse datasets
- Don’t use labeled data (otherwise you can’t scale!)
- Compute-aware scaling

In [None]:
# @title
url = "https://i.postimg.cc/g0rpYwqJ/0-c-ZXn-AEouh74p-Cb-Xf.webp"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1000 height=500></div>')
display(centered_image)

# Motivating model pretraining from word embeddings

> “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)



*Consider I **record** the **record**: the two instances of record mean different things.*





In [None]:
# @title **Where we were**: pretrained word embeddings
url = "https://i.postimg.cc/LXG8BpKr/Screenshot-2024-06-25-at-22-22-09.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=600 height=500></div>')
display(centered_image)

- Start with pretrained word embeddings (no context!)
- Learn how to incorporate context in an LSTM or Transformer while training on the task.

In [None]:
# @title **Where we’re going**: pretraining whole models
url = "https://i.postimg.cc/wTXjJZWG/Screenshot-2024-06-25-at-22-37-26.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=600 height=500></div>')
display(centered_image)

- In modern NLP All (or almost all) parameters in NLP networks are initialized via pretraining.
- Pretraining methods hide parts of the input from the model, and train the model to reconstruct those parts.

## What can we learn from reconstructing the input?

- Tbilisi is the capital city of ___
- I went to the ocean to see the fish, turtles, seals, and ___
- Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. ___
- I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21 ___
- The woman walked across the street, checking for traffic over her shoulder.
- I put ___ fork down on the table.


In [3]:
# @title
url = "https://i.postimg.cc/9M7j7vkH/Screenshot-2024-06-10-at-02-10-41.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1000 height=500></div>')
display(centered_image)

# Pretraining

language modeling task:
- Model $p_\theta(w_t|w_{1:t-1})$, the probability distribution over words given their past contexts.
- There’s lots of data for this! (In English.) Pretraining through language modeling:
- Train a neural network to perform language
modeling on a large amount of text.
- Save the network parameters in file

Where does this data come from?

- The Pile
- RedPajama
- RefinedWeb
- FineWeb
- DataComp-LM [**240 Trillion token**] 🤯


# Model pretraining three ways

1. Encoders
2. Encoder-Decoders
3. Decoders

## 1. Encoders

In [None]:
# @title Encoder
url = "https://i.postimg.cc/nh5VtxMM/Screenshot-2024-06-25-at-23-09-57.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=500 height=200></div>')
display(centered_image)

In [None]:
# @title
url = "https://i.postimg.cc/0N8nxT3R/Screenshot-2024-06-25-at-23-25-08.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=600 height=500></div>')
display(centered_image)

- So far, we’ve looked at language model pretraining. But encoders get bidirectional context, so we can’t do language modeling!
- Idea: replace some fraction of words in the input with a special [MASK] token; predict these words.

h₁, ... , h_T = Encoder(w₁, ... , w_T)
y_i ~ Ah_i + b

Only add loss terms from words that are "masked out." If x̃ is the masked version of x, we're learning p_θ(x|x̃). Called Masked LM.

**BERT: Bidirectional Encoder Representations from Transformers**

Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a pretrained Transformer, a model they labeled BERT.

Some more details about Masked LM for BERT:
- Predict a random 15% of (sub)word tokens.
- Replace input word with [MASK] 80% of the time
- Replace input word with a random token 10% of the time
- Leave input word unchanged 10% of the time (but still predict it!)
- Why? Doesn’t let the model get complacent and not
build strong representations of non-masked words.
(No masks are seen at fine-tuning time!)

In [None]:
# @title
url = "https://i.postimg.cc/9fFk1wc7/Screenshot-2024-06-25-at-23-29-40.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=700 height=600></div>')
display(centered_image)

Details about BERT

Two models were released:
- BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.
- BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.

Trained on:
- BooksCorpus (800 million words)
- English Wikipedia (2,500 million words)

Pretraining is expensive and impractical on a single GPU.
- BERT was pretrained with 64 TPU chips for a total of 4 days.
- (TPUs are special tensor operation acceleration hardware)

Finetuning is practical and common on a single GPU
- “Pretrain once, finetune many times.”

**Limitations of pretrained encoders**

Those results looked great! Why not use pretrained encoders for everything?

If your task involves generating sequences, consider using a pretrained decoder; BERT and other pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation methods.

In [None]:
# @title
url = "https://i.postimg.cc/zvcDXYpd/Screenshot-2024-06-25-at-23-33-41.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1400 height=400></div>')
display(centered_image)

**Extensions of BERT**

You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++

Some generally accepted improvements to the BERT pretraining formula:

- RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
- SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task

In [None]:
# @title
url = "https://i.postimg.cc/7hFvWHKy/Screenshot-2024-06-25-at-23-40-15.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1000 height=400></div>')
display(centered_image)

A takeaway from the RoBERTa paper:
- more compute
- more data can improve pretraining even when not changing the underlying Transformer encoder.

In [None]:
# @title Encoder - Decoders
url = "https://i.postimg.cc/B6fqXCT0/Screenshot-2024-06-25-at-23-10-00.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=500 height=300></div>')
display(centered_image)

For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.

The encoder portion can benefit from bidirectional context; the decoder portion is used to train the whole model through language modeling, autoregressively predicting and then conditioning on one token at a time.

In [None]:
# @title
url = "https://i.postimg.cc/BQFf1FvZ/Screenshot-2024-06-25-at-23-42-52.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=600 height=500></div>')
display(centered_image)

In [None]:
# @title Decoders
url = "https://i.postimg.cc/pVZ2Lzyw/Screenshot-2024-06-25-at-23-10-03.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=500 height=200></div>')
display(centered_image)

It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their

This is helpful in tasks where the output is a sequence with a vocabulary like that at

pretraining time!
- Dialogue (context=dialogue history)
- Summarization (context=document)

# GPT

2018’s GPT was a big success in pretraining a decoder!

- Transformer decoder with 12 layers, 117M parameters.
- 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
- Byte-pair encoding with 40,000 merges
- Trained on BooksCorpus: over 7000 unique books.
- Contains long spans of contiguous text, for learning long-distance dependencies.
- The acronym “GPT” never showed up in the original paper; it could stand for “Generative PreTraining” or “Generative Pretrained Transformer”

# GPT 2

# GPT 3

So far, we’ve interacted with pretrained models in two ways:
- Sample from the distributions they define (maybe providing a prompt)
- Fine-tune them on a task we care about and take their predictions.

Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.

ChatGPT/GPT-4/GPT-3.5 Turbo introduced a further instruction-tuning idea that we
cover next lecture

# Scaling Efficiency

GPT-3 was 175B parameters and trained on 300B tokens of text.

Roughly, the cost of training a large transformer scales as parameters*tokens

Did OpenAI strike the right parameter-token data to get the best model? No.

In [1]:
# @title
url = "https://i.postimg.cc/3RWxMwFy/Screenshot-2024-06-25-at-23-58-23.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=900 height=300></div>')
display(centered_image)

NameError: name 'HTML' is not defined

In [None]:
# @title
url = "https://i.postimg.cc/d3jtJhDw/Screenshot-2024-06-26-at-00-18-57.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1000 height=500></div>')
display(centered_image)

In [None]:
# @title
url = "https://i.postimg.cc/L6m8N30X/Screenshot-2024-06-26-at-00-18-45.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1200 height=500></div>')
display(centered_image)

# What kinds of things does pretraining teach?

In [None]:
# @title
url = "https://i.postimg.cc/LXfrXjqH/Screenshot-2024-06-10-at-02-13-41.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1000 height=500></div>')
display(centered_image)

In [None]:
# @title
url = "https://i.postimg.cc/2jQsNFRG/Screenshot-2024-06-10-at-02-15-32.png"
centered_image = HTML(f'<div style="text-align:center;"><img src="{url}" width=1000 height=500></div>')
display(centered_image)

In [None]:
# @title
video_id = "zjkBMFhNj_g"  # Replace with the actual YouTube video ID

embed_code = f'<iframe width="640" height="360" src="https://www.youtube.com/embed/{video_id}" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>'

centered_video = HTML(f'<div style="text-align:center;">{embed_code}</div>')
display(centered_video)

# asd

- modern llm pretraining
- in context learning / chain of thought
