# Explanation

The GPT-2 and GPT-3 papers are included for their historical significance, and I doubt too much context is needed.

The later additions to the GPT series papers don't necessarily introduce anything new in terms of architecture or approach in building models.

Instead, they primarily focus on executing the bet on the scaling laws hypothesis and showing the impressive results.

OpenAI as a company pivoted to focus on the scaling laws hypothesis far before it became obvious to the public that this was the path toward increased intelligence (as expected, most people thought the idea of scaling up the parameters without much else and expecting better results couldn't work).

The GPT series as a whole improves on the progress made by BERT, RoBERTa, T5 and the other large language models being trained at the time, and of course most notably builds on the transformer architecture.

GPT-2, with 1.5B parameters, achieved state-of-the-art results again (after BERT) on a wide variety of language tasks, providing another demonstration of the power of the pre-training & fine-tuning paradigm and showing signs of general intelligence (in terms of generalization to many tasks it wasn't explicitly trained on).

Then GPT-3, over 100x larger with 175B parameters, again broke the state-of-the-art and showed that language models can learn a variety of tasks with only a few examples (hence the title "language models are few-shot learners").

Despite the impressive results, GPT-3 didn't quite break into the mainstream until the creation of InstructGPT and the release of the ChatGPT interface.

# My Notes

## 📜 [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) [GPT-2]

> We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText.

Language models start to learn and be able to complete tasks that typically required fine-tuning and supervised learning in an unsupervised way if they are just given enough data to train on.

> Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still under-fits WebText.

> Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems.

> It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning.

Multitask learning is promising for approaching general intelligence in language models, but it is expensive to create labeled datasets for it.

> We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification.

### Approach

> Language modeling is also able to, in principle, learn the tasks of without the need for explicit supervision of which symbols are the outputs to be predicted.

> The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement.

Here’s the hypothesis of OpenAI that leads to all their scaling laws research. The intuition is that the internet already has a ton of data and that providing the model with this data will make it learn more than people expect.

**1. Training Dataset**

> Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.
> A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl.

> Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans.

We see improvements in taste on the dataset (determined by humans) improving the quality of the model. And the broader trend of improving the quality of web scrapes and the model size to improve the model itself.

**2. Input Representation**

> We prevent BPE from merging across character categories for any byte sequence. We add an exception for spaces which significantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.

### Experiments

> Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT.

![Screenshot 2024-05-16 at 11.59.59 AM.png](../../images/Screenshot_2024-05-16_at_11.59.59_AM.png)

### Discussion

> Much research has been dedicated to learning, understanding, and critically evaluating the representations of both supervised and unsupervised pre-training methods. Our results suggest that unsupervised task learning is an additional promising area of research to explore.

> While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with fine-tuning.

### Conclusion

> When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets.


## 📜 [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165) [GPT-3]

> Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.

> Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.

Again, GPT-3, like GPT-2 is not so much an introduction of new research methods as much as it is a practical implication of the principle that has been realized earlier - scaling laws are the direction of improvement.

> We discuss broader societal impacts of this finding and of GPT-3 in general.

> First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models.

> Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution.

> Third, humans do not require large supervised datasets to learn most language tasks.

> Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

> In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call GPT-3, and measuring its in-context learning abilities.

### Results

All you need is scale! These training curves show that increase model parameters significantly decreases cross entropy loss.

![Screenshot 2024-05-16 at 12.57.05 PM.png](../../images/Screenshot_2024-05-16_at_12.57.05_PM.png)

### Limitations

> First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks.

> A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pre-training objective.

> Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on.

### Broader Impacts

This section doesn’t say anything too unintuitive, but seems to be more about the optics of them having considered the safety aspects of the model now that it has reached a level that’s potentially harmful.

However, it’s interesting to see the effects of how society has quickly adjusted to it’s awareness of AI and algorithms working around it through memetics. Social media has rapidly scaled awareness of AI.

### Conclusion

> We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly

> Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.