# BERTology: Putting it all Together

함께, 1장에서 `spacy`를 만지작거리기 시작한 이래로 먼 길을 왔습니다. 우리는 딥 러닝 라이브러리와 동등한 전자레인지 식사를 사용하여 가장 일반적인 NLP 문제를 해결하는 것으로 시작한 다음 토큰화 및 임베딩을 포함한 낮은 수준의 세부 사항으로 진행했습니다. 그 과정에서 우리는 RNN, LSTM 및 GRU를 포함한 순환 네트워크와 Transformer 아키텍처 및 어텐션 메커니즘을 다루었습니다.

여러 면에서 이 장은 그랜드 피날레입니다. 우리는 모든 조각을 하나로 묶고 2018년 소위 ImageNet 순간으로 이끈 단계를 역추적할 것입니다. 그 이후로 NLP에서 이러한 발전의 잠재적인 상업적 응용 프로그램에 대한 흥분으로 이어졌습니다. 이러한 가능성 중 일부에 대해서도 다룰 것입니다. 시작하자.

## ImageNet

It's worth taking a moment to clarify what we mean by "ImageNet moment." ImageNet is a computer vision dataset that was originally published in 2009. It became a benchmark for the progress in image classification, a core computer vision task, and apawned an annual computer vision competition to see which research team could best indentify objects in the dataset's images with the lowest error rate.

The high visibility of the competition helped spur significant advances in the field of computer vision starting in 2010. From 2009 through 2017, the winning accuracy jumped from 71.8% to 97.3%, surpassing human ability (achieving superhuman ability) and capturing the world's imagination as to what machine learning could do.

Looking back, 2012 was the breakthrough year for computer vision and the so-called original ImageNet moment. In 2012, a team from the University of Toronto, led by Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky, beat the rest of the field by a 10.8% margin.

This performance shocked the AI research community and began to convince more commercially minded enterprises to pay more attention to computer vision. Over the next few years, enterprises used pretrained ImageNet models to solve a wide array of computer vision tasks, including tasks the models were not trained for explicitly. In other words, ImageNet was when computer vision broke through a performance and ease-of-application barrier that helped draw the attention of the rest of the world.

### The Power of Pretrained Models 

In much the same way that 2012 was the breakout year for computer vision, 2018 was the breakout year for NLP. This was the year the world started to pay much more serious attention to what NLP could accomplish in the enterprise, and since then there has been very active interest in NLP for commercial applications. Let's retrace the events that led to this breakout moment for NLP, its so-called ImageNet moment.

Prior to 2018, the mainstream view was that NLP models had to be trained mostly from scratch in order to solve specific NLP tasks. There was little to reuse from other language models to help develop a model for your specific task. The only thing that was of value to transfer from other language models was pretrained word embeddings, which could help your model get started but provided limited value.

This presented a major problem for solving specific NLP tasks in enterprise because, to train a model mostly from scratch, you needed a lot of annotated data for your specific task at hand. Without this large volumne annotated data, you would not be able to train a model from scratch to sufficiently good levels of performance. But, getting such a large volumn of annotated data was a nonstarter for many companies, limiting the applicability of NLP in the enterprise.

This train-from-scratch paradigm in NLP contrasted sharply with the leverage-pretrained-models pardigm in computer vision, which bacame dogma by late 2017. In computer vision, it was considered foolish to train computer vision models from scratch. Instead, applied machine learning engineers would leverage the first several layers of large, pretrained computer vision models, which had already learned some of the basic elements of computer vision such as indentifying edges and shapes, to develop computer vision models for their specific task.

Transferring some of the "knowledge" from these prtrained models to new models required less annotated data than would otherwise be necessary and improved the adoption of computer vision in the enterprise. Unfortunately, as fo the end of 2017, such a transfer of knowledge from pretrained models was not possible in NLP, requiring teams to gather a lot of annotated data to train their specific models from scratch.

## The Path to NLP’s ImageNet Moment

In 2018, the mainstream view changed dramatically as NLP researchers showed that pretrained language models could be used to achieve state-of-the-art results on a wide range of NLP tasks; you did not need to train language models from scratch to solve specific NLP problems. This led to a watershed moment for NLP because now applied machine learning teams could leverage pretrained language models to solve a wide array of NLP tasks, just like computer vision engineers were leveraging pretrained ImageNet models to solve a wide array of computer vision tasks. By reusing several layers of pretrained language models, applied NLP scientists and engineers needed much less annotated data to solve specific NLP problems. Previously intractable problems in NLP became ripe for solving.
To understand what led to this breakthrough moment in NLP, let's retrace the progress in NLP over the last several years. This will help us tie together the major concepts across this book, deepening your understanding of the field.

## Pretrained Word Embeddings

One of the first steps in NLP is tokenization, which we covered in Chapter 5. Tokenization breaks down text into discrete units (e.g., words, punctuation, etc.), after which we can apply NLP algorithms to learn the structure of the text, including how to represent each token.

Learning how to represent each token is generally the second step in NLP. This process is called `learning word embeddings` (i.e., word vectors), which we covered in Chapter 6. Word embedings are vital in NLP because they capture the relationship between words. Unless a model learns the relationship between words, it cannot perform more complex NLP tasks, such as text classification, well.

Prior to 2013, NLP researchers had to train their own word embeddings from scratch for much of the work they did. Starting in 2013, pretrained word embeddings began to rise in prominence, allowing NLP researchers to leverage them for model development, speeding up their training process.

### The Limitations of One-Hot Encoding

Before we dive into these pretrained word embeddings, let's visit why simple one-hot vector encoding of words would not be an optimal approach at generating meaningful word vectors. If we had to apply one-hot encoding for every word in a large corpus, the dimensionality of the encoding matrix would be equal to the number of unique words, which would be quite massive and impractical to work with.

For example, if our corpus had a vocabulary of 400,000 unique words, our one-hot encoding matrix would have 400,000 dimensions, which is very large. This matrix would be a sparse matrix (mostly zero) and would suffer from the curse of dimensionality (e.g., we would need a lot of data to train a model that generalizes well because this matrix is both large and sparse, making parameter estimation more difficult).

Aside from the high dimensionality, a one-hot encoding matrix would not capture any of the semantic properties of words. For example, "queen" and "king" would have vectors that are orthogonal, implying that they are completely different when they are in fact related.

In constrast, word embeddings trained with with algorithms such as Word2Vec, GloVe, and fastText store contextual information in a much lower dimensional space. For the same vocabulary of 400,000 unique words, we could store the contextual information for each word using just a few hundred dimensions, far less than the 400,000 dimensions required from one-hot encoding.

Moreover, the word embeddings trained by Word2Vec, GloVe, and fastText store semantic information for each word, unlike one-hot encoding. Words such as "queen" and "king" have vectors that are closer together in space, implying that there is some semantic relationship/similarity between the two. By capturing this semantic property, word embeddings trained by Word2Vec, GloVe, and fastText capture more of the structure in language compared to one-hot encodings, which is how these word embeddings helped materially advance the field of NLP starting in 2013. This was the year pretrained word embeddings that capture semantic information began to become widely available to the researchers in the NLP community.

### word2vec

> A word is characterized by the company it keeps.
> - John Rupert Firth

In 2013, pretrained word embeddings became popular with the rise of Word2Vec, the first of the major word embedding algorithms. As you may recall from Chapter 6, Word2Vec is highly efficient algorithm that is used to learn word associations from a large corpus of text. Each distinct word is represented with a vector (hence Word2Vec, which is short for "word to vector"). You can think of Word2Vec and other word embedding algorithms as unsupervised feature extractors for words.

Word2Vec learns how to represent each word with a vector based on the surrounding context of each word; in other words, the words around the target word help define the vector representation for the target word. There are two approaches to do this: continuous bag of words (CBOW), which uses a neural network to predict which words given a target word (the opposite of CBOW).

The magic of Word2Vec is that semantically similar words have vectors (e.g., numerical representations) that are similar because the words appear in similar contexts. In other words, in a high-dimensional space, words that have similar meaning, such as "queen" and "king", have similar representations (i.e., vectors) and, therefore, are located closer together.

Instead of having to learn word embeddings from scratch, ML engineers could use the pretrained word embeddings trained by Word2Vec in their model development, leveraging some of the "learning" done beforehand. This emergence of pretrained word embeddings helped ML engineers because they did not have to start model development entirely from scratch.

Despite its successes, Word2Vec has shortcomings. First, it relies on a realtively small window-based model in the context of the entire document. Second, it does not consider subword information, which means that it cannot efficiently learn, for example, how a noun and an abjective that are derived from the same subword are related. For instance, "intelligent" and "intelligence" share the subword "intellgen" and are related as a result, sharing similar semantic information.

Third, Word2Vec cannot handle Out of Vocabulary (OOV) words, it can only vectorize words that it has seen in training. Finally, Word2Vec cannot disambiguate the context-specific semantic properties of words. For example, with Word2Vec, the word "bank" has the same word vector regardless of whether it appears in the financial setting ("I deposited a check at the bank") or in the river setting ("I sat on the river bank after fishing").

> Generally, pretrained word embeddings have a few hundred dimensions (typically three hundred). The more dimensions, the more subtle representations you can embed with the word embeding algorithms, but his comes at the cost of computation speed and increased complexity. If you want a better-performing model, it is better to use a word embedding matrix that has more dimensions than fewer for any given word embedding algorithm you choose. If you want a fater/more computationally efficient model, it is better to use a matrix that has fewer dimensions, all else being equal.

### GloVe

### fastText

### Context-aware Pretrained Word Embeddings

## Sequential Models

### Sequential Data and the Importance of Sequential Models

## RNNs

### Vanilla RNNs

### LSTMs

### GRUs

## Attention Mechanisms

## Transformers

### Transformer-XL

## NLP’s ImageNet Moment

### ULMFiT

### ELMo

### BERT

### BERTology

### GPT-1, GPT-2, GPT-3

OpenAI는 또한 자체 Transformer 기반 모델을 설계하여 NLP 경쟁에 뛰어들었습니다. 이러한 모델은 Generative Pretrained Transformer의 줄임말인 GPT 모델로 알려져 있습니다. 첫 번째 GPT 모델인 GPT-1은 2018년에 출시되었으며 ULMFiT과 유사한 감독되지 않은 사전 훈련 및 감독된 미세 조정 프로세스를 사용했습니다. GPT-2는 2019년에 출시되었습니다. 이전 버전에 비해 더 많은 데이터와 더 많은 매개변수로 훈련하여 제로 샷 설정에서 많은 작업에서 최첨단 성능을 달성하는 데 도움이 되었습니다.

> 제로 샷 학습에서 모델은 학습할 예제가 제공되지 않지만 주어진 지침에 따라 수행할 작업을 이해해야 합니다. 예를 들어 제로 샷 학습 작업에는 영어 문장을 독일어로 번역하는 모델이 필요할 수 있지만 모델에는 학습할 영어-독일어 문장이 제공되지 않습니다. 퓨샷 학습에서는 모델에 학습할 몇 가지 예가 제공되지만 일반적으로 많지는 않습니다.

OpenAI는 2020년에 GPT-3를 출시했습니다. GPT-2와 비교할 때 GPT-3는 훨씬 더 큰 데이터 세트에서 학습했으며 더 많은 수의 매개 변수를 가졌습니다. GPT-3는 이전 모델을 능가하고 zero-shot 및 few-shot 학습의 새로운 표준을 세웠습니다. 현재까지 가장 성능이 좋은 생성적인 NLP 모델로 간주됩니다.

> 아시다시피 모델은 점점 더 커지고 수년에 걸쳐 점점 더 많은 데이터에 대해 학습했습니다. 이것은 설계 변경, 더 큰 모델 및 더 많은 데이터와 함께 NLP에서 최첨단 성능을 추진하는 데 도움이 되었습니다.

## Conclusion

이 장에서는 단어 임베딩, RNN, 어텐션 메커니즘, Transformer 아키텍처 및 문맥화된 단어 표현을 포함하여 책의 모든 주요 개념을 함께 묶었습니다. 종합적으로 이러한 발전은 2018년에 NLP의 ImageNet 순간을 가져오는 데 도움이 되었으며, 사전 훈련된 대규모 언어 모델이 대중에게 공개되고 NLP 벤치마크에서 새로운 성능 기록을 세웠습니다.

사전 훈련된 언어 모델의 등장으로 응용 NLP 엔지니어는 도메인별 NLP 작업에서 대규모 모델을 미세 조정하고 놀라운 성능을 달성할 수 있게 되었습니다. 이제 NLP 모델을 개발하기 위해 알아야 할 주요 NLP 개념을 다루었으므로 NLP 모델을 개발한 후 프로덕션화하는 방법에 대해 논의해 보겠습니다.