# BERTology: Putting it all Together

함께, 1장에서 `spacy`를 만지작거리기 시작한 이래로 먼 길을 왔습니다. 우리는 딥 러닝 라이브러리와 동등한 전자레인지 식사를 사용하여 가장 일반적인 NLP 문제를 해결하는 것으로 시작한 다음 토큰화 및 임베딩을 포함한 낮은 수준의 세부 사항으로 진행했습니다. 그 과정에서 우리는 RNN, LSTM 및 GRU를 포함한 순환 네트워크와 Transformer 아키텍처 및 어텐션 메커니즘을 다루었습니다.

여러 면에서 이 장은 그랜드 피날레입니다. 우리는 모든 조각을 하나로 묶고 2018년 소위 ImageNet 순간으로 이끈 단계를 역추적할 것입니다. 그 이후로 NLP에서 이러한 발전의 잠재적인 상업적 응용 프로그램에 대한 흥분으로 이어졌습니다. 이러한 가능성 중 일부에 대해서도 다룰 것입니다. 시작하자.

## ImageNet

It's worth taking a moment to clarify what we mean by "ImageNet moment." ImageNet is a computer vision dataset that was originally published in 2009. It became a benchmark for the progress in image classification, a core computer vision task, and apawned an annual computer vision competition to see which research team could best indentify objects in the dataset's images with the lowest error rate.

The high visibility of the competition helped spur significant advances in the field of computer vision starting in 2010. From 2009 through 2017, the winning accuracy jumped from 71.8% to 97.3%, surpassing human ability (achieving superhuman ability) and capturing the world's imagination as to what machine learning could do.

Looking back, 2012 was the breakthrough year for computer vision and the so-called original ImageNet moment. In 2012, a team from the University of Toronto, led by Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky, beat the rest of the field by a 10.8% margin.

This performance shocked the AI research community and began to convince more commercially minded enterprises to pay more attention to computer vision. Over the next few years, enterprises used pretrained ImageNet models to solve a wide array of computer vision tasks, including tasks the models were not trained for explicitly. In other words, ImageNet was when computer vision broke through a performance and ease-of-application barrier that helped draw the attention of the rest of the world.

### The Power of Pretrained Models 

In much the same way that 2012 was the breakout year for computer vision, 2018 was the breakout year for NLP. This was the year the world started to pay much more serious attention to what NLP could accomplish in the enterprise, and since then there has been very active interest in NLP for commercial applications. Let's retrace the events that led to this breakout moment for NLP, its so-called ImageNet moment.

Prior to 2018, the mainstream view was that NLP models had to be trained mostly from scratch in order to solve specific NLP tasks. There was little to reuse from other language models to help develop a model for your specific task. The only thing that was of value to transfer from other language models was pretrained word embeddings, which could help your model get started but provided limited value.

This presented a major problem for solving specific NLP tasks in enterprise because, to train a model mostly from scratch, you needed a lot of annotated data for your specific task at hand. Without this large volumne annotated data, you would not be able to train a model from scratch to sufficiently good levels of performance. But, getting such a large volumn of annotated data was a nonstarter for many companies, limiting the applicability of NLP in the enterprise.

This train-from-scratch paradigm in NLP contrasted sharply with the leverage-pretrained-models pardigm in computer vision, which bacame dogma by late 2017. In computer vision, it was considered foolish to train computer vision models from scratch. Instead, applied machine learning engineers would leverage the first several layers of large, pretrained computer vision models, which had already learned some of the basic elements of computer vision such as indentifying edges and shapes, to develop computer vision models for their specific task.

Transferring some of the "knowledge" from these prtrained models to new models required less annotated data than would otherwise be necessary and improved the adoption of computer vision in the enterprise. Unfortunately, as fo the end of 2017, such a transfer of knowledge from pretrained models was not possible in NLP, requiring teams to gather a lot of annotated data to train their specific models from scratch.

## The Path to NLP’s ImageNet Moment

In 2018, the mainstream view changed dramatically as NLP researchers showed that pretrained language models could be used to achieve state-of-the-art results on a wide range of NLP tasks; you did not need to train language models from scratch to solve specific NLP problems. This led to a watershed moment for NLP because now applied machine learning teams could leverage pretrained language models to solve a wide array of NLP tasks, just like computer vision engineers were leveraging pretrained ImageNet models to solve a wide array of computer vision tasks. By reusing several layers of pretrained language models, applied NLP scientists and engineers needed much less annotated data to solve specific NLP problems. Previously intractable problems in NLP became ripe for solving.
To understand what led to this breakthrough moment in NLP, let's retrace the progress in NLP over the last several years. This will help us tie together the major concepts across this book, deepening your understanding of the field.

## Pretrained Word Embeddings

One of the first steps in NLP is tokenization, which we covered in Chapter 5. Tokenization breaks down text into discrete units (e.g., words, punctuation, etc.), after which we can apply NLP algorithms to learn the structure of the text, including how to represent each token.

Learning how to represent each token is generally the second step in NLP. This process is called `learning word embeddings` (i.e., word vectors), which we covered in Chapter 6. Word embedings are vital in NLP because they capture the relationship between words. Unless a model learns the relationship between words, it cannot perform more complex NLP tasks, such as text classification, well.

Prior to 2013, NLP researchers had to train their own word embeddings from scratch for much of the work they did. Starting in 2013, pretrained word embeddings began to rise in prominence, allowing NLP researchers to leverage them for model development, speeding up their training process.

### The Limitations of One-Hot Encoding

Before we dive into these pretrained word embeddings, let's visit why simple one-hot vector encoding of words would not be an optimal approach at generating meaningful word vectors. If we had to apply one-hot encoding for every word in a large corpus, the dimensionality of the encoding matrix would be equal to the number of unique words, which would be quite massive and impractical to work with.

For example, if our corpus had a vocabulary of 400,000 unique words, our one-hot encoding matrix would have 400,000 dimensions, which is very large. This matrix would be a sparse matrix (mostly zero) and would suffer from the curse of dimensionality (e.g., we would need a lot of data to train a model that generalizes well because this matrix is both large and sparse, making parameter estimation more difficult).

Aside from the high dimensionality, a one-hot encoding matrix would not capture any of the semantic properties of words. For example, "queen" and "king" would have vectors that are orthogonal, implying that they are completely different when they are in fact related.

In constrast, word embeddings trained with with algorithms such as Word2Vec, GloVe, and fastText store contextual information in a much lower dimensional space. For the same vocabulary of 400,000 unique words, we could store the contextual information for each word using just a few hundred dimensions, far less than the 400,000 dimensions required from one-hot encoding.

Moreover, the word embeddings trained by Word2Vec, GloVe, and fastText store semantic information for each word, unlike one-hot encoding. Words such as "queen" and "king" have vectors that are closer together in space, implying that there is some semantic relationship/similarity between the two. By capturing this semantic property, word embeddings trained by Word2Vec, GloVe, and fastText capture more of the structure in language compared to one-hot encodings, which is how these word embeddings helped materially advance the field of NLP starting in 2013. This was the year pretrained word embeddings that capture semantic information began to become widely available to the researchers in the NLP community.

### word2vec

> A word is characterized by the company it keeps.
> - John Rupert Firth

In 2013, pretrained word embeddings became popular with the rise of Word2Vec, the first of the major word embedding algorithms. As you may recall from Chapter 6, Word2Vec is highly efficient algorithm that is used to learn word associations from a large corpus of text. Each distinct word is represented with a vector (hence Word2Vec, which is short for "word to vector"). You can think of Word2Vec and other word embedding algorithms as unsupervised feature extractors for words.

Word2Vec learns how to represent each word with a vector based on the surrounding context of each word; in other words, the words around the target word help define the vector representation for the target word. There are two approaches to do this: continuous bag of words (CBOW), which uses a neural network to predict which words given a target word (the opposite of CBOW).

The magic of Word2Vec is that semantically similar words have vectors (e.g., numerical representations) that are similar because the words appear in similar contexts. In other words, in a high-dimensional space, words that have similar meaning, such as "queen" and "king", have similar representations (i.e., vectors) and, therefore, are located closer together.

Instead of having to learn word embeddings from scratch, ML engineers could use the pretrained word embeddings trained by Word2Vec in their model development, leveraging some of the "learning" done beforehand. This emergence of pretrained word embeddings helped ML engineers because they did not have to start model development entirely from scratch.

Despite its successes, Word2Vec has shortcomings. First, it relies on a realtively small window-based model in the context of the entire document. Second, it does not consider subword information, which means that it cannot efficiently learn, for example, how a noun and an abjective that are derived from the same subword are related. For instance, "intelligent" and "intelligence" share the subword "intellgen" and are related as a result, sharing similar semantic information.

Third, Word2Vec cannot handle Out of Vocabulary (OOV) words, it can only vectorize words that it has seen in training. Finally, Word2Vec cannot disambiguate the context-specific semantic properties of words. For example, with Word2Vec, the word "bank" has the same word vector regardless of whether it appears in the financial setting ("I deposited a check at the bank") or in the river setting ("I sat on the river bank after fishing").

> Generally, pretrained word embeddings have a few hundred dimensions (typically three hundred). The more dimensions, the more subtle representations you can embed with the word embeding algorithms, but his comes at the cost of computation speed and increased complexity. If you want a better-performing model, it is better to use a word embedding matrix that has more dimensions than fewer for any given word embedding algorithm you choose. If you want a faster/more computationally efficient model, it is better to use a matrix that has fewer dimensions, all else being equal.

### GloVe

GloVe, short for Global Vectors, was 'the next major word embedding to come onto the scene; it launched in 2014, a year after Word2Vec. GloVe addressed the first major shortcoming of word2Vec. Instead of relying on a small window-based model like Word2Vec, GloVe considered the word statistics of the entire corpus when learning the word embedding for each word.

GloVe works similarly to Word2Vec but uss a different approach to learn the vector representations for words. More specially, gloVe uses unsupervised learning, generating a global co-occurence matrix to learn the sematic property of the target word given the entire corpus it appears in.

Although GloVe addreses one of the shortcomings of Word2Vec, it still does not consider subword information.

### fastText

In 2016, Facebook launched fastText, the third major word embedding approach in recent years. fastText differs from Word2Vec and GloVe; instead of considering each word as the smallest unit, fastText uses n-gram characters as the smallest unit. In other words, fastText uses subword information to generate word embeddings. For example, the word vector "kingdom" could be broken down into n-gram characters such as "ki,""kin,""ing,""ngd,""gdo," and "dom."

Instead of learning vector representations for words by using other words as context, fastText learns vector representations for n-gram characters by using other n-gram characters as context. Because it breaks the units down into a more granular level than either Word2Vec or GloVe, fastText achieves a wider variety and more nuanced set of word embeddings.

This use of n-gram characters instead of words as the smallest unit is a material improvement over Word2Vec and GloVe for several reasons. First, fastText requires less training data because it is able to learn a lot more from the various n-gram characters in a set of words than what Word2Vec or GloVe could for the same set of words.

Sencond, fastText uses subword information and, therefore, generalizes better becauses new words that fastText has trained on. For example, if fastText has trained on "fastest" but has not trained on "biggest," it can infer the meaning of "est" in "biggest" from the "est" in "fastes," whereas Word2Vec and GloVe could not. Third, fastText can generate embeddings for OOV words that fastText has embeddings for. This is related to the use of subword information, which neighter Word2Vec nor GloVe support.

With fastText, the only major shortcoming is its inability to produce multiple vectors for each word depending on the context.

### Context-aware Pretrained Word Embeddings

As good as they are, word embeddings trained by Word2Vec, GloVe, and fastText are not context-aware. They do not capture the different context-specific semantic properties of words. For example, the word "bank" has the same word vector (and, therefor, the same semantic property) regardless of whether it is used in the sentence, "I deposited a check at the bank," or in the sentence, "I sat on the river bank after fishing."

The large, pretrained language model based on the Transformer architecture, such as ELMo and BERT, that came onto the scene starting in 2018 changed this: they introduce context-aware word representations. With context-aware word representations, the "bank" in a financial setting has a different word vector than the "bank" in a river setting. This should feel intuitive: the same word in different contexts means different things, and, therefore, we should have differnt word vectors to represent the different meanings of the word depending on the context. More on this soon.

In this section, we just covered the advances in word embeddings over the years. In the next section, we'll explore the advaces in modeling approaches in recent years, starting with sequential models.

## Sequential Models

Starting in 2016, sequential models began to rise in prominence in the field of NLP, ahieving success in tasks such as machine translation, text summarization, conversational bots, and image captioning. Sequential models also cpatured mainstream attention with the New York Times article on Google's new machine-translation-based Google Translate (https://oreil.ly/M0Vjb). Dubbed "The Great AI Awakening," the article brought to the world stage for the first time the power of NLP models in solving complex language tasks.

Sequential models are machine learning models that input or output sequences of data, such as text, audio, and time series data. Sequential models are a class of modeling approaches, not just a singular approach, and they include RNNs, LSTMs, and GRUs, all of which we covered in detail earlier in the book. All of these sequential models either take in a sequence of data and output a singular result (for example, classify movie reviews as positive or negative sentiment), take in a singular input and output a senquence of data (for example, take in an image and return a caption that describes the image), or turn on sequence of data (such as text or audio) into another sequence (known as seq2sseq modeling). For instance, neural machine translation models take text in one language as the input sequence (e.g., English). In other words, the model takes in an input sequence and outputs an output sequence.

> To recap, sequential models handle multiple types of scenarios: (a) sequential input to single output, for scenarios such as sentiment analysis; (b) single input to sequential output, for image captioning; and (c) sequential input to sequential output, for machine translation.

Sequential models are generally composed of an encode and a decoder. The encoder takes in the input sequence, item by item, and generates a representation; think of this as converting text (such as a sentence) into a vector of numbers that machines can process. In the machine translation task, the encoder "encodes" the representation word by word to form the representation.

Once the encoder processes the entire input sequence, it passes the representation to the decoder, which unravels it into an output sequence, item by item. For example, in the machine translation task, the decoder "decoders" the representation word by word to form the output setence.

Over the past several years, senquential models have gotten better and better, solving the flaws of their predecessors. Let's explore the nature of sequential data some more, and then we will start with the earliest of the modern-day sequential models, made up or RNNs.

> Although we focus on NLP-based sequential modeling applications in this book, it is important to know that sequential modeling has applications well beyond NLP. Whitin NLP, sequential modeling is relevant for text (e.g., machine translation, text summarization, question answering, and more), audio (e.g., chabots), and speech (e.g., speech recongnition). Outside NLP, sequential modeling is relevant for images (e.g., image captioning), video (e.g., video captioning), anomaly detection on time series data, and time series predition involving sensor data, stock market data, genomic data, and weather data. Sequential modeling is one of the most relevant and flourishing areas of machine learning in the enterprise, not just for NLP.

### Sequential Data and the Importance of Sequential Models

Before we dive into RNNs, let's explore the nature of sequential data and why there is a need for a special class of models (i.e., sequential models) to work with it. Sequential data involves a series of seqeuntially interdependent/related intems; for example, the words in a sentence are related to on another in some sequential fashion and, therefore, are interdependent. The words in a sentence are not sequentially independent of each other.

This is ture of other sequential data as well. For example, the phonemes in a spoken sentence (phonemes are the smallest speech utterances, such as the "c" in "cat") are also sequentially dependent on each other. The sounds we make in speech are related to the ones preceding and succeeding eeach utterance. If we want to model audio, we need a way to capture the sequential connectedness of the data.

A third example of sequential data is stock market prices. Each one-second tick in stock prices is related to the series of ticks before and the series of ticks after. There is a pattern in the data data connects each tick with the rest. To predict stock prices well, a machine learning model that learns from the stock marget's price data needs to be able to represent and process the sequential nature of stock prices well. This is true of sensor-type time series data in healthcare and industrial robotics and many other fields.

Traditional feed-forward neural networks treat each input/observation as independent of the one preceding it and the one succeeding it. For example, a computer vision model that classifies images as "cat" or "dog" does not need to consider the preceding or suceeding image in order to successfully classify the current image. The model needs to focus only on the current image.

This singular focus on a single input is, of course, not optimal for sequential data problems. If we had to build a model to translate sentence from Frech to English, it would be suboptimal to translate each word in the French sentence to a word in English, word by word. While this would be a literal translation of French to English, the output sentence in English would likely not make much grammatical sense because the rules of French grammar differ from the rules of English grammar.

A more optimal approach would be for the model to generate a representation of the entire input (French) sentence first before attempting to translate it to an output (English) sentence. The model should do this by processing each word, word by word, while taking into consideration the preceding word(s). This is critical because language exhibits a sequential pattern that will help the model generate a better representation of the French sentence than if the model ignored the sequential pattern altogether.

By taking into account the sequential pattern of the French language as it processes the input sentence, the model is better equipped to more accurately translate the sentence into English. This is the crux of sequential modeling; by considering the sequential pattern of data, such as text, audio, and time series data, sequential models generate better performance on tasks than traditional feed-forward neural networks. 

With this context, let's dig into the first of the successful sequential models in recent years: RNNs.

## RNNs

Sequential models learn about the temporal nature of data, one time step at a time. Let's use an example to demonstrate this. Sequential models that process text "read" each word, one word at a time. Each moment the model reads a word is a time step. As the model processes an entire sentence, it moves from time step 0 to time step `x`, where `x` is the length of the sentence. At every time step increment, the model considers the present word while taking into consideration the series of words that preceded it.

> Here's a simple, intuitive way to think of RNNs: they are networks with loops in them, allowing information of the past to persist as "memory," which can be used to process the next input.

The better the memory of the model, the better the model wil be able to perform tasks, such as translating a sentence from on language to another or answering questions.

RNNs are a family of machine learning models that can store and use memory of prior sequential data in procssing the current data. For example, RNNs have memory of prior words and use this memory to process the current word in a sentence. The major challendge of RNNs is having great memory of senquential data that spans a long time frame. For example, it is easier for an RNN to have memory of the most recent few words it has processed compared to memory of words in a sentence several sentence ago.

Let's start with vanilla RNNs in this section, which have good short-tern memory, before we turn to RNNs with gates (e.g., LSTMs and GRUs), which model not only short-term memory but also long-term memory and are better able to capture the long-term dependencies in sequential data that are necessary to solve more complex tasks such as question answering.

> Befor RNNs, CNNs were used to solve NLP problems. CNNs have bercome famous in machine learning for their performance on computer vision tasks, but they also have relevance for natural language tasks. In CNNs, the neural net uses fixed-length windows to represent the data. For example, in text-based problems, the neural network uses a small, bounded context of words to perform tasks such as machine translation. While CNN-based language models are very fast, they have little context of words; they have even less context than the short-term memory of RNNs. This limits the CNN's performance, which is why researchers switched to RNNs once these became available.

### Vanilla RNNs

Unlike conventional feed-forward networks, recurrent neural networks have a temporal dimensional; in other words, RNNs take time into account, whereas conventional feed-forward networks do not. Conventional feed-forward networks feed information in one direction (hence, feed forward), but RNNs pass the data forward and then cycle the data back through a loop.

This "reccurent" cycling allows RNNs to have a sense of time, enableing the network to process the current input while retraining some context of th previous inputs. When an RNN processes an event in time step t, it also considers the recent past (e.g., what happened in time steps t-1, t-2, etc.). In other words, RNNs share weights over time. You can think of conventional feed-forward networks as having forgetfull memories, whereas RNNs have better memories of recent events. Information in RNNs persists, whereas it does not with conventional feed-forward networks.

In sequential models, bot the encoder and decoder could be RNNs. Both the encoder and decoder RNNs take two inputs at each timestamp. In the machine translation case, for example, the two inputs are (a) a word and (b) a hidden state. This hidden state vector is the seqeuntial memory that the recurrent network has preseved from previous time steps.

> Each word is represented with a word embedding, which we reviewed earlier in the chapter.

At each timestamp, the encoder RNN processes the input (i.e., the word vector) and the hidden state (also a vector). At the next timestamp, the RNN processes the next input(i.e., the next word vector) and the (output) hidden state from the previous time stamp and generates another output vector and a new output hidden state. This continues until the encoder RNN has finished processing the entire input senquence.

Once the encoder is done, it passes only the last output hidden state it generated to the decoder. This last output hidden state is the "representation" we alluded to earlier. You can think of it as the input sequence represented in a machine-processable format, ready for the decoder to process into an output sequence.

Once the decoder RNN receives the "representation" from the encoder, it unravels it word by word into the output sequence. In other words, the decoder RNN "translates" the hidden state word by word. You can think of this as unraveling what the encoder did but in reverse. This is a very simple explanation of how RNNs work, but check out Chapter 6 for more details.

> RNNs can handle all sorts of sequential data, not just text; this includes time series data, for example.

RNNs are an excellent choice for modeling sequential data such as text because they use an internal state to process the sequence. In other words, as the RNN works item by item through a sequence, it relies on its internal state/memory to process each item. This is very important because the items in the sequence (e.g., words in a sentence) are not independent of on another; they are related. Having an internal state/memory of how the inputs are related to one another is crucial for modeling the data effectively.

This should be intuitive. Translating one sentence from one language to another requires represnting the input sentence propverly first. Each word in the input sentence is dependent on/related to the word(s) prior. By managing the hidden states in the way it does, an RNN uses its internal state/memory (based on the prior words prcoessed) to process each subsequent word.

However, RNNs also have a mojor flaw: they cannot process very long sequences very well.

### LSTMs

Vanilla RNNs have memory, but this memory is mostly just short-term; vanilla RNNs really struggle with capturing and storing long-term dependencies in data. Therefore, they have limited performance in solving more complex NLP tasks such as question answering.

LSTM networks also use a recurrent neural network architecture but help solve for the inability of vanilla RNNs to process long sequences very well. LSTMs are able to hold memory of data over longer sequences. They can keep the context of sequential data in mind for much longer using mechanisms known as gates as part of their neural network architecture (more on gates soon).

Having longer-term memory is so important because solving an NLP task may require the memory of an item from many time steps before. Think of the following passage, taken from the Wikipedia article on George Washington (https://oreil.ly/V7Sdh):

> George Washington was an American political leader, military general, stateman, and Founding Father who sereved as the first president of the United States from 1789 to 1797. Previously, he led Patriot forces to victory in the nation's War for Independence. He presided at the Constitutional Convention of 1787, which established the U.S. Constitution and a federal government. Washington has been called the "Father of His Country" for his manifold leadership in the formative days of the new nation.

If asked "Who is the first president of the United States?" an NLP model trained using vanilla RNNs may be able to answer the question correctly ("George Washington") because the number of time steps between the mention of "George Washington" and the mention of "the first president of the United States" is reasonably low (fewer than 20 steps).

But, an NLP model trained using vanilla RNNs would have much greater difficulty answering "Who president at the Constitutional Convention of 1787?" because of the number of time steps between the mention of "George Washington" and the mention of "the Constitutional Convention of 1787" unless the NLP model could hold some form of longer-term memory.

> RNNs work perfectly fine when there's a small gap between the relevant information and the point where it is needed, but RNNs begin to struggle as the gap grows and the RNNs' short-term memory is unable to connect the question with information from the distant past. In other words, since RNNs can't preserve information over many time steps, they can only handle tasks that require short-term memory. An NLP model would have greater success answering the qeustion, "Who presided at the Constitutional Convention of 1787," if it could successfully disambiguate the pronoun "he" as "George Washington" as it processes the entire paragraph of text on George Washington. More modern NLP models do this very well using attention mechanisms. We'll discuss this in more detail in "Attention Mechanisms" on page 188.

LSTMs hold long-term memory using a series of three carefully regulated gates, with control how much information flows into and out of the LSTM's memory. These gates enable LSTMs to remeber values over arbitrary time intervals. You can think of the three gates as mechanisms that allow the nework to add or remove items to its memory depending on how relevant the network deems the items to be.

In other words, an LSTM network updates its memory vector at each time step depending on (a) which information it wants to add from the current input, (b) which information it deems not relevant anymore and wants to forget, and (c) which information it wants to keep. The gates are the mechanisms that learn which items in the sequential data are important to store in the LSTM's memory and which items are not. The gates are neural networks in their own regard, learning how to perform their respective specialized roles best. These three types of gates are as follows:

`Input gate`  
    This determines which information from the current input should be used to modify the memory.

`Forget gate`  
    This type of gate determines which information to forget from memory because the information is no longer relevant.

`Output gate`  
    This determines the information to keep in memory (and pass on to the next time step) given the current input and what the network has chosen to forget.

Using these gates, LSTM allows a neural network to operate on different time scales at once, capturing longer dependencies much better than RNNs would.

> The TL;DR is that LSTM is an improved version of RNNs. LSTM has longer-term memory compared to RNNs. LSTM achieves this using mechanisms called gates, which learn which information in the sequential data matters most and which does not.

### GRUs

Gated recurrent units (GRUs) are another form of RNNs with gates. They are similar to LSTM but have a simpler structure, using just two gates instead of three. These two kinds of gates are the following:

`Update gate`  
    The update gate determines whether the memory should be updated with new information from the current input.

`Reset gate`  
    This determines how much of the new memory is important (and should be retained and passed on) or not (and therefore reset).

In other words, the update gate controls information that flows into memory, and the reset gate controls information that flows out of memory. The update gate in GRUs is similar to the conbination of the input gate and the forget gate in LSTM, while the reset gate in GRUs is similar to the output gate in LSTM.

The performance of GRUs is similar to the performance of LSTMs (but generally not quite as good), but, because of their simpler structure, GRUs are computationally more efficient. They train faster and are also a better choice than LSTM when you have limited training data because GRUs have fewer weights and parameters to update during training.

To recap, RNNs are a family of sequential models that have helped advance the field of NLP, particularly since 2015. Vanilla RNNs have simple, short-term memory, but the gated variants (LSTM and GRUs) have longer short-term memory and capture longer-term dependencies in sequential data better. LSTM and GRUs are the best performing RNNs in the field of NLP today. Both have similar performance, but GRUs are simpler than LSTM (GRUs use two gates instead of the three in LSTM) and faster to train.

As successful as these RNNs were at the end of 2016, they handled longer-term dependencies poorly, a problem the next breakthrough in NLP-attention mehcanisms-sought to address.

## Attention Mechanisms

Although LSTM and GRUs have longer-term memories than RNNs, they still have shortcomings that hamper their performance on NLP tasks. This is most apparent on sequence-to-sequence NLP tasks such as machine translation. Both LSTM and GRUs have a hidden state ("memory") taht is passed on from on time step to the text. In machine translation, the input sentence is encoded first, and the final hidden state ("representation") is passed on to the decoder to decode/translate the sentence into the output language.

The decoder has only this final hidden state to work off of. It does not have access to the intermediate hidden states of the encoder. This means that there is information left on the table (in the form of the intermediate hidden states of the encoder) that the decoder could use to improve its translation but does not have access to. Because of this limitation, LSTM, for example, cannot preserve information for more than 20 words in machine translation. The bidirectional version of LSTM was invented to solve this (known as Bi-LSTMs), but even Bi-LSTM could not preserve information for more than 40 words.

It should be clear that trying to squeeze the meaning of an entire sentence into one vector (the final hidden state of the encoder) and passing this to the decoder to translate the input sentence is suboptimal. It would be better for the decoder to focus on the relevant locations of the input sentence (via the intermediate hidden states of the encoder) at every time step as the decoder worked on the translation.

In other words, the decoder should focus its attention on the relevant hidden states of the encoder at every time step instead of using just one vector (the final hidden state of the encoder) to perform the translation. Intuitively, this should improve the quality of translation because the decoder would have more relevant information as it worked through the translation of the input sentence.

The mechanisms that enable LSTM to focus their attention are known as attention mechanisms. They helped unleash the major breakthroughs in NLP in recent years, starting in 2017. Attention mechanisms in LSTM allow the decoder to access all the hidden states of the encoder, not just the final hidden state. Beyond this, they help the decoder focus its attention on particular hidden states and ignore the others as it translates the input sentence into the output sentence.

By being able to access the entire source input, an LSTM with attention mechanisms can handle longer input sequences better than vanilla LSTM and GRUs (and certainly much better than RNNs). In a nutshell, LSTM with attention mechanisms are less forgetful because they are able to have better and more forcused memory.

> In an LSTM with attention, the encoder passes all the hidden states to the decoder instead of just the last hidden state, which is the case in an LSTM without attention.

This should feel intuitive. Attention in neural networks mimics congnitive attention in humans. Whether reading a sentence or driving a car, we as humans are not paying equal attention to everything around us all the time. Doing so would not only be mentally exhausting, but it would be impossible for us. We could not hold that conginitive load at once.

Instead, we focus our attention on items that matter most given the task we are performing. If we are reading a sentence, we pay more attention to some of the words-such as the names of the protagonists, where they are, and the activities they are performing-than to filler words such as articles and prepositions and words from many sentneces ago. The filler words are less relevant and, therefor, not worth as much of our attention.

This is also true for complex tasks such as driving a car. We focus our attention on the road in front of us, including stop signs, traffic lights, crosswalks, other nearby vehicles, pedestrians, cyclists, and other objects we want to avoid hitting. We are much less focused on what is happening in our peripheral vision or the landscape on the horizon, espeicially when there is a lot of active traffic in front of us.

In much the same way, attention mechanisms in machine learning help neural networks focus their attention on what matters most for the taskl at hand while ignoring everything else that is less relevant. This enables neural networks to have better performance on tasks in much the same way that humans who can focus well (free of congnitive impairments such as drugs or alcohol or other distractions) are able to perform tasks better.

After their early success in improving machine translation, attention mechanisms became highly in vogue. A flurry of attention mechanisms came onto the market including self-attention, global/soft attention, and local/hard attention. Moreover, attention mechanisms were used for more than NLP applications; they became popular in computer vision, too. For more on these variants of attention mechanisms and how they work, refer to Chapter 7.

> Attention mechanisms also help make models more interpretable and explainable. This allows us to learn what the model is focused on as it translates words or generates a caption for an image, for example.

As these became more popular, researchers began to explore neural network architectures that relied more and more on attention mechanisms and less on the recrreunt neural architecture of RNNs. This resulted in the next big breakthrough in NLP: the Transformer architecture.

## Transformers

LSTM with attention mechanisms was a major improvement over vanilla LSTM and GRUs, but it had some shortcomings. Most notably, LSTM with attention mechanisms is computationally intensive to train and hard to parallelize.

Soon after LSTM with attention mechanisms became popular in 2017, researchers designed an even better architecture, one that was faster to train and eliminated recurrent networks altogether and relied solely on attention mechanisms. This new architecture is known as the Transformer architecture (or Transformer, for short).

Instead of using a recurrent network-based encoder-decoder, Transformers use a feed-forward encoder-decoder with attention.

Transformer were the catalyst for the ImageNet moment in NLP, heralding the advent of large, pretrained language models such as ULMFiT, ELMo, BERT, GPT-2, and GPT-3. Transformers are very memory-intensive, but parallelize very well. Given the prallelization, it is possible to train Transformers on lotsof data super fast across lots of GPUs, which is exactly what has happened at the large tech giants in recent years.

Let's dig into the parallelization some more because it helps explain why the Transformer was such a major breakthrough compared to LSTMs with attention. An LSTM, like all RNNs, needs to processed before the LSTM can process the second word, and the third word, etc. Transformers do not have this requirement; they do not need to process data in order, from beginning to end.

By removing the recurrent processing and relying solely on attention mechanisms, Transformers pass an entire sequence of data to the decoder at once rather than sequentially as the older conventional sequential models do. This innovation around passing blocks of data (such as several sentences) through the network at once was a game changer.

Compared to conventional sequential models, the Transformer learns from more data at any given time and has much more parallelization as a result, reducing training times. Greater parallization and faster training times have allowed researchers to train on massively large datasets, much larger than the datasets the conventional sequential models could have trained on.

This enabled research teams at Google, Facebook, and other firms to train on very large datasets, much larger than what was possible with LSTM. This breakthrough in parallelization during training led to the advent of very large, prtrained language models.

With this is mind, let's explore the Transformer architecture in more detail. Transformers rely on attention mechanisms alone, without any recurrent sequential processing, to perform tasks. The invention of attention mechanisms made the Transformer possible.

Like LSTM, the Transformer relies on an encoder-decoder architecture. Specifically, the encoding component is a stack of encoders that process the input, and the decoding component is a stack of decoders that process the encodings passed to them by the encoders. The number of encoders eqauals the number of decoders. Also, all the encoders are identical in structure, and all the decoders are identical in structure.

Let's examine the stack of encoders first. Each encoder has two components (or sub layers): a self-attention mechanism and a feed-forward neural network. The self-atteion mechanism takes in the input encodings from the previous encoder and weighs the relevance of the encodings to each other to generate a set of output encodings. The output encodings are fed into the feed-forward neural network, which processes the encodings individually before passing them to the next encoder layers and the decoder layers.

The self-attention mechanism helps the encoder weigh the relevance of the words in the input sentence for the word the encoder is processing. For example, in the sentence, "Washington was the first President of the United States, While Adams was the first Vice President," the self-attention mechanism would assign greater relevance to the words "first" and "President" when processing the word "Washington" than it would assign to the other words, particularly words related to "Adams." This helps the encoder focus on the more relevant information when processing the word at hand, once again highlight the beauty of attention mechanisms.

> The first encoder takes the word embeddings of the input sentence and the positional information of the tokens as its input rather than encodings. Every other encoder uses the encodings generated by the prior encoder in the stack.

The stack of decoders processes the encodings passed by the encoders and generates an output. Each decoder has three components: a self-attention mechanisms, an attention mechanism, and a feed-forward neural network. The decoder has a similar structure except for the additional attention mechanism, which helps it weigh the relevant information from the encodings generated by the encoders. This attention mechanisms helps the decoder focus on the relevant parts of the input sentence, similar to what the attention mechanism does in LSTMs with attention.

The self-attention layer in the decoder also works a bit differently; the self-attention mechanism is allowed to attend to only earlier positions in the output sentence. All future positions are masked so the transformer is not using the current or future output to generate its output.

After the stack of decoders, the output vector is fed into a final linear transformation and softmax layer to produce the output probabilities over the vocabulary. The word in the vocabulary with the highest probabiltiy is generated as the final output of the time step.

After it was introduced in late 2017, the Transformer became the clear standout architecture for solving many NLP problems. Researchers have used it to advance the field dramtically over the past three years, starting with the explosion of activity in 2018, the year of NLP's ImageNet moment.

### Transformer-XL

## NLP’s ImageNet Moment

### ULMFiT

### ELMo

### BERT

### BERTology

### GPT-1, GPT-2, GPT-3

OpenAI는 또한 자체 Transformer 기반 모델을 설계하여 NLP 경쟁에 뛰어들었습니다. 이러한 모델은 Generative Pretrained Transformer의 줄임말인 GPT 모델로 알려져 있습니다. 첫 번째 GPT 모델인 GPT-1은 2018년에 출시되었으며 ULMFiT과 유사한 감독되지 않은 사전 훈련 및 감독된 미세 조정 프로세스를 사용했습니다. GPT-2는 2019년에 출시되었습니다. 이전 버전에 비해 더 많은 데이터와 더 많은 매개변수로 훈련하여 제로 샷 설정에서 많은 작업에서 최첨단 성능을 달성하는 데 도움이 되었습니다.

> 제로 샷 학습에서 모델은 학습할 예제가 제공되지 않지만 주어진 지침에 따라 수행할 작업을 이해해야 합니다. 예를 들어 제로 샷 학습 작업에는 영어 문장을 독일어로 번역하는 모델이 필요할 수 있지만 모델에는 학습할 영어-독일어 문장이 제공되지 않습니다. 퓨샷 학습에서는 모델에 학습할 몇 가지 예가 제공되지만 일반적으로 많지는 않습니다.

OpenAI는 2020년에 GPT-3를 출시했습니다. GPT-2와 비교할 때 GPT-3는 훨씬 더 큰 데이터 세트에서 학습했으며 더 많은 수의 매개 변수를 가졌습니다. GPT-3는 이전 모델을 능가하고 zero-shot 및 few-shot 학습의 새로운 표준을 세웠습니다. 현재까지 가장 성능이 좋은 생성적인 NLP 모델로 간주됩니다.

> 아시다시피 모델은 점점 더 커지고 수년에 걸쳐 점점 더 많은 데이터에 대해 학습했습니다. 이것은 설계 변경, 더 큰 모델 및 더 많은 데이터와 함께 NLP에서 최첨단 성능을 추진하는 데 도움이 되었습니다.

## Conclusion

이 장에서는 단어 임베딩, RNN, 어텐션 메커니즘, Transformer 아키텍처 및 문맥화된 단어 표현을 포함하여 책의 모든 주요 개념을 함께 묶었습니다. 종합적으로 이러한 발전은 2018년에 NLP의 ImageNet 순간을 가져오는 데 도움이 되었으며, 사전 훈련된 대규모 언어 모델이 대중에게 공개되고 NLP 벤치마크에서 새로운 성능 기록을 세웠습니다.

사전 훈련된 언어 모델의 등장으로 응용 NLP 엔지니어는 도메인별 NLP 작업에서 대규모 모델을 미세 조정하고 놀라운 성능을 달성할 수 있게 되었습니다. 이제 NLP 모델을 개발하기 위해 알아야 할 주요 NLP 개념을 다루었으므로 NLP 모델을 개발한 후 프로덕션화하는 방법에 대해 논의해 보겠습니다.