# Chapter 1 - Overview of Transformers Library

<b>Transformer</b> - a novel machine learning architecture for sequence modelling, able to outperform recurrent networks.

During this development, research was also ongoing on training methods and LSTM architectures used for text classification with little labeled data.

These events catalyzed modern transformers like BERT and GPT by combining the transformer architecture with unsupervised learning. This in turn created a large array of transformer models, each of which utilize similar architectures:
- The encoder-decoder framework
- Attention mechanisms
- Transfer learning 

### Encoder-Decoder Framework

First, returning back to RNNs:
- Architectures that contain a feedback loop in the network connections that allows information to propagate from one step to another, idea for sequence data (NLP, time series, etc.)
- At each step of the sequences, an RNN receives some input, feeds it through the network, and outputs a vector called the hidden state 
- At the same time, the model feeds some information back to itself through the feedback looop, which it can use in the next step 
- Still widely used for NLP tasks
- Play an important in the development of machine translation systems, usually tackled with an encoder-decoder architecture

<b>Encoder-Decoder:</b> A sequence-to-sequence architecture, where the input and output are both sequences of arbitrary length. 
- Encoder converts information from the input sequence into a numerical representation called the last hidden state 
- This is passed to the decoder, which generates the output sequence

In practice there are usually many RNN layers in each encoder and decoder.

One shortcoming of this architecture is the creation of an information bottleneck. The final hidden state has to represent the meaning of the whole input sequence, which is difficult to do with longer sequences, as meaning towards the beginning might be lost in the process of compressing everything into a single, fixed representation. 

We can counteract this bottleneck by allowing the decoder to have access to all of the encoder's hidden states - a general mechanism called <b>attention</b>, a key component in transformer architecture.  

### Attention Mechanisms

Instead of producing a single hidden state for the input sequence, the encoder outputs a hidden state at each step that the decoder can access. But, using all of these would create a huge input for the decoder, so some mechanism is required to prioritize which states to use. Attention allows the decoder to assign different weights, or attention, to each of the encoder states at every decoding timestep.

By focusing on which input tokens are the most relevant at each time step, attention-based models are able to learn nontrivial alignments between words in a generated translation and those in a source sentence. 

This tasks like translation to be performed much more accurately, but there were still major shortcomings with using recurrent models for the encoder and the decoder: the computations are inherently sequential and cannot be parallelized across the input sequence. 

So, transforms removed recurrence entirely and instead shifted to a concept called self-attention. It's basic idea is to allow attention to operate on all states in the same layer of the network. The encoder and decoder each have their own self-attention mechanisms, whose outputs are fed to a feed-forward network. This can be trained must faster that recurrent models.

Most transformers are trained on a large corpus of text from various languages, but in practical application we often do not have access to this amount of data. 

### Transfer Learning

Transfer learning - where a network is trained on one general task, and then fine-tuned for a more specific downstream application, has been commonly used for computer vision. Despite this, it took many years before a training process for this was created for NLP applications. The general framework for this, created by ULMFiT, is:
- *Pretraining*: The initial training objective for the model is to predict the next word based on the previous words, a task called language modeling. This approach requries no labeled data and can make use of abundantly available texts
- *Domain Adaptation*: After pretraining, the model is adapted to an in-domain corpus. This still uses language modeling, but now the model has to predict words within the target corpus (ex. IMBD review dataset)
- *Fine-Tuning*: The language model is then fine-tuned with a classification layer for the target task (ex. sentiment classification of movie reviews)

### HuggingFace

Traditionally, adapting pretrained models to new use cases was an intensive process that required a lot of specific engineering and custom logic. HuggingFace provides a standardized interface to a wide range of transformer models as well as code and tools to adapt these to new uses. It supports Tensorflow, Pytorch, and JAX, and provides task-specific heads, making fine-tuning much easier. 

### HuggingFace Transformers Library Tour

In [1]:
import transformers
from transformers import pipeline
import pandas as pd 

Each NLP task starts with text, like this made up customer review:

In [9]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror
that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve this issue, I demand 
an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

**Text Classification**

In [4]:
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

HF has downloaded the model weights from the HF Hub. The second time the pipeline is instantiated, the library will used the cached version instead.

In [5]:
outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.916535


Only returns positive or negative, so the confidence of the opposite label is $1 - score$

**Named Entity Recognition**

In [6]:
ner_tagger = pipeline('ner', aggregation_strategy='simple')
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.8817,Amazon,5,11
1,MISC,0.990799,Optimus Prime,36,49
2,LOC,0.999756,Germany,90,97
3,MISC,0.55873,Mega,208,212
4,PER,0.591302,##tron,212,216
5,ORG,0.68014,Decept,253,259
6,MISC,0.490222,##icons,259,264
7,MISC,0.782622,Megatron,351,359
8,MISC,0.987765,Optimus Prime,368,381
9,PER,0.810411,Bumblebee,503,512


Detects all entities in the text and assigns them a category. 

**Question Answering**

We provide the model with a context, as well as a question whose answer we'd like to extract from the context. The model returns the span of text that corresponds to the answer. 

In [10]:
question = 'what does the customer want?'
reader = pipeline('question-answering')
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Unnamed: 0,score,start,end,answer
0,0.640578,337,360,an exchange of Megatron


**Summarization**

In [14]:
summarizer = pipeline('summarization')
outputs = summarizer(text, max_length=56, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I


**Translation**

In [15]:
translator = pipeline('translation_en_to_de', model='Helsinki-NLP/opus-mt-en-de')
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Downloading: 0.00B [00:00, ?B/s]

Downloading:   0%|          | 0.00/284M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading: 0.00B [00:00, ?B/s]



Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um dieses Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von Ihnen zu hören. Aufrichtig, Bumblebee.


**Text Generation**

In [16]:
generator = pipeline('text-generation')
response = 'Dear Bumblebee, I am sorry to hear that your order was mixed up.'
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror
that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve this issue, I demand 
an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. I received a request for a new Optimus Prime action figure that was not available on the B&E website in Germany. I asked for something like the Optimus Prime action figure, but this time it was a completely different toy. I wanted a better deal and needed a place to store mine. When I heard you were happy


### HuggingFace Tokenizers

Behind each of these models is a tokenizer which converts the raw text into smaller pieces called tokens - parts of words, or just characters (like punctuation). Transformers are trained on numerical representations of these tokens, so getting this right is critical to the NLP process. 

The HF tokenizers library provides many tokenization strategies and is very fast. It also takes fare of all pre- and postprocessing steps. Loading a tokenizer works the same way as loading a transformer.

### Main Transformer Challenges

1. Language: NLP research is dominated by English
2. Data Availability: Transfer learning can dramatically decrease the amount of labeled training data we need, but we sometimes still need a lot in instances where there is little
3. Working with long documents: Self-attention works extremely well on paragraph-long texts, but becomes very expensive when moved to documents
4. Opacity: As with other deep learning models, transformers are very opaque. It's difficult to unravel extactly why a model makes a certain prediction
5. Bias: Transformer models are trained on text from the internet, which is all subject to bias and misrepresentation