# Hello Transformers

<img alt="transformer-timeline" caption="The transformers timeline" src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_timeline.png?raw=1" id="transformer-timeline"/>

## The Encoder-Decoder Framework

<img alt="rnn" caption="Unrolling an RNN in time." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_rnn.png?raw=1" id="rnn"/>

<img alt="enc-dec" caption="Encoder-decoder architecture with a pair of RNNs. In general, there are many more recurrent layers than those shown." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_enc-dec.png?raw=1" id="enc-dec"/>

## Attention Mechanisms
-  모든 히든 스테이트를 동시에 사용하면 디코더 너무 큰 값을 다뤄야함
- 그래서 각 디코더 단계마다 서로 다른 가중치를 줘서 전달 (예측에 가장 관련있는 것들)
- decoder가 'sind'를 출력할 때, 인코더 히든 벡터들 중 'are' 벡터를 선택 (얼라인먼트, 대응 관계)


<img alt="enc-dec-attn" caption="Encoder-decoder architecture with an attention mechanism for a pair of RNNs." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_enc-dec-attn.png?raw=1" id="enc-dec-attn"/>

<img alt="attention-alignment" width="500" caption="RNN encoder-decoder alignment of words in English and the generated translation in French (courtesy of Dzmitry Bahdanau)." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter02_attention-alignment.png?raw=1" id="attention-alignment"/>

## Self-Attention

- 시퀀스가 아니라 한번에 처리하고(attention은 병렬도 안되고 시퀀셜한 특성), 인코더 디코더의 output은 FFNN에 전달된다

<img alt="transformer-self-attn" caption="Encoder-decoder architecture of the original Transformer." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_self-attention.png?raw=1" id="transformer-self-attn"/>

## Transfer Learning in NLP
- 트레이닝 시킬 대규모의 unlabeled한 데이터셋이 없기 때문
- 한 task에 대해 transfer learning으로 학습 한 후에, 새로운 task에 맞춰 fine-tuning
- head와 body가 있고, body는 학습 도중 가중치를 가지고 '특징'들을 배운다.
- 이 body의 '가중치'를 '새로운 task'에 대한 모델을 초기화 할 때 사용한다.
- 이런게 바로 Pretraining이다. (기본적인 피쳐를 배우는 것.. CNN에선 색, 에지..)
- 이 후에 이거는 분류 문제 등에 Fine tuning 되어서 쓰인다. (적은 unlabeled 데이터 가지고 등)


<img alt="transfer-learning" caption="Comparison of traditional supervised learning (left) and transfer learning (right)." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_transfer-learning.png?raw=1" id="transfer-learning"/>  

## ULMFiT
- Pretraining (LM)
- Domain Adaption (LM)
- Fine-tuning

<img alt="ulmfit" width="500" caption="The ULMFiT process (courtesy of Jeremy Howard)." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_ulmfit.png?raw=1" id="ulmfit"/>

# Transformer Model (Self-Attention & Transfer Learning)
1. GPT : Decoder
 - With unlabeled data we can train on large corpus better and call this ‘unsupervised pre-training’.(Language Modeling Objective) For the specific task supervised fine-tuning can be operated.(Supervised Objective)
2. BERT : Encoder
 - Masked LM

### Text Classification
- 감정분석

In [None]:
!pip install transformers

In [6]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

In [7]:
# pipeline을 이용 -> fine-tuned 모델을 가지고 수행
from transformers import pipeline

classifier = pipeline("text-classification") # 허깅페이스에서 자동 다운로드 됨 (감성분석 + 다중 분류도 함)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [9]:
import pandas as pd
outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.901546


### Named Entity Recognition
- aggregation_strategy : um zu gruppieren
 - "simple" : 해당 개체명 내의 각 토큰에 대한 스코어의 평균입니다. 예를 들어, "Sylvain"의 스코어는 이전 예에서 S, ##yl, ##va 및 ##in 토큰에 대해 계산된 스코어의 평균입니다.

 - "first" : 여기서 각 개체명의 스코어는 해당 개체명의 첫 번째 토큰의 스코어입니다(따라서 "Sylvain"의 경우 토큰 S의 점수인 0.993828이 됨).

 - "max" : 여기서 각 엔터티의 스코어는 해당 엔터티내의 토큰들 중의 최대값 스코어입니다("Hugging Face"의 경우 "Face"의 점수는 0.98879766이 됨).

 - "average" : 여기서 각 항목의 스코어는 해당 항목을 구성하는 단어(토큰이 아닙니다) 스코어의 평균입니다(따라서 "Sylvain"의 경우 "simple" 지정자와 차이가 없지만 "Hugging Face"의 점수는 0.9819이며 "Hugging"은 0.975이고 "Face"는 0.98879입니다).

 https://wikidocs.net/166822

In [10]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556571,Mega,208,212
4,PER,0.590256,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


In [12]:
# ORGanigation, LOCation, PERson
# hash 심볼(#)은 단어를 token으로 세분화한다!

### Question Answering
- start/end 는 index위치로, NER-Tagging에서와 같다.

In [13]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


In [14]:
pd.DataFrame(outputs)

ValueError: If using all scalar values, you must pass an index

### Summarization
- "clean_up_tokenization_spaces" : It should remove space artifacts inserted while encoding the sequence. E.g., if you have state-of-the-art it will be encoded as state - of - the - art. The cleanup should remove those spaces between -. Hope it helps!

https://discuss.huggingface.co/t/what-does-the-parameter-clean-up-tokenization-spaces-do-in-the-tokenizer-decode-function/17399

In [15]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your min_length=56 must be inferior than your max_length=45.


[{'summary_text': ' Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.'}]
 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.


### Translation

In [19]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.


### Text Generation

In [20]:
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [21]:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200) # 프롬프트 뒤에 생성됨
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The only thing I can give you is information about the particular products you received with your request, and I will provide that information to you when I can.

You are extremely welcome to come look at your package. The first thing to do is to bring that package over to your local store.

I will note to you


이 경고 메시지는 텍스트 데이터가 주어진 max_length보다 길 경우, 이를 어떻게 처리할지에 대한 설정이 명시되지 않아서 발생하는 것입니다. 기본적으로는 'longest_first' 전략을 사용하여 텍스트를 자르게 됩니다. 이를 해결하려면 truncation=True 옵션을 명시적으로 설정하면 됩니다. 또한, 텍스트 생성을 할 때 pad_token_id가 설정되지 않은 경우, eos_token_id를 대신 사용하겠다는 내용도 포함되어 있습니다.

```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 모델과 토크나이저 불러오기
tokenizer = GPT2Tokenizer.from_pretrained("gpt-2")
model = GPT2LMHeadModel.from_pretrained("gpt-2")

# 입력 텍스트
input_text = "Once upon a time"

# 텍스트를 토크나이징할 때 truncation과 padding 설정
inputs = tokenizer(
    input_text,
    max_length=50,
    truncation=True,
    padding="max_length",
    return_tensors="pt"
)

# 텍스트 생성
outputs = model.generate(
    inputs.input_ids,
    max_length=50,
    pad_token_id=tokenizer.eos_token_id
)

# 생성된 텍스트 디코딩
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

In [22]:
outputs = generator(prompt, max_length=200) # 프롬프트 뒤에 생성됨
print(outputs[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. It appears that the order list sent with your order form is faulty. Please contact Amazon Customer Support and we will attempt to resolve the issue. Thanks.

Customer service response: Sorry but no answer to this issue. Sorry but there is still no answer to this question.


In [24]:
set_seed(42) # 첫번째 생성결과랑 일치함
outputs = generator(prompt, max_length=200) # 프롬프트 뒤에 생성됨
print(outputs[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The only thing I can give you is information about the particular products you received with your request, and I will provide that information to you when I can.

You are extremely welcome to come look at your package. The first thing to do is to bring that package over to your local store.

I will note to you


## The Hugging Face Ecosystem

<img alt="ecosystem" width="500" caption="An overview of the Hugging Face ecosystem of libraries and the Hub." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_hf-ecosystem.png?raw=1" id="ecosystem"/>

### The Hugging Face Hub

<img alt="hub-overview" width="1000" caption="The models page of the Hugging Face Hub, showing filters on the left and a list of models on the right." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_hub-overview.png?raw=1" id="hub-overview"/>

<img alt="hub-model-card" width="1000" caption="A example model card from the Hugging Face Hub. The inference widget is shown on the right, where you can interact with the model." src="https://github.com/nlp-with-transformers/notebooks/blob/main/images/chapter01_hub-model-card.png?raw=1" id="hub-model-card"/>

### Hugging Face Tokenizers

### Hugging Face Datasets

### Hugging Face Accelerate

## Main Challenges with Transformers

## Conclusion