# Introduction

## What to expect?
- Chapter 1~4
  - transformers library
  - 허깅페이스 허브를 다루는 방법
  - 특정 dataset으로 fine-tune하는 방법
  - 그 결과를 hub에 공유하는 방법
- Chapter 5~8 (미구현)
  - Dataset과 TOkenizer의 Basic
  - 각 NLP Problem을 해결하는 방법
- Chapter 9~12 (미구현)
  - 메모리 효율이나, long seq에 대한 해결 방법
  - 등등!

# Latural Language Processing
## What is NLP
NLP는 인간의 언어와 관련된 모든 것을 이해하기 위한 머신 러닝과 언어학의 종합적인 학문으로, 개별 적인 단어 뿐만 아니라 문맥적 요소를 고려한 모든 단어를 이해하는 것이 목표다.

NLP task는 아래와 같이 다양하게 존재한다.

- **Classifying whole sentences**: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
- **Classifying each word in a sentence**: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
- **Generating text content**: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
- **Extracting an answer from a text**: Given a question and a context, extracting the answer to the question based on the information provided in the context
- **Generating a new sentence from an input text**: Translating a text into another language, summarizing a text

다만 NLP는 텍스트에만 국한되지 않고, 음성 인식, 영상에서 오디오 샘플의 스크립트 생성, 이미지 설명과 같은 문제로 다룬다.

# Transformer, What can they do?
- Tranformer model + pipeline 활용법에 대해 공부




## Working with pipelines
Transformers library에서 가장 기본적인 요소는 pipeline이다. pipline으로 다양한 처리 step과 model을 하나로 엮을 수 있다.

우선 학습을 위한 데이터셋을 설치하자

In [10]:
!pip install datasets transformers[sentencepiece]



## sentiment analysis
파이프라인을 불러와서, 예제와 같이 감성분류를 실행해보자

In [11]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [12]:
classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

위와 같이 해당 문장은 긍정이 95%인 확률로 분류했음을 확인할 
수 있다.   
이번엔 두 문장을 한번에 분석해보자

In [13]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
])

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

두 문장이 다음과 같이 분석되었다. 그렇다면 이건 어떻게 진행되는 것일까?
pipeline()함수를 사용하여 특정 task에 대한 pipeline 요소를 불러오게 되면, 해당 pipeline에 필요한 model이 다운된다. 이 과정에선 **distilbert-base-uncased-finetuned-sst-2-english**이 다운 된것을 확인할 수 있다.

파이프라인은 세 단계로 구성된다.
1. 우선 텍스트 전처리를 통해 모델의 input 형태로 변환해주고
2. 변환된 inputs을 model에 넣는다.
3. 예측 결과를 반환한다.   

파이프라인은 아래와 같이 task에 사용이 가능하다.
- feature-extraction
- fill-text
- NER
- QA
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification
- etc...

## Zero-shot classification

보통 라벨이 부착되지 않는 text를 분류하는 것으로 프로젝트를 시작으로, 텍스트에 주석을 다는 것은 많은 시간과 도메인 전문 지식이 필요하다. 따라서 zero-shot classification을 하면 분류에 사용할 레이블을 지정할 수 있다.


우선 제로샷 분류를 위한 파이프라인을 불러오자

In [14]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)
Downloading: 100%|██████████| 1.15k/1.15k [00:00<00:00, 578kB/s]
Downloading: 100%|██████████| 1.63G/1.63G [03:59<00:00, 6.80MB/s]
Downloading: 100%|██████████| 26.0/26.0 [00:00<00:00, 13.2kB/s]
Downloading: 100%|██████████| 899k/899k [00:01<00:00, 716kB/s]
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 358kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:02<00:00, 518kB/s]


위와 같이 bert-large-mnli pretrained model이 다운로드되는 것을 확인할 수 있다.
이제 label을 지정하고 classification을 해보자

In [16]:
classifier("This is a course about the Transformers library",
          candidate_labels = ["education","politics","business"])

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445971608161926, 0.11197549849748611, 0.04342736303806305]}

84.5% 확률로 education으로 분류가 되었다.   
이 파이프 라인은 제로샷으로, 분석할 데이터를 위한 추가적은 finetune을 하지 않고 동작하기 때문이다. 제로샷은 pre-trained 기반으로 내가 분류하고자 한 labels에 대해 확률로 결과를 반환해준다.

## Text generation

Text generation task는 text 일부를 가지고 모델이 다음 text를 생성하는 task이다.   
이 task는 randomness가 포함되어, 예제와 다른 결과를 출력하기도 한다.

In [17]:
from transformers import pipeline
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Downloading: 100%|██████████| 665/665 [00:00<00:00, 666kB/s]
Downloading: 100%|██████████| 548M/548M [01:09<00:00, 7.88MB/s]
Downloading: 100%|██████████| 1.04M/1.04M [00:01<00:00, 785kB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 465kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:01<00:00, 908kB/s] 


text-generation의 default model은 gpt2임을 확인할 수 있다. 이제 텍스트 생성을 해보자

In [18]:
generator("In this course, we will teach you how to")

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use a real-time JavaScript parser to perform a data conversion from one data point onto another. In order to perform this conversion, create a JSON document and add the following:\n\nmyDataObject'}]

이번엔, gpt-2가 아닌 distillgpt2 모델을 불러오고, maxlen과 returen seq 개수를 지정하여 생성해보자

In [22]:
from transformers import pipeline
generator = pipeline("text-generation", model="distilgpt2")

Downloading: 100%|██████████| 762/762 [00:00<00:00, 389kB/s]
Downloading: 100%|██████████| 353M/353M [00:54<00:00, 6.48MB/s]
Downloading: 100%|██████████| 1.04M/1.04M [00:02<00:00, 502kB/s]
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 360kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:03<00:00, 380kB/s]


In [27]:
generator("In this course, we will teach you how to",
          max_length = 15,
          num_return_sequences = 2)

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create a new class with'},
 {'generated_text': 'In this course, we will teach you how to handle this issue. If'}]

## Mask filling
다음 task는 mask filling으로 동일하게 진행해보자   
단, mask filling은 특정 단어를 <mask>로 변환하여 input을 넣어줘야 한다.

In [29]:
from transformers import pipeline
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.",
         top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)
Downloading: 100%|██████████| 480/480 [00:00<00:00, 466kB/s]
Downloading: 100%|██████████| 331M/331M [00:50<00:00, 6.61MB/s]
Downloading: 100%|██████████| 899k/899k [00:02<00:00, 403kB/s]
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 456kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:03<00:00, 446kB/s]


[{'sequence': 'This course will teach you all about mathematical models.',
  'score': 0.1961982101202011,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about computational models.',
  'score': 0.04052715376019478,
  'token': 38163,
  'token_str': ' computational'}]

해당 task에는 distilroberta-base가 기본 모델이며,   
top_k = n에서, n 개수만큼 <mask>를 대체할 단어를 출력한다.   
이 때, 각 단어의 확률값과, 해당 단어에 해당하는 token_id와 token_str가 주어진다

## Named entity recognition
NER은 특정 token이 location인지, person인지 등등 어떤 개체명으로 분류하는지에 대한 task이다.   
여기서 grouped_entities=True는 subword로 된 것중 특정 entity를 group하는 것에 대한 옵션이다.   
예를 들어, hugging, face가 있을 때 하나의 entity임을 나타내기 위해 True를 주면 "hugging face"를 하나의 entity로 인식한다.

In [33]:
from transformers import pipeline
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
Downloading: 100%|██████████| 998/998 [00:00<00:00, 506kB/s]
Downloading: 100%|██████████| 1.33G/1.33G [03:04<00:00, 7.25MB/s]
Downloading: 100%|██████████| 60.0/60.0 [00:00<00:00, 57.4kB/s]
Downloading: 100%|██████████| 213k/213k [00:00<00:00, 215kB/s]


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.97960204,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.99321055,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [34]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=False)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'entity': 'I-PER',
  'score': 0.9993828,
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.99815476,
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.99590725,
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.9992327,
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': 0.97389334,
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': 0.976115,
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.98879766,
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.99321055,
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

## Question answering

In [35]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)
Downloading: 100%|██████████| 473/473 [00:00<00:00, 242kB/s]
Downloading: 100%|██████████| 261M/261M [00:32<00:00, 8.08MB/s]
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 14.5kB/s]
Downloading: 100%|██████████| 213k/213k [00:00<00:00, 283kB/s]
Downloading: 100%|██████████| 436k/436k [00:01<00:00, 425kB/s]


{'score': 0.6949764490127563, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

## Summarization
- 요약 task로 max_length or a min_length를 지정할 수 있다.

In [36]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)
Downloading: 100%|██████████| 1.80k/1.80k [00:00<00:00, 899kB/s]
Downloading: 100%|██████████| 1.22G/1.22G [02:42<00:00, 7.53MB/s]
Downloading: 100%|██████████| 26.0/26.0 [00:00<00:00, 12.8kB/s]
Downloading: 100%|██████████| 899k/899k [00:01<00:00, 716kB/s]
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 449kB/s]


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## Translation
번역 task로 max_length or a min_length를 지정할 수 있다.

In [37]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading: 100%|██████████| 1.29k/1.29k [00:00<00:00, 642kB/s]
Downloading: 100%|██████████| 301M/301M [00:43<00:00, 6.94MB/s]
Downloading: 100%|██████████| 42.0/42.0 [00:00<00:00, 40.1kB/s]
Downloading: 100%|██████████| 802k/802k [00:01<00:00, 660kB/s]
Downloading: 100%|██████████| 778k/778k [00:07<00:00, 98.6kB/s]
Downloading: 100%|██████████| 1.34M/1.34M [00:05<00:00, 264kB/s]


[{'translation_text': 'This course is produced by Hugging Face.'}]

# How do Transformers work?
- 이번 section에선 transformer architecture에 대해 공부하는 과정이다.
- https://huggingface.co/course/chapter1/4?fw=pt

## Encoder models
- https://huggingface.co/course/chapter1/5?fw=pt
- 문장의 이해를 바탕으로 하는 task에 특화 - sentence classification, named entity recognition (and more generally word classification), and extractive question answering.
- albert, bert, distillbert, electra, roberta

## Decoder models
- https://huggingface.co/course/chapter1/6?fw=pt
- text Generation에 특화
- [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html), GPT, GPT-2, Transformer XL

## Encoder-Decoder Models(= seq2seq models)
- https://huggingface.co/course/chapter1/7?fw=pt
- use both of the transformer architecture
- generating new sentences에 특화(요약,번역,QA)
- BART, mBART, Marian, T5

## Bias and limitations
- https://huggingface.co/course/chapter1/8?fw=pt
- production에 pretrained or fine-tuned model을 사용할 매우 유용하게 사용가능하지만 어느 정도 제한사항이 존재
- 많은 양의 데이터를 스크래핑할 경우 좋은 데이터만 수집할 수는 없음
- 데이터는 기본적으로 인간이 가지고 있는 내재적 편견(인종차별, 성차별, 나이)등이 포함되어 있기에, fine-tune을 한다고 하더라도 이러한 편향은 사라지지 않음

In [38]:
!pip install datasets transformers[sentencepiece]

from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")




Downloading: 100%|██████████| 570/570 [00:00<00:00, 537kB/s]
Downloading: 100%|██████████| 440M/440M [01:18<00:00, 5.60MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 26.1kB/s]
Downloading: 100%|██████████| 232k/232k [00:01<00:00, 129kB/s]
Downloading: 100%|██████████| 466k/466k [00:01<00:00, 345kB/s]


In [39]:
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


다음과 같이 남성과 여성의 work에 대해 masking을 하고 이것을 예측해보면, 편향에 의해 직업에 차이가 발생함을 알 수 있다.

## summary

|Model | Examples |Tasks|
|------|----------|-----|
|Encoder|ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa|Sentence classification, named entity recognition, extractive question answering|
|Decoder|CTRL, GPT, GPT-2, Transformer XL|Text generation
|Encoder-decoder|BART, T5, Marian, mBART|Summarization, translation, generative question answering|