## Transformers

### Transformers는 어디에든 존재

* 공유된 모델을 만들고 사용할 수 있는 다양한 기능들 제공
* 다운로드 및 사용할 수 있는 사전 학습 모델들 존재 / 자신의 모델을 허브에 업로드할 수도 있다

### 파이프라인 활용

특정 모델과 동작에 필요한 __전처리 및 후처리__ 단계를 연결하여 텍스트를 직접 입력하고 이해하기 쉬운 답변을 얻을 수 있다

In [None]:
!pip install transformers
!pip install transformers[sentencepiece]

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 160kB/s]
Downloading model.safetensors: 100%|██████████| 268M/268M [00:33<00:00, 7.96MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 8.64kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.25MB/s]
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [None]:
# 여러 개의 문장 동시에 입력
classifier(["I've been waiting for a HuggingFace course my whole life.",
            "I hate this so much!"])

위의 파이프라인은 감정 분석을 위해 fine-tune된 사전 훈련 모델을 선택한다.

코드에서 `classifier` 객체를 생성할 때 모델이 다운로드되고 캐시된다.

생성된 `classifier` 객체를 다시 실행하면 캐시된 모델이 사용되며, 모델을 다운로드할 필요가 없다

* 파이프라인에 텍스트가 입력되면
1. 전처리
2. 전처리된 텍스트가 모델에 전달
3. 모델이 예측한 결과가 후처리

### Zero-shot 분류

레이블이 지정되지 않은 텍스트를 분류하는 작업

분류에 사용할 레이블을 마음대로 지정할 수 있으므로 사전 훈련된 모델의 레이블 집합에 의존하지 않아도 된다.\
이 모델을 이용해서 새로운 레이블 집합을 사용하여 텍스트를 분류하는 것이 가능하다.

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 1.10MB/s]
Downloading model.safetensors: 100%|██████████| 1.63G/1.63G [02:45<00:00, 9.82MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 13.9kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.25MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 12.2MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 6.07MB/s]


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.844599187374115, 0.11197401583194733, 0.04342672601342201]}

### 텍스트 생성

입력으로 prompt를 제공하면 모델이 나머지 텍스트를 생성하여 프롬프트를 자동 완성한다.

In [4]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

# generator 에 num_return_sequences 인자를 지정해서 생성되는 시퀀스 개수 지정 가능
# max_length 인자를 지정해서 출력 텍스트의 총 길이도 지정 가능

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build and modify an object/object family using 3rd party tools including: C++ Programmer\n\nObject Oriented Programming or ORP for Programming a Simple Game for Python (and Python 3 and'}]

### 허브의 다른 모델들 사용

텍스트 생성에 대한 파이프라인 중 `distilgpt2` 를 사용해보자

In [5]:
generator = pipeline("text-generation", model="distilgpt2")
print(generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
))

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to best use the best of its kind on every level of training.”\nThis course teaches you'}, {'generated_text': "In this course, we will teach you how to use JavaScript for complex, complex applications. I'll be showing you how to build complex applications using JavaScript"}]


### Mask filling

주어진 텍스트의 공백 채우기

`top_k`: 출력할 공백 채우기 종류의 개수 지정

In [7]:
unmasker = pipeline("fill-mask")
print(
    unmasker("This course will teach you all about <mask> models.",
             top_k=2)
)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.19619806110858917, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.04052723944187164, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}]


### NER

텍스트에서 어느 부분이 사람, 위치 혹은 조직과 같은 개체명애 해당하는지 식별

In [8]:
ner = pipeline(
    "ner",
    grouped_entities=True, # 동일한 엔티티에 해당하는 문장의 부분들을 그룹화
)

print(
    ner(
        "My name is Sylvain and I work at Hugging Face in Brooklyn."
    )
)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


### QA

입력 질문에 응답 제공

In [9]:
question_answerer = pipeline("question-answering")
print(question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn."
))

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6385912299156189, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}


### 요약

`max_length`, `min_length` 지정 가능

In [15]:
summarizer = pipeline("summarization")
print(
    summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
    """
    )
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance engineering .'}]


### 기계 번역

작업 이름에 언어 쌍을 지정

`max_length`, `min_length` 지정 가능

In [3]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
print(
    translator("Ce cours est produit par Hugging Face.")
)

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]
