# __Transformers, what can they do?__

## <font color="yellow">Монтирование токена в Colab</font>

In [None]:
!pip install huggingface_hub

In [1]:
# Установка токена в переменную окружения (на время сессии)
# См. токен на GıtHub

In [2]:
# Использования токена:
from huggingface_hub import login
login(os.getenv("HF_TOKEN"))

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


<font color="yellow">============================</font>

## Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [3]:
from transformers import pipeline

## <font color="orange">Классификация (позитив - негатив)</font>
model=`"distilbert-base-uncased-finetuned-sst-2-english"`

Модели DistilBERT в Hugging Face поддерживают те же основные функции, что и BERT, но работают быстрее и используют меньше ресурсов.

In [None]:
# https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

# text-classification - на одно предложение
classifier_1 = pipeline("text-classification", model='distilbert-base-uncased-finetuned-sst-2-english')
classifier_1("I've been waiting for a HuggingFace course my whole life.")

In [18]:
# sentiment-analysis (анализ настроений) - несколько предложений
classifier_2 = pipeline("sentiment-analysis", model='distilbert-base-uncased-finetuned-sst-2-english')
classifier_2(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## <font color="orange">Классификация (варианты)</font>
model=`"facebook/bart-large-mnli"`

Задача классифициии текста, который не был помечен. Это распространенный сценарий в реальных проектах, поскольку аннотирование текста обычно занимает много времени и требует экспертных знаний в предметной области. Для этого варианта используется конвейер `Zero-shot-classification`: он позволяет указывать, какие метки использовать для классификации, поэтому не нужно полагаться на метки предварительно обученной модели - можно использовать любой другой набор меток.

In [None]:
classifier_3 = pipeline("zero-shot-classification", model='facebook/bart-large-mnli')
classifier_3(
    "This is a course about the Transformers library",
    candidate_labels=["education", "sport", "business"],
)

__`Возможные варианты candidate_labels для разных задач:`__  
📌 1. Анализ настроения (Sentiment Analysis)
candidate_labels = ["positive", "negative", "neutral"]  
📌 2. Тематическая классификация новостей = ["sports", "politics", "technology", "business", "entertainment", "health"]  
📌 3. Категоризация отзывов пользователей  = ["customer service", "product quality", "pricing", "shipping", "returns"]  
📌 4. Распознавание эмоций = ["joy", "anger", "sadness", "fear", "surprise", "love"]  
📌 5. Классификация юридических документов  = ["contract", "privacy policy", "terms of service", "legal notice"]  
📌 6. Определение типа спама  = ["scam", "advertisement", "phishing", "legitimate"]  
📌 7. Классификация контента социальных сетей  = ["meme", "news", "personal story", "opinion", "advertisement"]  
📌 8. Анализ тональности клиентских сообщений  = ["complaint", "suggestion", "praise", "question"]  

__Например:__  
text = "The new iPhone has amazing camera features and a sleek design."  
candidate_labels = ["technology", "sports", "entertainment", "business"]  
result = classifier(text, candidate_labels)  
print(result)  
💡 Как выбрать candidate_labels?  
- Подбирайте ключевые слова, которые имеют четкое значение.  
- Используйте общее количество от 3 до 10 категорий (слишком много может запутать модель).

## <font color="orange">Генерация текста</font>  
model = `"openai-community/gpt2"`  
model=`"distilgpt2"`

Основная идея заключается в том, что вы предоставляете подсказку, а модель автоматически ее заполняет, генерируя оставшийся текст. Генерация текста подразумевает случайность, поэтому нормально получать разные результаты.

In [None]:
generator_1 = pipeline("text-generation", model = "openai-community/gpt2")
generator_1("In this course, we will teach you how to")

In [None]:
generator_2 = pipeline("text-generation", model="distilgpt2")
generator_2(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

In [None]:
! pip install bitsandbytes

## <font color="orange">Заполнение пропущенного текста по маске</font>  
model = `"distilroberta-base"`  
model=`"distilbert-base-uncased"`

In [50]:
# unmasker = pipeline("fill-mask", model = "distilbert/distilroberta-base")
# unmasker("This course will teach you all about <mask> models.", top_k=2)
unmasker = pipeline("fill-mask", model = "distilbert-base-uncased")
unmasker("This course will teach you all about [MASK] models.", top_k=2)

Device set to use cpu


[{'score': 0.09235800057649612,
  'token': 8045,
  'token_str': 'mathematical',
  'sequence': 'this course will teach you all about mathematical models.'},
 {'score': 0.027659984305500984,
  'token': 2535,
  'token_str': 'role',
  'sequence': 'this course will teach you all about role models.'}]

## <font color="orange">ner (распознавание поименованных сущностей)</font>  
model = `"bert-large-cased-finetuned-conll03-english"`

In [9]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

## <font color="orange">question-answering</font>  
model = `"distilbert-base-cased-distilled-squad"`


In [None]:
question_answerer_1 = pipeline("question-answering", model = "distilbert-base-cased-distilled-squad")
question_answerer_1(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

In [12]:
question_answerer_2 = pipeline('question-answering', model = "distilbert-base-uncased-finetuned-sst-2-english")
question_answerer_2(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
# Некоторые веса DistilBertForQuestionAnswering не были инициализированы из контрольной точки модели в distilbert-base-uncased-finetuned-sst-2-english
# и инициализированы заново: ['qa_outputs.bias', 'qa_outputs.weight']
# Возможно, следует ОБУЧИТЬ эту модель на нисходящей задаче, чтобы иметь возможность использовать ее для прогнозирования и вывода.

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.008997570723295212,
 'start': 8,
 'end': 29,
 'answer': 'is Sylvain and I work'}

## <font color="orange">summarization (рецензирование)</font>  
model = `"sshleifer/distilbart-cnn-12-6"`

Список задач, для которых подходит distilbart-cnn-12-6:  
- Суммирование текста: генерация краткого изложения новостей.
- Перевод текста: адаптирован для работы с длинными текстами.
- Текстовые аннотации: создание коротких обзоров.

In [15]:
from transformers import pipeline

summarizer_1 = pipeline("summarization")
summarizer_1(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [19]:
summarizer_2 = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

text = """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""

summary = summarizer_2(text, max_length=50, min_length=25, do_sample=False)
print(summary)


Device set to use cpu


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering . Rapidly developing economies such as'}]


## <font color="orange">Translation</font>  
model = `""`

In [20]:
translator_1 = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator_1("Ce cours est produit par Hugging Face.")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use cpu


[{'translation_text': 'This course is produced by Hugging Face.'}]

In [21]:
# Это плохо переводит
translator_1 = pipeline("translation", model="sshleifer/distilbart-cnn-12-6")
translator_1("Ce cours est produit par Hugging Face.")


Device set to use cpu


[{'translation_text': ' Ce cours est produit par Hugging Face . Hugging faces are produced by Hugging Faces and Hugging Foursies . The film is based in Paris, Paris, France, and is produced on the basis of Hugging Eyes and Faces . It is not the first time Hugging Faces has been produced in France .'}]