In [5]:
!python --version

Python 3.10.12


In [None]:
# !pip install transformers

In [6]:
# NLP Purpose
!pip install "transformers[sentencepiece]"



In [9]:
import warnings
warnings.filterwarnings("ignore")

In [10]:
import transformers

## **What is NLP?**
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

## **Transformers, what can they do?**
The 🤗 **Transformers library** provides the functionality to create and use those shared models. The **Model Hub** contains thousands of pretrained models that anyone can download and use.

**`pipeline()`** 👉 **connects a model** with its necessary **preprocessing** and **postprocessing** steps, allowing us to directly input any text and get an intelligible answer:

In [11]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

can even pass several sentences!

In [12]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

Some of the currently available pipelines are:

- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
- `ner` (named entity recognition)
- `question-answering`
- `sentiment-analysis`
- `summarization`
- `text-generation`
- `translation`
- `zero-shot-classification`

### **Zero-shot Classification**
Challenging task where we need to **classify texts that haven’t been labelled**. For this use case, the `zero-shot-classification` pipeline is very powerful: it **allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model**.

➡ *time-cosuming ---- require domain expertise*

In [13]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445993661880493, 0.1119738519191742, 0.04342673718929291]}

This pipeline is called `zero-shot` because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

### **Text generation**
The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves ***randomness***, **it’s normal if you don’t get the same results as it run**.

In [15]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create a simple, functional web application with PHP 7 and MySQL 8. It will also teach you how to build the source and build its dependencies.\n\nTo learn the PHP syntax and basic scripting,'}]

You can control:
- how many different sequences are generated with the argument `num_return_sequences` and;
- the total length of the output text with the argument `max_length`.

In [16]:
generator = pipeline("text-generation", num_return_sequences=3, max_length=100)
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use all these great technologies in an engaging and insightful way.\n\nWe offer courses and workshops to educate you and improve your approach to learning, focusing on two main topics:\n\nBuilding your knowledge, and\n\nDeveloping your skills.\n\nHow Do You Learn?\n\nIn this course, we will show you how to apply them all together. Your knowledge base is huge. Your skills, your knowledge base will grow. And now'},
 {'generated_text': 'In this course, we will teach you how to implement the F-Type class.\n\nPrerequisites: This course is offered in English, French and Spanish in France.\n\nCourse objectives:\n\n1) To create a F-Type framework.\n\n2) To implement the F-Type class from scratch.\n\n3) To describe the F-Type concept with reference to the C system.\n\n4) To solve the problems of the F-Type system in a'},
 {'generated_text': "In this course, we will teach you how to use MSP in Python 2.7. You will learn about MSP's mo

### **Use any Model**
You can also choose a particular model from the Hub to use in a pipeline for a specific task.

In [17]:
# Try distilgpt2 model
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create a project using jQuery.'},
 {'generated_text': 'In this course, we will teach you how to make changes to your mind and focus on your emotions, and how to work on your goals. For'}]

### **Mask Filling**
`fill-mask`. The idea of this task is to fill in the blanks in a given text

In [18]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.19619794189929962,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052729159593582,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

Note that here the model fills in the special `<mask>` word, which is often referred to as a mask token

### **Name Entity Recognition**
Named entity recognition (NER) is a task where the model has to **find which parts of the input text correspond to entities** such as persons, locations, or organizations.

In [19]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).

### **Question Answering**
The `question-answering` pipeline answers questions using information from a given context

In [20]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

Note that this pipeline works by extracting information from the provided context; it does not generate the answer.

### **Summarization**
Summarization is the task of **reducing a text into a shorter text while keeping all (or most) of the important aspects** referenced in the text.

In [21]:
summarizer = pipeline("summarization") ## max_length, min_length
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

### **Translation**
For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to use on the Model Hub

In [None]:
# Translation French to English
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

The pipelines shown so far are mostly for demonstrative purposes. They were programmed for specific tasks and cannot perform variations of them

## **How do Transformers Work?**
The [Transformer architecture](https://arxiv.org/abs/1706.03762) was introduced in June 2017.

The focus of the original research was on translation tasks.

**Transformers Models**:
- **GPT**-like (also called auto-regressive Transformer models)
- **BERT**-like (also called auto-encoding Transformer models)
- **BART/T5**-like (also called sequence-to-sequence Transformer models)

### **Transformer are Language Model**

- **Self-supervised learning** ➡ type of training in which the objective is *automatically computed from the inputs of the model*. That means that humans are not needed to label the data!
- **Transfer Learning** ➡ using human-annotated labels - on a given task
- **Causal language modeling** ➡ predict the next word in a sentences, depends on the past and present inputs, but not the future ones.
- **Masked language modeling** ➡ predict a masked word in sentence

### **Transfer Learning**
- `Pretraining` is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

- `Fine-tuning`, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.
    
    The `fine-tuning` will only require a limited amount of data: the knowledge the pretrained model has acquired is “*transferred*,”

    This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should **always try to leverage a pretrained model — one as close as possible to the task you have at hand** — and fine-tune it.

### **Architectures vs. Checkpoints**
- **Architecture**: This is the **skeleton of the model** — the definition of each layer and each operation that happens within the model.
- **Checkpoints**: These are **the weights that will be loaded in a given architecture**.
- **Model**: This is an umbrella term that isn't as precise as “architecture” or “checkpoint”: **it can mean both**.

`example:` **BERT** is an *architecture* while `bert-base-cased`, a set of weights trained by the Google team for the first release of **BERT**, is a *checkpoint*.

## **Encoder Models**
Encoder models use only the encoder of a Transformer model. At each stage, the **attention layers can access all the words in the initial sentence**. These models are often characterized as having “*bi-directional*” attention, and are often called *auto-encoding* models.

Encoder models are best suited for tasks requiring an **understanding of the full sentence**, such as **sentence classification**, **named entity recognition** (and more generally word classification), and **extractive question answering**.

Example:
- **ALBERT**
- **BERT**
- **DistilBERT**
- **ELECTRA**
- **RoBERTa**

## **Decoder Models**
Decoder models use only the decoder of a Transformer model. At each stage, for a given word the **attention layers can only access the words positioned before it in the sentence**. These models are often called *auto-regressive* models.

The pretraining of decoder models usually revolves around **predicting the next word in the sentence**.

Example:
- **CTRL**
- **GPT**
- **GPT-2**
- **Transformer XL**

## **Sequence-to-Sequence Models**
**Encoder-decoder** models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the **attention layers of the encoder can access all the words in the initial sentence**, whereas the **attention layers of the decoder can only access the words positioned before a given word in the input**.

Sequence-to-sequence models are best suited for tasks revolving around **generating new sentences depending on a given input**, such as **summarization**, **translation**, or **generative question answering**.

Example:
- **BART**
- **mBART**
- **Marian**
- **T5**

## **Bias and Limitation**
**Limitation** (use pretained model or fine-tuned version in production) ➡ The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet.

For Example:

In [22]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


This happens even though **BERT** is one of the rare Transformer models not built by scraping data from all over the internet, but rather using apparently neutral data (it’s trained on the English Wikipedia and BookCorpus datasets).

⚠ ⚠

When use these tools, you therefore need to keep in the back of your mind that **the original model** you are using **could very easily generate sexist, racist, or homophobic content**. **Fine-tuning** the model on your data **won’t make this intrinsic bias disappear**.