# Hugging Face - lesson 1

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


Run only during the first time to download hugging face default model for various use cases

In [2]:
pipeline("text-classification")
pipeline("feature-extraction")
pipeline("fill-mask")
pipeline("ner")
pipeline("question-answering")
pipeline("summarization")
pipeline("zero-shot-classification")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
No model was supplied, defaulted to distilbert/distilbert-base-cased and revision 935ac13 (https://huggingface.co/distilbert/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.
No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initia

<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x28b806921d0>

## Sentiment Analysis

In [2]:
sa_clf = pipeline("sentiment-analysis")
sa_clf("I've been looking for a God and Jesus appeared.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.7388884425163269}]

In [3]:
res = sa_clf(["I've been looking for a God and Jesus appeared.",
           "There is nowhere that i can run to but to God who show mercy to the world"])

## Zero Shot Classification

In [11]:
zs_clf = pipeline("zero-shot-classification")
res = zs_clf(
    "Can you help me to check if i can collect my passport tomorrow? I need it very urgently as i will be flying to China",
    candidate_labels=["expedite", "collection", "Singapore"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [12]:
res

{'sequence': 'Can you help me to check if i can collect my passport tomorrow? I need it very urgently as i will be flying to China',
 'labels': ['expedite', 'collection', 'Singapore'],
 'scores': [0.5579254031181335, 0.4045458436012268, 0.037528764456510544]}

## Text Generation

In [13]:
text_gen = pipeline("text-generation")
res = text_gen("I go to ICA to collect")
res

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I go to ICA to collect our first shipment. I need to collect my first shipment in 3 days. We're at the back of our building. In that time we will have 3 delivery slots.\n\nWe have received our first delivery slots"}]

Using a specific text-generation model

In [17]:
dist_gpt2 = pipeline("text-generation", model="distilgpt2")
res = dist_gpt2(
    "I go to ICA to collect",
    max_length=30,
    num_return_sequences=2,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [20]:
print(res[0]['generated_text'])

I go to ICA to collect for the first time.
“I had a problem that I could not resolve. I thought I was going


## Mask Filling

AI that "Fill in the blank"

In [22]:
unmasker = pipeline("fill-mask")
unmasker("I go to <mask> in Singapore to collect my passport.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3037779629230499,
  'token': 10102,
  'token_str': ' customs',
  'sequence': 'I go to customs in Singapore to collect my passport.'},
 {'score': 0.13516353070735931,
  'token': 461,
  'token_str': ' court',
  'sequence': 'I go to court in Singapore to collect my passport.'}]

## Named Entity Recognition (NER)
AI that detect text correspond to entities such as person, location or organisation

In [8]:
ner = pipeline("ner", grouped_entities= True)
res = ner("My name is John and I work in DHL in Germany")
print(res)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.99883586, 'word': 'John', 'start': 11, 'end': 15}, {'entity_group': 'ORG', 'score': 0.9977331, 'word': 'DHL', 'start': 30, 'end': 33}, {'entity_group': 'LOC', 'score': 0.9996934, 'word': 'Germany', 'start': 37, 'end': 44}]


## Question-Answering
AI that answer question with context

In [10]:
ques_ans = pipeline("question-answering")

ques_ans(
    question =  "Where do i work",
    context = "I live in Germany near a town call bonn. I work as a data analyst in DHL"
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9072030186653137, 'start': 69, 'end': 72, 'answer': 'DHL'}

## Summarisation
AI that summarise text

In [13]:
summarizer = pipeline("summarization")

summarizer(
    """
    Agriculture encompasses crop and livestock production, aquaculture, and forestry
    for food and non-food products.[1] Agriculture was a key factor in the rise of
    sedentary human civilization, whereby farming of domesticated species created
    food surpluses that enabled people to live in cities. While humans started
    gathering grains at least 105,000 years ago, nascent farmers only began 
    planting them around 11,500 years ago. Sheep, goats, pigs, and cattle were
    domesticated around 10,000 years ago. Plants were independently cultivated
    in at least 11 regions of the world. In the 20th century, industrial agriculture
    based on large-scale monocultures came to dominate agricultural output.
    """,
    max_length = 30,
    min_length = 10
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' Agriculture was a key factor in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled'}]

## Translation
AI that translate text from one language to another

In [14]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [16]:
translator("il y a trois chien et un chat.")

[{'translation_text': 'There are three dogs and one cat.'}]

## Biases

In [2]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [24]:
unmasker("This man is [MASK] lunch.")

[{'score': 0.14701026678085327,
  'token': 5983,
  'token_str': 'eating',
  'sequence': 'this man is eating lunch.'},
 {'score': 0.14547321200370789,
  'token': 2026,
  'token_str': 'my',
  'sequence': 'this man is my lunch.'},
 {'score': 0.05538772791624069,
  'token': 2010,
  'token_str': 'his',
  'sequence': 'this man is his lunch.'},
 {'score': 0.04142344370484352,
  'token': 2437,
  'token_str': 'making',
  'sequence': 'this man is making lunch.'},
 {'score': 0.03949165344238281,
  'token': 11065,
  'token_str': 'stealing',
  'sequence': 'this man is stealing lunch.'}]

In [5]:
# Gender Bias
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
