### Transformers

#### First tool: Pipelines

In [1]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis") # create a classifier object
classifier("We are very happy to show you the 🤗 Transformers library.") # classify a sentence


  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

The pipeline object is the most basic object in the Huggingface library. It connects pre and post processing steps and you can input any text to get an intelligble answer.

In [7]:
question_answer = pipeline("question-answering")
question_answer(question = "Where does Curtis live?",
         context = "My name is Curtis Pond and I work at Workday. I live in Monterey.")


No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9701592922210693, 'start': 56, 'end': 64, 'answer': 'Monterey'}

In [6]:
ner_extraction = pipeline("ner", grouped_entities=True)
ner_extraction("My name is Curtis Pond and I work at Workday. I live in Monterey.")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER',
  'score': 0.9993125,
  'word': 'Curtis Pond',
  'start': 11,
  'end': 22},
 {'entity_group': 'ORG',
  'score': 0.9889389,
  'word': 'Workday',
  'start': 37,
  'end': 44},
 {'entity_group': 'LOC',
  'score': 0.9805864,
  'word': 'Monterey',
  'start': 56,
  'end': 64}]

There are three main steps involved when you pass some text to a pipeline:

1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.

In [9]:
classifier = pipeline("zero-shot-classification")
classifier(
    "My product is a new type of machine learning algorithm that can predict the future.",
    candidate_labels=["technology", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'My product is a new type of machine learning algorithm that can predict the future.',
 'labels': ['technology', 'business', 'politics'],
 'scores': [0.9761189222335815, 0.021940218284726143, 0.0019408687949180603]}

In [11]:
generator = pipeline("text-generation")
generator("My name is Curtis Pond and I work at Workday. My goal is to",
          max_length=30,
          num_return_sequences=2)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "My name is Curtis Pond and I work at Workday. My goal is to be a firefighter with amazing skills. It's the heart and soul of"},
 {'generated_text': 'My name is Curtis Pond and I work at Workday. My goal is to educate everyone on the best way to have fun as well as the most'}]

In [12]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading: 100%|██████████| 762/762 [00:00<00:00, 619kB/s]
Downloading: 100%|██████████| 353M/353M [00:07<00:00, 47.8MB/s] 
Downloading: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.55MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 1.42MB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:00<00:00, 3.23MB/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use your knowledge to build a learning experience with the skills you take to create your own training experiences.'},
 {'generated_text': 'In this course, we will teach you how to be a part of the team and the organization."'}]

#### Fill Mask

In [13]:
unmask = pipeline("fill-mask")
unmask("My name is Curtis Pond and I work at Workday. My goal is to <mask>.")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading: 100%|██████████| 480/480 [00:00<00:00, 135kB/s]
Downloading: 100%|██████████| 331M/331M [00:06<00:00, 48.6MB/s] 
Downloading: 100%|██████████| 899k/899k [00:00<00:00, 2.22MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 1.41MB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:00<00:00, 3.27MB/s]


[{'score': 0.17194245755672455,
  'token': 5318,
  'token_str': ' graduate',
  'sequence': 'My name is Curtis Pond and I work at Workday. My goal is to graduate.'},
 {'score': 0.12913177907466888,
  'token': 6726,
  'token_str': ' succeed',
  'sequence': 'My name is Curtis Pond and I work at Workday. My goal is to succeed.'},
 {'score': 0.05298031494021416,
  'token': 6008,
  'token_str': ' survive',
  'sequence': 'My name is Curtis Pond and I work at Workday. My goal is to survive.'},
 {'score': 0.052075281739234924,
  'token': 339,
  'token_str': ' win',
  'sequence': 'My name is Curtis Pond and I work at Workday. My goal is to win.'},
 {'score': 0.04167119041085243,
  'token': 10732,
  'token_str': ' publish',
  'sequence': 'My name is Curtis Pond and I work at Workday. My goal is to publish.'}]

#### Part of Speech tagging

In [15]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

model_name = "QCRI/bert-base-multilingual-cased-pos-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
outputs = pipeline("A test example")
print(outputs)


Downloading: 100%|██████████| 49.0/49.0 [00:00<00:00, 19.3kB/s]
Downloading: 100%|██████████| 2.12k/2.12k [00:00<00:00, 878kB/s]
Downloading: 100%|██████████| 996k/996k [00:00<00:00, 2.42MB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 34.9kB/s]
Downloading: 100%|██████████| 712M/712M [00:13<00:00, 51.8MB/s] 


[{'entity': 'DT', 'score': 0.9997243, 'index': 1, 'word': 'A', 'start': 0, 'end': 1}, {'entity': 'NN', 'score': 0.9997472, 'index': 2, 'word': 'test', 'start': 2, 'end': 6}, {'entity': 'NN', 'score': 0.99973196, 'index': 3, 'word': 'example', 'start': 7, 'end': 14}]


### Bias

In [18]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

Downloading: 100%|██████████| 570/570 [00:00<00:00, 194kB/s]
Downloading: 100%|██████████| 440M/440M [00:04<00:00, 88.4MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 10.1kB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 936kB/s] 
Downloading: 100%|██████████| 466k/466k [00:00<00:00, 1.44MB/s]


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


### Translations

In [16]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading: 100%|██████████| 1.42k/1.42k [00:00<00:00, 1.13MB/s]
Downloading: 100%|██████████| 301M/301M [00:05<00:00, 51.2MB/s] 
Downloading: 100%|██████████| 42.0/42.0 [00:00<00:00, 38.9kB/s]
Downloading: 100%|██████████| 802k/802k [00:00<00:00, 2.00MB/s]
Downloading: 100%|██████████| 778k/778k [00:00<00:00, 1.94MB/s]
Downloading: 100%|██████████| 1.34M/1.34M [00:00<00:00, 3.22MB/s]


[{'translation_text': 'This course is produced by Hugging Face.'}]