## Install hugging face transformer.
[MORE DETAILED CODES ARE HERE](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/quicktour.ipynb#scrollTo=rUsYoEj3kM4N)

In [None]:
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading saf

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



[Source](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/quicktour.ipynb#scrollTo=2iCHdTn7kM4I)


- The [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) is the easiest way to use a pretrained model for inference. You can use the [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) out-of-the-box for many tasks across different modalities. Take a look at the table below for some supported tasks:

| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task="sentiment-analysis")           |
| Text generation              | generate text that follows a given prompt                                                                    | NLP             | pipeline(task="text-generation")              |
| Name entity recognition      | assign a label to each token in a sequence (people, organization, location, etc.)                            | NLP             | pipeline(task="ner")                          |
| Question answering           | extract an answer from the text given some context and a question                                            | NLP             | pipeline(task="question-answering")           |
| Fill-mask                    | predict the correct masked token in a sequence                                                               | NLP             | pipeline(task="fill-mask")                    |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task="summarization")                |
| Translation                  | translate text from one language into another                                                                | NLP             | pipeline(task="translation")                  |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task="image-classification")         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task="image-segmentation")           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task="object-detection")             |
| Audio classification         | assign a label to an audio file                                                                              | Audio           | pipeline(task="audio-classification")         |
| Automatic speech recognition | extract speech from an audio file into text                                                                  | Audio           | pipeline(task="automatic-speech-recognition") |
| Visual question answering    | given an image and a question, correctly answer a question about the image                                   | Multimodal      | pipeline(task="vqa")                          |

Start by creating an instance of [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) and specifying a task you want to use it for. You can use the [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) for any of the previously mentioned tasks, and for a complete list of supported tasks, take a look at the [pipeline API reference](https://huggingface.co/docs/transformers/main/en/./main_classes/pipelines).

## Define various pretrained model for various tasks

In [None]:
from transformers import pipeline

classifier = pipeline(task="sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
generator = pipeline(task='text-generation', model="distilgpt2")
fill_mask = pipeline(task='fill-mask', model='bert-base-uncased')
question_answering = pipeline(task='question-answering', model="deepset/roberta-base-squad2", tokenizer='deepset/roberta-base-squad2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

## Sentiment Classifier

In [None]:
classifier("It is too cold outside. But I love the winter season.")

[{'label': 'positive', 'score': 0.810588002204895}]

In [None]:
results = classifier(["It is too cold outside. I don't want to go outside today.",
                      "But Christmas is coming!"])
for res in results:
    print("label: {}, with score: {}".format(res['label'], res['score']))

label: negative, with score: 0.906129777431488
label: positive, with score: 0.84145188331604


## Text Generator

In [None]:
results = generator(['Baby it is cold outside', 'Finally it is summer time'])
print(results)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': "Baby it is cold outside with the cold air in it and the cold air in it.\n\nI don't know how many times I see a wet or dry cold air in my backyard.\nI used to live in a family farm. It"}], [{'generated_text': 'Finally it is summer time to go to the beach and to go to school.'}]]


In [None]:
print(results[0][0]['generated_text'])
print('*'*50)
print(results[1][0]['generated_text'])

Baby it is cold outside with the cold air in it and the cold air in it.

I don't know how many times I see a wet or dry cold air in my backyard.
I used to live in a family farm. It
**************************************************
Finally it is summer time to go to the beach and to go to school.


## Fill Mask

In [None]:
fill_results = fill_mask('A man is in a kitchen, and he is holding an empty mug. He walks towards a coffee machine to get [MASK].')

In [None]:
for res in fill_results:
    print(res['score'], '/', res['token_str'], '/', res['sequence'])

0.289387971162796 / coffee / a man is in a kitchen, and he is holding an empty mug. he walks towards a coffee machine to get coffee.
0.21969836950302124 / it / a man is in a kitchen, and he is holding an empty mug. he walks towards a coffee machine to get it.
0.16633819043636322 / one / a man is in a kitchen, and he is holding an empty mug. he walks towards a coffee machine to get one.
0.0972013771533966 / something / a man is in a kitchen, and he is holding an empty mug. he walks towards a coffee machine to get something.
0.03316611051559448 / water / a man is in a kitchen, and he is holding an empty mug. he walks towards a coffee machine to get water.


## Question and Answering

In [None]:
QA_input = {'question': 'What should a robot bring to man?',
            'context': fill_results[0]['sequence']}
results = question_answering(QA_input)
print(results)

{'score': 0.041885070502758026, 'start': 96, 'end': 102, 'answer': 'coffee'}


## Sentence Similarity based on the sentence bert

In [None]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

feature_extractor = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

Downloading (…)lve/main/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## The first method - non parallel computation

In [None]:
sentences = ['In this paper, we propose a novel method called Transformer.',
             'We propose a novel model named Transformer.']
feature_1 = torch.Tensor(feature_extractor(sentences[0]))
feature_2 = torch.Tensor(feature_extractor(sentences[1]))
print(feature_1.shape, feature_2.shape)

print(torch.mean(feature_1, dim=1).shape)
norm_feat_1 = F.normalize(torch.mean(feature_1, dim=1), dim=-1)
norm_feat_2 = F.normalize(torch.mean(feature_2, dim=1), dim=-1)
similarity = torch.sum( norm_feat_1 * norm_feat_2 )
print('TWO SENTENCE Similarity: ', similarity.item())

torch.Size([1, 15, 384]) torch.Size([1, 11, 384])
torch.Size([1, 384])
TWO SENTENCE Similarity:  0.7857804298400879


## The sceond method - parallel computation with padding (recommended)

In [None]:
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
print(encoded_input.keys())
print(encoded_input['input_ids'])
print(encoded_input['attention_mask'])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
tensor([[  101,  1999,  2023,  3259,  1010,  2057, 16599,  1037,  3117,  4118,
          2170, 10938,  2121,  1012,   102],
        [  101,  2057, 16599,  1037,  3117,  2944,  2315, 10938,  2121,  1012,
           102,     0,     0,     0,     0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])


In [None]:
with torch.no_grad():
    model_output = model(**encoded_input)
print(model_output.keys())
print(model_output['last_hidden_state'].shape)
print(model_output['pooler_output'].shape)

odict_keys(['last_hidden_state', 'pooler_output'])
torch.Size([2, 15, 384])
torch.Size([2, 384])


In [None]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    # token embedding: B by L by D
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)

In [None]:
attention_mask = encoded_input['attention_mask']
print(attention_mask.unsqueeze(-1).shape)
print(model_output[0].size())
print(attention_mask.unsqueeze(-1).expand(model_output[0].size()).shape)

torch.Size([2, 15, 1])
torch.Size([2, 15, 384])
torch.Size([2, 15, 384])


In [None]:
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print(sentence_embeddings.shape)

torch.Size([2, 384])


In [None]:
similarity = (sentence_embeddings[0] * sentence_embeddings[1]).sum()
print('TWO SENTENCE SIMILARITY: ', similarity.item())

TWO SENTENCE SIMILARITY:  0.7857806086540222
