## Hugging face transformer


### pipeline feature

inside ` pipeline() ` following processes are going on. we can eeithe

- `AutoTokenizer`: raw text $ →$ tokens

- `AutoModel`: input ids $ →$ Logits

- Pytorch `softmax` : Logits $ →$ predictions

In [None]:
from transformers import pipeline

#### Tokenizer

In [None]:
# first step is tokenization
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # model name, we can select other models also from hugging face
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
raw_inputs = [ "I love my dog, he is puppy","I am coding"
]
inputs1 = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
inputs2 = tokenizer(raw_inputs)
print(f'inputs{inputs1}')
print(f'inputs without padding,truncation: \n {inputs2}')


inputs{'input_ids': tensor([[  101,  1045,  2293,  2026,  3899,  1010,  2002,  2003, 17022,   102],
        [  101,  1045,  2572, 16861,   102,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}
inputs without padding,truncation: 
 {'input_ids': [[101, 1045, 2293, 2026, 3899, 1010, 2002, 2003, 17022, 102], [101, 1045, 2572, 16861, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}


In [None]:
tokenizer.tokenize(raw_inputs)

['i', 'love', 'my', 'dog', ',', 'he', 'is', 'puppy', 'i', 'am', 'coding']

### Model

In [None]:
from transformers import AutoModel
checkpoint = checkpoint
model = AutoModel.from_pretrained(checkpoint)

In [None]:
outputs = model(**inputs1)
print(outputs.last_hidden_state.shape) # batch size, sequence length, hidden size

torch.Size([2, 10, 768])


**there are many other automodels in huggingface library like `AutoModelForQuestionAnswering`,`AutoModelForSequenceClassification` etc**

In [None]:
from transformers import AutoModelForSequenceClassification

model = checkpoint
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs1)

In [None]:
print(outputs.logits.shape) # classifier have shape 2x2 since two labels

torch.Size([2, 2])


In [None]:
print(outputs.logits)

tensor([[-3.3514,  3.5196],
        [-1.5453,  1.4910]], grad_fn=<AddmmBackward0>)


### Prediction

In [None]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[0.0010, 0.9990],
        [0.0458, 0.9542]], grad_fn=<SoftmaxBackward0>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

first sentence positive with 0.990 and second sentence with 0.9542

In [None]:
## we can do the same thing with pipeline
from transformers import pipeline

classifier = pipeline("sentiment-analysis")



No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9989637136459351},
 {'label': 'POSITIVE', 'score': 0.9541887044906616}]

In [None]:
classifier(raw_inputs)

[{'label': 'POSITIVE', 'score': 0.9989637136459351},
 {'label': 'POSITIVE', 'score': 0.9541887044906616}]