# Huggingface Basic usage 

## Behind the pipeline 

First, grab a tokenizer with AutoTokenizer. The only thing that tokenizer does is transforming a string input into numbers so that computers can understand. Each model has its own tokenizer that it was trained with, so we have to grab the right one.

In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Now that we have the right tokenizer, we turn our raw input into tensors. The following shows how to turn raw_inputs into pytorch tensors. 

In [17]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


<code>return_tensors</code> parameter can be changed so that the return type can be changed. With "np", it returns numpy ndarray. If it is "tf", it is tensorflow tensors.

In [9]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="np")
print(inputs)

{'input_ids': array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,
            0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [15]:
type(inputs['input_ids'])

numpy.ndarray

We can grab models by using Automodel and following the same steps as using a tokenizer.

In [33]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [34]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([3, 15, 768])


However, what we want is classification model, which we get automatically from AutoModelForSequeceClassification.

In [35]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [23]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

Logits are raw outputs from passing our inputs into the model. To make it more useful, we need to turn it into percentages. That's what softmax is for. 

In [24]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Each tensor sums up to 1 because that's what probability sums up to.

In [37]:
predictions[0].sum()

tensor(1.0000, grad_fn=<SumBackward0>)

In case we forget which is negative or positive, we can look it up this way.

In [25]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Here are some examples going through each step. 

In [38]:
# Grab raw inputs
raw_inputs = [
    "All I want is freedom.",
    "I wish I can eat some chicken nuggets.",
    "If I enjoy the food, I don't get fat."
]

In [39]:
# Get a tokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [40]:
# Turn raw inputs into inputs 
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  2035,  1045,  2215,  2003,  4071,  1012,   102,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  1045,  4299,  1045,  2064,  4521,  2070,  7975, 16371, 13871,
          8454,  1012,   102,     0,     0],
        [  101,  2065,  1045,  5959,  1996,  2833,  1010,  1045,  2123,  1005,
          1056,  2131,  6638,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [41]:
# Use a classification model to get the outputs
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6301,  2.7617],
        [ 3.0354, -2.5513],
        [-3.4392,  3.5611]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [30]:
# Turn our outputs into percentages
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.5329e-03, 9.9547e-01],
        [9.9627e-01, 3.7333e-03],
        [9.1076e-04, 9.9909e-01]], grad_fn=<SoftmaxBackward0>)


In [42]:
# Just in case we forget which is positive
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}