In [6]:
import torch
from torch import Tensor as tensor
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

In [3]:
from transformers import pipeline


classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'multiprocessing.resource_tracker'


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

We'll need to convert our input text to something our transformer model can use. To do this we'll use a tokenizer.

The 'tokenizer' will *tokenize* words, that is split them up into chunks like words, subwords or punctuation. Then it'll map each of these tokens to an integer, and add some additional some into the integerized, tokenized string, like an start and stop indicator.  

In [4]:

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


We can see the list of integers that represent the tokens that comprise our inputs (one tensor of integers for each sentence). 

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Our output has three dimensions
1. Batch size - the number of sequences processed at a time
2. Sequence length - Length of the numerical representation of the sequence i.e. how many tokens our input ends up being
3. Hidden size - The vector dimension of each input token. Basically, we'll use a vector with elements equal to our hidden size to represent our tokens numerically, which will allow the model to do mathematical operations on it, thus 'work' with natural language. 

In [None]:
from transformers import AutoModelForSequenceClassification
import torch


checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

Using a specialized model for sequence classificaiton, give us an 2x2 output, one for each sequence, and for each of our sentiment categories (positive and negative).

We'll need to put the logits (raw, unnormalized output of last layer) through a softmax layer to turn them into a probability of our input falling into the output categories we have, but after we do that, we can see the chance of our input having negative or positive sentiment. 

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

We can use a Configuration object which contains a set of predetermined hyperparameters to instantiate our Model object. After that it will be initatilized with random weights. 

At this point, we would train our model. This would require a non-trivial amount of resources, so we can instead use a pretrained model with the from_pretrained() method, passing in our checkpoint of choice.

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

model.save_pretrained("directory_on_my_computer")

We'll save the model, passing in the directory we want to save it in. This will put two files in this directory, a .json, with the model's architecture (the hyperparameters it's configured with). We'll also get a .bin that contains the weights of the model. 

## Tokenizers

We need to convert our input string to numbers for the model to understand it. This is done by converting our input into tokens, which are partitions of our input that we have a few ways of implementing. These tokens each have a numerical ID, which will each be represented by our 'hidden size'-sized vector for our model to ingest. 

3 of the most popular approaches to tokenization are word, subword, and character based.

Word - each word is itself a token. The input is split on either spaces or punctuation. Since each word is a token, our vocabulary (the total set of tokens we have IDs for) must be large to account for all of the different words that might be in our input. Since each of these will then be represented by a vector that might be hundreds of numbers long, the model's representation of an input could becoming overwhelmingly large quickly. Additionally, we'll have an "UNKNOWN" token ID, assigned to every word that isn't in our vocabulary. So we might also lose a lot of information if our vocabulary isn't sufficiently comprehensive. 

Character - All characters (like letters) in a language become a token. Usually results in a smaller necessary vocabulary (although character-based languages like Chinese can still have large vocabularies). Issues arise since characters hold less semantic information than words. Additionally, this approach translates to longer token ID strings compared to word-based tokenization, introducing limitations to the length of sequences that can be processed.

Subword - A balance between character and word based tokenization. Complex words can be split into meaningful components as necessary, like how the -s or -es suffix indicates plurality in English, while common words can be efficiently represented with a single token ID. Because words are split up, we also have a special token indicating the start of new words. 

In [22]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

#tokenizer.save_pretrained("directory_on_my_computer")

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'multiprocessing.resource_tracker'


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Using tokenizers is essentially the same as using other pretrained models. We'll load the one we want based on chosen checkpoint. This will load the algorithm (character/subword/word) used by the tokenizer, and it's vocabulary (the dictionary of tokens and their asscociated IDs)

Saving a tokenizer is also identical to models. 

In [None]:
tokenizer("Using a Transformer network is simple")

We see the IDs of the tokens in our input string (we'll talk about the other outputs - token_type_ids/attention_mask later). 

This translation from text to numbers is called 'encoding' and consists of 2 parts: 

1. Tokenization - Splitting up our input string into tokens based on our tokenization algorithm 
2. Conversion - Taking those tokens and creating a sequence of those token's IDs as defined by our vocabulary. 

As a note: To use a model, we'll need to tokenize any input string in the same way that was used in the initial training of that model. 

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])

This process can also be reversed through decoding, translating our token ID sequence back to the string it was derived from. 

## Handling Longer Inputs with Batching

Our models expect a tensor of IDs as input. Since tensors are rectangular, we need to find a way to account for different lengthed sequences. 

We'll do this by padding. We pad the end of the shorter sequence with a token ID that we can find using tokenizer.pad_token_id until it's the same length as the longest sequence. 

This will disrupt our attention mechanism, since these padding tokens are improperly contributing to the context in which the other words are interpreted by the model. 

To solve this problem, we'll pass in an attention_mask argument during our inference. This is another tensor of the same shape as our input sequence, comprising of 1s and 0s. Token IDs that should be attended to by our model will have a 1 in the corresponding spot in our attention mask. Token IDs that should be ignored have a 0 (all our padding tokens will have a 0). 

In [17]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

seq1= "I've been waiting for a HuggingFace course my whole life."
seq2= "I hate this so much!"

#seqs = ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!",]

#ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(seqs))

ids1 = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(seq1))
ids2 = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(seq2))

pad_id = tokenizer.pad_token_id

batched_ids = [
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 2061, 2172, 999, pad_id, pad_id, pad_id, pad_id, pad_id, pad_id, pad_id, pad_id]
]

mask1 = [1 if i != 0 else 0 for i in batched_ids[0]]
mask2 = [1 if i != 0 else 0 for i in batched_ids[1]]

attention_mask = [mask1, mask2]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
output1 = model(torch.tensor([ids1]))
output2 = model(torch.tensor([ids2]))
print(f'W/ batching and attention maskign:{outputs}')
print(f'Passing in those sequences separately: {output1} and {output2}')
#ids


W/ batching and attention maskign:SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789],
        [ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
Passing in those sequences separately: SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None) and SequenceClassifierOutput(loss=None, logits=tensor([[ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


We can see that the output of our model is identical when we properly pad and use the attention mask. 

We might run into another problem where are sequences are too long for the model. In this case we either have to switch to a model that accepts longer input sequences or truncate (cut off) our sequence. 

In [21]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

We can just call the tokenizer directly on our input sequence by passing it in as a argument to our model, along with some other parameters that handle behavior related to truncation, padding, and the type of tensor returned. 