# Huggingface's pretrained models

Let's experiment with the pretrained models offered by the Huggingface ecosystem.

The `AutoModel` class is the model-level equivalent of what `AutoTokenizer` is for tokenizers: it allows to load a pretrained model (downloading it if it's not present in the local cache) and use it as it is.

`AutoModel` is the object that allows to download a model written in its original deep learning framework (PyTorch, TensorFlow or JAX). If a specific framework is needed, models can be converted and loaded with the appropriate object, e.g. `TFAutoModel` for TensorFlow.

In [None]:
from transformers import AutoModel, TFAutoModel, AutoTokenizer

In [None]:
# This establishes which model to (down)load, and with
# which weight values.
model_ckpt = 'distilbert-base-uncased'

# (Down)load the model.
model = TFAutoModel.from_pretrained(model_ckpt)

## Data and tokenization

Loaded models assume the input to be already tokenized.

__Note:__ models need to be used along with a tokenizer, which must be the one associated to the model (checkpoint) itself, otherwise nothing works! (E.g. different tokenizers in general assume that vocabularies are ordered in different ways).

In [None]:
# We use the tokenizer associated to the same model 
# (same checkpoint).
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
text = """
Instead of feeding into their need to “fix” their lives, I try to teach the concept of impermanence, and for those who are interested, I share mindfulness practices to help them understand and internalize the concept. 

Many of them already have experience with meditation, but it is often a goal-orientated practice in line with fixing themselves. In fact, I find that a goalless practice is the best way to understand impermanence. A goalless practice is about being right here in each moment without any conceptual objective in mind. It means giving up conceptual thinking and concepts, putting the brakes on constantly doing, releasing the need to be in control, and starting to just be in the world as you are. It means sitting with the fact that nothing is permanent, that everything is changing, and that is OK. I think of it as watching the clouds float by on a sunny day. Or, as Soto Zen teacher “Homeless” Kodo Sawaki Roshi said long ago, “Zazen is good for nothing!”
"""

documents = [t for t in text.strip().split('\n') if t != '']

print(f'{len(documents)} documents found')

for i, document in enumerate(documents):
    print(f'\nDocument {i+1}:')
    print(document)

In [None]:
inputs = tokenizer(documents, padding=True, return_tensors='tf')

inputs

Input TensorFlow tensors. Shape: `(batch_size, n_tokens)` (where `n_tokens` is the number of tokens in the longest sequence as we are using padding.

In [None]:
inputs['input_ids']

## Generating an output

The model expects the output of the tokenizer as its input (so a dict-like object containing both the tokenized sequences as tensors and the attention masks is fine). The output is another dict-like object containing a tensor of shape `(batch_shape, n_tokens, hidden_dim)`, where `hidden_dim` is the output dimension of the model for each token (each token is mapped to a vector of size `hidden_dim`).

In [None]:
outputs = model(inputs)

outputs

For downstream classification tasks, it's customary to use only the hidden state of the start-of-sentence token as the input feature: this is __NOT__ the same for every sentence and is encoded in a context-dependent way, resulting in a "summary" of the sentence itself.

In [None]:
outputs['last_hidden_state'][:, 0, :]