# **Auto Class**  

We have already seen the `dataset` and `pipeline` abstractions from Hugging Face.  While a `pipeline` is a quick way to set up an LLM for a given task, the slightly lower-level abstractions `model` and `tokenizer` permit a bit more control over options.  

<img width="800" height="500" src="data/images/hugging_face_transformers_pipeline.jpeg">

## **Pipeline**
1. Preprocessing Data (i.e. Tokenize the input)
2. Use the Model to solve the Task (i.e. Pass the input to appropriate Model)

#### **Preprocessing Data**
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you’ll learn that for:
- Text, use a `AutoTokenizer` to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
- Speech and audio, use a `AutoFeatureExtractor` to extract sequential features from audio waveforms and convert them into tensors.
- Image inputs use a `AutoImageProcessor` to convert images into tensors.
- Multimodal inputs, use a `AutoProcessor` to combine a tokenizer and a feature extractor or image processor.

**Note: `AutoProcessor` always works and automatically chooses the correct class for the model you’re using, whether you’re using a tokenizer, image processor, feature extractor or processor.**


#### **Solving the Task**
After the data has been preprocessed and converted to vectors, we can use the following pre-trained Auto classes for solving [Natural Language Processing](https://huggingface.co/docs/transformers/model_doc/auto#natural-language-processing), [Computer Vision](https://huggingface.co/docs/transformers/model_doc/auto#computer-vision), [Audio](https://huggingface.co/docs/transformers/model_doc/auto#audio) and [Multimodal](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) Tasks :
1. AutoModelFor
2. TFAutoModelFor
3. FlaxAutoModelFor




References:   
    - https://huggingface.co/docs/transformers/model_doc/auto  
    - https://huggingface.co/docs/transformers/autoclass_tutorial

We will first look at the [Auto* classes](https://huggingface.co/docs/transformers/model_doc/auto) for tokenizers and model types which can simplify loading pre-trained tokenizers and models.

## **Building the Summarization Pipeline**

Recall the `xsum` dataset from the **Summarization** section before.

**Steps**
1. Load the data
```python
from datasets import load_dataset

dataset = load_dataset('xsum')
```
2. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

summarizer = pipeline(
                task="summarization",
                model="t5-small"
)
```
3. Use `summarizer` to summarize the articles
```python
summarizer(article)
```

In [20]:
# ! pip install tensorflow
# ! pip install tf-keras

In [1]:
# Step 1 - Load the dataset

from datasets import load_dataset

xsum_dataset = load_dataset('xsum')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [2]:
xsum_sample = xsum_dataset['train'].select(range(10))

xsum_sample

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10
})

In [5]:
# Step 2 - Define the pipeline by specifying the task and model

from transformers import pipeline

summarizer = pipeline(
                task="summarization",
                model="t5-small",
                truncation=True
)

# If we donot set truncation=True, we get the following warning during inference:
# Token indices sequence length is longer than the specified maximum 
# sequence length for this model (541 > 512). Running this sequence 
# through the model will result in indexing errors




In [6]:
# Step 3 - Use summarizer to summarize the articles

summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]

## **AutoTokenizer**

A tokenizer takes text as input and outputs numbers the associated model can make sense of.

<img width="400" height="400" src="data/images/tokenization.JPG">

Let's learn the step by step process now.

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)
print()

## Convert the tokens to unique numerical number
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)
print()

## Lastly, the tokenizer adds special tokens the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print("Tokens Id with special tokens:", final_inputs["input_ids"])
print()

## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(final_inputs["input_ids"]))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Input Text: Let's try to tokenize!

Tokens: ['let', "'", 's', 'try', 'to', 'token', '##ize', '!']

Tokens Id: [2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]

Tokens Id with special tokens: [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]

Decoded Text Output: [CLS] let's try to tokenize! [SEP]


In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)
print()

## Convert the tokens to unique numerical number
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)
print()

## Lastly, the tokenizer adds special tokens the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print("Tokens Id with special tokens:", final_inputs["input_ids"])
print()

## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(final_inputs["input_ids"]))

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Input Text: Let's try to tokenize!

Tokens: ['Let', "'s", 'Ġtry', 'Ġto', 'Ġtoken', 'ize', '!']

Tokens Id: [7939, 18, 860, 7, 19233, 2072, 328]

Tokens Id with special tokens: [0, 7939, 18, 860, 7, 19233, 2072, 328, 2]

Decoded Text Output: <s>Let's try to tokenize!</s>


**Note: Given that now you understood how `AutoTokenizer` works, let's see an implementation to tokenize the input text efficiently.**

The tokenizer returns a dictionary with three important items:

- `input_ids` are the indices corresponding to each token in the sentence. The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
- `attention_mask` indicates whether a token should be attended to or not.
- `token_type_ids` identifies which sequence a token belongs to when there is more than one sequence.

In [11]:
# We can directly convert input text to tokens like follows

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

inputs = "Let's try to tokenize!"

input_ids = tokenizer(inputs)

print(input_ids)

{'input_ids': [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [12]:
print(input_ids["input_ids"])

[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]


In [13]:
print(tokenizer.decode(input_ids["input_ids"]))

[CLS] let's try to tokenize! [SEP]


**token_type_ids**  
Some models’ purpose is to do classification on pairs of sentences or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:
```python
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
```

We can use our tokenizer to automatically generate such a sentence by passing the two sequences to tokenizer as two arguments (and not a list, like before) like this:

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

sequence_a = "HuggingFace is based in NYC"

sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)

decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)

[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]


#### **Padding and Truncation**

Reference: https://huggingface.co/docs/transformers/en/pad_truncation

Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. **Padding and truncation are strategies for** dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a **special padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.

In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: `padding`, `truncation` and `max_length`.

#### **Pad**

Sentences aren’t always the same length which can be an issue because vectors/tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.

The padding argument controls padding. It can be a boolean or a string:

- True or 'longest': pad to the longest sequence in the batch (no padding is applied if you only provide a single sequence).
- 'max_length': pad to a length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). Padding will still be applied if you only provide a single sequence.
- False or 'do_not_pad': no padding is applied. This is the default behavior.

Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence.

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences, padding=True)

encoded_input["input_ids"]

# The first and third sentences are now padded with 0’s because they are shorter.

[[101, 2021, 2054, 2055, 2117, 6350, 1029, 102, 0, 0, 0, 0, 0, 0],
 [101,
  2123,
  1005,
  1056,
  2228,
  2002,
  4282,
  2055,
  2117,
  6350,
  1010,
  28315,
  1012,
  102],
 [101, 2054, 2055, 5408, 14625, 1029, 102, 0, 0, 0, 0, 0, 0, 0]]

In [15]:
for seq in encoded_input["input_ids"]:
    print(tokenizer.decode(seq))

[CLS] but what about second breakfast? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] don't think he knows about second breakfast, pip. [SEP]
[CLS] what about elevensies? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


#### **Truncate**

On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.

The truncation argument controls truncation. It can be a boolean or a string:

- True or 'longest_first': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached.
- 'only_second': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
- 'only_first': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
- False or 'do_not_truncate': no truncation is applied. This is the default behavior.

Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model.

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences, truncation=True, max_length=7)

encoded_input["input_ids"]

[[101, 2021, 2054, 2055, 2117, 6350, 102],
 [101, 2123, 1005, 1056, 2228, 2002, 102],
 [101, 2054, 2055, 5408, 14625, 1029, 102]]

In [17]:
for seq in encoded_input["input_ids"]:
    print(tokenizer.decode(seq))

[CLS] but what about second breakfast [SEP]
[CLS] don't think he [SEP]
[CLS] what about elevensies? [SEP]


#### **Build tensors**

Finally, you want the tokenizer to return the actual tensors that get fed to the model.

Set the `return_tensors` parameter to either **'pt' for PyTorch**, or **'tf' for TensorFlow**.  


`return_tensors`: Acceptable values are:
- 'tf': Return TensorFlow tf.constant objects.
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return Numpy np.ndarray objects.


In [19]:
encoded_input = tokenizer(batch_sentences, truncation=True, max_length=7, return_tensors="tf")

encoded_input["input_ids"]

<tf.Tensor: shape=(3, 7), dtype=int32, numpy=
array([[  101,  2021,  2054,  2055,  2117,  6350,   102],
       [  101,  2123,  1005,  1056,  2228,  2002,   102],
       [  101,  2054,  2055,  5408, 14625,  1029,   102]])>

## **AutoModelFor`*`, TFAutoModelFor`*` and FlaxAutoModelFor`*`**

We will show how to use those briefly, following this pattern:

* Given input articles.
* Tokenize them (converting to token indices).
* Apply the model on the tokenized data to generate summaries (represented as token indices).
* Decode the summaries into human-readable text.

In [46]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd

# Load the pre-trained tokenizer.
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Load the pre-trained model.
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

In [47]:
# For summarization, T5-small expects a prefix "summarize: ", 
# so we prepend that to each article as a prompt.

articles = list(map(lambda article: "summarize: " + article, xsum_sample["document"]))

pd.DataFrame(articles, columns=["prompts"])

Unnamed: 0,prompts
0,summarize: The full cost of damage in Newton S...
1,summarize: A fire alarm went off at the Holida...
2,summarize: Ferrari appeared in a position to c...
3,"summarize: John Edward Bates, formerly of Spal..."
4,summarize: Patients and staff were evacuated f...
5,summarize: Simone Favaro got the crucial try w...
6,"summarize: Veronica Vanessa Chango-Alverez, 31..."
7,summarize: Belgian cyclist Demoitie died after...
8,"summarize: Gundogan, 26, told BBC Sport he ""ca..."
9,summarize: The crash happened about 07:20 GMT ...


In [48]:
# Tokenize the input

inputs = tokenizer(
    articles, return_tensors="pt", padding=True, truncation=True, max_length=1024
)

print("input_ids:")
print(inputs["input_ids"])
print("attention_mask:")
print(inputs["attention_mask"])

input_ids:
tensor([[21603,    10,    37,  ...,     0,     0,     0],
        [21603,    10,    71,  ...,     0,     0,     0],
        [21603,    10, 21945,  ..., 18002,    21,     1],
        ...,
        [21603,    10, 21768,  ...,     0,     0,     0],
        [21603,    10,  9982,  ...,     0,     0,     0],
        [21603,    10,    37,  ...,     0,     0,     0]])
attention_mask:
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [49]:
# Generate summaries

summary_ids = model.generate(
                inputs.input_ids,
                attention_mask=inputs.attention_mask,
                num_beams=2,
                min_length=0,
                max_length=40,
)

print(summary_ids)

tensor([[    0,     8,   423,   583,    13,  1783,    16, 20126, 16496,    19,
           341,   271, 14841,     3,     5,   186,  7540,    16,   158,    15,
          2296,     7,  5718,  2367, 14621,  4161,    57,  4125,   387,     3,
             5,     3,     9,  8347,  5685,  3048,    16,   286,   640,     8],
        [    0,  1472,  6196,   877,   326,    44,     8,  9108,    86,    29,
            16,  6000,  1887,    30,  1856,     3,     5,  2554,   130,  1380,
            12,  1175,     8,  1595,     3,     5,    80,    13,     8,   192,
         14264,    19,    45, 13692,    63,     6,     8,   119,    45, 20576],
        [    0,     3,   849,  2239,     7,   163, 14014,     3,    60,  8234,
           232,   227,     3, 19585,   643,   845,   150,  8033,    47,   787,
            30,   213,     3,    88,   225,  2447,     3,     5,     3,   849,
          2239,     7,   497,     3,    31,    29,    32,   964,  8033,    47],
        [    0,     8,     3,  3708,    18,  1201

In [50]:
# Decode the generated summaries

decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

pd.DataFrame(decoded_summaries, columns=["decoded_summaries"])

Unnamed: 0,decoded_summaries
0,the full cost of damage in Newton Stewart is s...
1,fire alarm went off at the Holiday Inn in Hope...
2,stewards only handed reprimand after governing...
3,the 67-year-old is accused of committing the o...
4,a man receiving treatment at the clinic threat...
5,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...
8,gundogan says he can see the finishing line af...
9,the crash happened about 07:20 GMT at the junc...


## **Fine-Tunning**

https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras