<a href="https://colab.research.google.com/github/ai-bites/generative-ai-course/blob/main/Tour_of_Transformers_Library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A crashcourse on HuggingFace Transformers Library
* Pipelines
* AutoClass
* Tokenizers
* Models
* Trainer
* Saving and pushing models to HugginFace hub

In [None]:
!pip install transformers
!pip install --upgrade huggingface_hub



# Pipline


## Three components of a pipeline
* Tokenizer
* Model
* Optional Post-processing

In [5]:
from transformers import pipeline

# A simple classification example with default model
# privide multiple inputs in a list
text_classifier = pipeline(task="text-classification") # aka., sentiment-analysis
text_classifier("I am feeling very good today!")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998730421066284}]

In [6]:
# Multiple inputs in a list
text_classifier(["I am feeling very good today!",
                "The weather is not so good in winter",
                "what a day its been!"])

[{'label': 'POSITIVE', 'score': 0.9998730421066284},
 {'label': 'NEGATIVE', 'score': 0.9996813535690308},
 {'label': 'NEGATIVE', 'score': 0.9783575534820557}]

In [7]:
#
# Create a pipeline with a different model other than default model
#
model_name = "roberta-large-mnli"
classifier = pipeline(task="text-classification", model=model_name)
# classifier("What a day its been!")
classifier("I am feeling very good today!")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'ENTAILMENT', 'score': 0.5101075172424316}]

In [None]:
#
# Check all the available pipelines or Tasks from pipeline registry
#
from transformers.pipelines import PIPELINE_REGISTRY
PIPELINE_REGISTRY.supported_tasks.keys()

dict_keys(['audio-classification', 'automatic-speech-recognition', 'text-to-audio', 'feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'visual-question-answering', 'document-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-audio-classification', 'conversational', 'image-classification', 'image-segmentation', 'image-to-text', 'object-detection', 'zero-shot-object-detection', 'depth-estimation', 'video-classification', 'mask-generation', 'image-to-image'])

## Creating Custom Pipelines
The Transformers library allows us to create our own pipeline or tasks despite the readily available tasks listed above. For this, we need to inherit from the `Pipeline` class and implement 4 classes.

In [None]:
from transformers import Pipeline, AutoModelForSequenceClassification
from transformers.pipelines import PIPELINE_REGISTRY
import numpy as np

def softmax(outputs):
    maxes = np.max(outputs, axis=-1, keepdims=True)
    shifted_exp = np.exp(outputs - maxes)
    return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True)


# Step 1 - create a class for custom pipeline
class MyPairClassifiionPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "second_text" in kwargs:
            preprocess_kwargs["second_text"] = kwargs["second_text"]
        return preprocess_kwargs, {}, {}

    def preprocess(self, text, second_text=None):
        return self.tokenizer(text, text_pair=second_text, return_tensors=self.framework)

    def _forward(self, model_inputs):
        return self.model(**model_inputs)

    def postprocess(self, model_outputs):
        logits = model_outputs.logits[0].numpy()
        probabilities = softmax(logits)

        best_class = np.argmax(probabilities)
        label = self.model.config.id2label[best_class]
        score = probabilities[best_class].item()
        logits = logits.tolist()
        return {"label": label, "score": score, "logits": logits}

# Step 2 - register the pipeline to the pipeline registry
PIPELINE_REGISTRY.register_pipeline(
    "pair-classification",
    pipeline_class=MyPairClassifiionPipeline,
    pt_model=AutoModelForSequenceClassification,
)

pair-classification is already registered. Overwriting pipeline for task pair-classification...


In [None]:
# classifier = MyPairClassifiionPipeline(model=model, tokenizer=tokenizer, ...)
classifier = pipeline("pair-classification", pipeline_class=MyPairClassifiionPipeline, model="sgugger/finetuned-bert-mrpc")
classifier("this is a test message", second_text="I didn't go home yesterday")

{'label': 'not_equivalent',
 'score': 0.9380707144737244,
 'logits': [0.8161974549293518, -1.901633858680725]}

In [8]:
classifier("this is a test message", second_text="This is also a test mess")

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'second_text'

# AutoClass
* An AutoClass automatically infers and loads the correct architecture from a given model name.  
* the `from_pretrained()` function takes care of loading the right pretrained model. Applies to both `models` and `tokenizers`
* All AutoClasses have `Auto*` naming convention

### AutoTokenizer, AutoModel and `from_pretrained` function



In [None]:
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

model_name = "roberta-large-mnli"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
classifier("I am feeling far better today!")


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'NEUTRAL', 'score': 0.6073631048202515}]

### Auto Classes for Different Modalities
* Use `Tokenizer` for Text input. If based on model name, use `AutoTokenizer`
* Use `FeatureExtractor` for speech and audio. If based on model name,
* Use `ImageProcessor` for images and videos
* Use  `Processor` for multi-modal inputs

In [None]:
from transformers import AutoImageProcessor, AutoFeatureExtractor, AutoProcessor, AutoTokenizer

# for NLP tasks
txt_embeddings = AutoTokenizer.from_pretrained("roberta-large-mnli")
# For speech tasks
feat_extractor  = AutoFeatureExtractor.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")
# for image and video tasks
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
# for Multi-modality
processor = AutoProcessor.from_pretrained("microsoft/git-base")

#
# Use processor directly with the transforms (pre-processing) we do before feeding the model
#
def transforms(example_batch):
    images = [x for x in example_batch["image"]]
    captions = [x for x in example_batch["text"]]
    # Note the use of processor we just defined
    inputs = processor(images=images, text=captions, padding="max_length")
    inputs.update({"labels": inputs["input_ids"]})
    return inputs

# use the transform when loading data
# train_ds.set_transform(transforms)


### Auto Backbone

In [None]:
from transformers import AutoBackbone

# read an image
image = Image.open("")

# define the model name
model_name = "microsoft/swin-tiny-patch4-window7-224"
# do auto pre-processing on input
processor = AutoImageProcessor.from_pretrained(model_name)

# get auto backbone model
model = AutoBackbone.from_pretrained(model_name, out_indices=(0,))

# use the processor and model
inputs = processor(image, return_tensors="pt")
outputs = model(**inputs)
print(outputs.feature_maps.shape)

# Tokens and Tokenizers in Transformers

In [None]:
# =======================================
# Tokenize inputs using the tokenizer
# =======================================
from transformers import AutoTokenizer

model_name = "roberta-large-mnli"

# step 1 - first tokenize the input tokens
roberta_tokenizer = AutoTokenizer.from_pretrained(model_name)
roberta_tokens = roberta_tokenizer.tokenize("Lets test tokenization with this!")
print(f"Roberta tokens are: {roberta_tokens}")

# step 2 - Convert the tokens into ids
roberta_ids = roberta_tokenizer.convert_tokens_to_ids(roberta_tokens)
print(f"Roberta tokens ids are: {roberta_ids}")

# Another example with the Alberta model
# step 1 - first tokenize the input tokens
alberta_tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
alberta_tokens  = alberta_tokenizer.tokenize("Lets test tokenization with this!")
# step 2 - Convert the tokens into ids
alberta_ids = alberta_tokenizer.convert_tokens_to_ids(alberta_tokens)
print(f"Albert tokens are: {alberta_tokens}")
print(f"Albert token ids are: {alberta_ids}")

# txt_cls = pipeline("text-classification", model="roberta-large-mnli", tokenizer=tokenizer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Roberta tokens are: ['L', 'ets', 'Ġtest', 'Ġtoken', 'ization', 'Ġwith', 'Ġthis', '!']
Roberta tokens ids are: [574, 2580, 1296, 19233, 1938, 19, 42, 328]


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Albert tokens are: ['▁lets', '▁test', '▁to', 'ken', 'ization', '▁with', '▁this', '!']
Albert token ids are: [6884, 1289, 20, 2853, 1829, 29, 48, 187]


In [None]:
# With special Tokens needed for model input
prepared_inputs = roberta_tokenizer.prepare_for_model(roberta_ids)
print(f"Prepared inputs for the model are: {prepared_inputs}")

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Prepared inputs for the model are: {'input_ids': [0, 574, 2580, 1296, 19233, 1938, 19, 42, 328, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# Lastly, we can also decode the tokens back to text if we want to visualize
roberta_tokenizer.decode(prepared_inputs["input_ids"])

'<s>Lets test tokenization with this!</s>'

## Auto Models

In [None]:
import pkgutil
import torch
from transformers import AutoModel, BertConfig, BertModel

# load a pretrained model with its name
model_name = "roberta-large-mnli"
model = AutoModel.from_pretrained(model_name)

# define if loading float16 or some other data format like float32
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16)

# if accelerate installed, we can do as below
if pkgutil.find_loader("accelerate") is not None and 1 == 0:
  model_name = "NousResearch/Llama-2-7b-chat-hf"
  model = AutoModel.from_pretrained(model_name, low_cpu_mem_usage=True)

# config can be changed/overwritten during loading
bert_model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)

# we can even load from another framework by simply specifying. Eg., Flax
# bert_model = BertModel.from_pretrained("bert-base-uncased", from_flax=True)


model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

# Models in Transformers
* Models classes
* Create Custom Models
* Pushing to Hub

## Create custom models

# An example Vision pipeline

# Save, Load and Push - Models, Tokenizers

In [None]:
# Save the model
model_name = "roberta-large-mnli"
roberta_model = AutoModel.from_pretrained(model_name)
roberta_model.save_pretrained("./my_models/roberta_model")

# load the saved model with with different config.
# Eg., number of attention heads changed
modified_model = model.from_pretrained("./my_models/roberta_model",
                                       num_attention_heads=32)

# save the modified model and its config
modified_model.save_pretrained("./my_models/robert_modified_model")

# save the tokenizer, similar to the model
# tokenizer.save_pretrained(<saved directory>)

In [None]:
# Save the model, tokenizer
tokenizer.save_pretrained(<saved directory>)
my_model.save_pretrained(<saved directory>)

In [None]:
# Optionally add tags to the model before pushing
# model.add_model_tags(["custom", "test-roberta"])

# push the model to the hub
!huggingface-cli login
model.push_to_hub("ai-bites/my-first-roberta-model")

# Devices - CPU, GPU

In [None]:
#
# Device related parameters - batch_size and device
#
# set batch size EXPLICITLY, if needed.
# By default the batch size is 1 to overcome disadvantages of sequence length variation

classifier = pipeline(task = "text-classification", model = "distilbert-base-uncased-finetuned-sst-2-english", batch_size=2)
print(classifier(["I am feeling very good today!",
            "The weather is not so good in winter",
            "what a day its been!"])
)

# choose which GPU to use
classifier = pipeline(task = "text-classification", model = "distilbert-base-uncased-finetuned-sst-2-english", batch_size=2, device=0)
print(classifier(["I am feeling very good today!",
            "The weather is not so good in winter",
            "what a day its been!"])
)

[{'label': 'POSITIVE', 'score': 0.9998730421066284}, {'label': 'NEGATIVE', 'score': 0.9996813535690308}, {'label': 'NEGATIVE', 'score': 0.9783575534820557}]
[{'label': 'POSITIVE', 'score': 0.9998730421066284}, {'label': 'NEGATIVE', 'score': 0.9996813535690308}, {'label': 'NEGATIVE', 'score': 0.9783575534820557}]


In [None]:
vision_classifier = pipeline(model="google/vit-base-patch16-224")
preds = vision_classifier(
    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

[{'score': 0.4335, 'label': 'lynx, catamount'},
 {'score': 0.0348,
  'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'},
 {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'},
 {'score': 0.0239, 'label': 'Egyptian cat'},
 {'score': 0.0229, 'label': 'tiger cat'}]

# Datasets - Data loading

In [None]:
# Install the datasets library if not done so
!pip install datasets

In [None]:
# Load from a given dataset
from datasets import load_dataset

dataset = load_dataset("food101", split="train[:100]")
dataset[0]["image"]


# Training - Models, Trainer, Training Config, Datasets Library

In [None]:
!pip install accelerate -U
!pip install datasets

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1


In [None]:
# Step 1 - import the trainer class
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

# Step 2 - create a trainer object with all the settings
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    # data_collator=data_collator,
)


# Step 3 - kick start the training with just 1 line
trainer.train()



ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

# Generative AI - Text Generation