<a href="https://colab.research.google.com/github/desankha88/desankha88/blob/main/Hf_Tokenizer_Models_and_Ouputs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Huggingface Tokenizer, Models and Model Ouputs**

### Objectives:

At the end of the experiment you will be able to understand  :

1. tokenizer class in HF
2. models in HF
3. model output and interpret
4. built in pipline in HF and use for different task

### **Tokenizer**
[BERT](https://huggingface.co/transformers/v3.0.2/model_doc/bert.html)

A tokenizer is in charge of preparing the inputs for a model. The Transformer library contains tokenizers for all the models i.e. tokenizer rules are specific to models and differs model by model.

But there is a universal interface so that you don't have to worry about picking the right class. Specifically, there is a class called **`AutoTokenizer`**, where you can pass in a model checkpoints just like in  pipelines. This will automatically give you back the correct tokenized objects.

So for example, if your model checkpoint is based on BERT, you will get back a tokenized object that has all the right component in it as required by BERT model.

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint='bert-base-cased' # Different Bert models are there.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
tokenizer # Notice the output from tokenizer object

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

#### Tokenizing any text:

Note the result form below code, it is a dictionary with several keys. The key input IDs corresponds to the integer IDs of the tokens that have gotten back from tokenizing the input. The input IDs will always be present because you will always want to convert your text into token IDs. The other keys are attention, mask and token type IDs.

Note that these keys can be specific to the type of model. For instance, token type IDs will show up for BERT, but not to DistillBert.

In [None]:
tokenizer("Hello Dost") # --> "[CLS] Hello Dost [SEP] " --> "[CLS] Hello  Do  ##st [SEP]"

{'input_ids': [101, 8667, 2091, 2050, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

#### Using tokenizer object

In [None]:
tokens= tokenizer.tokenize("Hello Dost")
tokens

['Hello', 'Do', '##st']

#### Tokens into token Ids

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[8667, 2091, 2050]

#### Above two steps can be done in one go by encoding

In [None]:
ids_n = tokenizer.encode("Hello Dost")
ids_n

[101, 8667, 2091, 2050, 102]

**Note** above, 5 ids are because the input has been converted to [CLS],  'Hello', 'Do', '##st',  [SEP].

In [None]:
tokenizer.convert_ids_to_tokens(ids_n)

['[CLS]', 'Hello', 'Do', '##st', '[SEP]']

In [None]:
tokenizer.decode(ids_n) # Gives the single string with tokens joined back together

'[CLS] Hello Dost [SEP]'

#### Getting tensor as an output

The output from tokenizer that we just saw, is a dictionary containing values, which were all lists. But PyTorch doesn't take this as input.

Instead, PyTorch models process torch Tensors in order to get the values back as Tensors. We set the argument `return_tensors` to  string `pt`.

For TensorFlo set string as `tf` and for numpy as `np`you can use the string TAF or if you just want an empire raise, you can pass an NP.



In [None]:
tokenizer("Hello Dost",return_tensors='pt') # tf, np

{'input_ids': tensor([[ 101, 8667, 2091, 2050,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

#### Multiple inputs : Need to add two more parameters : `padding` and `truncation`

In [None]:
data = [ "Hello where are you going? " ,"going to home.",  "I like playing cricket."]

In [None]:
model_inputs=tokenizer(data,padding=True,truncation=True,return_tensors='pt')
model_inputs

{'input_ids': tensor([[ 101, 8667, 1187, 1132, 1128, 1280,  136,  102],
        [ 101, 1280, 1106, 1313,  119,  102,    0,    0],
        [ 101,  146, 1176, 1773, 5428,  119,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 0]])}

Purpose of the attention mask: Essentially, this tells the model where it should bother to pay attention to. For any tokens where the attention mask is zero, the model will ignore those tokens and it will not be possible to use them to compute the model output, which is  as intended.

In summary to ensure that the batch of data can be fed as input into to the PyTorch model, We need to specify the padding argument, the truncation argument and the return tensor argument. **Also note that PyTorch is the default for hugging face and currently the most flexible.**

### **Model**

We are using BERT model for text classification, through  `AutoModelForSequenceClassification` class since it it more flexible. We can create a BERT Specific model In order to load a pre-trained BERT model, we simply call the function from Pre-Trained. Just like we did with the tokenizer.

**Note that the checkpoint we pass in must match the checkpoint we passed in for the tokenizer so that we get the right tokenize for the model**.

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
checkpoint = 'bert-base-cased' # cased - both capital and lower case| uncased - only lower case
model= AutoModelForSequenceClassification.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Making Predictions

In [None]:
data

['Hello where are you going? ', 'going to home.', 'I like playing cricket.']

In [None]:
model_inputs = tokenizer(data, padding = True, return_tensors = 'pt')
model_inputs

{'input_ids': tensor([[ 101, 8667, 1187, 1132, 1128, 1280,  136,  102],
        [ 101, 1280, 1106, 1313,  119,  102,    0,    0],
        [ 101,  146, 1176, 1773, 5428,  119,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 0]])}

In [None]:
model_inputs['input_ids']

tensor([[  101,  8667,  1187,  1132,  1128,  1280,   136,   102],
        [  101, 11099,  1106,  1313,   119,   102,     0,     0],
        [  101,   146,  1176,  1773,  5428,   119,   102,     0]])

In [None]:
model_inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 0]])

In [None]:
outputs = model(**model_inputs) ## ** key value pair, named arguments

In [None]:
data

['Hello where are you going?', 'Going to home.', 'I like playing cricket.']

In [None]:
outputs # These logits are useless as final layers are not tuned and by default it assumes binary classification

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2633, -0.7576],
        [-0.1019, -0.6716],
        [-0.2258, -0.7429]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### **Model Outputs**
* If you pass in N documents, you will get back NxK, as an output where k is for representing number of classes.
* If you pass in a single document, you will get back a K-sized output.
* The outputs are logits i.e. value before applying softmax.
* To get class prediction, just take the argmax.


In [None]:
model_n= AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4)
model_n

In [None]:
outputs_n=model_n(**model_inputs) # K=4 : number of class; N=3 : Number of document ~ sentences in this case
outputs_n

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.4259,  0.0940,  0.2969, -0.0144],
        [ 0.3444,  0.3503,  0.3919, -0.0832],
        [ 0.3527,  0.1463,  0.4027, -0.0383]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
outputs_n.logits # Method1

tensor([[ 0.2527, -0.3958, -0.2612, -0.5529],
        [ 0.2320, -0.4483, -0.3449, -0.4332],
        [ 0.2593, -0.4572, -0.2861, -0.4897]], grad_fn=<AddmmBackward0>)

In [None]:
outputs_n['logits'] # Method2

tensor([[ 0.2527, -0.3958, -0.2612, -0.5529],
        [ 0.2320, -0.4483, -0.3449, -0.4332],
        [ 0.2593, -0.4572, -0.2861, -0.4897]], grad_fn=<AddmmBackward0>)

In [None]:
outputs_n[0] # Method3 --> Not recommended

tensor([[ 0.2527, -0.3958, -0.2612, -0.5529],
        [ 0.2320, -0.4483, -0.3449, -0.4332],
        [ 0.2593, -0.4572, -0.2861, -0.4897]], grad_fn=<AddmmBackward0>)

In [None]:
outputs.logits.detach().cpu().numpy() # convert into numpy array

array([[ 0.34314597, -0.10546011],
       [ 0.4258019 , -0.18717323],
       [ 0.29425198, -0.13826472]], dtype=float32)

### **Built in Pipeline**

[Link: HF-pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines)

#### **Example-1: Pipeline-Sentiment Analysis**

In [None]:
from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I ate Biryani in HYD not so good.", "I like the book by Geron in ML but it is in Tensorflow"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9992625117301941},
 {'label': 'NEGATIVE', 'score': 0.5498270392417908}]

In [None]:
type(sentiment_pipeline)

#### **Example-2: Pipeline-Question Answering**

In [None]:
# The task "question-answering" will return a QuestionAnsweringPipeline object
question_answer = pipeline(task="question-answering", model="distilbert-base-cased-distilled-squad")

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
context = """
Tea is an aromatic beverage prepared by pouring hot or boiling water over cured or fresh leaves of Camellia sinensis,
an evergreen shrub native to China and East Asia. After water, it is the most widely consumed drink in the world.
There are many different types of tea; some, like Chinese greens and Darjeeling, have a cooling, slightly bitter,
and astringent flavour, while others have vastly different profiles that include sweet, nutty, floral, or grassy
notes. Tea has a stimulating effect in humans primarily due to its caffeine content.

The tea plant originated in the region encompassing today's Southwest China, Tibet, north Myanmar and Northeast India,
where it was used as a medicinal drink by various ethnic groups. An early credible record of tea drinking dates to
the 3rd century AD, in a medical text written by Hua Tuo. It was popularised as a recreational drink during the
Chinese Tang dynasty, and tea drinking spread to other East Asian countries. Portuguese priests and merchants
introduced it to Europe during the 16th century. During the 17th century, drinking tea became fashionable among the
English, who started to plant tea on a large scale in India.

The term herbal tea refers to drinks not made from Camellia sinensis: infusions of fruit, leaves, or other plant
parts, such as steeps of rosehip, chamomile, or rooibos. These may be called tisanes or herbal infusions to prevent
confusion with 'tea' made from the tea plant."""

In [None]:
result = question_answer(question="Where is tea native to?", context=context) # BERT--> Extrractive Q&A | GPT --> Generative Q &A
print(result['answer'])

China and East Asia


In [None]:
questions = ["Where is tea native to?",
             "When was tea discovered?",
             "What is the species name for tea?"]

results = question_answer(question=questions, context=context)

for q, r in zip(questions, results):
    print(q, "\n>> " + r['answer'])

Where is tea native to? 
>> China and East Asia
When was tea discovered? 
>> 3rd century AD
What is the species name for tea? 
>> Camellia sinensis


#### [Pipeline Examples For almost all Task](https://www.kdnuggets.com/2023/02/simple-nlp-pipelines-huggingface-transformers.html)