In [2]:
import transformers
from transformers import AutoModel,AutoBackbone, AutoFeatureExtractor, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


### Preprocessing
pipeline() conisit of a) preprocessing, b) model, c) postprocess. 
<b>
In NLP, preprocess() refers to converting text to tokens and assigning ID. This is performed using tokenizers. 

In [3]:
## in NLP task first task is tokenizing the raw texts; for this we will be used
model_identifier='google-bert/bert-base-uncased'
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_identifier)

In [4]:
sequence="The quick brown fox jumps over the lazy dog."

tokenized_sequence=tokenizer(sequence)

Outputs of Above tokenizer

1) **input_ids**: This refers to indices value of the tokenizers. Tokenized valued is converted to ids(int) using vocabulary index.
2) **attention_mask**: Refers which tokens should be attended or not. We dont attend "padded" tokens, and give 0 value to it. 
3) **token_type_id**: identifies which sequence a token belongs to when there is more than one sequenc. For example in QA  we can assign one value to *Question* tokens, and other token values for *Answering* tokens.   

In [9]:
## we can also do batch tokeniation; We can have option to padding=True (add 0 or pad to sequennce with less tokens. The number of tokens for all batch remains same)

batch_Sequence=[
    "Hi, my name is Akash.",
    "The quick brown fox jumps toward the lazy dog."
]

batch_tokenized_sequence=tokenizer(batch_Sequence, padding=True) ## length of each ids will be same

In [10]:
print("input_ids",batch_tokenized_sequence['input_ids'])
print("attention_mask",batch_tokenized_sequence['attention_mask']) ## we can observe attn_mask=0 value for padded tokens

input_ids [[101, 7632, 1010, 2026, 2171, 2003, 9875, 4095, 1012, 102, 0, 0], [101, 1996, 4248, 2829, 4419, 14523, 2646, 1996, 13971, 3899, 1012, 102]]
attention_mask [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [11]:
## Sometimes some sequence might be too_long. So, trancuation is same as "max_length". If set to True, the sequence are trancuated to "max_length" as per the used model

batch_tokenized_sequence=tokenizer(batch_Sequence,padding=True,truncation=True)

print("input_ids",batch_tokenized_sequence['input_ids'])
print("attention_mask",batch_tokenized_sequence['attention_mask']) 

input_ids [[101, 7632, 1010, 2026, 2171, 2003, 9875, 4095, 1012, 102, 0, 0], [101, 1996, 4248, 2829, 4419, 14523, 2646, 1996, 13971, 3899, 1012, 102]]
attention_mask [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [12]:
## notice the returned is not yet in tensor; so let us return "PyTorch" tensors.
batch_tokenized_sequence=tokenizer(batch_Sequence,padding=True,truncation=True,return_tensors="pt")

print("input_ids",batch_tokenized_sequence['input_ids'])
print("attention_mask",batch_tokenized_sequence['attention_mask'])
print('type check')

print("type(input_ids",type(batch_tokenized_sequence['input_ids']))



input_ids tensor([[  101,  7632,  1010,  2026,  2171,  2003,  9875,  4095,  1012,   102,
             0,     0],
        [  101,  1996,  4248,  2829,  4419, 14523,  2646,  1996, 13971,  3899,
          1012,   102]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
type check
type(input_ids <class 'torch.Tensor'>


In [32]:
batch_ids=batch_tokenized_sequence['input_ids']

In [33]:
decode_tokens=tokenizer.batch_decode(batch_ids,skip_special_tokens=True) ## if single sentence then only decode

In [34]:
##sequence ids to sequesnce or sentence again 
decode_tokens

['hi, my name is akash.', 'the quick brown fox jumps toward the lazy dog.']

### Revising the *forward* Parameters of Autotokenizer.fromPreTrained();

1. inputs: can be single sequence or batch sequence
2. padding: if set to True, will add padding "0"
3. trancuation: if set to True, will trancuate the long_sequence to "max_length" as defined in the models. 
4. return_tensors: if "pt", return PyTorch tensor, if "tf" return tensorflow tensors

In [10]:
from transformers import AutoModelForSequenceClassification

### Parameters of AutoModelForSequenceClassification

- **config**: This parameter takes in a configuration object or string. It's typically an instance of AutoConfig, which defines the architecture and other settings of the model. If you pass a string, it should be the shortcut name of a pre-trained model configuration.

- **model**: This parameter takes in a model object or string. You can pass an instance of a pre-trained sequence classification model, such as BertForSequenceClassification, or a string representing the shortcut name of a pre-trained model.

- **tokenizer**: This parameter takes in a tokenizer object or string. You can pass an instance of a tokenizer associated with the pre-trained model, like BertTokenizer, or a string representing the shortcut name of a pre-trained tokenizer.

- **model_kwargs**: This parameter accepts any additional keyword arguments that you would pass to the model class constructor. These arguments vary depending on the specific model architecture you're using. For example, for BERT models, common kwargs might include **num_labels (number of output labels) or hidden_dropout_prob (dropout probability for hidden layers)**.

- **tokenizer_kwargs**: This parameter accepts any additional keyword arguments that you would pass to the tokenizer class constructor. Similar to model_kwargs, these arguments vary based on the specific tokenizer.

- **pretrained_model_name_or_path**: This parameter specifies the pre-trained model to load. It can be the name of the model (e.g., "bert-base-uncased") or a path to a directory containing model weights and configuration files.

- **cache_dir**: This parameter specifies the directory where the pre-trained models will be cached if the models are accessed from remote locations.


### Parameters of AutoModelForSequenceClassification.from_preTrained().forward()

- **input_ids**: Tensor containing the token indices of the input sequences.
- **attention_mask**: (Optional) Tensor containing indices specifying which tokens should be attended to and which should not (e.g., padding tokens).
- **token_type_ids**: (Optional) Tensor containing token type embeddings indicating which segment each token belongs to (e.g., for sequence pairs).
- **position_ids**: (Optional) Tensor containing indices indicating the position of each token in the input sequence.
- **head_mask**: (Optional) Tensor specifying which heads to mask in the self-attention layers.
- **inputs_embeds**: (Optional) Tensor containing the embeddings of the input sequences instead of token indices.


In [11]:
model=AutoModelForSequenceClassification.from_pretrained(model_identifier)

model.safetensors: 100%|██████████| 440M/440M [01:01<00:00, 7.10MB/s] 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [16]:
## for pretraining with the same config (for easy understanding) we can change num_class (replace the last layer with new layers to be trained with)

model=AutoModelForSequenceClassification.from_pretrained(model_identifier,num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
model ## we can already see the change in the last layers

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [19]:
## We will learn detailly about AutoModel in other notebooks