In [1]:
# It will install the transformer library from the hugging face
!pip install transformers



In [2]:
# whenever we use pipline, it will automatically download the tokenizer
# and the transformer model according to the text 'sentiment-analysis'
# and download a default model for 'sentiment-analysis'
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
# Testing the classifier for single input
print(classifier('I am happy to learn about transformers'))
print(classifier('I am not happy going to zoo'))

[{'label': 'POSITIVE', 'score': 0.9998570680618286}]
[{'label': 'NEGATIVE', 'score': 0.9996175765991211}]


In [4]:
# Testing the classifier for multiple inputs
results = classifier([
    'I am happy to learn about transformers',
    'I am not happy going to zoo'
])

In [5]:
# Printing the results
for result in results:
  print(f"label: {result['label']}, with score: {result['score']}")

label: POSITIVE, with score: 0.9998570680618286
label: NEGATIVE, with score: 0.9996175765991211


In [6]:
# Using a self-defined model (Pre-trained on multiple languages)
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [7]:
# label (1-5)
# 1 => Very bad, 5 => Very good
# Score is a kind of probability
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.33688199520111084}]

In [8]:
# All the steps which we have done so far are concerned
# with direct use of pre-trained models, If i want to
# call them locally, or want to fine tune them, I have to
# Follow the process which we I am going to discuss

# AutoTokenizer
# Handles text preprocessing, including tokenization, adding special tokens, padding, and creating attention masks.
# Task-Specific Model Classes (e.g., TFAutoModelForSequenceClassification)
# Load pre-trained models for specific tasks, such as sequence classification, and process tokenized inputs to generate predictions.
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [9]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [10]:
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.3368818163871765}]

In [11]:
# What is happening under the hood
# input_ids: The actual token IDs representing the input text.
# token_type_ids: All zeros, indicating this is a single sequence (all tokens belong to the same sentence).
# attention_mask: All ones, indicating all tokens are valid and there are no padding tokens to be ignored.
input = tokenizer("I am happy to learn about transformers")
print(input)

{'input_ids': [101, 151, 10345, 19308, 10114, 34990, 10935, 58263, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [12]:
# Using multiple sentences
input = tokenizer([
      "I am happy to learn about transformers",
      'Why should I learn Fine-Tuning'
    ])
for i in range(len(input['input_ids'])):
    print(f"input_ids: {input['input_ids'][i]}, token_type_ids: {input['token_type_ids'][i]}, attention_mask: {input['attention_mask'][i]}")

input_ids: [101, 151, 10345, 19308, 10114, 34990, 10935, 58263, 102], token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]
input_ids: [101, 18469, 14693, 151, 34990, 12922, 118, 30438, 10285, 102], token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [13]:
# For sending sentences as a batch
tf_batch = tokenizer(
    [
        "I am happy to learn about transformers",
        "Why should I learn Fine-Tuning"
    ],
    padding=True, # For same length
    truncation=True, # For same length
    max_length=512, # For max length of a tensor
    return_tensors="tf"
)

In [14]:
for key, value in tf_batch.items():
  print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 151, 10345, 19308, 10114, 34990, 10935, 58263, 102, 0], [101, 18469, 14693, 151, 34990, 12922, 118, 30438, 10285, 102]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [15]:
# In Transformers, all outputs are tuples (with only one element potentially).
# Here, we get a tuple with just the final activations of the model.
tf_output = model(tf_batch)
print(tf_output)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[-2.3471975 , -2.088179  ,  0.04335596,  1.7839197 ,  1.90646   ],
       [-0.6704417 , -0.39086363,  0.22899042,  0.37120494,  0.3005649 ]],
      dtype=float32)>, hidden_states=None, attentions=None)


In [16]:
# Let's apply the SoftMax activation to get predictions.
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_output[0], axis=-1)

In [17]:
print(tf_predictions)

tf.Tensor(
[[0.00685754 0.00888502 0.07488114 0.42686412 0.48251215]
 [0.09751094 0.1289652  0.23970205 0.27633426 0.25748748]], shape=(2, 5), dtype=float32)


In [18]:
# If you have labels, you can provide them to the model,
# it will return a tuple with the loss and the final activations.
tf_output = model(tf_batch, labels = tf.constant([1, 0]))

In [19]:
tf_output

TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([4.723388 , 2.3277907], dtype=float32)>, logits=<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[-2.3471975 , -2.088179  ,  0.04335596,  1.7839197 ,  1.90646   ],
       [-0.6704417 , -0.39086363,  0.22899042,  0.37120494,  0.3005649 ]],
      dtype=float32)>, hidden_states=None, attentions=None)

In [21]:
# Once your model is fine-tuned, you can save it with its tokenizer in the following way:
save_directory = "/content"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

In [22]:
from transformers import TFAutoModel, AutoModel

In [24]:
# After saving, we can load it using Tensorflow, or PyTorch as well
tokenizer = AutoTokenizer.from_pretrained(save_directory)
# Loading a saved PyTorch Model in a Tensorflow model
# model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
# Loading a saved Tensorflow Model in a PyTorch model
model = AutoModel.from_pretrained(save_directory, from_tf=True)

All TF 2.0 model weights were used when initializing BertModel.

All the weights of BertModel were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training.


In [36]:
# You can also ask the model to return all hidden states and all attention weights if you need them:
tf_outputs = model(tf_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = tf_outputs[-2:]

In [37]:
# We can also directly use the model, and tokenizer without auto magic:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

If you want to change how the model itself is built, you can define your custom configuration class. Each architecture comes with its own relevant configuration (in the case of DistilBERT, DistilBertConfig) which allows you to specify any of the hidden dimension, dropout rate, etc. If you do core modifications, like changing the hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would then instantiate the model directly from this configuration.

Here we use the predefined vocabulary of DistilBERT (hence load the tokenizer with the DistilBertTokenizer.from_pretrained method) and initialize the model from scratch (hence instantiate the model from the configuration instead of using the DistilBertForSequenceClassification.from_pretrained method).

In [57]:
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = 'distilbert-base-uncased'

# config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# model = TFDistilBertForSequenceClassification(config)
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, from_pt=True)

# If want to handle output of the model with limited number of classes
# model = TFDistilBertForSequenceClassification.from_pretrained(model_name, from_pt=True, num_labels=5)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'cla

In [58]:
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [59]:
classifier('Hi, My name is Akbar, and i am a very good boy')

[{'label': 'LABEL_0', 'score': 0.5351351499557495}]