# Installing the necessary libraries

In [None]:
!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
!pip install tensorflow==2.15.0
!pip install jax==0.4.23
!pip install transformers==4.35.2
!pip install numpy==1.23.5

# Pipeline Function

Suggestion: If running in Colab, you can change the run time to GPU.
This will significantly increase the speed of model inference.

In [2]:
# import the pipeline function
from transformers import pipeline

# Initilize the text classifier for sentiment analysis using only the task
classifier = pipeline(task='sentiment-analysis')

# Make inference from the model in one line of code
single_result = classifier('This movie is awesome')
print(single_result)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998761415481567}]


In [3]:
# Or multiple sentences
multiple_results = classifier(["This movie is awesome", "This movie is awful"])
print(multiple_results)

[{'label': 'POSITIVE', 'score': 0.9998761415481567}, {'label': 'NEGATIVE', 'score': 0.9998006224632263}]


# Pipeline Under the Hood

## Replicating the pre-processing step

In [4]:
# Importing the AutoTokenizer class to load a tokenizer from a hugging face checkpoint
from transformers import AutoTokenizer
import tensorflow as tf

# Specifying the model checkpoint for the Hugging Face model hub
model_checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

# Providing the checkpoint to demostrate the tokenization algorithm and the vocabulary for the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [5]:
#  A' way - You can have the results in one line
model_inputs_one = tokenizer('This movie is awesome', return_tensors="tf")
print(model_inputs_one)

{'input_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[  101,  2023,  3185,  2003, 12476,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1]], dtype=int32)>}


In [6]:
# B' way - Perform the tokenization step by step

# Split the string into tokens
tokens = tokenizer.tokenize('This movie is awesome')
print(tokens)

# Convert the tokens into ids using tokenizer's dictionary
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

# Find the special tokens needed for the model
print('Special tokens map:', tokenizer.all_special_tokens)
print('Special tokens ids:', tokenizer.all_special_ids)

# Adding special tokens need for the model
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs)

# Convert the input ids into tensors to add to the model
model_inputs = tf.constant([final_inputs['input_ids']])
print(model_inputs)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['this', 'movie', 'is', 'awesome']
[2023, 3185, 2003, 12476]
Special tokens map: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
Special tokens ids: [100, 102, 0, 101, 103]
{'input_ids': [101, 2023, 3185, 2003, 12476, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}
tf.Tensor([[  101  2023  3185  2003 12476   102]], shape=(1, 6), dtype=int32)


## Replicating the model inference task

In [7]:
# Importing the TFAutoModel to load the base model
from transformers import TFAutoModel

# Providing the checkpoint to denote the model architecture and the pretrained weights
base_model = TFAutoModel.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

# Create a high-dimensional vector with a contextual understanding of the text
model_results = base_model(model_inputs)
print('Model results:',model_results)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model results: TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(1, 6, 768), dtype=float32, numpy=
array([[[ 0.7164141 ,  0.13687831,  0.18230182, ...,  0.42986768,
          0.9575808 , -0.5615794 ],
        [ 0.82266235,  0.11453596,  0.09670142, ...,  0.33378035,
          1.0659671 , -0.44736785],
        [ 0.83979034,  0.12032587,  0.21936359, ...,  0.29039967,
          0.95869756, -0.48858052],
        [ 0.85048985,  0.15025085,  0.13571319, ...,  0.39664596,
          0.9679358 , -0.39255747],
        [ 0.8707633 ,  0.21408029,  0.16835524, ...,  0.47769868,
          0.9595284 , -0.57123035],
        [ 1.1498686 ,  0.18695728,  0.77936405, ...,  0.5333732 ,
          0.7962912 , -0.9428895 ]]], dtype=float32)>, hidden_states=None, attentions=None)


In [8]:
# Importing the TFAutoModelForSequenceClassification to load the model with fine tuned head for sequence classification
from transformers import TFAutoModelForSequenceClassification

# Providing the checkpoint to denote the model architecture and the pretrained weights
model = TFAutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

# Make inferences using the model
model_results = model(model_inputs)
print('Model results:',model_results)

# Take model's outputed logits for each class
outputed_logits = model_results.logits
print('Outputed Logits:', outputed_logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Model results: TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-4.3079815,  4.6880302]], dtype=float32)>, hidden_states=None, attentions=None)
Outputed Logits: tf.Tensor([[-4.3079815  4.6880302]], shape=(1, 2), dtype=float32)


## Replicating the post-processing task

In [9]:
# Importing the numpy function to perform some post-processing steps
import numpy as np

# Convert this logits into probabilities
prediction = tf.nn.softmax(outputed_logits)
print(prediction)
# To find the label positions to see the label for the corresponding probability
print(model.config.id2label)

# Finding the max probability
score = max(prediction.numpy()[0])
print(score)
# Finding the positionision of the max probability
label_position = np.argmax(prediction.numpy()[0])
print(label_position)
# Find the label corresponding to the max probability
label = model.config.id2label[label_position]
print(label)
# Reconstacting the output format
output = [{'label': label, 'score': score}]
print(output)

tf.Tensor([[1.2388763e-04 9.9987614e-01]], shape=(1, 2), dtype=float32)
{0: 'NEGATIVE', 1: 'POSITIVE'}
0.99987614
1
POSITIVE
[{'label': 'POSITIVE', 'score': 0.99987614}]


In [10]:
# If the inputs were the outputs of a text-generated model
hypothetical_outputs = final_inputs['input_ids']

# Decode the tokens created if the model resulted into a text generation
print(tokenizer.decode(hypothetical_outputs))

[CLS] this movie is awesome [SEP]
