<a href="https://colab.research.google.com/github/arksolutionzz/ark/blob/master/BERTWorkshopNotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Transformers Library

In [None]:
! pip install transformers



Download and initialise the pretrained tokenizer and models

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Tokenise the input

In [None]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

In [None]:
print(inputs)

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Batch tokenize with padding for uniform length

In [None]:
tf_batch = tokenizer(
    ["I am very excited to see you", "I didn't like the food at the restaurant. It was salty"],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

In [None]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 1045, 2572, 2200, 7568, 2000, 2156, 2017, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2134, 1005, 1056, 2066, 1996, 2833, 2012, 1996, 4825, 1012, 2009, 2001, 23592, 102]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


Use model to process the input

In [None]:
tf_outputs = tf_model(tf_batch)

In [None]:
print(tf_outputs)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.260652 ,  4.5962405],
       [ 2.714738 , -2.3840768]], dtype=float32)>, hidden_states=None, attentions=None)


Apply softmax on logits for classification

In [None]:
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

In [None]:
print(tf_predictions)

tf.Tensor(
[[1.4237658e-04 9.9985754e-01]
 [9.9393302e-01 6.0669431e-03]], shape=(2, 2), dtype=float32)


In [None]:
# Find the class with the highest probability for each sample
predicted_classes = tf.argmax(tf_predictions, axis=-1)
predicted_probabilities = tf.reduce_max(tf_predictions, axis=-1)

# Print predicted classes and their probabilities
for i, (pred_class, pred_prob) in enumerate(zip(predicted_classes.numpy(), predicted_probabilities.numpy())):
    sentiment = 'Positive' if pred_class == 1 else 'Negative'
    print(f"Sample {i}: Predicted Class = {sentiment} ({pred_class}), Probability = {pred_prob:.6f}")

Sample 0: Predicted Class = Positive (1), Probability = 0.999858
Sample 1: Predicted Class = Negative (0), Probability = 0.993933


## Simplified Pipelines in transformers Library

In [None]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
classifier('The pizza is not that great but the crust is awesome.')

[{'label': 'POSITIVE', 'score': 0.9998461008071899}]

In [None]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


## Multilingual Models

Model Link: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

In [None]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


This classifier can work with texts in English, French, Dutch, German, Italian and Spanish. Predicts 1-5 stars with 1 being more negative and 5 being positive sentiments.

In [None]:
# Negative
# classifier("the food was bad at the restaurant")
# classifier("La nourriture n'est pas bonne à la cantine")
# # Positive
# classifier("I like the music and the artists")
# # Neutral
classifier("Esperamos que no lo odie.") # We hope you don't hate it.

[{'label': '3 stars', 'score': 0.33688217401504517}]