In [1]:
from transformers import pipeline

In [2]:
# pipeline will help to call pre-trained model
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [3]:
classifier('I am neither happy nor sad at the moment!!!')

[{'label': 'NEGATIVE', 'score': 0.9991381168365479}]

In [12]:
results = classifier(['I am an RCB fan', 'We hope you don"t love it'])

for res in results:
    print(f"label: {res['label']}, score: {round(res['score'],3)}")

label: POSITIVE, score: 0.997
label: NEGATIVE, score: 0.598


The second sentence has been classified as negative and score is fairly neutral.

By default, the model downloaded for this pipeline is called "distillbert-base-uncased-finetuned-sst-2-english". It uses DistillBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

If we want to use another model, for instance, one that has been trained on French data. We can search through the model hub that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags "French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment"

In [14]:
new_classifier = pipeline('sentiment-analysis', 
                          model='nlptown/bert-base-multilingual-uncased-sentiment')


Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [15]:
new_classifier('Bonjour')

[{'label': '3 stars', 'score': 0.2779175341129303}]

In [16]:
new_classifier("Nous espérons que vous ne l'aimez pas")

[{'label': '1 star', 'score': 0.4849565625190735}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish, we can replace that name by a local folder where you have saved a pretrained model. We can also pass a model object and its assosciated tokenizer.

We will need two classes for this, the first is "AutoTokenizer", which we will use to download the tokenizer assosciated to the model we picked and instantiate it. The second is "AutoModelForSequenceClassification" or ( TFAutoModeForSequenceClassification if we are using tensorflow), which we will use to download the model itself. Note that if we were using the library on an another task, the class of the model would change

In [17]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [18]:
model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
# This model exists only in PyTorch, so we use the 'from_pt' flag to import that model in TensorFlow
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt = True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [19]:
classifier("I am a good boy")

[{'label': '4 stars', 'score': 0.4229269027709961}]

In [22]:
inputs = tokenizer("I am a good boy, not a bad boy")

In [23]:
inputs
# input_ids: each word is given a token id

{'input_ids': [101, 151, 10345, 143, 12050, 14140, 117, 10497, 143, 12428, 14140, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [24]:
# padding is done to make all sentences equal
# truncation will remove any spaces
# max length of embedding would be 512
tf_batch = tokenizer(
                    ['We are very happy to show this library',
                     'We hope you do not hate it'],
                    padding=True, 
                    truncation=True, 
                    max_length=512, 
                    return_tensors='tf')

In [25]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 11312, 10320, 12495, 19308, 10114, 11391, 10372, 13299, 102], [101, 11312, 18763, 10855, 10154, 10497, 39487, 10197, 102, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]
