https://medium.com/geekculture/hugging-face-distilbert-tensorflow-for-custom-text-classification-1ad4a49e26a7

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

# default model for predefined HuggingFace sentiment-analysis pipeline.
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'

# Get the models

In a real application, one might start with a pretrained model and fine-tune it for a specific application. For this example, we'll just use the pretrained model, and a pretrained tokenizer.

In [2]:
mytokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
mymodel = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)

2022-05-19 16:43:10.209963: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-19 16:43:10.222787: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


# Tokenize text

For this example, we'll assume short texts, of less than 100 words. We'll write a function to batch encode lists of texts, and return the encoded texts as a numpy array.

In [3]:
MAX_LEN = 100

# each row will be padded/truncated to exactly length MAX_LEN
# using padding=True would pad to length of longest text (or MAX_LEN, whichever is smaller)
def encode_texts(textlist, tokenizer=mytokenizer):
    tokenized = tokenizer(textlist, padding='max_length', truncation=True, max_length=MAX_LEN) 
    input_array = np.array(tokenized.input_ids)
    return input_array


In [4]:
textlist = [
    'I love apples.',
    "But I don't like oranges.",
    'Papayas are weird and icky.',
    'But mangoes are delicious.',
    'Lilikoi, or passionfruit, is one of my favorites.',
]

encoded = encode_texts(textlist)

In [5]:
encoded.shape

(5, 100)

# Call the model to make the predictions.

This model returns two outputs: the score for the negative class, and the score for the positive class, in that order.
The scores for HuggingFace sequence classifiers are in link space, rather than probability space; in other words, the model doesn't include a final softmax layer to return probabilities. Why? I don't know why.

In [6]:
output = mymodel.predict(encoded)
output

TFSequenceClassifierOutput(loss=None, logits=array([[ 0.936192  , -0.8445916 ],
       [ 1.0879397 , -0.986854  ],
       [ 2.2735894 , -1.9594767 ],
       [ 0.34103048, -0.2501971 ],
       [-0.47131115,  0.5441354 ]], dtype=float32), hidden_states=None, attentions=None)

We can convert the predictions to probabilities with a softmax.

In [7]:
tf.nn.softmax(output.logits).numpy()

array([[0.8557936 , 0.14420639],
       [0.888429  , 0.11157098],
       [0.9856996 , 0.01430037],
       [0.6436467 , 0.35635322],
       [0.2659153 , 0.73408467]], dtype=float32)

# Write a "pipeline" function to manage the entire classification process from text to final probabilities

In [8]:
def classify_texts(texts, model=mymodel, tokenizer=mytokenizer):
    encoded = encode_texts(texts, tokenizer=tokenizer)
    predictions = model.predict(encoded)
    return tf.nn.softmax(output.logits).numpy()


# for a prettier presentation
def classification_table(texts, predictions):
    pframe = pd.DataFrame(predictions, columns=['prob_negative', 'prob_positive'])
    pframe.insert(0, 'text', texts)
    label = np.where(pframe.prob_negative > pframe.prob_positive, 'negative', 'positive')
    pframe.insert(pframe.shape[-1], 'sentiment', label)
    return pframe


In [9]:
classification_table(textlist, classify_texts(textlist))

Unnamed: 0,text,prob_negative,prob_positive,sentiment
0,I love apples.,0.855794,0.144206,negative
1,But I don't like oranges.,0.888429,0.111571,negative
2,Papayas are weird and icky.,0.9857,0.0143,negative
3,But mangoes are delicious.,0.643647,0.356353,negative
4,"Lilikoi, or passionfruit, is one of my favorites.",0.265915,0.734085,positive


In [16]:
corpus = ["This film is so good",
           "I hate this movie",
           "A great way to spend a hot summer day.",
           "Meh. Boring",
           ]

# this is a positive review, but it does bring up a negative about the film (its story).
real_review = "I'll start by saying that if you're looking for a great story, you'll be disappointed. Shang-Chi is a pretty standard hero's journey at its core, which is a shame because the story could have been inspired by House of Flying Daggers and other wuxia titles. So why a rating so high? Because where the story falls, everything else excels."

corpus.append(real_review)

classification_table(corpus, classify_texts(corpus))

Unnamed: 0,text,prob_negative,prob_positive,sentiment
0,This film is so good,0.855794,0.144206,negative
1,I hate this movie,0.888429,0.111571,negative
2,A great way to spend a hot summer day.,0.9857,0.0143,negative
3,Meh. Boring,0.643647,0.356353,negative
4,I'll start by saying that if you're looking fo...,0.265915,0.734085,positive
