# Use of Hugging Face Hub models

Created by Andrés Segura-Tinoco  
Created on Jan 28, 2022

## <span>1. Argument Classifier</span>

In [1]:
# !pip install transformers

In [2]:
from transformers import pipeline

In [3]:
task = "text-classification"
arg_text = "It has been determined that the amount of greenhouse gases have decreased by almost half because of the prevalence in the utilization of nuclear power."
non_arg_text = "I think coding in Google Colab is a lot of fun."

### 1.1. Model: chkla/roberta-argument
https://huggingface.co/chkla/roberta-argument

In [4]:
model_name = "chkla/roberta-argument"
classifier_1 = pipeline(task, model=model_name)

In [5]:
classifier_1(arg_text)

[{'label': 'ARGUMENT', 'score': 0.974433183670044}]

In [6]:
classifier_1(non_arg_text)

[{'label': 'NON-ARGUMENT', 'score': 0.9412918090820312}]

### 1.2. Model: addy88/argument-classifier
https://huggingface.co/addy88/argument-classifier

In [7]:
model_name = "addy88/argument-classifier"
classifier_2 = pipeline(task, model=model_name)

In [8]:
classifier_2(arg_text)

[{'label': 'ARGUMENT', 'score': 0.974433183670044}]

In [9]:
classifier_2(non_arg_text)

[{'label': 'NON-ARGUMENT', 'score': 0.9412918090820312}]

## <span>2. Classifier from Scratch</span>

Using TensorFlow approach

In [10]:
import tensorflow as tf
from transformers import AutoTokenizer
from transformers import TFAutoModel
from transformers import TFAutoModelForSequenceClassification

### 2.1. Preprocessing with a tokenizer

In [11]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [12]:
raw_inputs = [
    arg_text,
    non_arg_text,
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 28), dtype=int32, numpy=
array([[  101,  2009,  2038,  2042,  4340,  2008,  1996,  3815,  1997,
        16635, 15865,  2031, 10548,  2011,  2471,  2431,  2138,  1997,
         1996, 20272,  1999,  1996, 27891,  1997,  4517,  2373,  1012,
          102],
       [  101,  1045,  2228, 16861,  1999,  8224, 15270,  2497,  2003,
         1037,  2843,  1997,  4569,  1012,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 28), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]], dtype=int32)>}


### 2.2. Going through the model

In [13]:
model = TFAutoModel.from_pretrained(checkpoint)
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertModel: ['dropout_19', 'classifier', 'pre_classifier']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


(2, 28, 768)


### 2.3. Making sense out of numbers

In [14]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)
print(outputs.logits.shape)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_38']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


(2, 2)


### 2.4. Postprocessing the output

In [15]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [16]:
predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[9.9009854e-01 9.9014947e-03]
 [9.6102111e-04 9.9903905e-01]], shape=(2, 2), dtype=float32)


## <span>3. Tokenization</span>

In [17]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

### 3.1. Encoding

Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

In [18]:
tokens = tokenizer.tokenize(non_arg_text)
print(tokens)

['i', 'think', 'coding', 'in', 'google', 'cola', '##b', 'is', 'a', 'lot', 'of', 'fun', '.']


In [19]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1045, 2228, 16861, 1999, 8224, 15270, 2497, 2003, 1037, 2843, 1997, 4569, 1012]


### 3.2. Decoding

In [20]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

i think coding in google colab is a lot of fun.


<hr>

**Note**: code taken from the HuggingFace [course](https://huggingface.co/course/).

You can contact me on <a href="https://twitter.com/SeguraAndres7" target="_blank">Twitter</a> | <a href="https://github.com/ansegura7/" target="_blank">GitHub</a> | <a href="https://www.linkedin.com/in/andres-segura-tinoco/" target="_blank">LinkedIn</a>