## Libraries required

There are a few libraries that must be install before use hugging in notebooks

- transformes
- pytorch and tensorflow
- datasets

In [None]:
pip install datasets 

In [5]:
pip install transformers

Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
     |████████████████████████████████| 4.4 MB 25.3 MB/s            
[?25hCollecting regex!=2019.12.17
  Downloading regex-2022.6.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
     |████████████████████████████████| 764 kB 64.0 MB/s            
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
     |████████████████████████████████| 6.6 MB 62.3 MB/s            
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
     |████████████████████████████████| 101 kB 1.7 MB/s            
Installing collected packages: tokenizers, regex, huggingface-hub, transformers
Successfully installed huggingface-hub-0.8.1 regex-2022.6.2 tokenizers-0.12.1 transformers-4.20.1
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip

## Load pipeline from transformer

Pipeline is the main function to use hugging with it you can combine different preprocessing and classification techniques.

This is the most basic example where we load a default sentiment classifier and pass a sentence

In [8]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

print(classifier("We are very happy to show you the Transformers library."))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9997994303703308}]


## Combine pre-train models

In [4]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, pipeline

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some layers from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [5]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier(['gran película', "un asco de pelicula, totalmente aburridora"])

[{'label': '5 stars', 'score': 0.6853786110877991},
 {'label': '1 star', 'score': 0.8153558969497681}]

## Using AutoClass

AutoClass is a class that retrieves the architecture of a particular transformation or model using its name or path. We have an AutoClass for Tokenizer and Models
- AutoTokeniter
- TFAutoModelForSequenceClassification (This is just for text classification, there are plenty of options)



In [1]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoding = tokenizer("Are We are happy?")
# print(encoding)

# testing a mini batch
tf_batch = tokenizer(
    ['gran película', "un asco de pelicula, totalmente aburridora"],
    padding=True,
    truncation=True,
    max_length=100,
    return_tensors="tf",
)

print(tf_batch)

{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101, 11121, 15141,   102,     0,     0,     0,     0,     0,
            0,     0,     0],
       [  101, 10119, 10146, 10805, 10102, 15141,   117, 28743, 16079,
        34923, 21829,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}


In [2]:
from transformers import TFAutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

tf_outputs = tf_model(tf_batch)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [3]:
import tensorflow as tf

tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
tf_predictions

<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[0.01016899, 0.01044207, 0.04682862, 0.2471818 , 0.6853785 ],
       [0.81535566, 0.16306774, 0.01867863, 0.00183764, 0.00106033]],
      dtype=float32)>

## Save and Load your model

In [4]:
tf_save_directory = "./tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
tf_model.save_pretrained(tf_save_directory)

# Load the saved model
tf_model = TFAutoModelForSequenceClassification.from_pretrained(tf_save_directory)

Some layers from the model checkpoint at ./tf_save_pretrained were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./tf_save_pretrained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
