### 1 - Pipeline

In [None]:
%pip install transformers -q

[Librería Transformers](https://github.com/huggingface/transformers)

In [2]:
from transformers import pipeline

In [3]:
# tarea de calsficiación
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [8]:
res = classifier("El fin de semana seguro que llueve")
print(res)

[{'label': 'Neutral', 'score': 0.543632984161377}]


In [11]:
res = classifier("Me temo que el fin de semana no lloverá")
print(res)

[{'label': 'Negative', 'score': 0.31309881806373596}]


In [10]:
res = classifier("Vaya lata! Otro fin de semana que llueve")
print(res)

[{'label': 'Very Negative', 'score': 0.5361645221710205}]


#### Selección del modelo

Vamos a seleccionar un modelo que tenga capacidades en español.

In [7]:
# seleccionamos el mismo modelo que tenemos por defecto
classifier = pipeline("sentiment-analysis", model="tabularisai/multilingual-sentiment-analysis")

config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu


Ahora podemos probar este clasificador con los ejemplos de arriba

Existe una amplia variedad de 'pipelines': [lista de pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline)

### 2 - Modelo y Tokenizer

El modelo lo hemos ejecutado con apenas una línea, pero realmente hay bastantes etapas que ocurren por debajo. En el siguiente código vamos a ver las más importantes.

In [12]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [13]:
# Primero veamos cuales son las etapas anteriores con el mismo modelo
# model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model_name = "pysentimiento/robertuito-sentiment-analysis"

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer
)

res = classifier("Vaya lata! Otro fin de semana que llueve")
print(res)

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'NEG', 'score': 0.9593914747238159}]


In [15]:
print(model)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30002, 768, padding_idx=1)
      (position_embeddings): Embedding(130, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

### 3 - ¿Para qué sirve el tokenizer?

Codificación (antes del LLM)

In [16]:
secuencia = "Vaya lata! Otro fin de semana que llueve"
res = tokenizer(secuencia)
print(res)

{'input_ids': [0, 2081, 9686, 5, 1198, 831, 413, 1292, 443, 10722, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Paso a paso

In [None]:
tokens = tokenizer.tokenize(secuencia)
print(tokens)

['▁vaya', '▁lata', '!', '▁otro', '▁fin', '▁de', '▁semana', '▁que', '▁llueve']


In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[2081, 9686, 5, 1198, 831, 413, 1292, 443, 10722]


Decodificar

In [None]:
tokenizer.decode(res['input_ids'])

'<s> vaya lata! otro fin de semana que llueve</s>'

### 4 - Guardar modelo y tokenizer en local

In [None]:
model_path = ("./modelo")
tokenizer.save_pretrained(model_path)
model.save_pretrained(model_path)

In [None]:
tokenizer_local = AutoTokenizer.from_pretrained(model_path)
model_local = AutoModelForSequenceClassification.from_pretrained(model_path)

### 5 - Pytorch

También compatible con tensorflow

In [None]:
import torch
import torch.nn.functional as F

In [None]:
sentences = [
    "Ya queda poco para las vacaciones",
    "Me encanta HuggingFace"
]

Tokenizer

In [None]:
batch = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
print(batch)

Modelo

In [None]:
with torch.no_grad():
    outputs = model(**batch)
    predictions = F.softmax(outputs.logits, dim=1)
    labels = torch.argmax(predictions, dim=1)

In [None]:
# Logits
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.6811, -0.0606,  0.9667],
        [-1.8555, -0.3444,  2.4575]]), hidden_states=None, attentions=None)


In [None]:
# Neg, Neu, Pos
print(predictions)

tensor([[0.1241, 0.2309, 0.6450],
        [0.0125, 0.0565, 0.9310]])


In [None]:
# etiquetas
print(labels)

tensor([2, 2])


In [None]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30002, 768, padding_idx=1)
      (position_embeddings): Embedding(130, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

### 5 - Tensorflow

Tenemos que buscar a ver si hay versión en tensorflow del modelo que estamos utilizando

https://huggingface.co/pysentimiento/robertuito-sentiment-analysis

In [None]:
# Importar librerías necesarias
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

In [None]:
# 1. Cargar modelo y tokenizador (versión TensorFlow)
model_name = "pysentimiento/robertuito-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can alrea

In [None]:
# 2. Datos de entrada
sentences = [
    "Ya queda poco para las vacaciones",
    "Me encanta HuggingFace"
]

In [None]:
# 3. Tokenización (genera tensores de TensorFlow)
batch = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"  # ¡Importante! Tensores para TF
)

In [None]:
# 4. Predicción del modelo
outputs = model(**batch)
logits = outputs.logits

In [None]:
# 5. Convertir logits a probabilidades
probabilities = tf.nn.softmax(logits, axis=-1).numpy()

In [None]:
# 6. Obtener etiquetas predichas
predicted_labels = tf.argmax(probabilities, axis=1).numpy()

# Mostrar resultados
print("Probabilidades:", probabilities)
print("Etiquetas predichas:", predicted_labels)

Probabilidades: [[0.02485846 0.22100838 0.75413316]
 [0.01246901 0.05650822 0.93102276]]
Etiquetas predichas: [2 2]


In [None]:
model_path = ("./modelo")
tokenizer.save_pretrained(model_path)
model.save_pretrained(model_path)

In [None]:
tokenizer_local = AutoTokenizer.from_pretrained(model_path)
model_local = TFAutoModelForSequenceClassification.from_pretrained(model_path)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at ./modelo.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
