# Introducción a la librería 🤗 Transformers
Este notebook es una demostración de las tareas que se pueden realizar con la librería 🤗 *transformers* de [Hugging face](https://huggingface.co)

In [None]:
#instalamos la librería
# !pip install transformers[sentencepiece]

## Tokenizado de texto
Las entradas a los modelos transformers corresponden al texto *tokenizado* con un algoritmos subword sobre un vocabulario específico de cada modelo. Hay que elegir el tokenizado asociado a cada modelo.

In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")



Exploramos el vocabulario concreto del tokenizado para un modelo `bert-base-cased`

In [3]:
tokenizer.vocab_size

28996

In [4]:
import numpy as np

np.random.choice(list(tokenizer.vocab.keys()), 10)

array(['##marks', 'ই', 'Goodbye', '1995', '##xed', '##ctive', 'Annette',
       'Marilyn', 'dominating', 'Pandora'], dtype='<U18')

El modelo BERT utiliza un tokenizado subword de tipo *WordPiece*

In [5]:
output = tokenizer.tokenize("the BERT tokenizer was created with a WordPiece model")
print(output)

['the', 'B', '##ER', '##T', 'token', '##izer', 'was', 'created', 'with', 'a', 'Word', '##P', '##ie', '##ce', 'model']


Cada token corresponde a un `token_id` en el vocabulario

In [6]:
import pandas as pd

tokens = map(lambda t: {'token': t,
                        'token_id': tokenizer.convert_tokens_to_ids(t)},
             output)

pd.DataFrame(tokens)

Unnamed: 0,token,token_id
0,the,1103
1,B,139
2,##ER,9637
3,##T,1942
4,token,22559
5,##izer,17260
6,was,1108
7,created,1687
8,with,1114
9,a,170


El tokenizado BERT añade unos tokens especiales a la entrada del modelo

In [7]:
output = tokenizer("the BERT tokenizer was created with a WordPiece model")
output

{'input_ids': [101, 1103, 139, 9637, 1942, 22559, 17260, 1108, 1687, 1114, 170, 10683, 2101, 1663, 2093, 2235, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

También podemos tokenizar un par de frases como entrada al modelo

In [8]:
output = tokenizer("the BERT tokenizer", "was created with a WordPiece model")
output

{'input_ids': [101, 1103, 139, 9637, 1942, 22559, 17260, 102, 1108, 1687, 1114, 170, 10683, 2101, 1663, 2093, 2235, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

El modelo BERT en inglés no conoce el vocabulario español

In [9]:
output = tokenizer.tokenize("El modelo BERT en inglés no conoce el vocabulario español")
print(output)

['El', 'model', '##o', 'B', '##ER', '##T', 'en', 'ing', '##lé', '##s', 'no', 'con', '##oc', '##e', 'el', 'v', '##oc', '##ab', '##ular', '##io', 'es', '##pa', '##ño', '##l']


In [10]:
tokenizer_multi = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment") #modelo multilingüe



In [11]:
output = tokenizer_multi.tokenize("El modelo BERT en inglés no conoce el vocabulario español")
print(output)

['el', 'modelo', 'bert', 'en', 'ingles', 'no', 'conoce', 'el', 'voc', '##ab', '##ular', '##io', 'espanol']


## Uso de los modelos (inferencia)
Para hacer inferencia con un modelo pre-entrenado, tenemos que cargar el modelo elegido y pasarle como entrada el texto tokenizado

In [12]:
from transformers import AutoConfig, AutoModel

nombre_modelo = "bert-base-cased"
config = AutoConfig.from_pretrained(nombre_modelo)
config.output_hidden_states = True
model = AutoModel.from_pretrained(nombre_modelo, config=config)
tokenizer = AutoTokenizer.from_pretrained(nombre_modelo)

Error while downloading from https://cdn-lfs.huggingface.co/bert-base-cased/1d8bdcee6021e2c25f0325e84889b61c2eb26b843eef5659c247af138d64f050?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1723973290&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMzk3MzI5MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9iZXJ0LWJhc2UtY2FzZWQvMWQ4YmRjZWU2MDIxZTJjMjVmMDMyNWU4NDg4OWI2MWMyZWIyNmI4NDNlZWY1NjU5YzI0N2FmMTM4ZDY0ZjA1MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=E8LoqS1GpdYmqJg1-VMVPvw6onfKyrVeSxyLRcHm1qwY1taEQSTOiW%7ElEBbWrf9TUTOaU7xj1OdTV1NR4rU4Yp%7EAyRsBMVAeuOfQ8kdPTq0rso%7EAa6hcCsOtQqBNIPZxHyeL6mCkv42c2onsvjXpgqhIXrCym%7EpNk%7Ek-1hnfMm-VKkWmZZAtV2S9XmVLCN3U97h51i9PxMZ%7E2QyQoRPA%7EIxUl6No50Z9m4lyOZ8Py99Wl0-hxl4iIgskr-NlCCZhJDgQQcMJvUe172N33Fuf74WG3JLmimmPaaHjWLGJKYW--3PxG36UqIg5iydQGr8%7ERImLTVsCP6kEdHJ1U7cXkg__&Key-Pair-Id=K3ESJI6DHPFC7: HTTPS

ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'cdn-lfs.huggingface.co\', port=443): Max retries exceeded with url: /bert-base-cased/1d8bdcee6021e2c25f0325e84889b61c2eb26b843eef5659c247af138d64f050?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1723973290&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMzk3MzI5MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9iZXJ0LWJhc2UtY2FzZWQvMWQ4YmRjZWU2MDIxZTJjMjVmMDMyNWU4NDg4OWI2MWMyZWIyNmI4NDNlZWY1NjU5YzI0N2FmMTM4ZDY0ZjA1MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=E8LoqS1GpdYmqJg1-VMVPvw6onfKyrVeSxyLRcHm1qwY1taEQSTOiW~lEBbWrf9TUTOaU7xj1OdTV1NR4rU4Yp~AyRsBMVAeuOfQ8kdPTq0rso~Aa6hcCsOtQqBNIPZxHyeL6mCkv42c2onsvjXpgqhIXrCym~pNk~k-1hnfMm-VKkWmZZAtV2S9XmVLCN3U97h51i9PxMZ~2QyQoRPA~IxUl6No50Z9m4lyOZ8Py99Wl0-hxl4iIgskr-NlCCZhJDgQQcMJvUe172N33Fuf74WG3JLmimmPaaHjWLGJKYW--3PxG36UqIg5iydQGr8~RImLTVsCP6kEdHJ1U7cXkg__&Key-Pair-Id=K3ESJI6DHPFC7 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f86b9fe2d70>: Failed to resolve \'cdn-lfs.huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: 102faa03-4662-4485-a5a7-35a7ede96dd2)')

In [None]:
model

In [None]:
sentences = [
    'We are very happy to show you the 🤗 Transformers library.',
    'I hate chocolate ice cream']

In [None]:
encodings = tokenizer(sentences, padding=True, return_tensors = "pt")
encodings.input_ids

In [None]:
encodings.input_ids.shape

In [None]:
output = model(**encodings)

In [None]:
output.keys()

In [None]:
output.last_hidden_state.shape

In [None]:
output.pooler_output.shape

>`pooler_output` contains a "representation" of each sequence in the batch. What it basically does is take the hidden representation of the `[CLS]` token of each sequence in the batch, and then run that through the BertPooler nn.Module. This consists of a linear layer followed by a Tanh activation function. The weights of this linear layer are already pretrained on the next sentence prediction task

In [None]:
len(output.hidden_states) #salida de cada capa interna del modelo

In [None]:
for layer in output.hidden_states:
  print(layer.shape)

## Uso de tareas con `pipeline`
La manera más directa de usar una tarea pre-entrenada es mediante un `pipeline`. Transformers tiene tareas pre-entrenadas para:
- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
  etc.)
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words (e.g., replaced by `[MASK]`), fill the blanks.
- Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.  

Primero importamos la clase `pipeline` antes de poder usarla:


In [None]:
from transformers import pipeline

### Análisis de sentimientos

In [None]:
classifier = pipeline('sentiment-analysis')

Una vez instanciado el modelo, el uso es casi inmediato:

In [None]:
classifier('We are very happy to show you the 🤗 Transformers library.')

In [None]:
classifier.model

In [None]:
classifier.model.config.id2label

In [None]:
!pip install torchinfo

In [None]:
from torchinfo import summary
summary(classifier.model)

Podemos elegir cualquier modelo pre-entrenado del [model hub](https://huggingface.co/models) de HugginFace. Por ejemplo el modelo `"nlptown/bert-base-multilingual-uncased-sentiment"` está pre-entrenado en varios idiomas, entre ellos el español

In [None]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [None]:
classifier('We are very happy to show you the 🤗 Transformers library.')

In [None]:
classifier.model.config.id2label

In [None]:
classifier.model

In [None]:
classifier('Me encanta el helado de vainilla')

In [None]:
classifier('I hate chocolate ice cream')

In [None]:
classifier(['Odio el helado de chocolate', 'Me encanta el helado de vainilla'])

### Zero-shot classification
Con esta tarea podemos clasificar un texto sin necesidad de etiquetar un conjunto de entrenamiento.

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "international politics", "business", "sports"],
)

In [None]:
del classifier

### Generación de texto
Usando un modelo generativo (de tipo auto-regresivo) podemos generar un texto a partir de una semilla.

In [None]:
generator = pipeline("text-generation")
generator("In this tutorial, we will teach you how to")

In [None]:
generator.model

In [None]:
output = generator("In this tutorial, we will teach you how to", num_return_sequences=2)
print(output[0]['generated_text'])
print(output[1]['generated_text'])

In [None]:
generator = pipeline("text-generation", model="mrm8488/spanish-gpt2")
generator("Me llamo Joan y me gusta")

In [None]:
del generator

### Mask filling
Esta tarea consiste en rellenar los huecos en medio de una frase. Esta es la tarea con la que se entrenan los modelos de lenguaje de los *transformers*

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
unmasker("I went to a japanese <mask> to eat some <mask> with cheese.", top_k=1)

In [None]:
del unmasker

### Named Entity Recognition
En esta tarea se etiqueta cada *token* según su pertenencia a una entidad.

In [None]:
ner = pipeline("ner", aggregation_strategy="simple")
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [None]:
#probar con aggregation_strategy="none" (default) para ver la etiqueta de cada token con un esquema B-I-O
ner = pipeline("ner", aggregation_strategy="none")
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [None]:
ner.model

In [None]:
ner.model.config.id2label

In [None]:
del ner

### Sistemas de respuesta automática (question answering)
Esta tarea consiste en responder una pregunta a partir de un contexto.

In [None]:
question_answerer = pipeline("question-answering")
context = r"""
Joan lives in New York. His friend Antonio lives in Brussels.
"""
question_answerer(
    question="Where does Joan live?",
    context=context
)

In [None]:
context[15:23]

In [None]:
del question_answerer

### Generación de resúmenes (summarization)
Esta tarea consiste en generar un resumen corto (abstractivo) a partir de un texto.

In [None]:
summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""")

In [None]:
del summarizer

### Traducción de texto
Se puede usar el modelo por defecto especificando el par de idiomas en el nombre de la tarea, o podemos usar un modelo específico del [model hub](https://huggingface.co/models).

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")
translator("Me llamo Joan y soy profesor de universidad.")

In [None]:
translator.model

In [None]:
del translator

### Feature extraction
El modelo devuelve la representación vectorial (embeddings) de la última capa para cada token

In [None]:
extractor = pipeline(model="bert-base-uncased", task="feature-extraction")
result = extractor("the BERT tokenizer was created with a WordPiece model.", return_tensors=True)
result.shape  # This is a tensor of shape [1, sequence_lenth, hidden_dimension] representing the input string.

In [None]:
extractor.model

In [None]:
result = extractor("the BERT tokenizer was created with a WordPiece model.", return_tensors=False)
type(result)

In [None]:
len(result[0])

In [None]:
len(result[0][0])

La longitud viene dada por el nº de tokens, no de palabras, y añade los tokens [CLS] y [SEP]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
output = tokenizer("the BERT tokenizer was created with a WordPiece model.")
print(output)

In [None]:
len(output.input_ids)

In [None]:
print(tokenizer.convert_ids_to_tokens(output.input_ids))

In [None]:
del extractor

## Uso de los modelos
Para usar estos modelos en nuestro flujo de trabajo (p. ej. como un modelo de `tensorflow.keras`) lo necesitamos cargar junto a su función de tokenizado específica.  
Por ejemplo, para un modelo de análisis de sentimientos:

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, AutoConfig

nombre_modelo = "distilbert-base-uncased-finetuned-sst-2-english"
config = AutoConfig.from_pretrained(nombre_modelo)
config.output_hidden_states = True
tf_model = TFAutoModelForSequenceClassification.from_pretrained(nombre_modelo, config=config)
tokenizer = AutoTokenizer.from_pretrained(nombre_modelo)


In [None]:
tf_model.summary()

Para usar el modelo, primero convertimos la entrada en tokens

In [None]:
docs = ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it.", "I hate chocolate ice cream"]

tf_batch = tokenizer(
    docs,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

In [None]:
#genera un diccionario con 'inputs_ids' y 'attention_mask' para cada texto de entrada
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

In [None]:

print(tokenizer.convert_ids_to_tokens(tf_batch['input_ids'][0]))

Aplicamos el modelo, que devuelve los 'logits' de la última capa y las salidas de cada capa intermedia (*embeddings*)

In [None]:
tf_outputs = tf_model(tf_batch)
tf_outputs.keys()

In [None]:
len(tf_outputs.hidden_states) #Nº de capas internas del transformer (embedding + 6 capas atención)

In [None]:
tf_outputs.hidden_states[0].shape #embeddings de salida de cada capa (nª muestras, nº tokens, nº dimensiones)

In [None]:
tf_outputs.logits.shape #capa de salida (nª muestras, nº clases)

In [None]:
tf_outputs.logits #salida del modelo

Aplicamos la función de activación Softmax para obtener las probabilidades normalizadas de cada clase (negativo, positivo) a partir de los logits

In [None]:
import tensorflow as tf
predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
print(predictions)

In [None]:
import numpy as np

np.argmax(predictions, axis=1)

También podemos cargar los modelos en PyTorch

In [None]:
from transformers import AutoModelForSequenceClassification

config = AutoConfig.from_pretrained(nombre_modelo)
config.output_hidden_states = True
model = AutoModelForSequenceClassification.from_pretrained(nombre_modelo, config=config)


In [None]:
model

In [None]:
batch = tokenizer(
    docs,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

In [None]:
batch.keys()

In [None]:
batch.input_ids #ahora los arrays son tensores de pyTorch

In [None]:
outputs = model(**batch)

In [None]:
outputs.keys()

In [None]:
outputs.logits

Convertimos las probabilidades *logits* a probabilidades normalizadas

In [None]:
outputs.logits.softmax(dim=-1).tolist()

In [None]:
outputs.logits.softmax(dim=-1).argmax(dim=-1)

## Sesgo de los modelos
Los modelos de lenguaje de los *transformers* se han entrenado con grandes cantidades de texto no supervisado, mayoritariamente obtenido de Internet. Por tanto, puede tener sesgos (racismo, sesgo de género, etc.)

In [None]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])