# Introduction to Hugging Face
In this Notebook we will explore how to use Hugging Face pre-trained models and transformers

In [1]:
import warnings
warnings.filterwarnings('ignore')


from transformers import AutoTokenizer

# Tokenizers


A **tokenizer** is a crucial component in Natural Language Processing (NLP). It converts raw text into tokens, which are the smallest units of text that a machine learning model can understand. 

A **token** can be a word, subword, character, or even punctuation depending on the tokenizer used.

In this example, we're using the Hugging Face **Transformers** library to load a pretrained tokenizer. 

The tokenizer is associated with the `sshleifer/distilbart-cnn-12-6` model, which is a distilled version of the BART model, optimized.

[Link to the Transformer](https://huggingface.co/sshleifer/distilbart-cnn-12-6)

In [2]:
# initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

We now a list of raw text inputs, tokenizes them using the specified tokenizer, feed it to the tokenizer.

This converts the tokenized data into padded, truncated tensors suitable for PyTorch ("pt").

The output is a **dictionary of tensors** that contain the tokenized representation of the input sentences, structured in a way that can be fed into a PyTorch model.

In [3]:
# provide a raw input
raw_inputs = [
    "I love learning languages!",
    "I find learning Slovak language difficult."
]

# convert input
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

# print inputs
print(inputs)

{'input_ids': tensor([[    0,   100,   657,  2239, 11991,   328,     2,     1,     1,     1],
        [    0,   100,   465,  2239, 33458,   677,  2777,  1202,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


### How a tokenizer works

You can feed the tokenizer the input directly without specifying parameters.

It converts sentences into tokens and then into token ids.

This process is made of two parts: **encoding** and **embedding**.

In [4]:
# Convert text to tokens
tokenizer('I love learning languages!')

{'input_ids': [0, 100, 657, 2239, 11991, 328, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Some tokenizers (like GPT, BERT, BART) use subword tokenization. 

It means that for a single word you can have multiple tokens.

In [5]:
# Not always 1 token = 1 word... 
tokens = tokenizer.tokenize('I find learning Slovak language difficult.')
print(tokens)

['I', 'Ġfind', 'Ġlearning', 'ĠSlov', 'ak', 'Ġlanguage', 'Ġdifficult', '.']


In [6]:
# Convert tokens to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("ID \tTOKEN\n-----------------")
for i, y in zip(token_ids, tokens):
    print(f"{i}\t{y}")

ID 	TOKEN
-----------------
100	I
465	Ġfind
2239	Ġlearning
33458	ĠSlov
677	ak
2777	Ġlanguage
1202	Ġdifficult
4	.


### Decoding Process

The tokenizer allows for the opposite process  of **decoding**.

Starting from token ids, the tokenizer can recover the original words.

In [7]:
print(token_ids)
print()

# decode token ids
decoded_tokens = tokenizer.decode(token_ids)
print(decoded_tokens)


[100, 465, 2239, 33458, 677, 2777, 1202, 4]

I find learning Slovak language difficult.


### Preparing data for a model

Tokenization can be used to prepare lists of token ids as input for a model.

It does that by adding necessary special tokens, padding, and attention masks to prepare them for input into a specific model.

In [8]:
# prepare tokens for model
model_preps = tokenizer.prepare_for_model(token_ids)
print(model_preps)

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': [0, 100, 465, 2239, 33458, 677, 2777, 1202, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [9]:
# Structure of preps for model = dictionary
for karg in model_preps.keys():
    print(f"{karg} : {model_preps[karg]}")

input_ids : [0, 100, 465, 2239, 33458, 677, 2777, 1202, 4, 2]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Transformers: Pipelines

The Hugging Face allows for different tasks by using the `pipeline` module.

Some of the tasks are:

- `sentiment-analysis` : performs sentiment analysis
- `question-answering` : provide an answer to given question
- `summarization` : summarize a given text
- `translation_en_to_fr` : example of translation task
- `fill-mask` : predict missing words in a sentence

In the next example we will use **sentiment analysis**.

In [10]:
from transformers import pipeline

In [11]:
# use sentiment analysis in pipeline
classifier = pipeline("sentiment-analysis")

#
classifier(
    [
        "I love learning languages!",
        "I hate eating junk food!"
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9998078942298889},
 {'label': 'NEGATIVE', 'score': 0.9950483441352844}]

In this example we will instead perform **summarization**.

In [12]:
text = """Odysseus lands on the island of the Cyclopes during his journey home from the Trojan War and, 
together with some of his men, enters a cave filled with provisions. When the giant Polyphemus returns 
home with his flocks, he blocks the entrance with a great stone and, scorning the usual custom of 
hospitality, eats two of the men. Next morning, the giant kills and eats two more and leaves the cave 
to graze his sheep. After the giant returns in the evening and eats two more of the men, Odysseus offers 
Polyphemus some strong and undiluted wine given to him earlier on his journey. Drunk and unwary, the 
giant asks Odysseus his name, promising him a guest-gift if he answers. Odysseus tells him 'Nobody' and 
Polyphemus promises to eat this "Nobody" last of all. With that, he falls into a drunken sleep. 
Odysseus had meanwhile hardened a wooden stake in the fire and drives it into Polyphemus' eye. 
When Polyphemus shouts for help from his fellow giants, saying that "Nobody" has hurt him, they think 
Polyphemus is being afflicted by divine power and recommend prayer as the answer. """

# initialize summarizer
summarizer = pipeline("summarization")
summarizer([text])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': ' Odysseus lands on the island of the Cyclopes during his journey home from the Trojan War and, with some of his men, enters a cave filled with provisions . When the giant Polyphemus returns with his flocks, he blocks the entrance with a great stone and, scorning the usual custom of \xa0hospitality, eats two of the men . The giant kills and eats two more and leaves the cave  to graze his sheep next morning . After the giant returns in the evening, he returns to the cave and eats three more men . Drunk and unwary, the giant asks Odyseus his name, promising him a guest-gift if he answers'}]

##### Example Output:

    [{'summary_text': ' Odysseus lands on the island of the Cyclopes during his journey home from the Trojan War and, with some of his men, enters a cave filled with provisions . When the giant Polyphemus returns with his flocks, he blocks the entrance with a great stone and, scorning the usual custom of \xa0hospitality, eats two of the men . The giant kills and eats two more and leaves the cave  to graze his sheep next morning . After the giant returns in the evening, he returns to the cave and eats three more men . Drunk and unwary, the giant asks Odyseus his name, promising him a guest-gift if he answers'}]

## Pretrained Models


A **pretrained model** on Hugging Face refers has already been trained on a large dataset and can be reused for various tasks without needing to train it from scratch. 

These models allow users to download and apply them directly for specific applications like text classification, sentiment analysis, translation, text generation, or image recognition.

##### Key Benefits of Pretrained Models:

- **Time-saving**: You don't need to train the model from scratch, saving computation time and resources.
- **Performance**: These models are often trained on massive datasets using powerful computing resources, resulting in high accuracy and generalizability.
- **Transfer Learning**: You can fine-tune a pretrained model on your specific task with less data, leveraging the knowledge it has already learned.
- **Plug-and-play**: using a pretrained mode with a single line of code.

<br>

<u>In the next steps we will load a model directly from Hugging Face.</u>

We will load a pretrained model specifically designed for **sequence classification tasks**, such as sentiment analysis, spam detection, or topic classification.

In [13]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [15]:
# Specify model (assuming tokenizer and decoder are the same)
model_name = "distilbert-base-uncased-finetuned-sst-2-english"


# Load tokenizer from pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [18]:
# Tokenized input
text = """I like learning German language"""
inputs = tokenizer(text, return_tensors='pt')

In [19]:
# Output
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.9966,  3.0761]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Embedding

Next we will see how model embeddings work. The following code will perform a series of steps:

1. It tokenizes the input sentence, padding and truncating it, and returns a PyTorch tensor.
2. The tokenized inputs are fed into a model to obtain the output, which includes token embeddings.

<hr>

#### Outputs:

##### Last Hidden State

The **shape of the token embeddings**, called the "last hidden state," is printed, showing the dimensions of the token representations.

The last hidden state provides contextual embeddings of tokens, which are crucial for understanding the input's meaning and relationships, making them essential for various NLP tasks like classification and generation.

##### Shape of Output

The **shape of the output** typically follows the format (batch_size, sequence_length, hidden_size), representing the batch size, the number of tokens, and the size of each token's embedding.

The shape of the output conveys important information about the batch size, sequence length, and embedding dimensionality, which is necessary for ensuring proper model functioning, optimizing performance, and adapting the model for specific tasks.

<br>

<hr>


We will import the **AutoModel** class from the Hugging Face Transformers library, which allows you to easily load and utilize pre-trained transformer models without specifying the exact architecture. 


In [21]:
from transformers import AutoModel
model = AutoModel.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [22]:
inputs = tokenizer('I love deep learning!', padding=True, truncation=True, return_tensors='pt')

outputs = model(**inputs)

print(outputs.last_hidden_state.shape) # the token embeddings

torch.Size([1, 7, 768])
