In [4]:
import torch

In [5]:
!pip install -U transformers datasets accelerate sentencepiece



# Utilizando Modelos Pré-treinados com HuggingFace

Hugging Face é uma plataforma que oferece uma ampla variedade de modelos pré-treinados para tarefas de processamento de linguagem natural (NLP). Usar esses modelos permite que você aplique técnicas avançadas de NLP sem precisar treinar um modelo do zero.

## Tokenizadores

Tokenização é o processo de converter texto em tokens, que são as unidades básicas que os modelos de NLP processam. A Hugging Face fornece tokenizadores para diferentes modelos, que geram tokens compatíveis com o modelo que será utilizado.

Nesta seção, aprenderemos como carregar e utilizar tokenizadores com Hugging Face.

In [3]:
from transformers import GPT2Tokenizer, BertTokenizer

# Carregando tokenizadores para GPT-2 e BERT
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
# Exemplo de tokenização
input_text = "Artificial intelligence is the future."
gpt2_tokens = gpt2_tokenizer.tokenize(input_text)
bert_tokens = bert_tokenizer.tokenize(input_text)

print(f"Tokens GPT-2: {gpt2_tokens}")
print(f"Tokens BERT: {bert_tokens}")

Tokens GPT-2: ['Art', 'ificial', 'Ġintelligence', 'Ġis', 'Ġthe', 'Ġfuture', '.']
Tokens BERT: ['artificial', 'intelligence', 'is', 'the', 'future', '.']


In [7]:
gpt2_ids = gpt2_tokenizer.convert_tokens_to_ids(gpt2_tokens)
bert_ids = bert_tokenizer.convert_tokens_to_ids(bert_tokens)

print(f"IDs GPT-2: {gpt2_ids}")
print(f"IDs BERT: {bert_ids}")

IDs GPT-2: [8001, 9542, 4430, 318, 262, 2003, 13]
IDs BERT: [7976, 4454, 2003, 1996, 2925, 1012]


In [8]:
gpt2_input = gpt2_tokenizer(input_text, return_tensors="pt")
bert_input = bert_tokenizer(input_text, return_tensors="pt")

print(f"Input GPT-2: {gpt2_input}")
print(f"Input BERT: {bert_input}")

Input GPT-2: {'input_ids': tensor([[8001, 9542, 4430,  318,  262, 2003,   13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
Input BERT: {'input_ids': tensor([[ 101, 7976, 4454, 2003, 1996, 2925, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


## Modelos

Modelos são as redes neurais que processam os tokens e geram saídas como texto ou embeddings. A Hugging Face disponibiliza uma variedade de modelos, como GPT-2 para geração de texto e BERT para tarefas como busca semântica.

Aqui, aprenderemos como carregar e usar esses modelos para diferentes tarefas.

### Carregando Modelos Pré-Treinados

Para começar, vamos carregar modelos pré-treinados, como o GPT-2.

In [9]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Carregando o tokenizador GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Carregando o modelo GPT-2
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=gpt2_tokenizer.eos_token_id)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

[1mGPT2LMHeadModel LOAD REPORT[0m from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Geração de Texto

Geração de texto consiste em fornecer uma sequência inicial e permitir que o modelo continue gerando texto a partir dessa entrada. Vamos ver como fazer isso com o GPT-2.

A geração de texto é uma das aplicações mais comuns de modelos como o GPT-2, que podem ser usados para completar frases, criar histórias ou mesmo gerar código.

In [10]:
# Texto de entrada
input_text = "In a world where AI"

# Tokenizando a entrada
input_ids = gpt2_tokenizer(input_text, return_tensors="pt")

print(f"Input IDs: {input_ids}")

Input IDs: {'input_ids': tensor([[ 818,  257,  995,  810, 9552]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


In [11]:
# Gerando o texto
output = gpt2_model.generate(**input_ids, max_length=50)

print(output.shape)
print(f"Output: {output}")

torch.Size([1, 50])
Output: tensor([[  818,   257,   995,   810,  9552,   318,   257,  1263,  1917,    11,
           340,   338,  1593,   284,  1833,   703,   340,  2499,    13,   198,
           198,   464,  1917,   318,   326,  9552,   318,   407,   257,  1917,
           379,   477,    13,   632,   338,   257,  1917,   326,   460,   307,
         16019,   416,   257,  1256,   286,  1180, 10581,    13,   198,   198]])


In [12]:
# Decodificando o texto gerado
generated_text = gpt2_tokenizer.decode(output[0])
print(generated_text)

In a world where AI is a big problem, it's important to understand how it works.

The problem is that AI is not a problem at all. It's a problem that can be solved by a lot of different approaches.




In [13]:
# Função para completar texto
def complete_text(input_text, model=gpt2_model, tokenizer=gpt2_tokenizer, max_length=50):
    input_ids = tokenizer(input_text, return_tensors="pt")
    output = model.generate(**input_ids, max_length=max_length)
    generated_text = tokenizer.decode(output[0])
    return generated_text

In [14]:
prediction = complete_text("Brazil is")
print(prediction)

Brazil is a country that has been a beacon of hope for the poor and the working class.

The country's economic growth has been a boon for the country's poor, and the country's economy has been a boon for the country's middle


In [15]:
def generate_next_tokens(input_text, n_tokens=1, model=gpt2_model, tokenizer=gpt2_tokenizer):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_new_tokens=n_tokens)
    generated_tokens = output[0][len(input_ids[0]):]  # Extrai apenas os novos tokens gerados
    predicted_tokens = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return predicted_tokens.strip()

input_text = "Brazil is a country. Apple is a fruit. Python is a"

next_token = generate_next_tokens(input_text)
print(next_token)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


language


## Embeddings de Texto

Embeddings de texto são representações vetoriais de palavras ou frases que capturam o significado semântico. Esses embeddings podem ser utilizados em várias tarefas de NLP, como classificação de texto, busca semântica e agrupamento.

Nesta seção, utilizaremos o BERT para gerar embeddings de texto e visualizar a similaridade entre diferentes frases. Também introduziremos conceitos como similaridade por cosseno.

In [16]:
from transformers import BertTokenizer, BertModel

# Carregando o tokenizador BERT
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Carregando o modelo BERT
bert_model = BertModel.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


## Capturando Embeddings

Predição de embeddings consiste em fornecer um texto e utilizar um modelo encoder para prever a representação vetorial deste texto. Vamos ver como fazer isso com o BERT.

In [17]:
# Texto de entrada
text = "Artificial intelligence is the future."
input_ids = bert_tokenizer.encode(text, return_tensors="pt")

# Obtendo a representação do texto
with torch.no_grad():
    output = bert_model(input_ids)


pooled_output = output.last_hidden_state.mean(dim=1)

print(output.last_hidden_state.shape)
print(pooled_output.shape)

torch.Size([1, 8, 768])
torch.Size([1, 768])


In [18]:
# Função para gerar embeddings usando BERT
def get_embedding(text):
    input_ids = bert_tokenizer.encode(text, return_tensors="pt")
    with torch.no_grad():
        outputs = bert_model(input_ids)
    return outputs.last_hidden_state.mean(dim=1)

### Medidas de Similaridade

A similaridade por cosseno é uma métrica comum utilizada para medir a similaridade entre dois vetores no espaço de embeddings. Ela calcula o cosseno do ângulo entre dois vetores, onde um valor de 1 indica vetores idênticos e um valor de 0 indica vetores ortogonais (sem similaridade).

A fórmula da similaridade por cosseno é:

$
\text{similaridade}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
$

Vamos calcular a similaridade entre diferentes frases usando embeddings gerados por BERT.

In [19]:
# Exemplo de frases
query = "Artificial intelligence is transforming industries."
doc1 = "AI is changing the way we work."
doc2 = "Brazil is a country in South America."

# Gerando embeddings
query_embedding = get_embedding(query)
doc1_embedding = get_embedding(doc1)
doc2_embedding = get_embedding(doc2)

In [20]:
# Calculando similaridade por cosseno
cos = torch.nn.CosineSimilarity(dim=1)
similarity_doc1 = cos(query_embedding, doc1_embedding)
similarity_doc2 = cos(query_embedding, doc2_embedding)

print(f"Similaridade com doc1: {similarity_doc1.item():.4f}")
print(f"Similaridade com doc2: {similarity_doc2.item():.4f}")

Similaridade com doc1: 0.8325
Similaridade com doc2: 0.5219


## Exercícios

### Exercício 1
Utilizando GPT-2, crie uma função que preveja o sujeito em uma frase. Por exemplo: Em "John went to the store", o sujeito é John.

In [25]:
# Importei o gpt2 novamente só como forma visual para ajudar no desenvolvimento
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch


In [26]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

model.eval()


Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

[1mGPT2LMHeadModel LOAD REPORT[0m from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [30]:
# Eu tive que usar o Spacy para poder pegar correto o sujeiro da frase

!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [31]:
#Importei a biblioteca
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_subject(sentence):
    doc = nlp(sentence)
    for token in doc:
        if token.dep_ in ("nsubj", "nsubjpass"):
            return token.text
    return None


In [39]:
# realizando a função para coleta da frase, se você notar eu tive que mudar o retorno da parte do teste, pois com esse def não estava pegando corretamente o sujeito da frase
def predict_subject(sentence, max_new_tokens=5):

    'GPT-2 para prever o sujeito de uma frase via prompt.'

    prompt = f" O Sujeiro da Frase '{sentence}' é"

    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Remove o prompt original
    subject_prediction = generated_text.replace(prompt, "").strip()

    # Limpeza básica
    subject_prediction = subject_prediction.split(".")[0]

    return subject_prediction


In [40]:
# nessa Parte é o teste , onde eu peguei a condição padrão da frase e juntei ao print a condição da extração do subject com spacy para mostrar a função correta

sentence = "John went to the store"
subject = predict_subject(sentence)

print("Frase:", sentence)
print("Sujeito previsto:",extract_subject("John went to the store"))


Frase: John went to the store
Sujeito previsto: John


### Exercício 2
Em um sistema avançado de busca por livros, você deverá implementar uma função que faça uma busca semântica e retorne os 5 livros mais apropriados de acordo com a consulta do usuário.

#Segue Exercício abaixo.

In [21]:
descriptions = [
    "A tale of love and loss set against the backdrop of war.",
    "A gripping mystery where nothing is as it seems.",
    "An epic fantasy adventure in a world of magic and dragons.",
    "A heartwarming story of friendship and second chances.",
    "A chilling thriller that will keep you on the edge of your seat.",
    "A coming-of-age story about finding yourself and your place in the world.",
    "A historical novel that brings the past to life with vivid detail.",
    "A suspenseful crime novel where the detective becomes the hunted.",
    "A dystopian future where one woman's rebellion could change everything.",
    "A romantic comedy that will make you believe in love again.",
    "A science fiction saga that explores the limits of human ingenuity.",
    "A powerful drama about family, secrets, and redemption.",
    "A journey through time to uncover hidden truths.",
    "A modern fairy tale where dreams really do come true.",
    "A dark fantasy filled with intrigue, betrayal, and forbidden magic.",
    "A psychological thriller that will mess with your mind.",
    "A poetic exploration of life, love, and everything in between.",
    "A detective novel where every clue leads to more questions.",
    "A story of survival and the strength of the human spirit.",
    "A heart-pounding adventure in a world beyond our own.",
    "A memoir of a life lived on the edge of society.",
    "A romance that defies the boundaries of time and space.",
    "A political thriller set in a world of corruption and power.",
    "A fantasy epic that weaves together destiny and desire.",
    "A mystery novel where the past refuses to stay buried.",
    "A heartwrenching story of love, loss, and letting go.",
    "A darkly comic tale of life in the absurd.",
    "A science fiction adventure that questions what it means to be human.",
    "A historical romance set in a time of revolution and change.",
    "A supernatural thriller where nightmares come to life.",
    "A journey of self-discovery in a world that demands conformity.",
    "A story of forbidden love in a society bound by tradition.",
    "A fast-paced action novel where every second counts.",
    "A lyrical exploration of nature, solitude, and the passage of time.",
    "A detective story where the truth is stranger than fiction.",
    "A powerful saga of family, loyalty, and betrayal.",
    "A fantastical journey through a land of myths and legends.",
    "A tale of revenge, justice, and the price of power.",
    "A story of hope in the face of overwhelming odds.",
    "A quirky romance where opposites truly attract.",
    "A sci-fi thriller that blurs the line between reality and illusion.",
    "A historical epic that spans generations and continents.",
    "A crime novel where the line between right and wrong is razor-thin.",
    "A love story that unfolds in the most unexpected way.",
    "A philosophical exploration of what it means to live a good life.",
    "A gripping tale of survival in a post-apocalyptic world.",
    "A romance that blossoms in the midst of chaos and war.",
    "A detective novel that unravels the darkest secrets of the human soul.",
    "A story of redemption and the power of forgiveness.",
    "A fantasy adventure where a reluctant hero must save the world."
]

input_text = "A horror novel"

# ...

In [47]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


In [43]:
# Importando essa biblioteca pois eu achei mais fácil de utilizar
!pip install sentence-transformers



In [49]:
from sentence_transformers import SentenceTransformer

In [51]:
# Importação do Setence-Transformers

!pip install sentence-transformers scikit-learn



In [52]:

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [61]:
# eu usei essa função para descobrir como que eu poderia usar o description

dataset = pd.DataFrame(descriptions, columns=['description'])
dataset['title'] = [f'Book {i+1}' for i in range(len(descriptions))]
dataset['author'] = 'Various Authors'
df = dataset.copy()
df['text'] = df['title'] + '. ' + df['description']
dataset

Unnamed: 0,description,title,author
0,A tale of love and loss set against the backdr...,Book 1,Various Authors
1,A gripping mystery where nothing is as it seems.,Book 2,Various Authors
2,An epic fantasy adventure in a world of magic ...,Book 3,Various Authors
3,A heartwarming story of friendship and second ...,Book 4,Various Authors
4,A chilling thriller that will keep you on the ...,Book 5,Various Authors
5,A coming-of-age story about finding yourself a...,Book 6,Various Authors
6,A historical novel that brings the past to lif...,Book 7,Various Authors
7,A suspenseful crime novel where the detective ...,Book 8,Various Authors
8,A dystopian future where one woman's rebellion...,Book 9,Various Authors
9,A romantic comedy that will make you believe i...,Book 10,Various Authors


In [68]:
# eu consegui retornar os 5 livros com uma busca semântica, retornando os 5 livros mais apropriados de acordo com a consulta do usuário

query = "books about deep learning and neural networks"
results = semantic_search(query)
display(results)

Unnamed: 0,description,title,author,text,combined_text,similarity_score
10,A science fiction saga that explores the limit...,Book 11,Various Authors,Book 11. A science fiction saga that explores ...,A science fiction saga that explores the limit...,0.311442
22,A political thriller set in a world of corrupt...,Book 23,Various Authors,Book 23. A political thriller set in a world o...,A political thriller set in a world of corrupt...,0.298477
14,"A dark fantasy filled with intrigue, betrayal,...",Book 15,Various Authors,"Book 15. A dark fantasy filled with intrigue, ...","A dark fantasy filled with intrigue, betrayal,...",0.292085
6,A historical novel that brings the past to lif...,Book 7,Various Authors,Book 7. A historical novel that brings the pas...,A historical novel that brings the past to lif...,0.286045
15,A psychological thriller that will mess with y...,Book 16,Various Authors,Book 16. A psychological thriller that will me...,A psychological thriller that will mess with y...,0.280821


In [65]:
# Selecionar apenas colunas textuais
text_columns = df.select_dtypes(include="object").columns

# Criar campo textual combinado
df["combined_text"] = df[text_columns].fillna("").agg(" ".join, axis=1)

In [66]:
book_embeddings = model.encode(
    df["combined_text"].tolist(),
    convert_to_numpy=True,
    show_progress_bar=True
)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Agora que temos os embeddings para todos os livros, podemos definir a função de busca semântica.

In [67]:
from sklearn.metrics.pairwise import cosine_similarity

def semantic_search(query, top_k=5):
    # Embedding da consulta
    query_embedding = model.encode([query], convert_to_numpy=True)

    # Similaridade cosseno
    similarities = cosine_similarity(query_embedding, book_embeddings)[0]

    # Top K resultados
    top_indices = np.argsort(similarities)[::-1][:top_k]

    results = df.iloc[top_indices].copy()
    results["similarity_score"] = similarities[top_indices]

    return results.sort_values(by="similarity_score", ascending=False)

Vamos testar a função de busca semântica com uma consulta de exemplo.

In [64]:
query = "books about artificial intelligence and machine learning"
results = semantic_search(query)
display(results)

Unnamed: 0,title,author,description,similarity_score
10,Book 11,Various Authors,A science fiction saga that explores the limit...,0.297868
17,Book 18,Various Authors,A detective novel where every clue leads to mo...,0.238624
27,Book 28,Various Authors,A science fiction adventure that questions wha...,0.237364
30,Book 31,Various Authors,A journey of self-discovery in a world that de...,0.237352
34,Book 35,Various Authors,A detective story where the truth is stranger ...,0.227985
