# Teste 3 - Criando embeddings

## 1. Cria dataframe com 4 conjuntos cada um com 3 frases distintas em inglês. Frases do mesmo conjunto devem ter uma forte relação semântica, enquanto frases em conjuntos diferentes são distantes semanticamente.

In [1]:
import pandas as pd

# Dados das frases organizados em um dicionário com os nomes dos conjuntos em inglês
data = {
    "Conjunto": ["Technology and Innovation", "Technology and Innovation", "Technology and Innovation",
                 "Climate Change and Environment", "Climate Change and Environment", "Climate Change and Environment",
                 "Health and Well-being", "Health and Well-being", "Health and Well-being",
                 "Travel and Culture", "Travel and Culture", "Travel and Culture"],
    "Frase": ["Advancements in artificial intelligence are transforming industries.",
              "The development of quantum computing holds the potential to revolutionize data processing.",
              "Innovative technologies like blockchain are reshaping financial transactions.",
              "Global warming is leading to more extreme weather patterns.",
              "Deforestation contributes significantly to the increase in atmospheric carbon dioxide levels.",
              "Renewable energy sources are crucial for reducing greenhouse gas emissions.",
              "Regular exercise is key to maintaining a healthy lifestyle.",
              "Mental health awareness is becoming increasingly important in society.",
              "Balanced nutrition is essential for physical and mental well-being.",
              "Exploring different cultures enriches our understanding of the world.",
              "Travel restrictions have impacted international tourism significantly.",
              "Learning a new language opens up opportunities for cultural exchange."]
}

# Criando o DataFrame
df = pd.DataFrame(data)

# Exibindo o DataFrame
print(df)

                          Conjunto  \
0        Technology and Innovation   
1        Technology and Innovation   
2        Technology and Innovation   
3   Climate Change and Environment   
4   Climate Change and Environment   
5   Climate Change and Environment   
6            Health and Well-being   
7            Health and Well-being   
8            Health and Well-being   
9               Travel and Culture   
10              Travel and Culture   
11              Travel and Culture   

                                                Frase  
0   Advancements in artificial intelligence are tr...  
1   The development of quantum computing holds the...  
2   Innovative technologies like blockchain are re...  
3   Global warming is leading to more extreme weat...  
4   Deforestation contributes significantly to the...  
5   Renewable energy sources are crucial for reduc...  
6   Regular exercise is key to maintaining a healt...  
7   Mental health awareness is becoming increasing...  
8

## 2. Tokenização do texto.

In [2]:
from transformers import AutoTokenizer

#Carregando o nokenizador Distilbert
#model_ckpt = "distilbert-base-uncased"
model_ckpt = 'sentence-transformers/all-MiniLM-L6-v2'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)



In [3]:
#Definição da função que realizará a tokenização em lotes
def tokenize(batch):
    return tokenizer(batch["Frase"], padding=True, truncation=True)

In [4]:
# Aplicando a função de tokenização aos dados
data_encoded = data.copy()
tokenized_outputs = tokenize(data_encoded)
tokenized_outputs

{'input_ids': [[101, 12607, 2015, 1999, 7976, 4454, 2024, 17903, 6088, 1012, 102, 0, 0, 0, 0, 0], [101, 1996, 2458, 1997, 8559, 9798, 4324, 1996, 4022, 2000, 4329, 4697, 2951, 6364, 1012, 102], [101, 9525, 6786, 2066, 3796, 24925, 2078, 2024, 24501, 3270, 4691, 3361, 11817, 1012, 102, 0], [101, 3795, 12959, 2003, 2877, 2000, 2062, 6034, 4633, 7060, 1012, 102, 0, 0, 0, 0], [101, 13366, 25794, 16605, 6022, 2000, 1996, 3623, 1999, 12483, 6351, 14384, 3798, 1012, 102, 0], [101, 13918, 2943, 4216, 2024, 10232, 2005, 8161, 16635, 3806, 11768, 1012, 102, 0, 0, 0], [101, 3180, 6912, 2003, 3145, 2000, 8498, 1037, 7965, 9580, 1012, 102, 0, 0, 0, 0], [101, 5177, 2740, 7073, 2003, 3352, 6233, 2590, 1999, 2554, 1012, 102, 0, 0, 0, 0], [101, 12042, 14266, 2003, 6827, 2005, 3558, 1998, 5177, 2092, 1011, 2108, 1012, 102, 0, 0], [101, 11131, 2367, 8578, 4372, 13149, 2229, 2256, 4824, 1997, 1996, 2088, 1012, 102, 0, 0], [101, 3604, 9259, 2031, 19209, 2248, 6813, 6022, 1012, 102, 0, 0, 0, 0, 0, 0], [101,

In [5]:
# Armazenando input_ids e attention_mask em data_encoded
data_encoded['input_ids'] = tokenized_outputs['input_ids']
data_encoded['attention_mask'] = tokenized_outputs['attention_mask']
data_encoded.keys()

dict_keys(['Conjunto', 'Frase', 'input_ids', 'attention_mask'])

In [6]:
for frase, input_id in zip(data_encoded["Frase"], data_encoded["input_ids"]):
    print(f"Frase: {frase}, \nInput IDs: {tokenizer.convert_ids_to_tokens(input_id)}\n")

Frase: Advancements in artificial intelligence are transforming industries., 
Input IDs: ['[CLS]', 'advancement', '##s', 'in', 'artificial', 'intelligence', 'are', 'transforming', 'industries', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

Frase: The development of quantum computing holds the potential to revolutionize data processing., 
Input IDs: ['[CLS]', 'the', 'development', 'of', 'quantum', 'computing', 'holds', 'the', 'potential', 'to', 'revolution', '##ize', 'data', 'processing', '.', '[SEP]']

Frase: Innovative technologies like blockchain are reshaping financial transactions., 
Input IDs: ['[CLS]', 'innovative', 'technologies', 'like', 'block', '##chai', '##n', 'are', 'res', '##ha', '##ping', 'financial', 'transactions', '.', '[SEP]', '[PAD]']

Frase: Global warming is leading to more extreme weather patterns., 
Input IDs: ['[CLS]', 'global', 'warming', 'is', 'leading', 'to', 'more', 'extreme', 'weather', 'patterns', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PA

## 3. Obtenção dos embeddings.

In [7]:
import torch
from transformers import AutoModel

#Carregar modelo distilbert
#model_ckpt = "distilbert-base-uncased"

#Caso exista GPU utilize-a, caso contrário use a CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

In [8]:
#Função para extração da última camada oculta (apenas a representação do token [CLS])
def extract_hidden_states(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

In [35]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [21]:
#Função para extração da última camada oculta (apenas a representação do token [CLS])
def extract_hidden_states_mean_pooling(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Extract last hidden states
    with torch.no_grad():
        model_output = model(**inputs)
    # 
    return {"hidden_state": mean_pooling(model_output, inputs['attention_mask']).cpu().numpy()} 

In [9]:
#Transforma os input_ids e attention_mask em tensores
data_encoded['input_ids'] = torch.tensor(data_encoded['input_ids'])
data_encoded['attention_mask'] = torch.tensor(data_encoded['attention_mask'])

In [22]:
#Extrai a última camada oculta de data_encoded e armazena em hidden_state
hidden_state = extract_hidden_states_mean_pooling(data_encoded)

In [23]:
#Transforma hidden_state em tensor
data_hidden = data_encoded.copy()
data_hidden['hidden_state'] = torch.tensor(hidden_state['hidden_state'])
data_hidden

{'Conjunto': ['Technology and Innovation',
  'Technology and Innovation',
  'Technology and Innovation',
  'Climate Change and Environment',
  'Climate Change and Environment',
  'Climate Change and Environment',
  'Health and Well-being',
  'Health and Well-being',
  'Health and Well-being',
  'Travel and Culture',
  'Travel and Culture',
  'Travel and Culture'],
 'Frase': ['Advancements in artificial intelligence are transforming industries.',
  'The development of quantum computing holds the potential to revolutionize data processing.',
  'Innovative technologies like blockchain are reshaping financial transactions.',
  'Global warming is leading to more extreme weather patterns.',
  'Deforestation contributes significantly to the increase in atmospheric carbon dioxide levels.',
  'Renewable energy sources are crucial for reducing greenhouse gas emissions.',
  'Regular exercise is key to maintaining a healthy lifestyle.',
  'Mental health awareness is becoming increasingly important

In [24]:
#Dimensões de data_hidden
data_hidden['hidden_state'].size()

torch.Size([12, 384])

## 4. Cálculo da distância entre os embeddings.

In [34]:
# Extraindo os embeddings de duas frases
embedding1_tensor = data_hidden['hidden_state'][0]
embedding2_tensor = data_hidden['hidden_state'][1]

# Normalizando os embeddings
embedding1_norm = embedding1_tensor / embedding1_tensor.norm()
embedding2_norm = embedding2_tensor / embedding2_tensor.norm()

# Calculando a similaridade por cosseno
cosine_similarity = torch.dot(embedding1_norm, embedding2_norm)

print(f"Similaridade por cosseno: {cosine_similarity.item()}")


Similaridade por cosseno: 0.36288851499557495


## 5. Estratégia com Sentence-Transformers.

In [33]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(model_ckpt)
embeddings = model.encode(data['Frase'])

# Calculando a similaridade por cosseno
cosine_similarity = torch.dot(torch.tensor(embeddings[0]), torch.tensor(embeddings[1]))
print(f"Similaridade por cosseno: {cosine_similarity.item()}")

Similaridade por cosseno: 0.36288851499557495
