# 1. question과 유사한 context token Masking

## 1.0 데이터셋 불러오기

In [1]:
from datasets import load_from_disk

datasets = load_from_disk('/opt/ml/data/train_dataset')
train_dataset = datasets['train']

## 1.1 `sentence_transformers`를 이용한 유사도 계산

### 1.1.1 구현

In [2]:
from transformers import AutoTokenizer, AutoConfig, AutoModel
from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np

In [3]:
model_name = 'klue/roberta-large'

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, config=config)
model.to('cuda')

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it f

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(32000, 1024, padding_idx=1)
    (position_embeddings): Embedding(514, 1024, padding_idx=1)
    (token_type_embeddings): Embedding(1, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (d

In [4]:
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(outputs, attention_mask):
    token_embeddings = outputs[0]
    attention_mask = attention_mask.unsqueeze(-1)
    embeddings = torch.sum(token_embeddings * attention_mask, 1)
    sum_masks = torch.clamp(attention_mask.sum(1), min=1e-9)
    return embeddings/sum_masks

In [5]:
def get_embedding(sentence):
    tokenized_sentence = tokenizer(
        sentence,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors='pt'    
    )

    with torch.no_grad():
        outputs = model(input_ids=tokenized_sentence['input_ids'].to('cuda'))

    def mean_pooling(outputs, attention_mask):
        token_embeddings = outputs[0]
        attention_mask = attention_mask.unsqueeze(-1)
        embeddings = torch.sum(token_embeddings * attention_mask, 1)
        sum_masks = torch.clamp(attention_mask.sum(1), min=1e-9)
        return embeddings/sum_masks

    return mean_pooling(outputs, tokenized_sentence['attention_mask'].to('cuda'))

In [18]:
rnd_idx = np.random.choice(len(train_dataset))
#rnd_idx = 0
sample = train_dataset[rnd_idx]
answer = sample['answers']['text'][0]

question_embeddings = get_embedding(sample['question'])

tokenized_contexts = tokenizer.tokenize(
    sample['context'],
    padding=True,
    truncation=True,
    max_length=800,
    return_tensors='pt'       
)
context_token_embeddings = get_embedding(tokenized_contexts)

cosine_scores = util.pytorch_cos_sim(question_embeddings, context_token_embeddings).squeeze(0)
sim_indices = torch.argsort(cosine_scores)
tokens = [(token, score.item()) for token, score in zip(tokenized_contexts, cosine_scores)]
tokens_subset = [(token, score) for token, score in tokens if '#' not in token]
tokens_subset = sorted(tokens_subset, key=lambda x: x[1], reverse=True)

for token, score in tokens_subset[:10]:
    print(f"q:{sample['question']} | a:{answer} | token: {token} : {score:.4f}")

q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 전쟁 : 0.7302
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 가하 : 0.7275
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 제국 : 0.7273
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 제국 : 0.7273
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 일본 : 0.7267
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 일본 : 0.7267
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 일본 : 0.7267
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 이 : 0.7267
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 이 : 0.7267
q:1940년부터 독일과 함께 연합군을 공격한 나라는? | a:일본 | token: 가한 : 0.7266


### 1.1.2 평가

유사도가 높다고 나온 토큰이 질문과 유사해보이지 않는다.