## 가상환경
- BERT, SBERT, SimCSE: s-bert
- DiffCSE: diffcse
- PromCSE: promcse

In [1]:
import torch
import numpy as np
import csv
from tqdm.notebook import tqdm

from covid_q.code.get_align_uniform import *

In [2]:
align_csv = 'covid_q/data/positive_pairs.csv'
uniform_csv = 'covid_q/data/train4_uniform.csv'
uniform_all_csv = 'covid_q/data/final_master_dataset.csv'

## BERT

In [None]:
from transformers import BertModel, BertTokenizer

In [3]:
# BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

tokenizer_add = BertTokenizer.from_pretrained('bert-base-uncased')
model_add = BertModel.from_pretrained('bert-base-uncased')

# Add vocabulary
new_tokens = ['covid']
num_added_toks = tokenizer_add.add_tokens(new_tokens)
model_add.resize_token_embeddings(len(tokenizer_add))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias'

Embedding(30523, 768)

### Alignment

In [4]:
%%time
# Vocab 추가
q1_emb_add, q2_emb_add = get_all_embeddings_align(align_csv, tokenizer_add, model_add)
print('Alignment (Vocab added):', align_loss(q1_emb_add, q2_emb_add).numpy())

# Vocab 추가 x
q1_emb, q2_emb = get_all_embeddings_align(align_csv, tokenizer, model)
print('Alignment:', align_loss(q1_emb, q2_emb).numpy())

Alignment (Vocab added): 41.089592
Alignment: 36.87594
CPU times: user 38min 32s, sys: 1min 24s, total: 39min 57s
Wall time: 1min 44s


### Uniformity

In [5]:
%%time
# train4_uniform.csv 사용

# Vocab 추가
q_emb_add = get_all_embeddings_uniform(uniform_csv, tokenizer_add, model_add)
print('Uniformity (Vocab added):', uniform_loss(q_emb_add).numpy())

# Vocab 추가 x
q_emb = get_all_embeddings_uniform(uniform_csv, tokenizer, model)
print('Uniformity:', uniform_loss(q_emb).numpy())

Uniformity (Vocab added): -14.242066
Uniformity: -14.06199
CPU times: user 12min 47s, sys: 26.7 s, total: 13min 13s
Wall time: 34.7 s


In [6]:
%%time
# final_master_dataset.csv 사용

# Vocab 추가
q_emb_add = get_all_embeddings_uniform_all(uniform_all_csv, tokenizer_add, model_add)
print('Uniformity (All) (Vocab added):', uniform_loss(q_emb_add).numpy())

# Vocab 추가 x
q_emb = get_all_embeddings_uniform_all(uniform_all_csv, tokenizer, model)
print('Uniformity (All):', uniform_loss(q_emb).numpy())

Uniformity (All) (Vocab added): -11.433103
Uniformity (All): -11.431488
CPU times: user 59min 13s, sys: 2min 4s, total: 1h 1min 17s
Wall time: 2min 40s


## SBERT

In [None]:
from sentence_transformers import SentenceTransformer

In [7]:
%%time

model_list = [   
    'nli-bert-base',
    'nli-roberta-base',
    'stsb-bert-base',
    'stsb-roberta-base',
    'bert-base-nli-stsb-mean-tokens',
    'roberta-base-nli-stsb-mean-tokens'
    ]

test_sentence = 'This framework generates embeddings for each input sentence'

for model_name in model_list:
    
    model = SentenceTransformer(model_name)
    
    # Embedding dimension 체크
    if model.encode(test_sentence).shape[0] != 768:
        print(f'Embedding dimension for {model_name} is not 768')
        break
        
    # Alignment
    q1_emb, q2_emb = get_sentence_embedding_align(align_csv, model)
    print(f'[{model_name}] Alignment:', align_loss(q1_emb, q2_emb).numpy())
    
    # Uniformity
    q_emb = get_sentence_embedding_uniform(uniform_csv, model) # train4_uniform.csv 사용
    print(f'[{model_name}] Uniformity:', uniform_loss(q_emb).numpy())

    q_emb = get_sentence_embedding_uniform_all(uniform_all_csv, model) # final_master_dataset.csv 사용
    print(f'[{model_name}] Uniformity (All):', uniform_loss(q_emb).numpy())

12/04/2022 21:38:21 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: nli-bert-base
12/04/2022 21:38:22 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

[nli-bert-base] Alignment: 152.76573


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

[nli-bert-base] Uniformity: -15.318449


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

12/04/2022 21:38:27 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: nli-roberta-base


[nli-bert-base] Uniformity (All): -11.435321


12/04/2022 21:38:29 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

[nli-roberta-base] Alignment: 272.89117


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

[nli-roberta-base] Uniformity: -16.950438


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

12/04/2022 21:38:31 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: stsb-bert-base


[nli-roberta-base] Uniformity (All): -11.436182


12/04/2022 21:38:32 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

[stsb-bert-base] Alignment: 165.91066


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

[stsb-bert-base] Uniformity: -17.821161


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

12/04/2022 21:38:35 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: stsb-roberta-base


[stsb-bert-base] Uniformity (All): -11.436219


12/04/2022 21:38:36 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

[stsb-roberta-base] Alignment: 242.88684


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

[stsb-roberta-base] Uniformity: -17.610994


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

12/04/2022 21:38:39 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: bert-base-nli-stsb-mean-tokens


[stsb-roberta-base] Uniformity (All): -11.436294


12/04/2022 21:38:40 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

[bert-base-nli-stsb-mean-tokens] Alignment: 165.91066


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

[bert-base-nli-stsb-mean-tokens] Uniformity: -17.821161


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

12/04/2022 21:38:42 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: roberta-base-nli-stsb-mean-tokens


[bert-base-nli-stsb-mean-tokens] Uniformity (All): -11.436219


12/04/2022 21:38:44 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

[roberta-base-nli-stsb-mean-tokens] Alignment: 242.88684


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

[roberta-base-nli-stsb-mean-tokens] Uniformity: -17.610994


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

[roberta-base-nli-stsb-mean-tokens] Uniformity (All): -11.436294
CPU times: user 1min 18s, sys: 11.3 s, total: 1min 30s
Wall time: 25.1 s


## SimCSE

In [None]:
from simcse import SimCSE

In [8]:
%%time

model_list = [   
    'princeton-nlp/unsup-simcse-bert-base-uncased',
    'princeton-nlp/unsup-simcse-roberta-base',
    'princeton-nlp/sup-simcse-bert-base-uncased',
    'princeton-nlp/sup-simcse-roberta-base'
    ]

test_sentence = 'This framework generates embeddings for each input sentence'

for model_name in model_list:
    
    model = SimCSE(model_name)
    
    # Embedding dimension 체크
    if model.encode(test_sentence).shape[0] != 768:
        print(f'Embedding dimension for {model_name} is not 768')
        break
        
    # Alignment
    q1_emb, q2_emb = get_sentence_embedding_align(align_csv, model)
    print(f'[{model_name}] Alignment:', align_loss(q1_emb, q2_emb).numpy())
    
    # Uniformity
    q_emb = get_sentence_embedding_uniform(uniform_csv, model) # train4_uniform.csv 사용
    print(f'[{model_name}] Uniformity:', uniform_loss(q_emb).numpy())

    q_emb = get_sentence_embedding_uniform_all(uniform_all_csv, model) # final_master_dataset.csv 사용
    print(f'[{model_name}] Uniformity (All):', uniform_loss(q_emb).numpy())

12/04/2022 21:38:49 - INFO - simcse.tool -   Use `cls_before_pooler` for unsupervised models. If you want to use other pooling policy, specify `pooler` argument.
100%|██████████| 1/1 [00:00<00:00, 76.79it/s]
100%|██████████| 9/9 [00:00<00:00, 28.13it/s]
100%|██████████| 9/9 [00:00<00:00, 28.33it/s]


[princeton-nlp/unsup-simcse-bert-base-uncased] Alignment: 0.66342306


100%|██████████| 6/6 [00:00<00:00, 28.16it/s]


[princeton-nlp/unsup-simcse-bert-base-uncased] Uniformity: -2.160287


100%|██████████| 26/26 [00:00<00:00, 26.47it/s]


[princeton-nlp/unsup-simcse-bert-base-uncased] Uniformity (All): -2.1886444


12/04/2022 21:38:54 - INFO - simcse.tool -   Use `cls_before_pooler` for unsupervised models. If you want to use other pooling policy, specify `pooler` argument.
100%|██████████| 1/1 [00:00<00:00, 64.92it/s]
100%|██████████| 9/9 [00:00<00:00, 28.65it/s]
100%|██████████| 9/9 [00:00<00:00, 29.05it/s]


[princeton-nlp/unsup-simcse-roberta-base] Alignment: 0.53692436


100%|██████████| 6/6 [00:00<00:00, 28.47it/s]


[princeton-nlp/unsup-simcse-roberta-base] Uniformity: -1.660558


100%|██████████| 26/26 [00:00<00:00, 26.23it/s]


[princeton-nlp/unsup-simcse-roberta-base] Uniformity (All): -1.760116


100%|██████████| 1/1 [00:00<00:00, 65.21it/s]
100%|██████████| 9/9 [00:00<00:00, 29.63it/s]
100%|██████████| 9/9 [00:00<00:00, 29.29it/s]


[princeton-nlp/sup-simcse-bert-base-uncased] Alignment: 0.7114831


100%|██████████| 6/6 [00:00<00:00, 27.96it/s]


[princeton-nlp/sup-simcse-bert-base-uncased] Uniformity: -2.472804


100%|██████████| 26/26 [00:00<00:00, 26.90it/s]


[princeton-nlp/sup-simcse-bert-base-uncased] Uniformity (All): -2.553935


100%|██████████| 1/1 [00:00<00:00, 63.70it/s]
100%|██████████| 9/9 [00:00<00:00, 29.64it/s]
100%|██████████| 9/9 [00:00<00:00, 28.74it/s]


[princeton-nlp/sup-simcse-roberta-base] Alignment: 0.7651381


100%|██████████| 6/6 [00:00<00:00, 28.54it/s]


[princeton-nlp/sup-simcse-roberta-base] Uniformity: -2.6867628


100%|██████████| 26/26 [00:00<00:00, 26.54it/s]

[princeton-nlp/sup-simcse-roberta-base] Uniformity (All): -2.7603693
CPU times: user 38.4 s, sys: 5.51 s, total: 43.9 s
Wall time: 19.6 s





## DiffCSE

In [3]:
from DiffCSE.diffcse import DiffCSE

In [4]:
%%time

model_list = [   
    'voidism/diffcse-bert-base-uncased-sts',
    'voidism/diffcse-bert-base-uncased-trans',
    'voidism/diffcse-roberta-base-sts',
    'voidism/diffcse-roberta-base-trans'
    ]

test_sentence = 'This framework generates embeddings for each input sentence'

for model_name in model_list:
    
    model = DiffCSE(model_name)
    
    # Embedding dimension 체크
    if model.encode(test_sentence).shape[0] != 768:
        print(f'Embedding dimension for {model_name} is not 768')
        break
        
    # Alignment
    q1_emb, q2_emb = get_sentence_embedding_align(align_csv, model)
    print(f'[{model_name}] Alignment:', align_loss(q1_emb, q2_emb).numpy())
    
    # Uniformity
    q_emb = get_sentence_embedding_uniform(uniform_csv, model) # train4_uniform.csv 사용
    print(f'[{model_name}] Uniformity:', uniform_loss(q_emb).numpy())

    q_emb = get_sentence_embedding_uniform_all(uniform_all_csv, model) # final_master_dataset.csv 사용
    print(f'[{model_name}] Uniformity (All):', uniform_loss(q_emb).numpy())

Some weights of BertModel were not initialized from the model checkpoint at voidism/diffcse-bert-base-uncased-sts and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
12/04/2022 21:53:50 - INFO - DiffCSE.diffcse.tool -   Use `cls_before_pooler` for DiffCSE models. If you want to use other pooling policy, specify `pooler` argument.
100%|██████████| 1/1 [00:00<00:00, 31.94it/s]
100%|██████████| 9/9 [00:00<00:00, 28.33it/s]
100%|██████████| 9/9 [00:00<00:00, 29.80it/s]
 33%|███▎      | 2/6 [00:00<00:00, 19.07it/s]

[voidism/diffcse-bert-base-uncased-sts] Alignment: 0.28357688


100%|██████████| 6/6 [00:00<00:00, 23.56it/s]
 12%|█▏        | 3/26 [00:00<00:00, 24.87it/s]

[voidism/diffcse-bert-base-uncased-sts] Uniformity: -0.95707244


100%|██████████| 26/26 [00:00<00:00, 26.74it/s]


[voidism/diffcse-bert-base-uncased-sts] Uniformity (All): -0.97734153


Some weights of BertModel were not initialized from the model checkpoint at voidism/diffcse-bert-base-uncased-trans and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
12/04/2022 21:54:04 - INFO - DiffCSE.diffcse.tool -   Use `cls_before_pooler` for DiffCSE models. If you want to use other pooling policy, specify `pooler` argument.
100%|██████████| 1/1 [00:00<00:00, 66.64it/s]
100%|██████████| 9/9 [00:00<00:00, 30.00it/s]
100%|██████████| 9/9 [00:00<00:00, 30.64it/s]
 50%|█████     | 3/6 [00:00<00:00, 28.68it/s]

[voidism/diffcse-bert-base-uncased-trans] Alignment: 0.2766701


100%|██████████| 6/6 [00:00<00:00, 28.03it/s]
 12%|█▏        | 3/26 [00:00<00:00, 24.83it/s]

[voidism/diffcse-bert-base-uncased-trans] Uniformity: -0.8962289


100%|██████████| 26/26 [00:00<00:00, 26.67it/s]


[voidism/diffcse-bert-base-uncased-trans] Uniformity (All): -0.8996946


Some weights of RobertaModel were not initialized from the model checkpoint at voidism/diffcse-roberta-base-sts and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
12/04/2022 21:54:17 - INFO - DiffCSE.diffcse.tool -   Use `cls_before_pooler` for DiffCSE models. If you want to use other pooling policy, specify `pooler` argument.
100%|██████████| 1/1 [00:00<00:00, 65.02it/s]
100%|██████████| 9/9 [00:00<00:00, 30.01it/s]
100%|██████████| 9/9 [00:00<00:00, 29.78it/s]
 50%|█████     | 3/6 [00:00<00:00, 29.95it/s]

[voidism/diffcse-roberta-base-sts] Alignment: 0.20578939


100%|██████████| 6/6 [00:00<00:00, 28.90it/s]
 12%|█▏        | 3/26 [00:00<00:01, 22.59it/s]

[voidism/diffcse-roberta-base-sts] Uniformity: -0.69976825


100%|██████████| 26/26 [00:00<00:00, 26.46it/s]


[voidism/diffcse-roberta-base-sts] Uniformity (All): -0.7303018


Some weights of RobertaModel were not initialized from the model checkpoint at voidism/diffcse-roberta-base-trans and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
12/04/2022 21:54:30 - INFO - DiffCSE.diffcse.tool -   Use `cls_before_pooler` for DiffCSE models. If you want to use other pooling policy, specify `pooler` argument.
100%|██████████| 1/1 [00:00<00:00, 64.08it/s]
100%|██████████| 9/9 [00:00<00:00, 30.28it/s]
100%|██████████| 9/9 [00:00<00:00, 29.10it/s]
 67%|██████▋   | 4/6 [00:00<00:00, 28.87it/s]

[voidism/diffcse-roberta-base-trans] Alignment: 0.1858467


100%|██████████| 6/6 [00:00<00:00, 29.05it/s]
 12%|█▏        | 3/26 [00:00<00:00, 23.85it/s]

[voidism/diffcse-roberta-base-trans] Uniformity: -0.6210463


100%|██████████| 26/26 [00:00<00:00, 26.75it/s]

[voidism/diffcse-roberta-base-trans] Uniformity (All): -0.6481909
CPU times: user 48 s, sys: 14.6 s, total: 1min 2s
Wall time: 50.9 s





## PromCSE
- PromCSE는 코드 구조가 달라서 (args 사용함) 바로 python script로 실행
```
cd PromCSE
python promcse_encoder.py
```

- 결과
 - Alignment: 0.90093094
 - Uniformity: -2.7525558
 - Uniformity (All): -2.8021905