### Add hard negatives to Deberta's train

Uses SPECTER v2 proximity
https://huggingface.co/allenai/specter2_proximity

Sorts possible answers using cosine distances

!pip install -U adapter-transformers

In [1]:
from transformers import AutoTokenizer, AutoModel
import pickle
import re
from tqdm import tqdm
import numpy as np
import torch
import gc

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
with open('/home/jovyan/chatbot/_common/datasets/deberta_retrain/squad_format_train_upd.pkl', 'rb') as f:
    train_squad = pickle.load(f)
    
with open('/home/jovyan/chatbot/_common/datasets/deberta_retrain/squad_format_valid_upd.pkl', 'rb') as f:
    valid_squad = pickle.load(f)    

In [None]:
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2')

#load base model
model = AutoModel.from_pretrained('allenai/specter2')

#load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
model.load_adapter("allenai/specter2_proximity", source="hf", load_as="specter2_proximity", set_active=True)

model.to('cuda')


In [4]:
specter2_embeddings = {}
all_data = train_squad + valid_squad
i=0

for d in tqdm(all_data):
    paper_id = '_'.join(d['id'].split('_')[0:2])
    
    if paper_id not in specter2_embeddings:
        t1 = '' if  type(d['title']) == float else d['title']
        txt = t1 + tokenizer.sep_token + d['context'].replace(tokenizer.sep_token, '')
        
        inputs = tokenizer(
            [txt], 
            padding=True, 
            truncation=True,
            return_tensors="pt", 
            return_token_type_ids=False, 
            max_length=512).to('cuda')
        
        output = model(**inputs)
        embeddings = output.last_hidden_state[0, 0, :].cpu().detach().numpy().astype(np.float16)
        specter2_embeddings[paper_id] = {
            'specter2_text': txt,
            'specter2_embedding': embeddings,
            }
                
with open('/home/jovyan/chatbot/_common/datasets/deberta_retrain/specter2_embeddings_upd.pkl', 'wb') as f:
    pickle.dump(specter2_embeddings, f)
        

100%|██████████| 45174/45174 [03:12<00:00, 234.85it/s]


In [7]:
len(specter2_embeddings)

11602

In [9]:
import random
from sklearn.metrics.pairwise import cosine_distances

In [10]:
all_ids = list(specter2_embeddings.keys())
all_values = [specter2_embeddings[id]['specter2_embedding'] for id in all_ids]

In [11]:
dists = cosine_distances(all_values, all_values)

In [13]:
test_ids = set(['_'.join(el['id'].split('_')[0:2]) for el in valid_squad])

In [14]:
id2neig = {all_ids[i]: [all_ids[k] for k in np.argpartition(dists[i],20)[:20] if k != i and all_ids[k] not in test_ids][:3] for i in range(len(dists))}

In [15]:
id2quest = {}
for el in train_squad:
    id = '_'.join(el['id'].split('_')[0:2])
    if id in id2quest:
        id2quest[id].append(el['question'])
    else:
        id2quest[id] = [el['question']]
        
id2context = {}
for el in train_squad:
    id = '_'.join(el['id'].split('_')[0:2])
    id2context[id] = (el['title'], el['context'])

In [16]:
key = list(id2quest.keys())[2300]
print(id2context[key])
print()
print()
for el in id2neig[key]:
    print()
    print(id2context[el])

('Key Principles', 'To allow PyDial to be applied to new problems easily, the PyDial architecture is designed to support three key principles: Domain Independence Wherever possible, the implementation of the dialogue modules is kept separate from the domain specification. Thus, the main functionality is domain independent, i.e., by simply using a different domain specification, simulated dialogues using belief tracker and policy are possible. To achieve this, the Ontology handles all domain-related functionality and is accessible system-wide. While this is completely true for the belief tracker, the policy, and the user simulator, the semantic decoder and the language generator inevitably have some domain-dependency and each needs domain-specific models to be loaded. To use PyDial, all relevant functionality can be controlled via a configuration file. This specifies the domains of the conversation, the variant of each domain module, which is used in the pipeline, and its parameters. Fo

In [17]:
# list of general questions to remove
manual_selected = '''
What kind of models were trained?
What is the article about?
What is the baseline method?
What is the paper about?
What is the purpose of the paper?
What is the mechanism that is being discussed in this paper?
What are the datasets discussed in the paper?
What is the main topic of the article?
What is the focus of the paper?
What is the purpose of the paper?
What is the focus of the study mentioned in the summary?
What is the model presented in the paper?
What is the paper about?
What is the purpose of the paper?
What is the context of the article?
What is the purpose of the algorithm introduced in the paper?
what is this algorithm used for? 
What is the context or topic of the article?
What is the article about?
What is the paper discussing?
What is the paper about?
What is the article about?
What is the focus of the paper? 
What is the paper examining?  
What is the proof about? 
What was the purpose of the experiments?
What is the purpose of the paper?
What is the topic of the article?
What is the proposed method in this paper? 
What is the paper discussing?
What is the topic of the paper?
What is the method presented in the paper?
What is the approach being examined in the paper?
What is the article discussing?
What is the method proposed in the paper?
What is the main focus of the paper?
What is the main topic of the article?
What is the paper evaluating?
What is the main focus of the article?
What is the main topic of the paper?
Thank you!
'''.split('?')
manual_selected = [m.strip()+'?' for m in manual_selected][:-1]
manual_selected

['What kind of models were trained?',
 'What is the article about?',
 'What is the baseline method?',
 'What is the paper about?',
 'What is the purpose of the paper?',
 'What is the mechanism that is being discussed in this paper?',
 'What are the datasets discussed in the paper?',
 'What is the main topic of the article?',
 'What is the focus of the paper?',
 'What is the purpose of the paper?',
 'What is the focus of the study mentioned in the summary?',
 'What is the model presented in the paper?',
 'What is the paper about?',
 'What is the purpose of the paper?',
 'What is the context of the article?',
 'What is the purpose of the algorithm introduced in the paper?',
 'what is this algorithm used for?',
 'What is the context or topic of the article?',
 'What is the article about?',
 'What is the paper discussing?',
 'What is the paper about?',
 'What is the article about?',
 'What is the focus of the paper?',
 'What is the paper examining?',
 'What is the proof about?',
 'What was

In [18]:
def add_hard_negatives(ids, initial_set):
    random.seed(5757)
    set_with_hard = initial_set.copy()
    for id in tqdm(ids):
        try:
            closest_quests = sum([id2quest[idx][2:] for idx in id2neig[id]], [])  # first 2 are common
            closest_quests = [q for q in closest_quests if q not in manual_selected]
            closest_quests_sample = random.sample(closest_quests, 2)  # 1 closest in all cases

            if random.random() < 0.75:
                closest_quests_sample = closest_quests_sample[:1]

            for el in initial_set:
                if '_'.join(el['id'].split('_')[0:2]) == id:
                    new_item = el.copy()
                    break
            new_item['answers'] = {'text': [], 'answer_start': []}
            new_item['chat_gpt_answer'] = ''

            for q in closest_quests_sample:
                assert q not in manual_selected
                new_item_add = new_item.copy()
                new_item_add['question'] = q
                set_with_hard.append(new_item_add)
        except:
            continue
    
    for entry in set_with_hard:
        entry['title'] = '' if type(entry['title']) != str else entry['title']
    return set_with_hard

In [19]:
def filt_non_negative(data):
    return [el for el in data if not (el['question'] in manual_selected and len(el['answers']['answer_start']) == 0)]

In [20]:
train_ids = set(['_'.join(el['id'].split('_')[0:2]) for el in train_squad])
train_with_hard = add_hard_negatives(train_ids, train_squad)
train_with_hard = filt_non_negative(train_with_hard)

100%|██████████| 11202/11202 [01:31<00:00, 121.97it/s]


In [21]:
len(train_squad), len(train_with_hard)

(43615, 57306)

In [22]:
train_with_hard[-2]

{'id': '02c3a1d660128b694ce0aa97ab857eea1851193c_0_0_1',
 'title': 'abstract',
 'context': 'Temporal irradiance variations are useful for finding dense stereo correspondences. These variations can be created artificially using structured light. They also occur naturally underwater. We introduce a variational optimization formulation for finding a dense stereo correspondence field. It is based on multi-frame optical flow, adapted to stereo. The formulation uses a sequence of stereo frames, and yields dense and robust results. The inherent aperture problem of optical flow is resolved using a temporal sequence of stereo frame-pairs. The results are achieved even without considering epi-polar geometry. The method has the ability to handle dynamic stereo underwater, in harsh conditions of flickering illumination. The method is demonstrated experimentally both outdoors and indoors. We use the L 1 norm, as described in Sec. 2. Therefore, Ψ is determined according to Eq. (3) in our scheme. Ste

In [23]:
with open('/home/jovyan/chatbot/_common/datasets/deberta_retrain/squad_format_train_withhn_filt_upd.pkl', 'wb') as f:
    pickle.dump(train_with_hard, f)

In [24]:
test_ids = set(['_'.join(el['id'].split('_')[0:2]) for el in valid_squad])
valid_with_hard = add_hard_negatives(test_ids, valid_squad)
valid_with_hard = filt_non_negative(valid_with_hard)

100%|██████████| 400/400 [00:00<00:00, 3506.27it/s]


In [25]:
len(valid_squad), len(valid_with_hard)

(1559, 2039)

In [26]:
valid_with_hard[-2]

{'id': '1cb4e99c30e602be6af42b559d7b667307d1a853_7_0_1',
 'title': 'Experiments',
 'context': 'For all experiments, we use a radial basis-function (RBF) kernel as in [15] , i.e., k(x, x ) = exp(− 1 h x − x 2 2 ) , where the bandwidth, h, is the median of pairwise distances between current samples. q 0 (θ) and q 0 (ξ) are set to isotropic Gaussian distributions. We share the samples of ξ across data points, i.e., ξ jn = ξ j , for n = 1, . . . , N (this is not necessary, but it saves computation). The samples of θ and z, and parameters of the recognition model, η, are optimized via Adam [9] with learning rate 0.0002. We do not perform any dataset-specific tuning or regularization other than dropout [32] and early stopping on validation sets. We set M = 100 and k = 50, and use minibatches of size 64 for all experiments, unless otherwise specified. 1 −2 and σ = 0.1. The recognition model f η (x n , ξ j ) is specified as a multi-layer perceptron (MLP) with 100 hidden units, by first concate

In [27]:
with open('/home/jovyan/chatbot/_common/datasets/deberta_retrain/squad_format_valid_withhn_filt_upd.pkl', 'wb') as f:
    pickle.dump(valid_with_hard, f)