# Test retrieval from centroid + BM25
This notebook uses the script for centroid retrieval parting from BM25 to retrieve a series of candidates usable for next models or even, snippet retrieval directly.

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

import copy
import logging
from pprint import pprint

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

from src.cemb_bm25 import centroid_retrieval
from src.elastic_search_utils.elastic_utils import load_json, save_json

In [2]:
!which python

/datasets/anaconda3/envs/tf2.8/bin/python


## Params

In [3]:
BM25_QUESTIONS = '/datasets/johan_tests_original_format/test_docs_10b-testset4.json'

ABSTRACT_WEIGHT = 0.7
TITLE_WEIGHT = 0.3

In [4]:
LOADING_FOLDER = '/datasets/johan_tests_original_format_centroid/merged_training_docs'

LOADING_ABSTRACT_W2V_PATH = f'{LOADING_FOLDER}/Bio_Word2Vec_doc_abstract_model_10b_train.bin'
LOADING_TITLE_W2V_PATH = f'{LOADING_FOLDER}/Bio_Word2Vec_doc_title_model_10b_train.bin'
LOADING_QUESTION_W2V_PATH = f'{LOADING_FOLDER}/Bio_Word2Vec_doc_question_model_10b_train.bin'

## Saving directions

In [5]:
SAVING_FOLDER = '/datasets/johan_tests_original_format_centroid/merged_training_docs'

SAVING_ORIGINAL_PATH = f'{SAVING_FOLDER}/test_original_10b-testset4.json'
SAVING_TOKENS_PATH = f'{SAVING_FOLDER}/test_tokens_10b-testset4.json'
SAVING_ENTITY_PATH = f'{SAVING_FOLDER}/test_entity_10b-testset4.json'

## Constants

In [6]:
questions = load_json(BM25_QUESTIONS)

## Extracting unique document info
Only documents with abstract will be kept.

One dict cleaned-tokenized, another only cleaned.

In [7]:
unique_docs = centroid_retrieval.extract_unique_doc_info(
    questions=questions['questions']
)

Extracting unique doc info: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 45595.77it/s]


## Extracting tokens for valid documents

In [8]:
%%time
tokenized_unique_docs = centroid_retrieval.docs_to_tokens(
    unique_docs=unique_docs,
    n_jobs=16
)

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   9 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Done  29 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Done  40 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Batch computation too fast (0.1971s.) Setting batch_size=2.
[Parallel(n_jobs=16)]: Done  53 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Done  66 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Batch computation too fast (0.0262s.) Setting batch_size=4.
[Parallel(n_jobs=16)]: Done  85 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Done 114 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Batch computation too fast (0.0464s.) Setting batch_size=8.
[Parallel(n_jobs=16)]: Done 154 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Done 216 tasks      | elapsed:    5.0s
[Parallel(n_jobs=16)]: Batch computation too fast (0.0

CPU times: user 574 ms, sys: 236 ms, total: 810 ms
Wall time: 6.28 s


[Parallel(n_jobs=16)]: Done 6381 tasks      | elapsed:    6.1s
[Parallel(n_jobs=16)]: Done 6585 tasks      | elapsed:    6.2s
[Parallel(n_jobs=16)]: Done 6730 tasks      | elapsed:    6.2s
[Parallel(n_jobs=16)]: Done 6885 tasks      | elapsed:    6.2s
[Parallel(n_jobs=16)]: Done 7098 out of 7098 | elapsed:    6.2s finished


## Extracting entities for valid documents

In [9]:
%%time
entitized_unique_docs = centroid_retrieval.docs_to_entities(
    unique_docs=unique_docs,
    n_jobs=16
)

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Batch computation too fast (0.0680s.) Setting batch_size=2.
[Parallel(n_jobs=16)]: Done   9 tasks      | elapsed:    0.2s
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.3s
[Parallel(n_jobs=16)]: Done  29 tasks      | elapsed:    0.4s
[Parallel(n_jobs=16)]: Done  48 tasks      | elapsed:    0.8s
[Parallel(n_jobs=16)]: Done  74 tasks      | elapsed:    1.1s
[Parallel(n_jobs=16)]: Done 100 tasks      | elapsed:    1.3s
[Parallel(n_jobs=16)]: Done 130 tasks      | elapsed:    1.7s
[Parallel(n_jobs=16)]: Done 160 tasks      | elapsed:    2.0s
[Parallel(n_jobs=16)]: Done 194 tasks      | elapsed:    2.4s
[Parallel(n_jobs=16)]: Done 228 tasks      | elapsed:    2.8s
[Parallel(n_jobs=16)]: Done 266 tasks      | elapsed:    3.3s
[Parallel(n_jobs=16)]: Done 304 tasks      | elapsed:    3.8s
[Parallel(n_jobs=16)]: Done 346 tasks      | elapsed:    4.4s
[Parallel(n_jobs=16)]: Done 388 ta

CPU times: user 3.96 s, sys: 425 ms, total: 4.38 s
Wall time: 1min 7s


[Parallel(n_jobs=16)]: Done 7098 out of 7098 | elapsed:  1.1min finished


## Tokenize question body and replace documents with tokenized documents

In [10]:
tokenized_questions, question_solving_doc_ids = centroid_retrieval.select_questions_useful_documents(
    questions=questions['questions'],
    unique_docs=tokenized_unique_docs
)

Selecting useful documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 4718.71it/s]


In [11]:
tokenized_questions = {'questions': tokenized_questions}

## (Graph entity dict) Tokenize question body and replace documents with entity extracted documents

In [12]:
graph_questions, question_solving_doc_ids = centroid_retrieval.select_questions_useful_documents(
    questions=questions['questions'],
    unique_docs=entitized_unique_docs
)

Selecting useful documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 4303.22it/s]


In [13]:
graph_questions = {'questions': graph_questions}

## Extracting unique tokenized abstracts and titles for word2vec

In [14]:
unique_abstract_tokens, unique_title_tokens = centroid_retrieval.extract_unique_titles_and_abstracts(
    tokenized_unique_docs=tokenized_unique_docs,
    question_solving_doc_ids=question_solving_doc_ids
)

In [15]:
len(unique_abstract_tokens), len(unique_title_tokens)

(7098, 7098)

In [16]:
len(list(unique_abstract_tokens.values())[0]), len(list(unique_title_tokens.values())[0])

(165, 11)

## Extracting unique tokenized questions for word2vec

In [17]:
unique_question_tokens = centroid_retrieval.extract_unique_questions(
    tokenized_questions['questions']
)

In [18]:
len(unique_question_tokens)

90

In [19]:
len(list(unique_question_tokens.values())[0])

10

## Loading models w2vec models for questions, titles and abstracts

In [20]:
question_w2vec_model = centroid_retrieval.load_bio_w2vec_model(LOADING_QUESTION_W2V_PATH)

In [21]:
abstract_w2vec_model = centroid_retrieval.load_bio_w2vec_model(LOADING_ABSTRACT_W2V_PATH)

In [22]:
title_w2vec_model = centroid_retrieval.load_bio_w2vec_model(LOADING_TITLE_W2V_PATH)

## Calculating centroids

In [23]:
question_centroids = centroid_retrieval.calculate_centroids_test(
    text_tokens=unique_question_tokens,
    model=question_w2vec_model
)

Extracting centroids: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 34577.94it/s]


In [24]:
abstract_centroids = centroid_retrieval.calculate_centroids_test(
    text_tokens=unique_abstract_tokens,
    model=abstract_w2vec_model
)

Extracting centroids: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7098/7098 [00:01<00:00, 4687.08it/s]


In [25]:
title_centroids = centroid_retrieval.calculate_centroids_test(
    text_tokens=unique_title_tokens,
    model=title_w2vec_model
)

Extracting centroids: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7098/7098 [00:00<00:00, 37624.54it/s]


## Calculating question cosine similarities to answers

In [26]:
question_similarities = centroid_retrieval.calculate_question_answer_similarity(
    tokenized_questions=tokenized_questions['questions'],
    question_centroids=question_centroids,
    abstract_centroids=abstract_centroids,
    title_centroids=title_centroids
)

Calculating cosine similarity: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 736.81it/s]


## Calculating document scores for questions

In [27]:
question_scores = centroid_retrieval.calculate_centroid_score(
    questions_similarities=question_similarities['questions'],
    abstract_weight=ABSTRACT_WEIGHT,
    title_weight=TITLE_WEIGHT
)

Calculating centroid distance: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 15109.16it/s]


## Selecting useful documents only from original question dictionaries
### Original dict (for Andres model)

In [28]:
centroid_retrieval.update_question_scores_from_raw_data(
    raw_questions=questions['questions'],
    question_scores=question_scores
)

Updating dictionary with centroid scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 12809.64it/s]


### Tokenized dict

In [29]:
centroid_retrieval.update_question_scores_from_raw_data(
    raw_questions=tokenized_questions['questions'],
    question_scores=question_scores
)

Updating dictionary with centroid scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 17121.16it/s]


### Graph entity dict 

In [30]:
centroid_retrieval.update_question_scores_from_raw_data(
    raw_questions=graph_questions['questions'],
    question_scores=question_scores
)

Updating dictionary with centroid scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 11726.85it/s]


## Saving into disk

In [31]:
save_json(questions, SAVING_ORIGINAL_PATH)

In [32]:
save_json(tokenized_questions, SAVING_TOKENS_PATH)

In [33]:
save_json(graph_questions, SAVING_ENTITY_PATH)

In [35]:
graph_questions['questions'][0]

{'id': '620155b6c9dfcb9c09000024',
 'type': 'yesno',
 'body': ['is',
  'covid',
  'induced',
  'anosmia',
  'caused',
  'by',
  'disruption',
  'of',
  'nuclear',
  'architecture'],
 'documents': [{'id': '32708872',
   'entities': ['neutralizing',
    'antibody',
    'asymptomatic',
    'mild',
    'patients',
    'comparison',
    'pneumonic',
    'patients',
    'investigate',
    'antibody',
    'asymptomatic',
    'mild',
    'patients',
    'methods',
    'sera',
    'asymptomatic',
    'severe',
    'patients',
    'microneutralization',
    'fluorescence',
    'immunoassay',
    'fia',
    'enzymelinked',
    'immunosorbent',
    'assay',
    'elisa',
    'results',
    'patients',
    'asymptomaticanosmia',
    'mild',
    'symptomatic',
    'pneumonia',
    'patients',
    'production',
    'neutralizing',
    'antibody',
    'pneumonia',
    'mild',
    'symptomatic',
    'asymptomaticanosmia',
    'groups',
    'patients',
    'pneumonia',
    'group',
    'high',
    'titer