# Test retrieval from centroid + BM25
This notebook uses the script for centroid retrieval parting from BM25 to retrieve a series of candidates usable for next models or even, snippet retrieval directly.

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

import copy
import logging
from pprint import pprint

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

from src.cemb_bm25 import centroid_retrieval
from src.elastic_search_utils.elastic_utils import load_json, save_json

In [2]:
!which python

/datasets/anaconda3/envs/tf2.8/bin/python


## Params

In [3]:
BM25_QUESTIONS = '/datasets/johan_tests_original_format/test_docs_10b-testset3.json'

ABSTRACT_WEIGHT = 0.7
TITLE_WEIGHT = 0.3

In [4]:
LOADING_FOLDER = '/datasets/johan_tests_original_format_centroid/merged_training_docs'

LOADING_ABSTRACT_W2V_PATH = f'{LOADING_FOLDER}/Bio_Word2Vec_doc_abstract_model_10b_train.bin'
LOADING_TITLE_W2V_PATH = f'{LOADING_FOLDER}/Bio_Word2Vec_doc_title_model_10b_train.bin'
LOADING_QUESTION_W2V_PATH = f'{LOADING_FOLDER}/Bio_Word2Vec_doc_question_model_10b_train.bin'

## Saving directions

In [5]:
SAVING_FOLDER = '/datasets/johan_tests_original_format_centroid/merged_training_docs'

SAVING_ORIGINAL_PATH = f'{SAVING_FOLDER}/test_original_10b-testset3.json'
SAVING_TOKENS_PATH = f'{SAVING_FOLDER}/test_tokens_10b-testset3.json'
SAVING_ENTITY_PATH = f'{SAVING_FOLDER}/test_entity_10b-testset3.json'

## Constants

In [6]:
questions = load_json(BM25_QUESTIONS)

## Extracting unique document info
Only documents with abstract will be kept.

One dict cleaned-tokenized, another only cleaned.

In [7]:
unique_docs = centroid_retrieval.extract_unique_doc_info(
    questions=questions['questions']
)

Extracting unique doc info: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 49318.97it/s]


## Extracting tokens for valid documents

In [8]:
%%time
tokenized_unique_docs = centroid_retrieval.docs_to_tokens(
    unique_docs=unique_docs,
    n_jobs=16
)

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   9 tasks      | elapsed:    4.7s
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    4.7s
[Parallel(n_jobs=16)]: Done  29 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Batch computation too fast (0.1691s.) Setting batch_size=2.
[Parallel(n_jobs=16)]: Done  40 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Done  53 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Done  66 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Batch computation too fast (0.0317s.) Setting batch_size=4.
[Parallel(n_jobs=16)]: Done  87 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Done 116 tasks      | elapsed:    4.8s
[Parallel(n_jobs=16)]: Batch computation too fast (0.0482s.) Setting batch_size=8.
[Parallel(n_jobs=16)]: Done 154 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Done 219 tasks      | elapsed:    4.9s
[Parallel(n_jobs=16)]: Batch computation too fast (0.0

CPU times: user 533 ms, sys: 233 ms, total: 766 ms
Wall time: 6.16 s


[Parallel(n_jobs=16)]: Done 6260 tasks      | elapsed:    6.1s
[Parallel(n_jobs=16)]: Done 6556 tasks      | elapsed:    6.1s
[Parallel(n_jobs=16)]: Done 6836 out of 6836 | elapsed:    6.1s finished


## Extracting entities for valid documents

In [9]:
%%time
entitized_unique_docs = centroid_retrieval.docs_to_entities(
    unique_docs=unique_docs,
    n_jobs=16
)

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Batch computation too fast (0.1178s.) Setting batch_size=2.
[Parallel(n_jobs=16)]: Done   9 tasks      | elapsed:    0.2s
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.3s
[Parallel(n_jobs=16)]: Done  29 tasks      | elapsed:    0.4s
[Parallel(n_jobs=16)]: Done  48 tasks      | elapsed:    0.7s
[Parallel(n_jobs=16)]: Done  74 tasks      | elapsed:    0.9s
[Parallel(n_jobs=16)]: Done 100 tasks      | elapsed:    1.2s
[Parallel(n_jobs=16)]: Done 130 tasks      | elapsed:    1.5s
[Parallel(n_jobs=16)]: Done 160 tasks      | elapsed:    1.8s
[Parallel(n_jobs=16)]: Done 194 tasks      | elapsed:    2.2s
[Parallel(n_jobs=16)]: Done 228 tasks      | elapsed:    2.5s
[Parallel(n_jobs=16)]: Done 266 tasks      | elapsed:    2.9s
[Parallel(n_jobs=16)]: Done 304 tasks      | elapsed:    3.3s
[Parallel(n_jobs=16)]: Done 346 tasks      | elapsed:    3.8s
[Parallel(n_jobs=16)]: Done 388 ta

CPU times: user 3.72 s, sys: 416 ms, total: 4.13 s
Wall time: 1min 3s


[Parallel(n_jobs=16)]: Done 6836 out of 6836 | elapsed:  1.1min finished


## Tokenize question body and replace documents with tokenized documents

In [10]:
tokenized_questions, question_solving_doc_ids = centroid_retrieval.select_questions_useful_documents(
    questions=questions['questions'],
    unique_docs=tokenized_unique_docs
)

Selecting useful documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 4081.43it/s]


In [11]:
tokenized_questions = {'questions': tokenized_questions}

## (Graph entity dict) Tokenize question body and replace documents with entity extracted documents

In [12]:
graph_questions, question_solving_doc_ids = centroid_retrieval.select_questions_useful_documents(
    questions=questions['questions'],
    unique_docs=entitized_unique_docs
)

Selecting useful documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 4314.54it/s]


In [13]:
graph_questions = {'questions': graph_questions}

## Extracting unique tokenized abstracts and titles for word2vec

In [14]:
unique_abstract_tokens, unique_title_tokens = centroid_retrieval.extract_unique_titles_and_abstracts(
    tokenized_unique_docs=tokenized_unique_docs,
    question_solving_doc_ids=question_solving_doc_ids
)

In [15]:
len(unique_abstract_tokens), len(unique_title_tokens)

(6836, 6836)

In [16]:
len(list(unique_abstract_tokens.values())[0]), len(list(unique_title_tokens.values())[0])

(156, 5)

## Extracting unique tokenized questions for word2vec

In [17]:
unique_question_tokens = centroid_retrieval.extract_unique_questions(
    tokenized_questions['questions']
)

In [18]:
len(unique_question_tokens)

90

In [19]:
len(list(unique_question_tokens.values())[0])

12

## Loading models w2vec models for questions, titles and abstracts

In [20]:
question_w2vec_model = centroid_retrieval.load_bio_w2vec_model(LOADING_QUESTION_W2V_PATH)

In [21]:
abstract_w2vec_model = centroid_retrieval.load_bio_w2vec_model(LOADING_ABSTRACT_W2V_PATH)

In [22]:
title_w2vec_model = centroid_retrieval.load_bio_w2vec_model(LOADING_TITLE_W2V_PATH)

## Calculating centroids

In [23]:
question_centroids = centroid_retrieval.calculate_centroids_test(
    text_tokens=unique_question_tokens,
    model=question_w2vec_model
)

Extracting centroids: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 29367.31it/s]


In [24]:
abstract_centroids = centroid_retrieval.calculate_centroids_test(
    text_tokens=unique_abstract_tokens,
    model=abstract_w2vec_model
)

Extracting centroids: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6836/6836 [00:01<00:00, 4894.24it/s]


In [25]:
title_centroids = centroid_retrieval.calculate_centroids_test(
    text_tokens=unique_title_tokens,
    model=title_w2vec_model
)

Extracting centroids: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6836/6836 [00:00<00:00, 38896.16it/s]


## Calculating question cosine similarities to answers

In [26]:
question_similarities = centroid_retrieval.calculate_question_answer_similarity(
    tokenized_questions=tokenized_questions['questions'],
    question_centroids=question_centroids,
    abstract_centroids=abstract_centroids,
    title_centroids=title_centroids
)

  cosine_similarity = projection / normalization
Calculating cosine similarity: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 762.40it/s]


## Calculating document scores for questions

In [27]:
question_scores = centroid_retrieval.calculate_centroid_score(
    questions_similarities=question_similarities['questions'],
    abstract_weight=ABSTRACT_WEIGHT,
    title_weight=TITLE_WEIGHT
)

Calculating centroid distance: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 16943.64it/s]


## Selecting useful documents only from original question dictionaries
### Original dict (for Andres model)

In [28]:
centroid_retrieval.update_question_scores_from_raw_data(
    raw_questions=questions['questions'],
    question_scores=question_scores
)

Updating dictionary with centroid scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 12877.37it/s]


### Tokenized dict

In [29]:
centroid_retrieval.update_question_scores_from_raw_data(
    raw_questions=tokenized_questions['questions'],
    question_scores=question_scores
)

Updating dictionary with centroid scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 14760.59it/s]


### Graph entity dict 

In [30]:
centroid_retrieval.update_question_scores_from_raw_data(
    raw_questions=graph_questions['questions'],
    question_scores=question_scores
)

Updating dictionary with centroid scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 12685.67it/s]


## Saving into disk

In [31]:
save_json(questions, SAVING_ORIGINAL_PATH)

In [32]:
save_json(tokenized_questions, SAVING_TOKENS_PATH)

In [33]:
save_json(graph_questions, SAVING_ENTITY_PATH)

In [34]:
graph_questions['questions'][0]

{'id': '61f97372882a024a10000051',
 'type': 'list',
 'body': ['list',
  'clinical',
  'phenotypes',
  'and',
  'molecular',
  'genetic',
  'features',
  'of',
  'patients',
  'with',
  'kmtbrelated',
  'disorders'],
 'documents': [{'id': '32241076',
   'entities': ['pallidal',
    'stimulation',
    'patient',
    'kmtbrelated',
    'dystonia',
    'kmtb',
    'gene',
    'causative',
    'gene',
    'earlyonset',
    'generalized',
    'dystonia',
    'efficacy',
    'deep',
    'brain',
    'stimulation',
    'kmtbrelated',
    'dystonia',
    'yearold',
    'woman',
    'generalized',
    'dystonia',
    'developmental',
    'delay',
    'microcephaly',
    'short',
    'stature',
    'cognitive',
    'decline',
    'diagnosed',
    'kmtb',
    'related',
    'dystonia',
    'wholeexome',
    'sequencing',
    'heterozygous',
    'frameshift',
    'insertion',
    'kmtb',
    'gene',
    'oral',
    'medications',
    'botulinum',
    'toxin',
    'injection',
    'dystonia',
    'b