# Full Text Search evaluation with participant demographics

Here, I evaluate the strategy of finding relevant text sections using chunk embeddings.

First, I split PMC articles using Markdown (and lines), into chunks less than `n_tokens` (~4000).

Next, we embed each chunks.

Finally, using a text query, we find the most relevant section of each article for finding participant demographics.
The query is also embedded and a distance metric is taken between each chunk and the query.

To evaluate this method, I will see if this method correctly identifies the section where human annotators found demographic info.

## Setup

In [105]:
import pandas as pd
from labelrepo.projects.participant_demographics import get_participant_demographics

subgroups = get_participant_demographics(include_locations=True)

In [106]:
from labelrepo.database import get_database_connection

import pandas as pd

docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [107]:
# Only look at single group documents
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]

In [108]:
single_group_docs = docs_info[docs_info.pmcid.isin(single_group.pmcid)]

### Embed all documents

In [109]:
import openai
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

In [110]:
from embed import embed_pmc_articles, query_embeddings
texts = single_group_docs[['pmcid', 'text']].to_dict(orient='records')
# single_group_embeddings = embed_pmc_articles(texts)

In [111]:
import pickle
# pickle.dump(single_group_embeddings, open('data/single_group_embeddings.pkl', 'wb'))

In [112]:
single_group_embeddings = pickle.load(open('data/single_group_embeddings.pkl', 'rb'))

### Test query across all documents

Given a query, see what the average rank for the chunk matching the human annotation is

In [113]:
single_group_embeddings = pd.DataFrame(single_group_embeddings)

In [114]:
import numpy as np 

def get_matching_chunks(ranks_df, annotation_df):
    """ Select the chunks that contains the original annotation """
    matches = []
    for ix, row in annotation_df.iterrows():
        m = ranks_df[
            (ranks_df['pmcid'] == row['pmcid']) & (ranks_df['start_char'] <= row['start_char']) & (ranks_df['end_char'] >= row['end_char'])]
        matches.append(m)
    
    return pd.concat(matches)
    
def evaluate_query_across_docs(embeddings_df, annotations_df, query):
    # For every document, get distance and rank between query and embeddings
    distances, ranks = zip(*[
        query_embeddings(sub_df['embedding'].tolist(), query) 
        for pmcid, sub_df in embeddings_df.groupby('pmcid', sort=False)
    ])

    # Combine with meta-data into a df
    ranks_df = embeddings_df[['pmcid', 'chunk_id', 'start_char', 'end_char']].copy()
    ranks_df['distance'] = np.concatenate(distances)
    ranks_df['rank'] = np.concatenate(ranks)
    
    mc = get_matching_chunks(ranks_df, annotations_df)

    print(
        f"Query: '{query}' \nMean rank: {mc['rank'].mean():.2f},\
        top 1 %: {(mc['rank'] == 0).mean():.2f}, \
        top 3 %: {(mc['rank'] <= 3).mean():.2f}"
    )

In [115]:
queries = ['ljskfdklsjdfk', 'Methods section', 'Number of participants',
           'The number of subjects or participants that were involved in the study or underwent MRI',
           'How many participants or subjects were recruited for this study?',
           'How many participants were recruited for this study?']

In [116]:
# for q in queries:
#     evaluate_query_across_docs(single_group_embeddings, single_group, q)

### Try only on Body

Looks like Jerome's annotations were only in the Body of the study, so it would be fair to exclude any embeddings not on the Body of the paper. 

In [117]:
single_group_embeddings_body = single_group_embeddings[single_group_embeddings.section_name == 'Body'].reset_index().drop(columns='index')

In [118]:
# for q in queries:
#     evaluate_query_across_docs(single_group_embeddings_body, single_group, q)

# Extract Sample Size from relevant secton

In [47]:
from extract import extract_from_multiple
from templates import ZERO_SHOT_SAMPLE_SIZE_FUNCTION

def extract_sample_size_full_text(embeddings_df, query, template, num_workers=3):
    # For every document, get distance and rank between query and embeddings
    distances, ranks = zip(*[
        query_embeddings(sub_df['embedding'].tolist(), query) 
        for pmcid, sub_df in embeddings_df.groupby('pmcid', sort=False)
    ])

    # Combine with meta-data into a df
    ranks_df = embeddings_df[['pmcid', 'content']].copy()
    ranks_df['rank'] = np.concatenate(ranks)

    # Subset to only include top ranked chunks
    ranks_df = ranks_df[ranks_df['rank'] == 0]

    # For every chunk, apply template
    predictions = extract_from_multiple(
        ranks_df.content.to_list(), 
        **template, 
        num_workers=num_workers
    )

    predictions['pmcid'] = ranks_df['pmcid'].tolist()
    predictions['content'] =  ranks_df['content'].tolist()
    return predictions

In [48]:
full_text_demo = extract_sample_size_full_text(
    single_group_embeddings_body, 
    'How many participants were recruited for this study?', 
    ZERO_SHOT_SAMPLE_SIZE
)

100%|███████████████████████████████████████████| 75/75 [00:20<00:00,  3.71it/s]


#### Evaluation

In [49]:
sg_eval = single_group.reset_index()[['count', 'pmcid']].rename(columns={'count': 'annot_count'})

In [50]:
full_text_demo = pd.merge(full_text_demo, sg_eval)

In [51]:
full_text_demo['correct'] = full_text_demo['count'] == full_text_demo['annot_count']

In [52]:
# Overall accuracy
full_text_demo['correct'].mean()

0.7733333333333333

In [94]:
# Accuracy for non-nas
ftd_nonna = full_text_demo[(full_text_demo['count'] != 'n/a') & (full_text_demo['count'] != 0) ]
ftd_nonna['correct'].mean()

0.8656716417910447

In [102]:
def is_within_percentage(num1, num2, percentage=10):
    if int(num1) == int(num2):
        return True
    if abs(int(num1) - int(num2)) / int(num1) <= percentage / 100:
        return True
    return False

In [103]:
# Accuracy within 10%
(ftd_nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 10), axis=1)).mean()

0.9253731343283582

In [104]:
# Accuracy within 20%
(ftd_nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 20), axis=1)).mean()

0.9850746268656716

#### Incorrect responses

In [56]:
incorrect = ftd_nonna[ftd_nonna['correct'] == False]

In [57]:
incorrect

Unnamed: 0,count,pmcid,content,annot_count,correct
0,146,9108497,\n### Participants \n \nThe participants are ...,134,False
2,112,8752963,\n## Methods \n \n### MEG datasets \n \nWe a...,110,False
4,6,9461104,\n## Material & methods \n \nPatients who wer...,6,False
7,0,9308181,\n## Methods \n \n### Study group \n \nA coh...,173,False
8,17,8837589,\n## MATERIALS AND METHODS \n \n### Participa...,20,False
17,40,2748718,\n## Materials and Methods \n \n### Experimen...,30,False
37,25,3483694,\n## Methods \n \n### Participants \n \nTwen...,22,False
50,19,5537800,\n### 2.1. Participants \n \nThe study involv...,16,False
68,27,6509414,\n## Materials and Methods \n \n### Participa...,26,False
73,15,6989437,\n## Methods \n \n### Participants \n \nFift...,13,False


- 8/10 are only off by a few, due to complex exclusion crtiera
- Attempting to change the prompt to identify # of excluded subjects actually makes accuracy go down, and n deviates further from either final N or total N (and substracting the two numbers doesn't help)
- In one case, the annotation is actually incorrect. (1/10)
- Zero's should be treated as `n/a` (1/1)
- Sometimes models  confused other info for demographic information (i.e. ROIs).
  It seems as if the models are good at putting `n/a` for these section, but sometimes (in a non stable manner), fail

#### Missing Responses

In [77]:
ftd_na = full_text_demo[full_text_demo['count'] == 'n/a']

In [78]:
ftd_na

Unnamed: 0,count,pmcid,content,annot_count,correct
3,,8978988,\n## AUTHOR CONTRIBUTIONS \n \nZhouwei Xu and...,1,False
9,,6492297,"\nFurthermore, each subject has a number of be...",820,False
11,,3775427,\n## Methods \n \n### Regions of interest \n ...,102,False
14,,6678781,"\n## 5. Limitations \n \nFirst of all, partic...",61,False
35,,6878729,\n## Ethics Statement \n \nThe studies involv...,8,False
60,,7481390,\n## Ethics Statement \n \nThe studies involv...,28,False
63,,3409150,\n## Methods \n \nThis research was approved ...,18,False


In almost all these cases, the wrong passage was selected. 

### Iterating over n/as, until finding a sample size