# Full Text Search evaluation with participant demographics

Here, I evaluate the strategy of finding relevant text sections using chunk embeddings.

First, I split PMC articles using Markdown (and lines), into chunks less than `n_tokens` (~4000).

Next, we embed each chunks.

Finally, using a text query, we find the most relevant section of each article for finding participant demographics.
The query is also embedded and a distance metric is taken between each chunk and the query.

To evaluate this method, I will see if this method correctly identifies the section where human annotators found demographic info.

## Setup

In [1]:
import pandas as pd
from labelrepo.projects.participant_demographics import get_participant_demographics
from labelrepo.database import get_database_connection

subgroups = get_participant_demographics(include_locations=True)
docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [2]:
# Only look at single group documents
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]
single_group_docs = docs_info[docs_info.pmcid.isin(single_group.pmcid)]

In [3]:
from split import split_pmc_document

In [4]:
pd.DataFrame(split_pmc_document(single_group_docs.iloc[0].text))

Unnamed: 0,content,start_chars,end_chars,section_0,section_1,section_2
0,"Pulli, Elmo P. and Silver, Eero and Kumpulaine...",0,323,,,
1,\n# Title\n\nFeasibility of FreeSurfer Process...,323,468,Title,,
2,\n# Keywords\n\nbrain\nchild\nneuroimaging\nbr...,468,563,Keywords,,
3,\n# Abstract\n \nPediatric neuroimaging is a q...,563,2381,Abstract,,
4,\n# Body\n \n## Introduction \n \nThere are m...,2381,5253,Body,Introduction,
5,\nQuality control is often done by applying a ...,5253,9234,Body,Introduction,
6,\n## Materials and Methods \n \nThis study wa...,9234,9500,Body,Materials and Methods,
7,\n### Participants \n \nThe participants are ...,9500,12550,Body,Materials and Methods,Participants
8,\n### The Study Visits \n \nAll MRI scans wer...,12550,16313,Body,Materials and Methods,The Study Visits
9,\nAll images were viewed by one neuroradiologi...,16313,16826,Body,Materials and Methods,The Study Visits


In [None]:
%debug

> [0;32m/home/zorro/repos/biomed-llm-retrieval/split.py[0m(68)[0;36msplit_markdown[0;34m()[0m
[0;32m     66 [0;31m        [0;32mif[0m [0mchunk[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     67 [0;31m            [0;32mif[0m [0;34m'Materials and Methods'[0m [0;32min[0m [0mchunk[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 68 [0;31m                [0;32massert[0m [0;36m0[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     69 [0;31m            [0;32mif[0m [0;32mnot[0m [0mix[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     70 [0;31m                [0msection_name[0m[0;34m,[0m [0m_[0m [0;34m=[0m [0mcandidate_chunks[0m[0;34m[[0m[0;36m1[0m[0;34m][0m[0;34m.[0m[0msplit[0m[0;34m([0m[0;34m'\n'[0m[0;34m,[0m [0mmaxsplit[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  chunk


'Materials and Methods \n  \nThis study was conducted in accordance with the Declaration of Helsinki, and it was approved by the Joint Ethics Committee of the University of Turku and the Hospital District of Southwest Finland (07.08.2018) §330, ETMK: 31/180/2011. \n\n### Participants \n  \nThe participants are part of the FinnBrain Birth Cohort Study  ( ), where 5-year-olds were invited to neuropsychological, logopedic, neuroimaging, and pediatric study visits. For the neuroimaging visit, we primarily recruited participants that had a prior visit to neuropsychological measurements at circa 5 years of age (  n   = 141/146). However, there were a few exceptions: three participants were included without a neuropsychological visit, as they had an exposure to maternal prenatal synthetic glucocorticoid treatment (recruited separately for a nested case–control sub-study). The data additionally includes two participants that were enrolled for pilot scans. We aimed to scan all subjects between 

ipdb>  candidate_chunks[1].split('\n', maxsplit=1)


['Introduction ', '  \nThere are multiple methodological challenges in pediatric neuroimaging studies that may affect quality of data and comparisons between studies. Magnetic resonance imaging (MRI) requires the subject to lie still while awake, which is more of a challenge with children than with adults ( ;  ;  ). This can lead to increased motion artifact. One study,   found that mild, moderate, and severe motion artifact were associated with 4, 7, and 27% loss of total gray matter (GM) volume in segmentation, respectively. Furthermore, subtle motion can cause bias even when a visible artifact is absent ( ). Another core challenge is the variation in preprocessing and segmentation techniques ( ), due to a lack of a “gold standard” processing pipeline for pediatric brain images. Therefore, some studies rightfully emphasize the importance of a validated quality control protocol ( ). \n\nFreeSurfer  is an open source software suite for processing brain MRI images that is commonly used 

ipdb>  candidate_chunks[0]


'\n# Body\n '


ipdb>  candidate_chunks[1]


'Introduction \n  \nThere are multiple methodological challenges in pediatric neuroimaging studies that may affect quality of data and comparisons between studies. Magnetic resonance imaging (MRI) requires the subject to lie still while awake, which is more of a challenge with children than with adults ( ;  ;  ). This can lead to increased motion artifact. One study,   found that mild, moderate, and severe motion artifact were associated with 4, 7, and 27% loss of total gray matter (GM) volume in segmentation, respectively. Furthermore, subtle motion can cause bias even when a visible artifact is absent ( ). Another core challenge is the variation in preprocessing and segmentation techniques ( ), due to a lack of a “gold standard” processing pipeline for pediatric brain images. Therefore, some studies rightfully emphasize the importance of a validated quality control protocol ( ). \n\nFreeSurfer  is an open source software suite for processing brain MRI images that is commonly used in 

ipdb>  chunk


'Materials and Methods \n  \nThis study was conducted in accordance with the Declaration of Helsinki, and it was approved by the Joint Ethics Committee of the University of Turku and the Hospital District of Southwest Finland (07.08.2018) §330, ETMK: 31/180/2011. \n\n### Participants \n  \nThe participants are part of the FinnBrain Birth Cohort Study  ( ), where 5-year-olds were invited to neuropsychological, logopedic, neuroimaging, and pediatric study visits. For the neuroimaging visit, we primarily recruited participants that had a prior visit to neuropsychological measurements at circa 5 years of age (  n   = 141/146). However, there were a few exceptions: three participants were included without a neuropsychological visit, as they had an exposure to maternal prenatal synthetic glucocorticoid treatment (recruited separately for a nested case–control sub-study). The data additionally includes two participants that were enrolled for pilot scans. We aimed to scan all subjects between 

ipdb>  chunk


'Materials and Methods \n  \nThis study was conducted in accordance with the Declaration of Helsinki, and it was approved by the Joint Ethics Committee of the University of Turku and the Hospital District of Southwest Finland (07.08.2018) §330, ETMK: 31/180/2011. \n\n### Participants \n  \nThe participants are part of the FinnBrain Birth Cohort Study  ( ), where 5-year-olds were invited to neuropsychological, logopedic, neuroimaging, and pediatric study visits. For the neuroimaging visit, we primarily recruited participants that had a prior visit to neuropsychological measurements at circa 5 years of age (  n   = 141/146). However, there were a few exceptions: three participants were included without a neuropsychological visit, as they had an exposure to maternal prenatal synthetic glucocorticoid treatment (recruited separately for a nested case–control sub-study). The data additionally includes two participants that were enrolled for pilot scans. We aimed to scan all subjects between 

### Embed all documents

In [3]:
import openai
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

In [4]:
import pickle
# pickle.dump(single_group_embeddings, open('data/single_group_embeddings.pkl', 'wb'))

In [5]:
single_group_embeddings = pickle.load(open('data/single_group_embeddings.pkl', 'rb'))

In [6]:
single_group_embeddings = pd.DataFrame(single_group_embeddings)

In [7]:
single_group_embeddings.i

Unnamed: 0,section_name,content,start_char,end_char,embedding,pmcid
0,Authors,"Pulli, Elmo P. and Silver, Eero and Kumpulaine...",0,323,"[-0.005361288785934448, -0.015115763060748577,...",9108497
1,Title,\n# Title\n\nFeasibility of FreeSurfer Process...,323,468,"[-0.011751627549529076, 0.026484699919819832, ...",9108497
2,Keywords,\n# Keywords\n\nbrain\nchild\nneuroimaging\nbr...,468,563,"[-0.02721841260790825, 0.02183225192129612, 0....",9108497
3,Abstract,\n# Abstract\n \nPediatric neuroimaging is a q...,563,2381,"[-0.0069894008338451385, 0.03050420992076397, ...",9108497
4,Body,\n# Body\n \n## Introduction \n \nThere are m...,2381,5253,"[-0.0003534696879796684, 0.024464335292577744,...",9108497
...,...,...,...,...,...,...
1744,Body,\n## Data Availability Statement \n \nThe raw...,31764,31956,"[-0.016607334837317467, -0.02181965485215187, ...",6908505
1745,Body,\n## Ethics Statement \n \nThe studies involv...,31956,32200,"[-0.008219190873205662, -0.012068435549736023,...",6908505
1746,Body,\n## Author Contributions \n \nKM designed th...,32200,32513,"[-0.0033632966224104166, 0.008040985092520714,...",6908505
1747,Body,\n## Funding \n \nThe study was based on two ...,32513,32823,"[0.0009335727663710713, -0.024999098852276802,...",6908505


In [43]:
_PARTICIPANTS_SECTIONS = (
    r"#+ .*(?:participants?|subjects?|patients|population).*\n"
)

### Test query across all documents

Given a query, see what the average rank for the chunk matching the human annotation is

In [8]:
import numpy as np 

def get_matching_chunks(ranks_df, annotation_df):
    """ Select the chunks that contains the original annotation """
    matches = []
    for _, row in annotation_df.iterrows():
        for ix, start_char in enumerate(row['start_char']):
            end_char = row['end_char'][ix]
            m = ranks_df[
                (ranks_df['pmcid'] == row['pmcid']) & \
                (ranks_df['start_char'] <= start_char) & (ranks_df['end_char'] >= end_char)]
            if not m.empty:
                matches.append(m)
                break
    
    return pd.concat(matches)

def get_chunk_query_distance(embeddings_df, query):
    # For every document, get distance and rank between query and embeddings
    distances, ranks = zip(*[
        query_embeddings(sub_df['embedding'].tolist(), query) 
        for pmcid, sub_df in embeddings_df.groupby('pmcid', sort=False)
    ])

    # Combine with meta-data into a df
    ranks_df = embeddings_df[['pmcid', 'content', 'start_char', 'end_char']].copy()
    ranks_df['distance'] = np.concatenate(distances)
    ranks_df['rank'] = np.concatenate(ranks)

    return ranks_df
    
def evaluate_query_across_docs(embeddings_df, annotations_df, query):
    ranks_df = get_chunk_query_distance(embeddings_df, query)
    
    mc = get_matching_chunks(ranks_df, annotations_df)

    print(
        f"Query: '{query}' \nMean rank: {mc['rank'].mean():.2f},\
        top 1 %: {(mc['rank'] == 0).mean():.2f}, \
        top 3 %: {(mc['rank'] <= 3).mean():.2f}"
    )

In [9]:
queries = ['ljskfdklsjdfk', 'Methods section', 'Number of participants',
           'The number of subjects or participants that were involved in the study or underwent MRI',
           'How many participants or subjects were recruited for this study?',
           'How many participants were recruited for this study?']

In [10]:
for q in queries:
    evaluate_query_across_docs(single_group_embeddings, single_group, q)


KeyboardInterrupt



### Try only on Body

Looks like *for some studies* Jerome's annotations were only in the Body of the study, so it would be fair to exclude any embeddings not on the Body of the paper. 

In [61]:
single_group_embeddings_body = single_group_embeddings[single_group_embeddings.section_name == 'Body'].reset_index().drop(columns='index')

In [12]:
# for q in queries:
#     evaluate_query_across_docs(single_group_embeddings_body, single_group, q)

### Compare to heuristic approach (looking for headers)

# Extract Sample Size from relevant secton

In [53]:
from extract import extract_from_multiple
from templates import ZERO_SHOT_SAMPLE_SIZE_FUNCTION

def extract_sample_size_full_text(embeddings_df, annotations_df, query, template, num_workers=3, model_name='gpt-3.5-turbo'):
    ranks_df = get_chunk_query_distance(embeddings_df, query)
    mc = get_matching_chunks(ranks_df, annotations_df).rename(columns={'rank': 'matching_rank'})[['pmcid', 'matching_rank']]
    
    ranks_df = pd.merge(ranks_df, mc, on='pmcid')
    ranks_df['is_matching_chunk'] = ranks_df['rank'] == ranks_df['matching_rank']

    # Subset to only include top ranked chunks
    ranks_df = ranks_df[ranks_df['rank'] == 0]

    # For every chunk, apply template
    predictions = extract_from_multiple(
        ranks_df.content.to_list(), 
        **template, 
        num_workers=num_workers,
        model_name=model_name
    )

    predictions = pd.DataFrame(predictions)

    predictions['is_matching_chunk'] = ranks_df['is_matching_chunk'].tolist()
    predictions['pmcid'] = ranks_df['pmcid'].tolist()
    predictions['content'] =  ranks_df['content'].tolist()
    return predictions

In [54]:
full_text_demo = extract_sample_size_full_text(
    single_group_embeddings_body, single_group,
    'How many participants were recruited for this study?', 
    ZERO_SHOT_SAMPLE_SIZE_FUNCTION
)

100%|███████████████████████████████████████████| 69/69 [00:18<00:00,  3.66it/s]


#### Evaluation

In [55]:
def is_within_percentage(num1, num2, percentage=10):
    if int(num1) == int(num2):
        return True
    if abs(int(num1) - int(num2)) / int(num1) <= percentage / 100:
        return True
    return False
    
def _print_evaluation(predictions_df, annotations_df):

    # Combine annotations with predicted values
    eval_df = annotations_df.reset_index()[['count', 'pmcid']].rename(columns={'count': 'annot_count'})
    predictions_df = pd.merge(predictions_df, eval_df)
    predictions_df['correct'] = predictions_df['count'] == predictions_df['annot_count']
    
    wrong_chunk = predictions_df[predictions_df.is_matching_chunk == False]

    matching = predictions_df[predictions_df.is_matching_chunk]

    non_na_ix = ((pd.isna(matching['count']) == False) & (matching['count'] != 0))

    nonna = matching[non_na_ix]

    print(f"""
    Accuracy: {predictions_df['correct'].mean():.2f}
    % FTS chose a chunk w/ annotated information: {predictions_df.is_matching_chunk.mean():.2f}
    %  null when wrong chunk: {pd.isna(wrong_chunk['count']).mean():.2f}
    Accuracy for cases when correct chunk was given to LLM: {matching['correct'].mean():.2f}
    % LLM reported a non-na value when correct chunk was given: {non_na_ix.mean():.2f}
    Accuracy for non-NA values w/ correct chunk given: {nonna['correct'].mean():.2f}
    Accuracy within 10%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 10), axis=1)).mean():.2f}
    Accuracy within 20%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 20), axis=1)).mean():.2f}
    Accuracy within 30%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 30), axis=1)).mean():.2f}""")

    return predictions_df, nonna

In [56]:
full_text_demo, ft_nonna  = _print_evaluation(full_text_demo, single_group)


    Accuracy: 0.72
    % FTS chose a chunk w/ annotated information: 0.91
    %  null when wrong chunk: 0.17
    Accuracy for cases when correct chunk was given to LLM: 0.79
    % LLM reported a non-na value when correct chunk was given: 1.00
    Accuracy for non-NA values w/ correct chunk given: 0.79
    Accuracy within 10%: 0.86
    Accuracy within 20%: 0.97
    Accuracy within 30%: 0.98


#### Summary

When the correct chunk is given the GPT-3.5, it can extract a sample size value (although typically not final N), most of the time, with relatively few gross errors.

However, when given the incorrect chunk, it will often not give `null` values when it should

#### Incorrect responses

In [57]:
incorrect = ft_nonna[ft_nonna['correct'] == False]

- Majority are only off by a few, due to complex exclusion crtiera
- Attempting to change the prompt to identify # of excluded subjects actually makes accuracy go down, and n deviates further from either final N or total N (and substracting the two numbers doesn't help)
- In one case, the annotation is actually incorrect. (1/10)
- Sometimes models  confused other info for demographic information (i.e. ROIs).
  It seems as if the models are good at putting `n/a` for these section, but sometimes (in a non stable manner), fail

## GPT-4 Full Text

In [58]:
full_text_gpt4 = extract_sample_size_full_text(
    single_group_embeddings_body, single_group,
    'How many participants were recruited for this study?',
    ZERO_SHOT_SAMPLE_SIZE_FUNCTION,
    model_name='gpt-4',
    num_workers=2
)

100%|███████████████████████████████████████████| 69/69 [03:17<00:00,  2.86s/it]


In [59]:
full_text_gpt4, ft_gpt4_nonna  = _print_evaluation(full_text_gpt4, single_group)


    Accuracy: 0.86
    % FTS chose a chunk w/ annotated information: 0.91
    %  null when wrong chunk: 0.83
    Accuracy for cases when correct chunk was given to LLM: 0.92
    % LLM reported a non-na value when correct chunk was given: 1.00
    Accuracy for non-NA values w/ correct chunk given: 0.92
    Accuracy within 10%: 0.94
    Accuracy within 20%: 0.98
    Accuracy within 30%: 0.98


GPT-4 is slightly more accurate, but more importantly, is less likely to hallucinate or get the wrong answer

In [60]:
import pickle
pickle.dump(full_text_gpt4, open('data/full_text_gpt4_single_group.pkl', 'wb'))