# Full Text Search evaluation with participant demographics

Here, I evaluate the strategy of finding relevant text sections using chunk embeddings.

First, I split PMC articles using Markdown (and lines), into chunks less than `n_tokens` (~4000).

Next, we embed each chunks.

Finally, using a text query, we find the most relevant section of each article for finding participant demographics.
The query is also embedded and a distance metric is taken between each chunk and the query.

To evaluate this method, I will see if this method correctly identifies the section where human annotators found demographic info.

## Setup

In [1]:
import pandas as pd
from labelrepo.projects.participant_demographics import get_participant_demographics
from labelrepo.database import get_database_connection

subgroups = get_participant_demographics(include_locations=True)
docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [2]:
# Only look at single group documents
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]
all_pd_docs = docs_info[docs_info.pmcid.isin(jerome_pd.pmcid)]

### Embed all documents

In [3]:
import openai
from embed import embed_pmc_articles
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

In [328]:
all_embeddings = embed_pmc_articles(all_pd_docs.to_dict(orient='records'))

100%|█████████████████████████████████████████| 156/156 [08:30<00:00,  3.27s/it]


In [398]:
import pickle
pickle.dump(all_embeddings, open('data/all_embeddings.pkl', 'wb'))
# all_embeddings = pickle.load(open('data/all_embeddings.pkl', 'rb'))

In [332]:
all_embeddings = pd.DataFrame(all_embeddings)

### Test query across all documents

Given a query, see what the average rank for the chunk matching the human annotation is

In [377]:
import numpy as np 
from embed import query_embeddings, get_chunks_heuristic, get_chunk_query_distance
import re

def get_matching_chunks(ranks_df, annotation_df):
    """ Select the chunks that contains the original annotation """
    matches = []
    for _, row in annotation_df.iterrows():
        for ix, start_char in enumerate(row['start_char']):
            end_char = row['end_char'][ix]
            m = ranks_df[
                (ranks_df['pmcid'] == row['pmcid']) & \
                (ranks_df['start_char'] <= start_char) & (ranks_df['end_char'] >= end_char)]
            if not m.empty:
                matches.append(m)
                break
    
    return pd.concat(matches)

def evaluate_query_across_docs(embeddings_df, annotations_df, query):
    ranks_df = get_chunk_query_distance(embeddings_df, query)
    
    mc = get_matching_chunks(ranks_df, annotations_df)

    print(
        f"Query: '{query}' \nMean rank: {mc['rank'].mean():.2f},\
        Percentage match: {mc.pmcid.unique().shape[0] / embeddings_df.pmcid.unique().shape[0]:.2f}\
        top 1 %: {(mc['rank'] == 0).mean():.2f}, \
        top 3 %: {(mc['rank'] < 3).mean():.2f}"
    )

def evaluate_query_plus_heuristic(embeddings_df, annotations_df, query, use_heuristic=True, section_2=True):
    # Use heuristic to pre-select section
    # Fall back to searching entire document if this fails
    if use_heuristic:
        embeddings_df = get_chunks_heuristic(embeddings_df, section_2=section_2)

    # Rank chunks
    ranks_df = get_chunk_query_distance(embeddings_df, query)

    # Take only the top ranking chunk
    top_1 = ranks_df[ranks_df['rank'] == 0]
    mc_1 = get_matching_chunks(top_1, annotations_df)

    # Only keep chunks that match annotation

    top_3 = ranks_df[ranks_df['rank'] < 3]
    mc_3 = get_matching_chunks(top_3, annotations_df)


    print(
        f"Query: '{query}' \n\
        % match top 1: {mc_1.pmcid.unique().shape[0] / embeddings_df.pmcid.unique().shape[0]:.2f} \n\
        % match top 3: {mc_3.pmcid.unique().shape[0] / embeddings_df.pmcid.unique().shape[0]:.2f}"
    )

### Semantic search

In [366]:
queries = ['ljskfdklsjdfk', 'Methods section', 'Number of participants',
           'The number of subjects or participants that were involved in the study or underwent MRI',
           'How many participants or subjects were recruited for this study?',
           'How many participants were recruited for this study?']

In [367]:
for q in queries:
    evaluate_query_across_docs(all_embeddings, jerome_pd, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 7.40,        Percentage match: 1.00        top 1 %: 0.01,         top 3 %: 0.20
Query: 'Methods section' 
Mean rank: 6.93,        Percentage match: 1.00        top 1 %: 0.06,         top 3 %: 0.29
Query: 'Number of participants' 
Mean rank: 3.40,        Percentage match: 1.00        top 1 %: 0.60,         top 3 %: 0.72
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 3.48,        Percentage match: 1.00        top 1 %: 0.36,         top 3 %: 0.71
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 3.19,        Percentage match: 1.00        top 1 %: 0.64,         top 3 %: 0.77
Query: 'How many participants were recruited for this study?' 
Mean rank: 3.16,        Percentage match: 1.00        top 1 %: 0.62,         top 3 %: 0.78


### Try only on Body

Looks like *for some studies* Jerome's annotations were only in the Body of the study, so it would be fair to exclude any embeddings not on the Body of the paper. 

In [338]:
all_embeddings_body = all_embeddings[all_embeddings.section_0 == 'Body']

In [339]:
for q in queries:
    evaluate_query_across_docs(all_embeddings_body, jerome_pd, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 4.35,        Percentage match: 0.90        top 1 %: 0.09,         top 3 %: 0.42
Query: 'Methods section' 
Mean rank: 3.49,        Percentage match: 0.90        top 1 %: 0.14,         top 3 %: 0.48
Query: 'Number of participants' 
Mean rank: 0.58,        Percentage match: 0.90        top 1 %: 0.73,         top 3 %: 0.91
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 1.21,        Percentage match: 0.90        top 1 %: 0.43,         top 3 %: 0.88
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 0.50,        Percentage match: 0.90        top 1 %: 0.75,         top 3 %: 0.95
Query: 'How many participants were recruited for this study?' 
Mean rank: 0.49,        Percentage match: 0.90        top 1 %: 0.74,         top 3 %: 0.95


For some, it looks like the correct passage was outside the body (likely Abstract)

### Single group only

In [340]:
single_group_embeddings_body = all_embeddings_body[all_embeddings_body.pmcid.isin(single_group.pmcid)]

In [341]:
for q in queries:
    evaluate_query_across_docs(single_group_embeddings_body, jerome_pd, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 3.33,        Percentage match: 0.92        top 1 %: 0.09,         top 3 %: 0.65
Query: 'Methods section' 
Mean rank: 2.52,        Percentage match: 0.92        top 1 %: 0.20,         top 3 %: 0.67
Query: 'Number of participants' 
Mean rank: 0.25,        Percentage match: 0.92        top 1 %: 0.88,         top 3 %: 0.99
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 0.70,        Percentage match: 0.92        top 1 %: 0.64,         top 3 %: 0.91
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 0.19,        Percentage match: 0.92        top 1 %: 0.88,         top 3 %: 0.99
Query: 'How many participants were recruited for this study?' 
Mean rank: 0.16,        Percentage match: 0.92        top 1 %: 0.91,         top 3 %: 0.99


## Try combined approach (heuristic + embedding fallback)

In [378]:
for q in queries:
    evaluate_query_plus_heuristic(all_embeddings, jerome_pd, q, use_heuristic=False)

Query: 'ljskfdklsjdfk' 
        % match top 1: 0.01 
        % match top 3: 0.27
Query: 'Methods section' 
        % match top 1: 0.10 
        % match top 3: 0.37
Query: 'Number of participants' 
        % match top 1: 0.71 
        % match top 3: 0.82
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
        % match top 1: 0.46 
        % match top 3: 0.80
Query: 'How many participants or subjects were recruited for this study?' 
        % match top 1: 0.74 
        % match top 3: 0.86
Query: 'How many participants were recruited for this study?' 
        % match top 1: 0.73 
        % match top 3: 0.86


In [379]:
for q in queries:
    evaluate_query_plus_heuristic(all_embeddings, jerome_pd, q, section_2=False)

Query: 'ljskfdklsjdfk' 
        % match top 1: 0.30 
        % match top 3: 0.71
Query: 'Methods section' 
        % match top 1: 0.38 
        % match top 3: 0.71
Query: 'Number of participants' 
        % match top 1: 0.76 
        % match top 3: 0.86
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
        % match top 1: 0.47 
        % match top 3: 0.82
Query: 'How many participants or subjects were recruited for this study?' 
        % match top 1: 0.80 
        % match top 3: 0.86
Query: 'How many participants were recruited for this study?' 
        % match top 1: 0.78 
        % match top 3: 0.86


In [380]:
for q in queries:
    evaluate_query_plus_heuristic(all_embeddings, jerome_pd, q, section_2=True)

Query: 'ljskfdklsjdfk' 
        % match top 1: 0.67 
        % match top 3: 0.83
Query: 'Methods section' 
        % match top 1: 0.68 
        % match top 3: 0.82
Query: 'Number of participants' 
        % match top 1: 0.81 
        % match top 3: 0.84
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
        % match top 1: 0.73 
        % match top 3: 0.84
Query: 'How many participants or subjects were recruited for this study?' 
        % match top 1: 0.82 
        % match top 3: 0.84
Query: 'How many participants were recruited for this study?' 
        % match top 1: 0.82 
        % match top 3: 0.84


Using heuristic seems to increase the chances that the correct chunk is the top chunk, but also slighlty increases overall misses.

The best approach is likely to use only the Methods only heuristic

# Extract Sample Size from relevant section (single group only)

In [393]:
from extract import extract_from_multiple
from templates import ZERO_SHOT_SAMPLE_SIZE_FUNCTION

def extract_sample_size_full_text(embeddings_df, annotations_df, query, template, num_workers=3, model_name='gpt-3.5-turbo'):
    ranks_df = get_chunk_query_distance(embeddings_df, query)
    mc = get_matching_chunks(ranks_df, annotations_df).rename(columns={'rank': 'matching_rank'})[['pmcid', 'matching_rank']]
    
    ranks_df = pd.merge(ranks_df, mc, on='pmcid')
    ranks_df['is_matching_chunk'] = ranks_df['rank'] == ranks_df['matching_rank']

    # Subset to only include top ranked chunks
    ranks_df = ranks_df[ranks_df['rank'] == 0]

    # For every chunk, apply template
    predictions = extract_from_multiple(
        ranks_df.content.to_list(), 
        **template, 
        num_workers=num_workers,
        model_name=model_name
    )

    predictions = pd.DataFrame(predictions)

    predictions['is_matching_chunk'] = ranks_df['is_matching_chunk'].tolist()
    predictions['pmcid'] = ranks_df['pmcid'].tolist()
    predictions['content'] =  ranks_df['content'].tolist()
    return predictions

In [394]:
full_text_demo = extract_sample_size_full_text(
    single_group_embeddings_body, single_group,
    'How many participants were recruited for this study?', 
    ZERO_SHOT_SAMPLE_SIZE_FUNCTION
)

100%|███████████████████████████████████████████| 69/69 [00:11<00:00,  5.77it/s]


In [392]:
single_group_embeddings_body.pmcid.unique().shape[0]

75

#### Evaluation

In [395]:
def is_within_percentage(num1, num2, percentage=10):
    if int(num1) == int(num2):
        return True
    if abs(int(num1) - int(num2)) / int(num1) <= percentage / 100:
        return True
    return False
    
def _print_evaluation(predictions_df, annotations_df):

    # Combine annotations with predicted values
    eval_df = annotations_df.reset_index()[['count', 'pmcid']].rename(columns={'count': 'annot_count'})
    predictions_df = pd.merge(predictions_df, eval_df)
    predictions_df['correct'] = predictions_df['count'] == predictions_df['annot_count']
    
    wrong_chunk = predictions_df[predictions_df.is_matching_chunk == False]

    matching = predictions_df[predictions_df.is_matching_chunk]

    non_na_ix = ((pd.isna(matching['count']) == False) & (matching['count'] != 0))

    nonna = matching[non_na_ix]

    print(f"""
    Accuracy: {predictions_df['correct'].mean():.2f}
    % FTS chose a chunk w/ annotated information: {predictions_df.is_matching_chunk.mean():.2f}
    %  null when wrong chunk: {pd.isna(wrong_chunk['count']).mean():.2f}
    Accuracy for cases when correct chunk was given to LLM: {matching['correct'].mean():.2f}
    % LLM reported a non-na value when correct chunk was given: {non_na_ix.mean():.2f}
    Accuracy for non-NA values w/ correct chunk given: {nonna['correct'].mean():.2f}
    Accuracy within 10%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 10), axis=1)).mean():.2f}
    Accuracy within 20%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 20), axis=1)).mean():.2f}
    Accuracy within 30%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 30), axis=1)).mean():.2f}""")

    return predictions_df, nonna

In [396]:
full_text_demo, ft_nonna  = _print_evaluation(full_text_demo, single_group)


    Accuracy: 0.72
    % FTS chose a chunk w/ annotated information: 0.91
    %  null when wrong chunk: 0.33
    Accuracy for cases when correct chunk was given to LLM: 0.79
    % LLM reported a non-na value when correct chunk was given: 1.00
    Accuracy for non-NA values w/ correct chunk given: 0.79
    Accuracy within 10%: 0.84
    Accuracy within 20%: 0.97
    Accuracy within 30%: 0.98


#### Summary

When the correct chunk is given the GPT-3.5, it can extract a sample size value (although typically not final N), most of the time, with relatively few gross errors.

However, when given the incorrect chunk, it will often not give `null` values when it should

#### Incorrect responses

In [57]:
incorrect = ft_nonna[ft_nonna['correct'] == False]

- Majority are only off by a few, due to complex exclusion crtiera
- Attempting to change the prompt to identify # of excluded subjects actually makes accuracy go down, and n deviates further from either final N or total N (and substracting the two numbers doesn't help)
- In one case, the annotation is actually incorrect. (1/10)
- Sometimes models  confused other info for demographic information (i.e. ROIs).
  It seems as if the models are good at putting `n/a` for these section, but sometimes (in a non stable manner), fail

## GPT-4 Full Text

In [58]:
full_text_gpt4 = extract_sample_size_full_text(
    single_group_embeddings_body, single_group,
    'How many participants were recruited for this study?',
    ZERO_SHOT_SAMPLE_SIZE_FUNCTION,
    model_name='gpt-4',
    num_workers=2
)

100%|███████████████████████████████████████████| 69/69 [03:17<00:00,  2.86s/it]


In [59]:
full_text_gpt4, ft_gpt4_nonna  = _print_evaluation(full_text_gpt4, single_group)


    Accuracy: 0.86
    % FTS chose a chunk w/ annotated information: 0.91
    %  null when wrong chunk: 0.83
    Accuracy for cases when correct chunk was given to LLM: 0.92
    % LLM reported a non-na value when correct chunk was given: 1.00
    Accuracy for non-NA values w/ correct chunk given: 0.92
    Accuracy within 10%: 0.94
    Accuracy within 20%: 0.98
    Accuracy within 30%: 0.98


GPT-4 is slightly more accurate, but more importantly, is less likely to hallucinate or get the wrong answer

In [60]:
import pickle
pickle.dump(full_text_gpt4, open('data/full_text_gpt4_single_group.pkl', 'wb'))