# Full Text Search evaluation with participant demographics

Here, I evaluate the strategy of finding relevant text sections using chunk embeddings.

First, I split PMC articles using Markdown (and lines), into chunks less than `n_tokens` (~4000).

Next, we embed each chunks.

Finally, using a text query, we find the most relevant section of each article for finding participant demographics.
The query is also embedded and a distance metric is taken between each chunk and the query.

To evaluate this method, I will see if this method correctly identifies the section where human annotators found demographic info.

## Setup

In [95]:
import pandas as pd
from labelrepo.projects.participant_demographics import get_participant_demographics
from labelrepo.database import get_database_connection

subgroups = get_participant_demographics(include_locations=True)
docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [93]:
# Only look at single group documents
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]
single_group_docs = docs_info[docs_info.pmcid.isin(single_group.pmcid)]

### Embed all documents

In [5]:
import openai
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

In [6]:
from embed import embed_pmc_articles, query_embeddings
texts = single_group_docs[['pmcid', 'text']].to_dict(orient='records')
# single_group_embeddings = embed_pmc_articles(texts)

In [7]:
import pickle
# pickle.dump(single_group_embeddings, open('data/single_group_embeddings.pkl', 'wb'))

In [8]:
single_group_embeddings = pickle.load(open('data/single_group_embeddings.pkl', 'rb'))

### Test query across all documents

Given a query, see what the average rank for the chunk matching the human annotation is

In [9]:
single_group_embeddings = pd.DataFrame(single_group_embeddings)

In [10]:
import numpy as np 

def get_matching_chunks(ranks_df, annotation_df):
    """ Select the chunks that contains the original annotation """
    matches = []
    for _, row in annotation_df.iterrows():
        for ix, start_char in enumerate(row['start_char']):
            end_char = row['end_char'][ix]
            m = ranks_df[
                (ranks_df['pmcid'] == row['pmcid']) & \
                (ranks_df['start_char'] <= start_char) & (ranks_df['end_char'] >= end_char)]
            if not m.empty:
                matches.append(m)
                break
    
    return pd.concat(matches)
    
def evaluate_query_across_docs(embeddings_df, annotations_df, query):
    # For every document, get distance and rank between query and embeddings
    distances, ranks = zip(*[
        query_embeddings(sub_df['embedding'].tolist(), query) 
        for pmcid, sub_df in embeddings_df.groupby('pmcid', sort=False)
    ])

    # Combine with meta-data into a df
    ranks_df = embeddings_df[['pmcid', 'start_char', 'end_char']].copy()
    ranks_df['distance'] = np.concatenate(distances)
    ranks_df['rank'] = np.concatenate(ranks)
    
    mc = get_matching_chunks(ranks_df, annotations_df)

    print(
        f"Query: '{query}' \nMean rank: {mc['rank'].mean():.2f},\
        top 1 %: {(mc['rank'] == 0).mean():.2f}, \
        top 3 %: {(mc['rank'] <= 3).mean():.2f}"
    )

In [11]:
queries = ['ljskfdklsjdfk', 'Methods section', 'Number of participants',
           'The number of subjects or participants that were involved in the study or underwent MRI',
           'How many participants or subjects were recruited for this study?',
           'How many participants were recruited for this study?']

In [12]:
# for q in queries:
#     evaluate_query_across_docs(single_group_embeddings, single_group, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 5.68,        top 1 %: 0.03,         top 3 %: 0.48
Query: 'Methods section' 
Mean rank: 4.95,        top 1 %: 0.12,         top 3 %: 0.52
Query: 'Number of participants' 
Mean rank: 1.52,        top 1 %: 0.80,         top 3 %: 0.89
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 1.60,        top 1 %: 0.59,         top 3 %: 0.88
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 1.39,        top 1 %: 0.81,         top 3 %: 0.91
Query: 'How many participants were recruited for this study?' 
Mean rank: 1.32,        top 1 %: 0.84,         top 3 %: 0.91


### Try only on Body

Looks like *for some studies* Jerome's annotations were only in the Body of the study, so it would be fair to exclude any embeddings not on the Body of the paper. 

In [13]:
single_group_embeddings_body = single_group_embeddings[single_group_embeddings.section_name == 'Body'].reset_index().drop(columns='index')

In [14]:
# for q in queries:
#     evaluate_query_across_docs(single_group_embeddings_body, single_group, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 3.33,        top 1 %: 0.09,         top 3 %: 0.72
Query: 'Methods section' 
Mean rank: 2.55,        top 1 %: 0.20,         top 3 %: 0.75
Query: 'Number of participants' 
Mean rank: 0.25,        top 1 %: 0.88,         top 3 %: 0.99
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 0.70,        top 1 %: 0.64,         top 3 %: 0.96
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 0.19,        top 1 %: 0.88,         top 3 %: 0.99
Query: 'How many participants were recruited for this study?' 
Mean rank: 0.16,        top 1 %: 0.91,         top 3 %: 0.99


# Extract Sample Size from relevant secton

In [85]:
from extract import extract_from_multiple
from templates import ZERO_SHOT_SAMPLE_SIZE_FUNCTION

def extract_sample_size_full_text(embeddings_df, annotations_df, query, template, num_workers=3, model_name=None):
    # For every document, get distance and rank between query and embeddings
    distances, ranks = zip(*[
        query_embeddings(sub_df['embedding'].tolist(), query) 
        for pmcid, sub_df in embeddings_df.groupby('pmcid', sort=False)
    ])

    # Combine with meta-data into a df
    ranks_df = embeddings_df[['pmcid', 'content', 'start_char', 'end_char']].copy()
    ranks_df['rank'] = np.concatenate(ranks)

    # See if chunk being fed to LLM is "correct" chunk
    mc = get_matching_chunks(ranks_df, annotations_df).rename(columns={'rank': 'matching_rank'})
    
    mc = mc[['pmcid', 'matching_rank']]
    ranks_df = pd.merge(ranks_df, mc, on='pmcid')
    ranks_df['is_matching_chunk'] = ranks_df['rank'] == ranks_df['matching_rank']

    # Subset to only include top ranked chunks
    ranks_df = ranks_df[ranks_df['rank'] == 0]

    # For every chunk, apply template
    predictions = extract_from_multiple(
        ranks_df.content.to_list(), 
        **template, 
        num_workers=num_workers,
        model_name=model_name
    )

    predictions['is_matching_chunk'] = ranks_df['is_matching_chunk'].tolist()
    predictions['pmcid'] = ranks_df['pmcid'].tolist()
    predictions['content'] =  ranks_df['content'].tolist()
    return predictions

In [16]:
full_text_demo = extract_sample_size_full_text(
    single_group_embeddings_body, single_group,
    'How many participants were recruited for this study?', 
    ZERO_SHOT_SAMPLE_SIZE_FUNCTION
)

100%|███████████████████████████████████████████| 69/69 [00:13<00:00,  5.28it/s]


#### Evaluation

In [89]:
def is_within_percentage(num1, num2, percentage=10):
    if int(num1) == int(num2):
        return True
    if abs(int(num1) - int(num2)) / int(num1) <= percentage / 100:
        return True
    return False
    
def _print_evaluation(predictions_df, annotations_df):

    # Combine annotations with predicted values
    eval_df = annotations_df.reset_index()[['count', 'pmcid']].rename(columns={'count': 'annot_count'})
    predictions_df = pd.merge(predictions_df, eval_df)
    predictions_df['correct'] = predictions_df['count'] == predictions_df['annot_count']


    
    wrong_chunk = predictions_df[predictions_df.is_matching_chunk == False]

    matching = predictions_df[predictions_df.is_matching_chunk]

    non_na_ix = ((pd.isna(matching['count']) == False) & (matching['count'] != 0))

    nonna = matching[non_na_ix]

    print(f"""
    Accuracy: {predictions_df['correct'].mean():.2f}
    % FTS chose a chunk w/ annotated information: {predictions_df.is_matching_chunk.mean():.2f}
    %  null when wrong chunk: {pd.isna(wrong_chunk['count']).mean():.2f}
    Accuracy for cases when correct chunk was given to LLM: {matching['correct'].mean():.2f}
    % LLM reported a non-na value when correct chunk was given: {non_na_ix.mean():.2f}
    Accuracy for non-NA values w/ correct chunk given: {nonna['correct'].mean():.2f}
    Accuracy within 10%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 10), axis=1)).mean():.2f}
    Accuracy within 20%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 20), axis=1)).mean():.2f}
    Accuracy within 30%: {(nonna.apply(lambda x: is_within_percentage(x['count'], x['annot_count'], 30), axis=1)).mean():.2f}""")

    return predictions_df, nonna

In [90]:
full_text_demo, ft_nonna  = _print_evaluation(full_text_demo, single_group)


    Accuracy: 0.68
    % FTS chose a chunk w/ annotated information: 0.91
    %  null when wrong chunk: 0.33
    Accuracy for cases when correct chunk was given to LLM: 0.75
    % LLM reported a non-na value when correct chunk was given: 1.00
    Accuracy for non-NA values w/ correct chunk given: 0.75
    Accuracy within 10%: 0.84
    Accuracy within 20%: 0.97
    Accuracy within 30%: 0.98


#### Summary

When the correct chunk is given the GPT-3.5, it can extract a sample size value (although typically not final N), most of the time, with relatively few gross errors.

However, when given the incorrect chunk, it will often not give `null` values when it should

#### Incorrect responses

In [58]:
incorrect = ft_nonna[ft_nonna['correct'] == False]

- Majority are only off by a few, due to complex exclusion crtiera
- Attempting to change the prompt to identify # of excluded subjects actually makes accuracy go down, and n deviates further from either final N or total N (and substracting the two numbers doesn't help)
- In one case, the annotation is actually incorrect. (1/10)
- Sometimes models  confused other info for demographic information (i.e. ROIs).
  It seems as if the models are good at putting `n/a` for these section, but sometimes (in a non stable manner), fail

## GPT-4 Full Text

In [87]:
full_text_gpt4 = extract_sample_size_full_text(
    single_group_embeddings_body, single_group,
    'How many participants were recruited for this study?',
    ZERO_SHOT_SAMPLE_SIZE_FUNCTION,
    model_name='gpt-4',
    num_workers=2
)

100%|███████████████████████████████████████████| 69/69 [03:06<00:00,  2.70s/it]


In [91]:
full_text_gpt4, ft_gpt4_nonna  = _print_evaluation(full_text_gpt4, single_group)


    Accuracy: 0.75
    % FTS chose a chunk w/ annotated information: 0.91
    %  null when wrong chunk: 0.83
    Accuracy for cases when correct chunk was given to LLM: 0.83
    % LLM reported a non-na value when correct chunk was given: 1.00
    Accuracy for non-NA values w/ correct chunk given: 0.83
    Accuracy within 10%: 0.86
    Accuracy within 20%: 0.95
    Accuracy within 30%: 0.97


GPT-4 is slightly more accurate, but more importantly, is less likely to hallucinate or get the wrong answer