# Full Text Search evaluation with participant demographics

Here, I evaluate the strategy of finding relevant text sections using chunk embeddings.

First, I split PMC articles using Markdown (and lines), into chunks less than `n_tokens` (~4000).

Next, we embed each chunks.

Finally, using a text query, we find the most relevant section of each article for finding participant demographics.
The query is also embedded and a distance metric is taken between each chunk and the query.

To evaluate this method, I will see if this method correctly identifies the section where human annotators found demographic info.

## Setup

In [1]:
import pandas as pd
from labelrepo.projects.participant_demographics import get_participant_demographics

subgroups = get_participant_demographics(include_locations=True)

In [19]:
from labelrepo.database import get_database_connection

import pandas as pd

docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [20]:
# Only look at single group documents
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]

In [23]:
single_group_docs = docs_info[docs_info.pmcid.isin(single_group.pmcid)]

### Embed all documents

In [4]:
import openai
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

In [5]:
from embed import embed_pmc_articles
# texts = single_group_docs[['pmcid', 'text']].to_dict(orient='records')
# single_group_embeddings = embed_pmc_articles(texts)

100%|███████████████████████████████████████████| 75/75 [04:58<00:00,  3.97s/it]


In [13]:
import pickle
# pickle.dump(single_group_embeddings, open('data/single_group_embeddings.pkl', 'wb'))

### Test query across all documents

Given a query, see what the average rank for the chunk matching the human annotation is

In [194]:
single_group_embeddings = pickle.load(open('data/single_group_embeddings.pkl', 'rb'))

In [195]:
single_group_embeddings = pd.DataFrame(single_group_embeddings)

In [216]:
def find_match(df, pmcid, start_char, end_char):
    """Find the row in the dataframe that matches the pmcid and contains the start and end character indices."""
    return df[(df['pmcid'] == pmcid) & (df['start_char'] <= start_char) & (df['end_char'] >= end_char)]

def get_matching_chunks(ranks_df, annotation_df):
    matches = []
    for ix, row in annotation_df.iterrows():
        m = find_match(ranks_df, row['pmcid'], row['start_char'], row['end_char'])
        matches.append(m)
    
    matces = pd.concat(matches)
    return matces

def _query_multiple_df(df, query):
    all_distances, all_ranks = [], []
    for pmcid, sub_df in df.groupby('pmcid', sort=False):
        distances, ranks = query_embeddings(sub_df['embedding'].tolist(), query)
        all_distances.append(distances)
        all_ranks.append(ranks)


    df['distance'] = np.concatenate(all_distances)
    df['rank'] = np.concatenate(all_ranks)

    return df

def test_query_across_docs(embeddings_df, query):
    single_group_ranks = _query_multiple_df(embeddings_df, query)
    matching_chunks = get_matching_chunks(single_group_ranks, single_group)

    return matching_chunks

In [290]:
def try_query(embeddings_df, query):
    mc = test_query_across_docs(embeddings_df, query)
    
    # Mean ranking of matching chunk
    mean_rank = mc['rank'].mean()
    
    # Percentage of the time query matches passage
    top_rank_percentage = (mc['rank'] == 0).mean()
    
    top_3_percentage = (mc['rank'] <= 3).mean()

    print(f"Query: '{query}' \nMean rank: {mean_rank:.2f}, top 1 %: {top_rank_percentage:.2f}, top 3 %: {top_3_percentage:.2f}")

In [291]:
queries = ['ljskfdklsjdfk', 'Methods section', 'Number of participants',
           'The number of subjects or participants that were involved in the study or underwent MRI',
           'How many participants or subjects were recruited for this study?',
           'How many participants were recruited for this study?'
          ]

In [292]:
for q in queries:
    try_query(single_group_embeddings, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 5.60, top 1 %: 0.03, top 3 %: 0.51
Query: 'Methods section' 
Mean rank: 5.21, top 1 %: 0.13, top 3 %: 0.51
Query: 'Number of participants' 
Mean rank: 3.00, top 1 %: 0.73, top 3 %: 0.81
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 2.67, top 1 %: 0.53, top 3 %: 0.80
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 2.84, top 1 %: 0.73, top 3 %: 0.83
Query: 'How many participants were recruited for this study?' 
Mean rank: 2.83, top 1 %: 0.76, top 3 %: 0.83


### Try only on Body

Looks like Jerome's annotations were only in the Body of the study, so it would be fair to exclude any embeddings not on the Body of the paper. 

In [293]:
single_group_embeddings_body = single_group_embeddings[single_group_embeddings.section_name == 'Body'].reset_index().drop(columns='index')

In [294]:
for q in queries:
    try_query(single_group_embeddings_body, q)

Query: 'ljskfdklsjdfk' 
Mean rank: 2.78, top 1 %: 0.10, top 3 %: 0.76
Query: 'Methods section' 
Mean rank: 2.35, top 1 %: 0.22, top 3 %: 0.78
Query: 'Number of participants' 
Mean rank: 0.25, top 1 %: 0.89, top 3 %: 0.98
Query: 'The number of subjects or participants that were involved in the study or underwent MRI' 
Mean rank: 0.71, top 1 %: 0.63, top 3 %: 0.95
Query: 'How many participants or subjects were recruited for this study?' 
Mean rank: 0.21, top 1 %: 0.87, top 3 %: 0.98
Query: 'How many participants were recruited for this study?' 
Mean rank: 0.17, top 1 %: 0.90, top 3 %: 0.98
