# Full Text Search evaluation with participant demographics

Here, I evaluate the strategy of finding relevant text sections using chunk embeddings.

First, I split PMC articles using Markdown (and lines), into chunks less than `n_tokens` (~4000).

Next, we embed each chunks.

Finally, using a text query, we find the most relevant section of each article for finding participant demographics.
The query is also embedded and a distance metric is taken between each chunk and the query.

To evaluate this method, I will see if this method correctly identifies the section where human annotators found demographic info.

## Setup

In [1]:
import pandas as pd
from labelrepo.projects.participant_demographics import get_participant_demographics

subgroups = get_participant_demographics(include_locations=True)

In [19]:
from labelrepo.database import get_database_connection

import pandas as pd

docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [20]:
# Only look at single group documents
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]

In [23]:
single_group_docs = docs_info[docs_info.pmcid.isin(single_group.pmcid)]

### Embed all documents

In [4]:
import openai
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

In [5]:
from embed import embed_pmc_articles
# texts = single_group_docs[['pmcid', 'text']].to_dict(orient='records')
# single_group_embeddings = embed_pmc_articles(texts)

100%|███████████████████████████████████████████| 75/75 [04:58<00:00,  3.97s/it]


In [13]:
import pickle
# pickle.dump(single_group_embeddings, open('data/single_group_embeddings.pkl', 'wb'))

### Test query across all documents

Given a query, see what the average rank for the chunk matching the human annotation is

In [194]:
single_group_embeddings = pickle.load(open('data/single_group_embeddings.pkl', 'rb'))

In [195]:
single_group_embeddings = pd.DataFrame(single_group_embeddings)

In [216]:
def find_match(df, pmcid, start_char, end_char):
    """Find the row in the dataframe that matches the pmcid and contains the start and end character indices."""
    return df[(df['pmcid'] == pmcid) & (df['start_char'] <= start_char) & (df['end_char'] >= end_char)]

def get_matching_chunks(ranks_df, annotation_df):
    matches = []
    for ix, row in annotation_df.iterrows():
        m = find_match(ranks_df, row['pmcid'], row['start_char'], row['end_char'])
        matches.append(m)
    
    matces = pd.concat(matches)
    return matces

def _query_multiple_df(df, query):
    all_distances, all_ranks = [], []
    for pmcid, sub_df in df.groupby('pmcid', sort=False):
        distances, ranks = query_embeddings(sub_df['embedding'].tolist(), query)
        all_distances.append(distances)
        all_ranks.append(ranks)


    df['distance'] = np.concatenate(all_distances)
    df['rank'] = np.concatenate(all_ranks)

    return df

def test_query_across_docs(embeddings_df, query):
    single_group_ranks = _query_multiple_df(embeddings_df, query)
    matching_chunks = get_matching_chunks(single_group_ranks, single_group)

    return matching_chunks

In [233]:
def test_query(embeddings_df, query):
    mc = test_query_across_docs(embeddings_df, query)
    
    # Mean ranking of matching chunk
    mean_rank = mc['rank'].mean()
    
    # Percentage of the time query matches passage
    top_rank_percentage = (mc['rank'] == 0).mean()
    
    top_5_percentage = (mc['rank'] <= 5).mean()

    print(f"Mean rank: {mean_rank}, top 1 percentage: {top_rank_percentage}, top 5 percentage: {top_5_percentage}")

    return mc

In [234]:
_ = test_query(single_group_embeddings, 'kjfkfkjfkd')

Mean rank: 6.48, top 1 percentage: 0.013333333333333334, top 5 percentage: 0.6133333333333333


In [235]:
_ = test_query(single_group_embeddings, 'Methods section')

Mean rank: 5.2, top 1 percentage: 0.13333333333333333, top 5 percentage: 0.6933333333333334


In [236]:
_ = test_query(single_group_embeddings, 'Number of participants')

Mean rank: 3.0, top 1 percentage: 0.7333333333333333, top 5 percentage: 0.8266666666666667


In [237]:
_ = test_query(single_group_embeddings, 'How many participants were recruited for this study?')

Mean rank: 2.8266666666666667, top 1 percentage: 0.76, top 5 percentage: 0.8266666666666667


In [239]:
_ = test_query(single_group_embeddings, 'The number of subjects or participants that were involved in the study or underwent MRI')

Mean rank: 2.6666666666666665, top 1 percentage: 0.5333333333333333, top 5 percentage: 0.8133333333333334
