## Full text semantic search - evaluation

In order to perform IR over full texts, we need to find relevant section to use as context for GPT.

This is typically done by chunking texts into sections, and obtaining embeddings for each section of the text.
Then, using a distance metric, one can find the most relevant section for use as context to the GPT query. 

Here, I will evaluate this method for finding sample size in papers that do not list it in the Abstract.


### Setup
Load participant annotations, and full text for documents

In [1]:
import pandas as pd
from labelrepo.projects.participant_demographics import (
    get_participant_demographics, select_participants_annotations
)

subgroups = get_participant_demographics(include_locations=True)

In [2]:
from labelrepo.database import get_database_connection

import pandas as pd

docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [4]:
subgroups = pd.merge(subgroups, docs_info, how='left')
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]

# Case study

Testing out workflow with single study with participant info not in Abstact

In [6]:
example = single_group[single_group.pmcid == 5548834]

In [7]:
full_text = example.text.values[0]

### Chunk Full Text

The maximum context length is 4k-16k tokens for GPT 3.5, therefore we need to break up the text into chunks/
The following chunks PMC articles by Markdown into sections

In [8]:
import openai
import numpy as np
from embed import embed_pmc_article, embed_text
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

### Embed chunks

`embed_pmc_article` chunks the document, and retrieves embeddings for each chunk, returned alongside chunk meta-data

In [9]:
embeddings = embed_pmc_article(full_text, model_name='text-embedding-ada-002')

In [10]:
embeddings = pd.DataFrame(embeddings)

### Test Query

In [11]:
query = "How many subjects are in the study?"

query_embedding = embed_text(query)

In [12]:
from sklearn.metrics.pairwise import euclidean_distances

In [13]:
doc_embeddings = np.array(embeddings['embedding'].tolist())
distances = euclidean_distances(doc_embeddings, np.array(query_embedding).reshape(1, -1), squared=True)
embeddings['distances'] = distances
embeddings.sort_values('distances')

The closest passage is where the participant information is found!

### Comparison to Jerome's annotations

The goal here is to see if across documents we can design a query that will find the target passage (i.e. where the participant informaton actually exists)

In [14]:
top_1 = embeddings.sort_values('distances').iloc[0]
example.start_char > top_1.start_char

In [15]:
example.end_char < top_1.end_char