## Full text semantic search - demo

In order to perform IR over full texts, we need to find relevant section to use as context for GPT.

This is typically done by chunking texts into sections, and obtaining embeddings for each section of the text.
Then, using a distance metric, one can find the most relevant section for use as context to the GPT query. 

Here, I will evaluate this method for finding sample size in papers that do not list it in the Abstract.


### Setup
Load participant annotations, and full text for documents

In [1]:
import pandas as pd
from labelrepo.projects.participant_demographics import (
    get_participant_demographics, select_participants_annotations
)

subgroups = get_participant_demographics(include_locations=True)

In [2]:
from labelrepo.database import get_database_connection

import pandas as pd

docs_info = pd.read_sql(
    "select pmcid, publication_year, title, text from document",
    get_database_connection(),
)

In [3]:
subgroups = pd.merge(subgroups, docs_info, how='left')
jerome_pd = subgroups[(subgroups.project_name == 'participant_demographics') & \
                      (subgroups.annotator_name == 'Jerome_Dockes')]

counts = jerome_pd.groupby('pmcid').count().reset_index()
single_group_pmcids = counts[counts['count'] == 1].pmcid
single_group = jerome_pd[jerome_pd.pmcid.isin(single_group_pmcids)]

# Case study

Testing out workflow with single study with participant info not in Abstact

In [4]:
example = single_group[single_group.pmcid == 5548834]

In [5]:
full_text = example.text.values[0]

### Chunk Full Text

The maximum context length is 4k-16k tokens for GPT 3.5, therefore we need to break up the text into chunks/
The following chunks PMC articles by Markdown into sections

In [6]:
import openai
import numpy as np
from embed import embed_pmc_article, embed_text
openai.api_key = open('/home/zorro/.keys/open_ai.key').read().strip()

### Embed chunks

`embed_pmc_article` chunks the document, and retrieves embeddings for each chunk, returned alongside chunk meta-data

In [7]:
embeddings = embed_pmc_article(full_text, model_name='text-embedding-ada-002')

In [8]:
embeddings = pd.DataFrame(embeddings)

### Test Query

In [9]:
query = "How many subjects are in the study?"

query_embedding = embed_text(query)

In [10]:
from sklearn.metrics.pairwise import euclidean_distances

In [11]:
doc_embeddings = np.array(embeddings['embedding'].tolist())
distances = euclidean_distances(doc_embeddings, np.array(query_embedding).reshape(1, -1), squared=True)
embeddings['distances'] = distances
embeddings.sort_values('distances')

Unnamed: 0,section_name,content,start_char,end_char,embedding,distances
6,Body,\n## Method \n \n### Participants \n \nParti...,7835,9127,"[-0.01049839984625578, 0.013235046528279781, 0...",0.39256
0,Authors,"Riem, Madelon M. E. and Van Ijzendoorn, Marinu...",0,218,"[-0.018246043473482132, -0.0213411133736372, 0...",0.440493
11,Body,\n### fMRI paradigm and data acquisition \n \...,16747,18029,"[-0.03294625133275986, 0.015457101166248322, 0...",0.452415
10,Body,\n### Stimuli \n \nAll infant faces images an...,15342,16747,"[-0.05092040076851845, 0.008138147182762623, 0...",0.454937
7,Body,\n### Procedure \n \nParticipants were invite...,9127,11203,"[-0.03600265458226204, 0.006860304158180952, 0...",0.469708
8,Body,\nThe second part of the BSRT consisted of the...,11203,14248,"[-0.027335399761795998, 0.009072703309357166, ...",0.475538
12,Body,\n### fMRI data acquisition and analysis \n \...,18029,21410,"[-0.021956197917461395, 0.02941172942519188, 0...",0.4769
13,Body,\nAll first-level contrast images and the corr...,21410,23525,"[-0.019373657181859016, 0.021196560934185982, ...",0.483069
9,Body,\nAfter the training phase participants were r...,14248,15342,"[-0.02984154224395752, 0.010838804766535759, 0...",0.486831
14,Body,\n## Results \n \n### Behavioral analyses \n ...,23525,24503,"[-0.018938401713967323, 0.024022815749049187, ...",0.489271


The closest passage is where the participant information is found!

### Comparison to Jerome's annotations

The goal here is to see if across documents we can design a query that will find the target passage (i.e. where the participant informaton actually exists)

In [14]:
top_1 = embeddings.sort_values('distances').iloc[0]
example.start_char > top_1.start_char

In [15]:
example.end_char < top_1.end_char