# Infer ontology grounded metadata from an abstract


## Motivation
We want to infer ontology grounded metadata from an abstract. This will both facilitate and improve automatic annotation in the archive.

## Plan of action

The main input to the pipeline should be either a doi or an abstract in plain text from which relevant information is to be extracted.

There are two main steps to this process:
1) Use LLMs as a Named Entity Recognition (NER) tool to identify ontology grounded entities in the abstract. This technique is based on [recent research](https://ar5iv.labs.arxiv.org/html/2304.10428). More specifically, the technique that we use is in-context learning, where a set of relevant examples are presented as context or instructions for inference to tailor a pre-trained language model to a specific task. The core of this 
technique is the selection of relevant examples to be used as context which instruct the LLM the schema of the desired response. The output of this step is a set of entities in plain language (e.g. mouse, fly, etc).

2) LLMs tend to allucinate results which is not acceptable in the context of provenance and metadata annotation. In order to ground the results and constrain them to specific ontologies we semantically map the extracted entities in step 1 to specific identifiers extract from relevant ontologies (e.g. Uberon, NCBI Taxonomy, etc.). More in detail, for each relevant ontology we create a vector database for each of the ontology identifiers that contains a semantic representation of the identifier. We then use the extracted entities from step 1 to query the vector database and find the most similar ontology identifier. The output of this step is a set of ontology identifiers (e.g. UBERON:0000001, NCBITaxon:0000001, etc) Some [related approaches](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2678-8) to combine ontologies with LLMs have been tried before in the literature.


A general overview of the pipeline is shown in the figure below:
<img src="metadata_extraction.svg" style="width: 1200px;" />


## Infrastructure and technical implementation

This notebook is meant to showcase a proof of concept of the pipeline. The current implementation is based on a limited number of well annotated examples to be used in context learning and has a limited number of ontologies to be used for semantic mapping. So far, we have implemented two ontologies as vector databases in [Qdrant](https://qdrant.tech/) which we use both as a data model and a hosting solution. The first for Uberon which is based on the latest release and has been vectorized in full using the help of the [obonet](https://pypi.org/project/obonet/) project. The second is for NCBI Taxonomy which is based on the latest release and has been vectorized for [vertebrata](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=7742) only and has been parsed using tools provided with the help of [etetoolkit](http://etetoolkit.org/). For LLM technologies, we have used [OpenAI functions](https://openai.com/blog/function-calling-and-other-api-updates). More specifically, we have used `OpenAI ada-002` as an embeding solution and `gpt-3.5-turbo` as an auto-completition agent.



# Pipeline

In [1]:
import pprint
from pathlib import Path
import os 

from repo_secrets import OPENAI_API_KEY
from repo_secrets import QDRANT_API_KEY

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["QDRANT_API_KEY"] = QDRANT_API_KEY

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from utils.metadata_extraction import generate_prompt_examples, generate_task_prompt_from_abstract, infer_metadata, generate_prompt_from_dandiset_id
from utils.metadata_extraction import get_crossref_abstract
from utils.metadata_extraction import ground_metadata_in_ontologies

## Load context prompt built with examples

In [3]:
instructions_prompt = """You are a neuroscience researcher and you are interested in figuring the metadata from abstracts. Here are some examples of how you work:"""

training_dadiset_ids = ["000568", "000250", "000147", "000127", "000055", "000044"]
context_examples = generate_prompt_examples(training_dadiset_ids)

context_prompt = instructions_prompt + context_examples
pprint.pprint(context_prompt)

('You are a neuroscience researcher and you are interested in figuring the '
 'metadata from abstracts. Here are some examples of how you work:\n'
 '- Example 1: The abstract of the paper is:\n'
 '    Understanding how excitatory (E) and inhibitory (I) inputs are '
 'integrated by neurons requires monitoring their subthreshold behavior. We '
 'probed the subthreshold dynamics using optogenetic depolarizing pulses in '
 'hippocampal neuronal assemblies in freely moving mice. Excitability '
 'decreased during sharp-wave ripples coupled with increased I. In contrast to '
 'this “negative gain,” optogenetic probing showed increased within-field '
 'excitability in place cells by weakening I and unmasked stable place fields '
 'in initially non–place cells. Neuronal assemblies active during sharp-wave '
 'ripples in the home cage predicted spatial overlap and sequences of place '
 'fields of both place cells and unmasked preexisting place fields of '
 'non–place cells during track running. 

## Example 1 from DOI

In [4]:
# Random article from elife
doi = "https://doi.org/10.7554/eLife.89093.1" 
abstract_to_test = get_crossref_abstract(doi)

print("\n Abstract: \n")
pprint.pprint(abstract_to_test)


 Abstract: 

('Basal forebrain cholinergic neurons modulate how organisms process and '
 'respond to environmental stimuli through impacts on arousal, attention, and '
 'memory. It is unknown, however, whether basal forebrain cholinergic neurons '
 'are directly involved in conditioned behavior, independent of secondary '
 'roles in the processing of external stimuli. Using fluorescent imaging, we '
 'found that cholinergic neurons are active during behavioral responding for a '
 'reward – even in prior to reward delivery and in the absence of discrete '
 'stimuli. Photostimulation of basal forebrain cholinergic neurons, or their '
 'terminals in the basolateral amygdala (BLA), selectively promoted '
 'conditioned responding (licking), but not unconditioned behavior nor innate '
 'motor outputs. In vivo electrophysiological recordings during cholinergic '
 'photostimulation revealed reward-contingency-dependent suppression of BLA '
 'neural activity, but not prefrontal cortex (PFC). F

In [5]:
task_prompt = generate_task_prompt_from_abstract(abstract_to_test)
prompt = f"{context_prompt} {task_prompt}"
plain_metadata = infer_metadata(prompt)

print("\n Information extracted: \n")
pprint.pprint(plain_metadata)


 Information extracted: 

{'anatomy': ['basal forebrain cholinergic neurons',
             'basolateral amygdala',
             'prefrontal cortex'],
 'approach_names': ['fluorescent imaging', 'electrophysiological approach'],
 'measurement_names': ['surgical technique',
                       'signal filtering technique',
                       'multi electrode extracellular electrophysiology '
                       'recording technique'],
 'species_names': []}


In [6]:
print("\n Information extracted: \n")
pprint.pprint(plain_metadata)


 Information extracted: 

{'anatomy': ['basal forebrain cholinergic neurons',
             'basolateral amygdala',
             'prefrontal cortex'],
 'approach_names': ['fluorescent imaging', 'electrophysiological approach'],
 'measurement_names': ['surgical technique',
                       'signal filtering technique',
                       'multi electrode extracellular electrophysiology '
                       'recording technique'],
 'species_names': []}


In [7]:
grounded_metadata = ground_metadata_in_ontologies(plain_metadata)
grounded_metadata

{'species_names': [],
 'anatomy': ['basal forebrain cholinergic neurons',
  'basolateral amygdala',
  'prefrontal cortex'],
 'approach_names': ['fluorescent imaging', 'electrophysiological approach'],
 'measurement_names': ['surgical technique',
  'signal filtering technique',
  'multi electrode extracellular electrophysiology recording technique'],
 'anatomy_identifiers': ['UBERON:0002743', 'UBERON:0002887', 'UBERON:0000451'],
 'anatomy_urls': ['http://purl.obolibrary.org/obo/UBERON_0002743',
  'http://purl.obolibrary.org/obo/UBERON_0002887',
  'http://purl.obolibrary.org/obo/UBERON_0000451']}

## Example two from abstract

In [8]:
# Random article from elife
abstract_to_test = """The relationship between mesoscopic local field potentials (LFPs) and single-neuron firing in the multi-layered neocortex is poorly understood. Simultaneous recordings from all layers in the primary visual cortex (V1) of the behaving mouse revealed functionally defined layers in V1. The depth of maximum spike power and sink-source distributions of LFPs provided consistent laminar landmarks across animals. Coherence of gamma oscillations (30-100 Hz) and spike-LFP coupling identified six physiological layers and further sublayers. Firing rates, burstiness, and other electrophysiological features of neurons displayed unique layer and brain state dependence. Spike transmission strength from layer 2/3 cells to layer 5 pyramidal cells and interneurons was stronger during waking compared with non-REM sleep but stronger during non-REM sleep among deep-layer excitatory neurons. A subset of deep-layer neurons was active exclusively in the DOWN state of non-REM sleep. These results bridge mesoscopic LFPs and single-neuron interactions with laminar structure in V1."""

print("\n Abstract: \n")
pprint.pprint(abstract_to_test)


 Abstract: 

('The relationship between mesoscopic local field potentials (LFPs) and '
 'single-neuron firing in the multi-layered neocortex is poorly understood. '
 'Simultaneous recordings from all layers in the primary visual cortex (V1) of '
 'the behaving mouse revealed functionally defined layers in V1. The depth of '
 'maximum spike power and sink-source distributions of LFPs provided '
 'consistent laminar landmarks across animals. Coherence of gamma oscillations '
 '(30-100 Hz) and spike-LFP coupling identified six physiological layers and '
 'further sublayers. Firing rates, burstiness, and other electrophysiological '
 'features of neurons displayed unique layer and brain state dependence. Spike '
 'transmission strength from layer 2/3 cells to layer 5 pyramidal cells and '
 'interneurons was stronger during waking compared with non-REM sleep but '
 'stronger during non-REM sleep among deep-layer excitatory neurons. A subset '
 'of deep-layer neurons was active exclusively 

In [9]:
task_prompt = generate_task_prompt_from_abstract(abstract_to_test)
prompt = f"{context_prompt} {task_prompt}"
plain_metadata = infer_metadata(prompt)

print("\n Information extracted: \n")
pprint.pprint(plain_metadata)


 Information extracted: 

{'anatomy': ['primary visual cortex (V1)'],
 'approach_names': ['electrophysiological approach'],
 'measurement_names': ['multi electrode extracellular electrophysiology '
                       'recording technique',
                       'spike sorting technique',
                       'signal filtering technique'],
 'species_names': ['Mus musculus - House mouse']}


In [10]:
grounded_metadata = ground_metadata_in_ontologies(plain_metadata)
pprint.pprint(grounded_metadata)

{'anatomy': ['primary visual cortex (V1)'],
 'anatomy_identifiers': ['UBERON:0002436'],
 'anatomy_urls': ['http://purl.obolibrary.org/obo/UBERON_0002436'],
 'approach_names': ['electrophysiological approach'],
 'measurement_names': ['multi electrode extracellular electrophysiology '
                       'recording technique',
                       'spike sorting technique',
                       'signal filtering technique'],
 'species_identifiers': ['NCBITaxon:3004188'],
 'species_names': ['Mus musculus - House mouse'],
 'species_urls': ['http://purl.obolibrary.org/obo/NCBITaxon_3004188']}


In [11]:
# Random article from elife
abstract_to_test = """Walking is a fundamental mode of locomotion, yet its neural correlates are unknown at brain-
wide scale in any animal. We use volumetric two-photon imaging to map neural activity
associated with walking across the entire brain of Drosophila. We detect locomotor signals in
approximately 40% of the brain, identify a global signal associated with the transition from rest
to walking, and define clustered neural signals selectively associated with changes in forward or
angular velocity. These networks span functionally diverse brain regions, and include regions
that have not been previously linked to locomotion. We also identify time-varying trajectories of
neural activity that anticipate future movements, and that represent sequential engagement of
clusters of neurons with different behavioral selectivity. These motor maps suggest a dynamical
systems framework for constructing walking maneuvers reminiscent of models of forelimb
reaching in primates and set a foundation for understanding how local circuits interact across
large-scale networks."""

task_prompt = generate_task_prompt_from_abstract(abstract_to_test)
prompt = f"{context_prompt} {task_prompt}"

plain_metadata = infer_metadata(prompt)
grounded_metadata = ground_metadata_in_ontologies(plain_metadata)
pprint.pprint(grounded_metadata)

{'anatomy': [],
 'anatomy_identifiers': [],
 'anatomy_urls': [],
 'approach_names': ['imaging approach'],
 'measurement_names': ['two-photon imaging technique'],
 'species_identifiers': ['NCBITaxon:448761'],
 'species_names': ['Drosophila melanogaster - Fruit fly'],
 'species_urls': ['http://purl.obolibrary.org/obo/NCBITaxon_448761']}


# Appendix (to be deleted)

## Comparison to baselines


In [None]:
dandiset_id_test = "000568"
doi = "https://doi.org/10.1038/s41593-022-01138-x" 
abstract_to_test = "The incorporation of new information into the hippocampal network is likely to be constrained by its innate architecture and internally generated activity patterns. However, the origin, organization and consequences of such patterns remain poorly understood. In the present study we show that hippocampal network dynamics are affected by sequential neurogenesis. We birthdated CA1 pyramidal neurons with in utero electroporation over 4 embryonic days, encompassing the peak of hippocampal neurogenesis, and compared their functional features in freely moving adult mice. Neurons of the same birthdate displayed distinct connectivity, coactivity across brain states and assembly dynamics. Same-birthdate neurons exhibited overlapping spatial representations, which were maintained across different environments. Overall, the wiring and functional features of CA1 pyramidal neurons reflected a combination of birthdate and the rate of neurogenesis. These observations demonstrate that sequential neurogenesis during embryonic development shapes the preconfigured forms of adult network dynamics."
task_prompt = generate_task_prompt_from_abstract(abstract_to_test)

In [None]:
context_prompt

In [None]:
task_prompt = generate_task_prompt_from_abstract(abstract_to_test)
context_prompt = preamble_prompt + zero_shot_prompt
prompt = f"{context_prompt} {task_prompt}"

print("abstract: \n")
pprint.pprint(abstract_to_test)
print("\n Information extracted: \n")
metadata = infer_metadata(prompt)
pprint.pprint(metadata)

#### Baseline without context learning for sanity check
1) Same as our pipeline, but without context learning

In [None]:
context_prompt = preamble_prompt 
prompt = f"{context_prompt} {task_prompt}"

print("abstract: \n")
pprint.pprint(abstract_to_test)
print("\n Information extracted: \n")
metadata = infer_metadata(prompt)
pprint.pprint(metadata)

#### Better baseline
For a comparision that is more fair, here we attempt to make the baseline better by using a more sophisticated prompt and more clear expectations.

In [None]:
def generate_task_prompt_from_abstract_without_context(abstract: str) -> str:
    prompt = f"""The abstract of the paper is:
    {abstract} 

    Extract the following information from the abstract:
    - species:
    - species identifier in the NCBI taxonomy:
    - approach:
    - measurement:
    - anatomy:
    - anatomy identifier in the Uberon ontology:

    Return the response as a JSON object with the following format:
    
    {{
        "species": [species_name_1, species_name_2, ...],
        "species_identifier": [species identifiers in the NCBI taxonomy. e.g 'http://purl.obolibrary.org/obo/NCBITaxon_10090'],
        "approach": [e.g. 'electrophysiology', 'calcium imaging', 'optogenetics'],
        "measurement": [e.g. surgery, spike sorting, etc.],
        "anatomy": [e.g. 'hippocampus', 'cortex', 'thalamus'],
        "anatomy_identifier": [anatomy identifier in the Uberon ontology]
    }}
    
    If some information is missing, leave it blank.

    """

    return prompt

In [None]:
context_prompt = preamble_prompt 
task_prompt = generate_task_prompt_from_abstract_without_context(abstract_to_test)
prompt = f"{context_prompt} {task_prompt}"

print("abstract: \n")
pprint.pprint(abstract_to_test)
print("\n Information extracted: \n")
metadata = infer_metadata(prompt)
pprint.pprint(metadata)