# Identify Dandisets relevant to a scientific question


## Motivation

We want to provide a system that, based on the user's questions, suggests dandisets that could be of relevance. The hope is that ChatGPT will be able to use semantic information of the question and return better results than simple text matching.

## Plan of action
 

### 1. Collect metadata from dandisets.
<img src="step1_embed_dandiset_metadata.jpg" style="width: 700px;" />
For each Dandiset:

- get name, description, 
- get assets summary: approaches, measurement techniques, variables measured
- for species, we can get accurate info from NCBITaxon, if it's included
- (?) get related resources: scrape title/summary from articles
- Use OpenAI ada-002 to vectorize each information piece
- Store the vectors in Qdrant, have the original dandiset id as object metadata for each vector
- (?) any useful extra info we could include on object metadata, for later filtered searches?

### 2. Process user questions
<img src="step2_do_search.jpg" style="width: 700px;" />

- User queries can come in the form of questions, e.g.: "Which datasets can I use to investigate the effects of drug YYYY on cells of type XXXX?"
- For method 1, the question is passed directly to a semantic embedding API using OpenAI.
- For method 2, a prompt instructs the LLM to extract useful neuroscience research-related keywords from the question, and proceed to use the semantic search engine. We can achieve this using [OpenAI functions](https://openai.com/blog/function-calling-and-other-api-updates)
- (simple approach) We return the dandiset IDs present in the top results 
- (advanced approach) We gather the text content of the collected metadata for the dandisets in the top results and include them in a prompt together with the user's original question and a default instruction such as: "Given the user question and the listed reference datasets, which datasets could help address the user's question, and why?"

```bash
pip install -r requirements.txt
export OPENAI_API_KEY=<your-openai-api-key>
docker run -p 6333:6333 -v ~/qdrant_storage:/qdrant/storage qdrant/qdrant:latest
```

In [1]:
%load_ext autoreload
%autoreload 2

In [6]:
from rest.clients.dandi import DandiClient
from rest.clients.qdrant import QdrantClient
from rest.clients.openai import OpenaiClient
import json

dandi_client = DandiClient()
qdrant_client = QdrantClient()
openai_client = OpenaiClient()

In [7]:
# Get all dandisets metadata
all_metadata = dandi_client.get_all_dandisets_metadata()

In [24]:
# Extract only relevant text fields from metadata
all_metadata_formatted = dandi_client.collect_relevant_metadata(metadata_list=all_metadata)

In [25]:
print("Number of items: ", len(all_metadata_formatted))
all_metadata_formatted[0]

Number of items:  167


{'dandiset_id': 'DANDI:000005/0.220126.1853',
 'title': 'Electrophysiology data from thalamic and cortical neurons during somatosensation',
 'description': 'intracellular and extracellular electrophysiology recordings performed on mouse barrel cortex and ventral posterolateral nucleus (vpm) in whisker-based object locating task.',
 'approaches': ['electrophysiological approach', 'optogenetic approach'],
 'measurement_techniques': ['current clamp technique',
  'surgical technique',
  'spike sorting technique'],
 'variables_measured': ['CurrentClampStimulusSeries',
  'CurrentClampSeries',
  'OptogeneticSeries',
  'ElectrodeGroup',
  'Units'],
 'species': ['House mouse']}

In [28]:
print(dandi_client.stringify_relevant_metadata(all_metadata_formatted[0]))

Title: Electrophysiology data from thalamic and cortical neurons during somatosensation
Description: intracellular and extracellular electrophysiology recordings performed on mouse barrel cortex and ventral posterolateral nucleus (vpm) in whisker-based object locating task.
Approaches: electrophysiological approach, optogenetic approach
Measurement techniques: current clamp technique, surgical technique, spike sorting technique
Variables measured: CurrentClampStimulusSeries, CurrentClampSeries, OptogeneticSeries, ElectrodeGroup, Units
Species: House mouse



# Vector embeddings

At this step, we generate vector embeddings for each property of the formatted metadata from each dandiset. After that, we insert the combination of vectors + payload (metadata information) to Qdrant.
To run the cells below, you must have:
- `OPENAI_API_KEY` set as environment variable
- Qdrant service running [ref](https://qdrant.tech/documentation/quick-start/)

In [10]:
# Generate vector embeddings all items in formatted metadata list
# This can be slow and costs a few cents per run, so it's recommended to save results to disk and load it later on
emb = openai_client=openai_client.get_embeddings(
    metadata_list=all_metadata_formatted,
    max_num_sets=10,
    save_to_file=True
)

Processing item DANDI:000012/draft: 100%|█████████████████████████| 10/10 [00:13<00:00,  1.34s/it]


In [13]:
for e in emb:
    print(e["payload"]) 

{'dandiset_id': 'DANDI:000005/0.220126.1853', 'field': 'title', 'text_content': 'title: Electrophysiology data from thalamic and cortical neurons during somatosensation'}
{'dandiset_id': 'DANDI:000005/0.220126.1853', 'field': 'description', 'text_content': 'description: intracellular and extracellular electrophysiology recordings performed on mouse barrel cortex and ventral posterolateral nucleus (vpm) in whisker-based object locating task.'}
{'dandiset_id': 'DANDI:000005/0.220126.1853', 'field': 'approaches', 'text_content': 'approach: electrophysiological approach'}
{'dandiset_id': 'DANDI:000005/0.220126.1853', 'field': 'approaches', 'text_content': 'approach: optogenetic approach'}
{'dandiset_id': 'DANDI:000005/0.220126.1853', 'field': 'measurement_techniques', 'text_content': 'measurement technique: current clamp technique'}
{'dandiset_id': 'DANDI:000005/0.220126.1853', 'field': 'measurement_techniques', 'text_content': 'measurement technique: surgical technique'}
{'dandiset_id': '

In [12]:
# Or load them from a file, if already previously produced
with open("qdrant_points.json", "r") as file:
    emb = json.load(file)

In [13]:
print(f"Produced {len(emb)} embedding points for {len(all_metadata_formatted)} dandisets")

Produced 1605 embedding points for 156 dandisets


In [14]:
# Create Qdrant collection
create_collection(collection_name="dandi_collection")

In [15]:
# Populate collection with points
add_points_to_collection(embeddings_objects=emb)

All points added to collection dandi_collection


In [16]:
info = get_collection_info("dandi_collection")
print(f"Inserted {info['points_count']} points to Qdrant collection")

Inserted 1605 points to Qdrant collection


# Vectorize user questions

At this step, we handle users input, with the goal of finding the most relevant dandisets for their questions. 

We test two vectorization options:
- vectorize the entire question
- extract relevant keywords from user's questions, vectorize these keywords

Then perform a similarity search against our vector database, to gather the most semantically similar points to the user's question.

Finally, we include the most semantically similar results and the user's input into a prompt which instructs the LLM to further refine the answer to the user.

In [17]:
from utils.openai import (
    keywords_extraction, 
    prepare_keywords_for_semantic_search, 
    add_ordered_similarity_results_to_prompt, 
    get_llm_chat_answer
)
from utils.qdrant import query_all_keywords, query_from_user_input
from utils.pipeline import prepare_prompt

In [18]:
# Vectorize user's input and query similar Qdrant points
user_input = "I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?"

ordered_similarity_results = query_from_user_input(text=user_input, top_k=15)

In [19]:
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)
prompt = prepare_prompt(user_input=user_input, dandisets_text=dandisets_text, model="gpt-3.5-turbo-16k")
print(prompt)

Given the user input and the list of most reference dandisets, propose which dandisets the user might be interested in using.
Explain your decision based on the information of the reference dandisets.
Suggest only the dandisets which you consider to be most relevant. There can be multiple relevant dandisets.
Structure your answer as a numbered list, with one suggestion per item.
Always start your answers always with: "The most relevant dandisets for your question are:" 
Unless you consider there are no relevant dandisets, in which case only reply: "There are no relevant dandisets for your question"
---
User input: I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?
---
Reference dandisets:

DANDISET:000049/draft
title: Allen Institute – TF x SF tuning in mouse visual cortex with calcium imaging
description: A two photon calcium imaging dataset from Allen Institute measuring responses to full-field drifting gratings (approx. 120x90 de

In [20]:
answer = get_llm_chat_answer(prompt=prompt, model="gpt-3.5-turbo-16k")
print(answer)

The most relevant dandisets for your question are:

1. DANDISET:000049/draft (Allen Institute – TF x SF tuning in mouse visual cortex with calcium imaging): This dandiset includes a two-photon calcium imaging dataset from Allen Institute, measuring responses to full-field drifting gratings in mouse visual cortex. While the focus is on visual cortex neurons, it provides valuable information about stimulus response properties that can be applied to glial cells as well.

2. DANDISET:000168/draft (Simultaneous loose seal cell-attached recordings and two-photon imaging of GCaMP8 expressing mouse V1 neurons with drifting gratings visual stimuli): This dandiset combines two-photon imaging with electrophysiological recordings of GCaMP8 expressing neurons in mouse primary visual cortex. While the primary focus is on pyramidal neurons, the dataset can also provide insights into the tuning properties of glial cells in response to visual stimuli.

3. DANDISET:000003/draft (Physiological Properties

In [21]:
# A second approach would be to first extract neuroscience-related keywords from user's questions
keywords = keywords_extraction(user_input=user_input)

# Join the results in a list of strings, before semantic search
keywords_2 = prepare_keywords_for_semantic_search(keywords)
keywords_2

['glial cells']

In [22]:
# Query similar entries for each keyword, accumulate the scores for repeated results
ordered_similarity_results = query_all_keywords(keywords_2, top_k=15)
ordered_similarity_results

[('DANDI:000350/0.221219.1506', 1.60849445),
 ('DANDI:000003/0.230629.1955', 1.60160926),
 ('DANDI:000168/draft', 1.58226733),
 ('DANDI:000020/0.210913.1639', 0.81621814),
 ('DANDI:000245/draft', 0.79532015),
 ('DANDI:000302/draft', 0.7936288),
 ('DANDI:000295/draft', 0.78901976),
 ('DANDI:000165/0.211118.1526', 0.78858197),
 ('DANDI:000409/draft', 0.78844327),
 ('DANDI:000008/0.211014.0809', 0.786623),
 ('DANDI:000292/0.220708.1652', 0.7866051),
 ('DANDI:000244/draft', 0.7853388)]

In [23]:
# Prepare a prompt instructing the LLM to suggest the most relevant dandisets based on user's input
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)
prompt = prepare_prompt(user_input=user_input, dandisets_text=dandisets_text, model="gpt-3.5-turbo-16k")
print(prompt)

Given the user input and the list of most reference dandisets, propose which dandisets the user might be interested in using.
Explain your decision based on the information of the reference dandisets.
Suggest only the dandisets which you consider to be most relevant. There can be multiple relevant dandisets.
Structure your answer as a numbered list, with one suggestion per item.
Always start your answers always with: "The most relevant dandisets for your question are:" 
Unless you consider there are no relevant dandisets, in which case only reply: "There are no relevant dandisets for your question"
---
User input: I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?
---
Reference dandisets:

DANDISET:000350/draft
title: Glia Accumulate Evidence that Actions Are Futile and Suppress Unsuccessful Behavior
description: When a behavior repeatedly fails to achieve its goal, animals often give up and become passive, which can be strategic fo

In [24]:
answer = get_llm_chat_answer(prompt=prompt)
print(answer)

The most relevant dandisets for your question are:

1. DANDISET:000350/draft:
   - Title: Glia Accumulate Evidence that Actions Are Futile and Suppress Unsuccessful Behavior
   - Description: This dandiset investigates the tuning properties of radial astrocytes in larval zebrafish. The study uses whole-brain calcium imaging and microscopy techniques to reveal how astrocytes accumulate evidence of failed motor behaviors. This dataset can provide insights into the tuning properties of glial cells.
   - Approach: Microscopy approach; Cell population imaging
   - Measurement Technique: Two-photon microscopy technique
   - Species: Danio rerio - Zebra fish

2. DANDISET:000168/draft:
   - Title: Simultaneous loose seal cell-attached recordings and two-photon imaging of GCaMP8 expressing mouse V1 neurons with drifting gratings visual stimuli
   - Description: This dandiset focuses on two-photon imaging of mouse visual cortex neurons with drifting gratings visual stimuli. The dataset includes 

# Comparison of methods

Here we compare the results of both methods for a variety of possible questions.

In [25]:
from utils.pipeline import suggest_relevant_dandisets

In [26]:
user_input = "I want to study natural movement in humans"

suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=1)
print(suggestions)

The most relevant dandisets for your question are:

1. DANDISET:000055/draft
   - Title: AJILE12: Long-term naturalistic human intracranial neural recordings and pose
   - Description: This dataset contains synchronized intracranial neural recordings and upper body pose trajectories from 12 human participants engaged in naturalistic movements. It includes a large amount of data recorded over 55 days, including thousands of wrist movement events and annotated behavioral states. The neural recordings are available at a high sampling rate, and pose trajectories are sampled at 30 frames per second. This dataset is ideal for studying natural movement in humans.

2. DANDISET:000540/draft
   - Title: Dataset for: A change in behavioral state switches the pattern of motor output that underlies rhythmic head and orofacial movements
   - Description: This dataset contains multi-modal data recorded from rats performing naturalistic foraging and rearing behaviors in an open arena. It includes vide

In [27]:
suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=2)
print(suggestions)

The most relevant dandisets for your question are:

1. DANDISET:000055/draft - AJILE12: Long-term naturalistic human intracranial neural recordings and pose:
   - This dataset includes synchronized intracranial neural recordings and upper body pose trajectories during naturalistic movements in humans. It provides insights into the neural basis of human movement in naturalistic scenarios, allowing for the study of unstructured, spontaneous movements in completely naturalistic settings.

2. DANDISET:000019/draft - Human ECoG speaking consonant-vowel syllables:
   - This dataset contains high-density electrocorticography (ECoG) recordings from human patients reading aloud consonant-vowel syllables. It enables the investigation of brain activity during natural spoken language production and can provide valuable information about natural movement in humans.

3. DANDISET:000143/draft - PPC_Finger: human posterior parietal cortex recordings during attempted finger movements:
   - This dataset

In [28]:
user_input = "Are there any datasets that have electrophysiology recordings of a rodent navigating a maze?"

suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=1)
print(suggestions)

The most relevant dandisets for your question are:

1. DANDISET:000044/draft: This dataset consists of electrophysiological recordings of rats navigating different types of mazes. It includes multi-cellular electrophysiological recordings, as well as neck EMG and head-mounted accelerometer signals. The dataset focuses on the effect of novel spatial learning on hippocampal neural firing and LFP patterns.

2. DANDISET:000410/draft: This dataset includes electrophysiological data from rats running on different types of tracks. It contains recordings from dorsal CA1 and provides information on spatial representations and the stepping rhythm.

3. DANDISET:000115/draft: This dataset includes electrophysiological and behavioral data from rats. It focuses on hippocampal replay and its relation to specific past experiences. The dataset includes dorsal CA1 tetrodes recordings, as well as behavioral data such as port triggers and position tracking.

4. DANDISET:000301/draft: This dataset includes

In [29]:
suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=2)
print(suggestions)

The most relevant dandisets for your question are:

1. DANDISET:000044/draft - "Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences." This dataset includes multi-cellular electrophysiological recordings of rats navigating novel mazes. It contains hippocampal recordings as well as neck EMG and head-mounted accelerometer signals.

2. DANDISET:000017/draft - "Distributed coding of choice, action and engagement across the mouse brain." This dataset includes electrophysiological recordings from mouse brains during behavioral tasks. It provides insights into how different brain regions process choices, actions, and engagement.

3. DANDISET:000006/draft - "Mouse anterior lateral motor cortex (ALM) in delay response task." This dataset includes extracellular electrophysiology recordings from the mouse anterior lateral motor cortex during a delay response task. It focuses on neural activity related to movement execution.

4. DANDISET:000053/draft - "Recordi