In [1]:
%load_ext autoreload
%autoreload 2

# Motivation

We want to provide a system that, based on the user's questions, suggests dandisets that could be of relevance.

# Plan of action

### Collect metadata from dandisets. For each set:
- get name, description, 
- get assets summary: approaches, measurement techniques, variables measured
- for species, we can get accurate info from NCBITaxon, if it's included
- (?) get related resources: scrape title/summary from articles

### Vector embeddings
- Use OpenAI ada-002 to vectorize each information piece
- Store the vectors in Qdrant, have the original dandiset id as object metadata for each vector
- (?) any useful extra info we could include on object metadata, for later filtered searches?

### Process user questions
- User queries can come in the form of questions, e.g.: "Which datasets can I use to investigate the effects of drug YYYY on cells of type XXXX?"
- We need a prompt that instructs the LLM to extract useful neuroscience research-related keywords from the question, and proceed to use the semantic search engine. We can achieve this using [OpenAI functions](https://openai.com/blog/function-calling-and-other-api-updates)
- (simple approach) We return the dandiset IDs present in the top results 
- (advanced approach) We gather the text content of the collected metadata for the dandisets in the top results and include them in a prompt together with the user's original question and a default instruction such as: "Given the user question and the listed reference datasets, which datasets could help address the user's question, and why?"

# Requirements
- `pip install -r requirements.txt`
- `export OPENAI_API_KEY=<your-openai-api-key>`
- `docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant:latest`

In [2]:
from utils.dandi import get_all_dandisets_metadata, collect_relevant_metadata

In [3]:
# Get all dandisets metadata
all_metadata = get_all_dandisets_metadata()

In [4]:
# Extract only relevant text fields from metadata
all_metadata_formatted = collect_relevant_metadata(metadata_list=all_metadata)

In [5]:
all_metadata_formatted[0]

{'dandiset_id': 'DANDI:000005/0.220126.1853',
 'title': 'title: Electrophysiology data from thalamic and cortical neurons during somatosensation',
 'description': 'description: intracellular and extracellular electrophysiology recordings performed on mouse barrel cortex and ventral posterolateral nucleus (vpm) in whisker-based object locating task.',
 'approaches': ['approach: electrophysiological approach',
  'approach: optogenetic approach'],
 'measurement_techniques': ['measurement technique: current clamp technique',
  'measurement technique: surgical technique',
  'measurement technique: spike sorting technique'],
 'variables_measured': ['variable measured: CurrentClampStimulusSeries',
  'variable measured: CurrentClampSeries',
  'variable measured: OptogeneticSeries',
  'variable measured: ElectrodeGroup',
  'variable measured: Units'],
 'species': ['species: House mouse']}

# Vector embeddings

At this step, we generate vector embeddings for each property of the formatted metadata from each dandiset. After that, we insert the combination of vectors + payload (metadata information) to Qdrant.
To run the cells below, you must have:
- `OPENAI_API_KEY` set as environment variable
- Qdrant service running [ref](https://qdrant.tech/documentation/quick-start/)

In [25]:
from utils.openai import get_embeddings
from utils.qdrant import create_collection, add_points_to_collection, get_collection_info
import json

In [15]:
# Generate vector embeddings all items in formatted metadata list
# This can be slow and costs a few cents per run, so it's recommended to save results to disk and load it later on
emb = get_embeddings(
    metadata_list=all_metadata_formatted,
    # max_num_sets=10,
    save_to_file=True
)

Processing item DANDI:000363/0.230613.1608:  70%|████████████████████████████████████████████████████████▎                        | 105/151 [04:34<02:55,  3.81s/it]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIError: The server had an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 22f58235a3c8819f0f40e8b2b07fced0 in your message.) {
  "error": {
    "message": "The server had an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 22f58235a3c8819f0f40e8b2b07fced0 in your message.)",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
 500 {'error': {'message': 'The server had an error while processing your request. Sorry

In [None]:
# Or load them from a file, if already previously produced
with open("qdrant_points.json", "r") as file:
    emb = json.load(file)

In [17]:
print(f"Produced {len(emb)} embedding points for {len(all_metadata_formatted)} dandisets")

Produced 1527 embedding points for 151 dandisets


In [18]:
# Create Qdrant collection
create_collection(collection_name="dandi_collection")

In [20]:
# Populate collection with points
add_points_to_collection(embeddings_objects=emb)

All points added to collection dandi_collection


In [28]:
info = get_collection_info("dandi_collection")
print(f"Inserted {info['points_count']} points to Qdrant collection")

Inserted 1527 points to Qdrant collection


# Vectorize user questions

At this step, we handle users input, with the goal of finding the most relevant dandisets for their questions. 

We test two vectorization options:
- vectorize the entire question
- extract relevant keywords from user's questions, vectorize these keywords

Then perform a similarity search against our vector database, to gather the most semantically similar points to the user's question.

Finally, we include the most semantically similar results and the user's input into a prompt which instructs the LLM to further refine the answer to the user.

In [66]:
from utils.openai import (
    keywords_extraction, 
    prepare_keywords_for_semantic_search, 
    add_ordered_similarity_results_to_prompt, 
    get_llm_chat_answer
)
from utils.qdrant import query_all_keywords, query_from_user_input
from utils.pipeline import prepare_prompt

In [67]:
# Vectorize user's input and query similar Qdrant points
user_input = "I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?"

ordered_similarity_results = query_from_user_input(text=user_input, top_k=15)

In [68]:
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)
prompt = prepare_prompt(user_input=user_input, dandisets_text=dandisets_text, model="gpt-3.5-turbo-16k")
print(prompt)

Given the user input and the list of most reference dandisets, propose which dandisets the user might be interested in using.
Explain your decision based on the information of the reference dandisets.
Suggest only the dandisets which you consider to be most relevant. There can be multiple relevant dandisets.
Structure your answer as a numbered list, with one suggestion per item.
Always start your answers always with: "The most relevant dandisets for your question are:" 
Unless you consider there are no relevant dandisets, in which case only reply: "There are no relevant dandisets for your question"
---
User input: I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?
---
Reference dandisets:

DANDISET:000049/draft
title: Allen Institute – TF x SF tuning in mouse visual cortex with calcium imaging
description: A two photon calcium imaging dataset from Allen Institute measuring responses to full-field drifting gratings (approx. 120x90 de

In [69]:
answer = get_llm_chat_answer(prompt=prompt, model="gpt-3.5-turbo-16k")
print(answer)

The most relevant dandisets for your question are:

1. DANDISET:000049/draft - Allen Institute – TF x SF tuning in mouse visual cortex with calcium imaging: This dandiset provides a two-photon calcium imaging dataset that measures the responses of pyramidal neurons and inhibitory interneurons in the mouse visual cortex to drifting gratings. It includes information about spatial and temporal frequencies, which are relevant for studying tuning properties.

2. DANDISET:000020/draft - Patch-seq recordings from mouse visual cortex: This dandiset contains whole-cell Patch-seq recordings from neurons in the mouse visual cortex. While it primarily focuses on GABAergic interneurons, it also includes some glutamatergic neurons. The combination of patch-clamp recordings with single-cell RNA sequencing allows for the comprehensive study of morpho-electric properties, which can provide insights into tuning properties.

These dandisets provide relevant data on the tuning properties of neurons, speci

In [70]:
# A second approach would be to first extract neuroscience-related keywords from user's questions
keywords = keywords_extraction(user_input=user_input)

# Join the results in a list of strings, before semantic search
keywords_2 = prepare_keywords_for_semantic_search(keywords)
keywords_2

['glial cells']

In [71]:
# Query similar entries for each keyword, accumulate the scores for repeated results
ordered_similarity_results = query_all_keywords(keywords_2, top_k=15)
ordered_similarity_results

[('DANDI:000350/0.221219.1506', 1.60849447),
 ('DANDI:000003/0.210812.1448', 1.6015465400000002),
 ('DANDI:000168/draft', 1.58228317),
 ('DANDI:000020/0.210913.1639', 0.8162182),
 ('DANDI:000245/draft', 0.7953203),
 ('DANDI:000302/draft', 0.7935248),
 ('DANDI:000295/draft', 0.78901964),
 ('DANDI:000165/0.211118.1526', 0.78858215),
 ('DANDI:000409/draft', 0.7884433),
 ('DANDI:000292/0.220708.1652', 0.7866779),
 ('DANDI:000008/0.211014.0809', 0.7866111),
 ('DANDI:000244/draft', 0.7853387)]

In [72]:
# Prepare a prompt instructing the LLM to suggest the most relevant dandisets based on user's input
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)
prompt = prepare_prompt(user_input=user_input, dandisets_text=dandisets_text, model="gpt-3.5-turbo-16k")
print(prompt)

Given the user input and the list of most reference dandisets, propose which dandisets the user might be interested in using.
Explain your decision based on the information of the reference dandisets.
Suggest only the dandisets which you consider to be most relevant. There can be multiple relevant dandisets.
Structure your answer as a numbered list, with one suggestion per item.
Always start your answers always with: "The most relevant dandisets for your question are:" 
Unless you consider there are no relevant dandisets, in which case only reply: "There are no relevant dandisets for your question"
---
User input: I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?
---
Reference dandisets:

DANDISET:000350/draft
title: Glia Accumulate Evidence that Actions Are Futile and Suppress Unsuccessful Behavior
description: When a behavior repeatedly fails to achieve its goal, animals often give up and become passive, which can be strategic fo

In [73]:
answer = get_llm_chat_answer(prompt=prompt)
print(answer)

The most relevant dandisets for your question are:

1. DANDISET:000350/draft - This dandiset titled "Glia Accumulate Evidence that Actions Are Futile and Suppress Unsuccessful Behavior" focuses on the tuning properties of astrocytes in larval zebrafish. It uses a microscopy approach and whole-brain calcium imaging to investigate the role of radial astrocytes in identifying failed swim attempts and mediating a switch to passive behavior.

2. DANDISET:000168/draft - This dandiset titled "Simultaneous loose seal cell-attached recordings and two-photon imaging of GCaMP8 expressing mouse V1 neurons with drifting gratings visual stimuli" involves the study of mouse primary visual cortex. It combines microscopy and electrophysiological approaches to examine the tuning properties of neurons in response to drifting gratings visual stimuli.

These two dandisets are directly relevant to your interest in studying the tuning properties of glial cells and provide valuable insights into the functiona

# Comparison of methods

Here we compare the results of both methods for a variety of possible questions.

In [17]:
from utils.pipeline import suggest_relevant_dandisets

In [25]:
user_input = "I want to study natural movement in humans"

suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=1)
print(suggestions)

The most relevant dandisets for your question are:

1. DANDISET:000055/draft - AJILE12: Long-term naturalistic human intracranial neural recordings and pose. This dataset includes synchronized intracranial neural recordings and upper-body pose trajectories across 55 semi-continuous days of naturalistic movements in humans. The dataset provides insights into the neural basis of human movement in naturalistic scenarios.

2. DANDISET:000540/draft - Dataset for: A change in behavioral state switches the pattern of motor output that underlies rhythmic head and orofacial movements. This dataset includes multi-modal data from rats performing naturalistic foraging and rearing behaviors. While it is not directly related to human movement, it provides valuable information on the patterns of motor output in response to changes in behavioral states.

These dandisets are the most relevant to your interest in studying natural movement in humans. The AJILE12 dataset offers intracranial neural recordi

In [26]:
suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=2)
print(suggestions)

The most relevant dandisets for your question are:

1. DANDISET:000045/draft: A NWB-based dataset and processing pipeline of human single-neuron activity during a declarative memory task. This dataset contains 1863 single neurons recorded from the medial temporal lobes of 59 human subjects during a recognition memory task. The data is stored in Neurodata Without Borders (NWB) format and includes meta-data, stimulus information, and behavior.

2. DANDISET:000097/draft: Large-scale neural recordings with single-neuron resolution using Neuropixels probes in human cortex. This dataset describes a new probe variant and techniques that enable simultaneous recording from over 200 well-isolated cortical single units in human participants. It provides high-density recordings at an unprecedented spatiotemporal resolution.

3. DANDISET:000115/draft: Low-noise encoding of active touch by layer 4 in the somatosensory cortex. This dataset includes data from a study on how layer 4 neurons in the soma

In [23]:
user_input = "Are there any datasets that have electrophysiology recordings of a rodent navigating a maze?"

suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=1)
print(suggestions)

The most relevant dandisets for your question are:
1. DANDISET:000044/draft - This dataset includes electrophysiological recordings from rat CA1 hippocampus during navigation in different types of mazes. The animals were trained to run on novel mazes and their neural firing and LFP patterns were recorded. This dataset provides insights into the neural activity underlying spatial learning and navigation in rodents.

2. DANDISET:000410/draft - This dataset contains electrophysiological data from rat CA1 hippocampus during navigation on a linear or w-shaped track. The dataset provides information about the dynamic synchronization between hippocampal spatial representations and the rat's stepping rhythm.

These dandisets are highly relevant to your query as they involve electrophysiological recordings of rodents navigating mazes. They provide valuable data for studying the neural correlates of spatial learning and memory.


In [24]:
suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=2)
print(suggestions)

The most relevant dandisets for your question are: 

1. DANDISET:000044/draft: This dandiset consists of electrophysiological recordings of hippocampal CA1 neural firing and LFP patterns in rats navigating different types of mazes. The recordings were performed during a novel spatial learning task in which the rats were transferred to a novel room and water-rewarded to run on a maze. The data includes multi-cellular electrophysiological recordings, neck EMG, head-mounted accelerometer signals, and tracking of the animal's position during the maze running epochs.

2. DANDISET:000402/draft: This dandiset includes two-photon functional imaging data from a male mouse. The recordings were performed in vivo using a two-photon random access mesoscope while the mouse viewed natural movies and parametric stimuli. The data includes calcium imaging of approximately 75,000 pyramidal cells in primary visual cortex and higher visual cortical areas, as well as treadmill rotation and eye movement data