In [1]:
%load_ext autoreload
%autoreload 2

# Plan of action

### Collect metadata from dandisets. For each set:
- get name, description, 
- get assets summary: approaches, measurement techniques, variables measured
- for species, we can get accurate info from NCBITaxon, if it's included
- (?) get related resources: scrape title/summary from articles

### Vector embeddings
- Use OpenAI ada-002 to vectorize each information piece
- Store the vectors in Qdrant, have the original dandiset id as object metadata for each vector
- (?) any useful extra info we could include on object metadata, for later filtered searches?

### Process user questions
- User queries can come in the form of questions, e.g.: "Which datasets can I use to investigate the effects of drug YYYY on cells of type XXXX?"
- We need a prompt that instructs the LLM to extract useful neuroscience research-related keywords from the question, and proceed to use the semantic search engine. We can achieve this using [OpenAI functions](https://openai.com/blog/function-calling-and-other-api-updates)
- (simple approach) We return the dandiset IDs present in the top results 
- (advanced approach) We gather the text content of the collected metadata for the dandisets in the top results and include them in a prompt together with the user's original question and a default instruction such as: "Given the user question and the listed reference datasets, which datasets could help address the user's question, and why?"

# Collect metadata from dandisets

At this step, we collect dandisets metadata and extract relevant text fields to be vectorized later on.

In [2]:
from utils.dandi import get_all_dandisets_metadata, collect_relevant_metadata

In [3]:
# Get all dandisets metadata
all_metadata = get_all_dandisets_metadata()

In [4]:
# Extract only relevant text fields from metadata
all_metadata_formatted = collect_relevant_metadata(metadata_list=all_metadata)

In [5]:
all_metadata_formatted[0]

{'dandiset_id': 'DANDI:000003/0.210812.1448',
 'title': 'title: Physiological Properties and Behavioral Correlates of Hippocampal Granule Cells and Mossy Cells',
 'description': 'description: Data from "Physiological Properties and Behavioral Correlates of Hippocampal Granule Cells and Mossy Cells" Senzai, Buzsaki, Neuron 2017. Electrophysiology recordings of hippocampus during theta maze exploration.',
 'approaches': ['approach: electrophysiological approach',
  'approach: behavioral approach'],
 'measurement_techniques': ['measurement technique: signal filtering technique',
  'measurement technique: fourier analysis technique',
  'measurement technique: spike sorting technique',
  'measurement technique: behavioral technique',
  'measurement technique: multi electrode extracellular electrophysiology recording technique'],
 'variables_measured': ['variable measured: DecompositionSeries',
  'variable measured: LFP',
  'variable measured: Units',
  'variable measured: Position',
  'vari

# Vector embeddings

At this step, we generate vector embeddings for each property of the formatted metadata from each dandiset. After that, we insert the combination of vectors + payload (metadata information) to Qdrant.
To run the cells below, you must have:
- `OPENAI_API_KEY` set as environment variable
- Qdrant service running [ref](https://qdrant.tech/documentation/quick-start/)

In [19]:
from utils.openai import get_embeddings
from utils.qdrant import create_collection, add_points_to_collection

In [16]:
# Generate vector embeddings all items in formatted metadata list
emb = get_embeddings(
    metadata_list=all_metadata_formatted,
    max_num_sets=10
)
print(len(emb))

Generating embeddings for dandiset: DANDI:000003/0.210812.1448
Generating embeddings for dandiset: DANDI:000004/0.220126.1852
Generating embeddings for dandiset: DANDI:000025/draft
Generating embeddings for dandiset: DANDI:000010/0.220126.1905
Generating embeddings for dandiset: DANDI:000007/0.220126.1903
Generating embeddings for dandiset: DANDI:000016/draft
Generating embeddings for dandiset: DANDI:000009/0.220126.1903
Generating embeddings for dandiset: DANDI:000005/0.220126.1853
Generating embeddings for dandiset: DANDI:000017/draft
Generating embeddings for dandiset: DANDI:000020/0.210913.1639
113


In [20]:
# Create collection
create_collection(collection_name="dandi_collection")

In [21]:
# Populate collection with points
add_points_to_collection(embeddings_objects=emb)

# Process user questions

At this step, we handle users input, with the goal of finding the most relevant dandisets for their questions. We start by extracting relevant keywords from their questions, then perform a semantic similarity search against our vector database.

In [79]:
from utils.openai import keywords_extraction, prepare_keywords_for_semantic_search, add_ordered_similarity_results_to_prompt, get_llm_chat_answer
from utils.qdrant import query_all_keywords

In [32]:
# Extract keywords from user input
user_question = """I want to investigate the effects of caffeine on the activity of cerebellar Purkinje cells in monkeys. 
I'm interested in any type of electrophysiology recordings.
I also want spike sorting from neurons in mice."""

keywords = keywords_extraction(user_question=user_question)
keywords

[{'species': 'monkeys',
  'approaches': 'electrophysiology recordings',
  'measurement_techniques': 'spike sorting',
  'variables_measured': 'activity of cerebellar Purkinje cells',
  'anatomy': 'cerebellar Purkinje cells',
  'disease': '',
  'cell_types': '',
  'drugs': 'caffeine'},
 {'species': 'mice',
  'approaches': 'electrophysiology recordings',
  'measurement_techniques': 'spike sorting',
  'variables_measured': '',
  'anatomy': '',
  'disease': '',
  'cell_types': 'neurons',
  'drugs': ''}]

In [36]:
# Join the results in a list of strings, before semantic search
keywords_2 = prepare_keywords_for_semantic_search(keywords)
keywords_2

['drugs caffeine',
 'variables_measured activity of cerebellar purkinje cells',
 'species monkeys',
 'species mice',
 'anatomy cerebellar purkinje cells',
 'measurement_techniques spike sorting',
 'approaches electrophysiology recordings',
 'cell_types neurons']

In [77]:
# Query similar entries for each keyword, accumulate the scores for repeated results
ordered_similarity_results = query_all_keywords(keywords_2)
ordered_similarity_results

[('DANDI:000003/0.210812.1448', 11.780093099999998),
 ('DANDI:000005/0.220126.1853', 10.00378832),
 ('DANDI:000009/0.220126.1903', 8.42570872),
 ('DANDI:000017/draft', 7.604895280000001),
 ('DANDI:000010/0.220126.1905', 7.550612050000001),
 ('DANDI:000020/0.210913.1639', 6.56471599),
 ('DANDI:000004/0.220126.1852', 5.90100909),
 ('DANDI:000007/0.220126.1903', 5.18161296),
 ('DANDI:000025/draft', 3.3110855900000002),
 ('DANDI:000016/draft', 0.7949213)]

In [93]:
# Prepare a prompt instructing the LLM to suggest the most relevant dandisets based on user's input
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)

base_prompt = """Given the user input and the list of most relevant dandisets, propose which dandisets the user might be interested in using.
Explain your decision based on the infromation of the relvant dandisets.
Take also into consideration the relevance score, higher number are likely to be more relevant.
Structure your answer as one suggestion per paragraph.
Suggest only the dandisets which you consider most relevant. There can be multiple relevant dandisets.
Start your answers always with: "The most relevant dandisets for your question are" 
Unless you consider there are no relevant dandisets, in which case only reply: "There are no relevant dandisets for your question"
---
User input: {user_input}
---
{dandisets_text}
---
Begin:"""

prompt = base_prompt.format(user_input=user_question, dandisets_text=dandisets_text)
print(prompt)

Given the user input and the list of most relevant dandisets, propose which dandisets the user might be interested in using.
Explain your decision based on the infromation of the relvant dandisets.
Take also into consideration the relevance score, higher number are likely to be more relevant.
Structure your answer as one suggestion per paragraph.
Suggest only the dandisets which you consider most relevant. There can be multiple relevant dandisets.
Start your answers always with: "The most relevant dandisets for your question are" 
Unless you consider there are no relevant dandisets, in which case only reply: "There are no relevant dandisets for your question"
---
User input: I want to investigate the effects of caffeine on the activity of cerebellar Purkinje cells in monkeys. 
I'm interested in any type of electrophysiology recordings.
I also want spike sorting from neurons in mice.
---
Most relevant dandisets:

DANDISET:000003/draft
title: Physiological Properties and Behavioral Corre

In [94]:
answer = get_llm_chat_answer(prompt=prompt)

In [95]:
print(answer)

The most relevant dandiset for investigating the effects of caffeine on the activity of cerebellar Purkinje cells in monkeys, particularly using electrophysiology recordings, is DANDISET:000003/draft. This dataset contains electrophysiology recordings of the hippocampus during theta maze exploration in house mice. It includes measurements such as DecompositionSeries, LFP, Units, Position, and ElectricalSeries, which are relevant for studying neuronal activity. The approach used in this dataset includes electrophysiological and behavioral approaches, making it suitable for investigating the effects of caffeine on cerebellar Purkinje cell activity.

Another relevant dandiset for spike sorting from neurons in mice, and potentially for investigating the effects of caffeine, is DANDISET:000005/draft. This dataset consists of intracellular and extracellular electrophysiology recordings performed on the mouse barrel cortex and ventral posterolateral nucleus (vpm) during a whisker-based object