# Identify Dandisets relevant to a scientific question


## Motivation

We want to provide a system that, based on the user's questions, suggests dandisets that could be of relevance. The hope is that ChatGPT will be able to use semantic information of the question and return better results than simple text matching.

## Plan of action
 

### 1. Collect metadata from dandisets.
<img src="step1_embed_dandiset_metadata.jpg" style="width: 700px;" />
For each Dandiset:

- get name, description, 
- get assets summary: approaches, measurement techniques, variables measured
- for species, we can get accurate info from NCBITaxon, if it's included
- Use OpenAI ada-002 to vectorize each metadata
- Store the vectors in Qdrant, have the original dandiset id as object metadata for each vector

### 2. Process user questions
<img src="step2_do_search.jpg" style="width: 700px;" />

- User queries can come in the form of questions, e.g.: "Which datasets can I use to investigate the effects of drug YYYY on cells of type XXXX?"
- For method 1, the question is passed directly to a semantic embedding API using OpenAI.
- For method 2, a prompt instructs the LLM to extract useful neuroscience research-related keywords from the question, and proceed to use the semantic search engine. We can achieve this using [OpenAI functions](https://openai.com/blog/function-calling-and-other-api-updates)
- (simple approach) We return the dandiset IDs present in the top results 
- (advanced approach) We gather the text content of the collected metadata for the dandisets in the top results and include them in a prompt together with the user's original question and a default instruction such as: "Given the user question and the listed reference datasets, which datasets could help address the user's question, and why?"

```bash
pip install -r requirements.txt
export OPENAI_API_KEY=<your-openai-api-key>
docker run -p 6333:6333 -v ~/qdrant_storage:/qdrant/storage qdrant/qdrant:latest
```

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from rest.clients.dandi import DandiClient
from rest.clients.qdrant import QdrantClient
from rest.clients.openai import OpenaiClient
import json

dandi_client = DandiClient()
qdrant_client = QdrantClient(host="http://localhost")
openai_client = OpenaiClient()

# Extract Dandisets metadata

In [None]:
# Get all dandisets metadata
all_metadata = dandi_client.get_all_dandisets_metadata()

In [None]:
# Extract only relevant text fields from metadata
all_metadata_formatted = dandi_client.collect_relevant_metadata(metadata_list=all_metadata)

In [None]:
print("Number of items: ", len(all_metadata_formatted))
all_metadata_formatted[0]

In [None]:
print(dandi_client.stringify_relevant_metadata(all_metadata_formatted[0]))

# Vector embeddings

At this step, we generate vector embeddings for the formatted metadata from each dandiset. After that, we insert the combination of vectors + payload (metadata information) to Qdrant.
To run the cells below, you must have:
- `OPENAI_API_KEY` set as environment variable
- Qdrant service running [ref](https://qdrant.tech/documentation/quick-start/)

In [None]:
# Generate vector embeddings all items in formatted metadata list
# This can be slow and costs a few cents per run, so it's recommended to save results to disk and load it later on
emb = openai_client.get_embeddings(
    metadata_list=all_metadata_formatted,
    # max_num_sets=10,
    save_to_file=True
)

In [None]:
# Or load them from a file, if already previously produced
with open("data/qdrant_points.json", "r") as file:
    emb = json.load(file)

In [None]:
print(f"Produced {len(emb)} embedding points for {len(all_metadata_formatted)} dandisets")

In [None]:
# Create Qdrant collection
qdrant_client.create_collection(collection_name="dandi_collection")

In [None]:
# Populate collection with points
qdrant_client.add_points_to_collection(collection_name="dandi_collection", embeddings_objects=emb)

In [None]:
info = qdrant_client.get_collection_info("dandi_collection")
print(f"Inserted {info['points_count']} points to Qdrant collection")

# Vectorize user questions

At this step, we handle users input, with the goal of finding the most relevant dandisets for their questions. 

We test two vectorization options:
- vectorize the entire question
- extract relevant keywords from user's questions, vectorize these keywords

Then perform a similarity search against our vector database, to gather the most semantically similar points to the user's question.

Finally, we include the most semantically similar results and the user's input into a prompt which instructs the LLM to further refine the answer to the user.

In [None]:
from utils.openai import (
    keywords_extraction, 
    prepare_keywords_for_semantic_search, 
    add_ordered_similarity_results_to_prompt, 
    get_llm_chat_answer
)
from utils.qdrant import query_all_keywords, query_from_user_input
from utils.pipeline import prepare_prompt

In [None]:
# Vectorize user's input and query similar Qdrant points
user_input = "I am interested in the tuning properties of glial cells. Are there any good dandisets for studying that?"

ordered_similarity_results = query_from_user_input(text=user_input, top_k=15)

In [None]:
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)
prompt = prepare_prompt(user_input=user_input, dandisets_text=dandisets_text, model="gpt-3.5-turbo-16k")
print(prompt)

In [None]:
answer = get_llm_chat_answer(prompt=prompt, model="gpt-3.5-turbo-16k")
print(answer)

In [None]:
# A second approach would be to first extract neuroscience-related keywords from user's questions
keywords = keywords_extraction(user_input=user_input)

# Join the results in a list of strings, before semantic search
keywords_2 = prepare_keywords_for_semantic_search(keywords)
keywords_2

In [None]:
# Query similar entries for each keyword, accumulate the scores for repeated results
ordered_similarity_results = query_all_keywords(keywords_2, top_k=15)
ordered_similarity_results

In [None]:
# Prepare a prompt instructing the LLM to suggest the most relevant dandisets based on user's input
dandisets_text = add_ordered_similarity_results_to_prompt(similarity_results=ordered_similarity_results)
prompt = prepare_prompt(user_input=user_input, dandisets_text=dandisets_text, model="gpt-3.5-turbo-16k")
print(prompt)

In [None]:
answer = get_llm_chat_answer(prompt=prompt)
print(answer)

# Comparison of methods

Here we compare the results of both methods for a variety of possible questions.

In [None]:
from utils.pipeline import suggest_relevant_dandisets

In [None]:
user_input = "I want to study natural movement in humans"

suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=1)
print(suggestions)

In [None]:
suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=2)
print(suggestions)

In [None]:
user_input = "Are there any datasets that have electrophysiology recordings of a rodent navigating a maze?"

suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=1)
print(suggestions)

In [None]:
suggestions = suggest_relevant_dandisets(user_input=user_input, model="gpt-3.5-turbo-16k", method=2)
print(suggestions)