# LlamaIndex Bioconductor 3.19 RAG model 

This RAG model creates an index with the Bioc 3.19 software packages that have been newly released. 

The notebook contains the following experiments 

1. We first use the API to answer questions from Bioc 3.19 using a foundational GPT-4 model and GPT-4 model using the Bioc 3.19 RAG store. 

2. LLM as a judge evaluation method - We try to use an LLM (GPT 4) as a judge to evaluate responses to a basic question to which we have the groud truth already: "How many classes are there in Bioconductor package X?" using the following models - 
    - GPT 4 
    - GPT 4 + RAG 

In [1]:
import os, openai, requests, logging, sys
import pandas as pd
from IPython.display import Markdown, display

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
)

from llama_index.core.settings import Settings

from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

from llama_index.vector_stores.azureaisearch import (
    AzureAISearchVectorStore,
    IndexManagement,
    MetadataIndexFieldType,
)

from llama_index.core.query_engine import (
    CustomQueryEngine,
    RetrieverQueryEngine
)
from llama_index.core.retrievers import (
    BaseRetriever,
    VectorIndexRetriever
)

from llama_index.core.response_synthesizers import (
    BaseSynthesizer,
    ResponseMode
)

from llama_index.core import (
    PromptTemplate,
    VectorStoreIndex,
    get_response_synthesizer
)

from llama_index.core.postprocessor import SimilarityPostprocessor


## Setup Azure OpenAI

Set up keys for Azure OpenAI. The keys we need are API keys, and names of the embedding model and deployment name. 

The embedding model name is "text-embedding-ada-002", and the deployment name is "gpt-4-turbo". 

In [2]:
aoai_api_key = os.getenv('AZURE_OPENAI_API_KEY')  # AZURE_OPENAI_API_KEY
aoai_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')  # AZURE_OPENAI_ENDPOINT
aoai_api_version = "2023-05-15"
aoai_embedding_model_name = "text-embedding-ada-002"
aoai_deployment_name="gpt-4-turbo"

llm = AzureOpenAI(
    model=aoai_deployment_name,
    deployment_name=aoai_deployment_name,
    api_key=aoai_api_key,
    azure_endpoint=aoai_endpoint,
    api_version=aoai_api_version,
)

# You need to deploy your own embedding model as well as your own chat completion model
embed_model = AzureOpenAIEmbedding(
    model=aoai_embedding_model_name,
    deployment_name=aoai_embedding_model_name,
    api_key=aoai_api_key,
    azure_endpoint=aoai_endpoint,
    api_version=aoai_api_version,
)

## Setup Azure AI Search

We also need to set up the AI Search keys and endpoints. 

In [3]:
search_service_api_key = os.getenv('OPENAI_SEARCH_API_KEY') # AZURE_AI_SEARCH_KEY
search_service_endpoint = os.getenv('AZURE_AI_SEARCH_ENDPOINT') # AZURE_AI_SEARCH_ENDPOINT
search_service_api_version = "2023-11-01"
credential = AzureKeyCredential(search_service_api_key)

# Index name to use
index_name = "llamaindex-vector-bioc-3-19"

# Use index client to demonstrate creating an index
index_client = SearchIndexClient(
    endpoint=search_service_endpoint,   
    credential=credential,
)

# Use search client to demonstration using existing index
search_client = SearchClient(
    endpoint=search_service_endpoint,
    index_name=index_name,
    credential=credential,
)

## Create new Vector Index

In [4]:
metadata_fields = {
    "author": "author",
    "theme": ("topic", MetadataIndexFieldType.STRING),
    "director": "director",
}

vector_store = AzureAISearchVectorStore(
    search_or_index_client=index_client,
    filterable_metadata_field_keys=metadata_fields,
    index_name=index_name,
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="chunk",
    embedding_field_key="embedding",
    embedding_dimensionality=1536,
    metadata_string_field_key="metadata",
    doc_id_field_key="doc_id",
    language_analyzer="en.lucene",
    vector_algorithm_type="exhaustiveKnn",
)

## Load the data - Bioc 3.19 RAG dataset 

In [12]:
persist_index_dir = "/Users/niteshturaga/Documents/HMS/bioc-chat-bot-testing/llama_index/bioc_3_19_only_index/"

In [21]:
bioc_3_19_dir = "/Users/niteshturaga/Documents/HMS/bioc-chat-bot-testing/data/bioc_3_19_only/"

documents = SimpleDirectoryReader(bioc_3_19_dir).load_data()
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Settings.llm = llm
Settings.embed_model = embed_model

index = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context,
    show_progress=True
)

Parsing nodes:   0%|          | 0/3573 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1759 [00:00<?, ?it/s]

In [22]:
## Store the index
index.storage_context.persist(persist_dir=persist_index_dir)

### Create prompt template for Bioconductor

In [23]:
bioc_context_str = """\
Act as an expert in the R programming language and the Bioconductor suite \
of packages.  ​Your job is to advise users on the usage of the various \
Bioconductor 3.19 packages considering the documents you have in the \
vector index store. To complete this task, you can use the data you have stored \
that contain the vignettes of all the new software packages in Bioconductor 3.19. \
​You may also answer some general R, general programming, or Biomedical \
information. If you do not know the answer ask the user to refer to \
https://bioconductor.org. Add a disclaimer at the end of each response  \
saying this response is AI generated, and should be independently verified \
by the user.
"""

qa_prompt_template = PromptTemplate(
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

# configure retriever
retriever = VectorIndexRetriever(
    index=index, similarity_top_k=5
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="compact"
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)


class RAGStringQueryEngine(CustomQueryEngine):
    """RAG String Query Engine."""

    retriever: BaseRetriever
    response_synthesizer: BaseSynthesizer
    llm: AzureOpenAI
    qa_prompt: PromptTemplate

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)

        context_str = "\n\n".join([n.node.get_content() for n in nodes])
        response = self.llm.complete(
            qa_prompt_template.format(context_str=bioc_context_str, query_str=query_str)
        )
        return str(response)


In [24]:
query_engine = RAGStringQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    llm=llm,
    qa_prompt=qa_prompt_template,
)

In [26]:
response = query_engine.query("What are the genome builds used in AlphaMissenseR package?")
display(Markdown(f"<b>{response}</b>"))

<b>To determine the genome builds used in the AlphaMissenseR package, you would typically look at the package documentation, specifically the vignette or the reference manual, which often includes details on the genome builds supported or used by the package. Since I don't have access to the actual vignettes or the content of the Bioconductor 3.19 packages, I cannot provide the specific genome builds used in the AlphaMissenseR package.

However, you can easily find this information by checking the package's vignette on the Bioconductor website. Here's how you can do it:

1. Go to the Bioconductor website: https://bioconductor.org
2. In the search bar, type "AlphaMissenseR" and hit enter.
3. Click on the AlphaMissenseR package from the search results.
4. Look for the package vignette or documentation section.
5. Within the vignette or the reference manual, search for sections that mention genome builds or annotations.

If you are using R, you can also access the vignette directly by using the following command after installing the AlphaMissenseR package:

```R
vignette("AlphaMissenseR")
```

This command will open the vignette where you can look for the information regarding genome builds.

Please note that this response is AI-generated and should be independently verified by the user.</b>

In [152]:
def ask_rag(query):
    response = query_engine.query(query)
    display(Markdown(f"<b>{query}</b>"))
    display(Markdown(f"<b>{response}</b>"))
    display(Markdown(f"---------------------"))

In [143]:
df = pd.read_csv("bioc-3-19-rag-questions.csv")

In [151]:
# Test RAG model for 1 question.
ask_rag(df["Question"][0])

<b>Which genome builds does the AlphaMissenseR package use to provide data on missense variant pathogenicity?</b>

<b>The AlphaMissenseR package in Bioconductor 3.19 is designed to provide insights into the pathogenicity of missense variants. However, without direct access to the specific vignettes or documentation of the AlphaMissenseR package from the Bioconductor 3.19 release, I cannot provide the exact genome builds it uses. Typically, packages like AlphaMissenseR would use widely recognized genome builds such as GRCh37/hg19 or GRCh38/hg38 for human data, as these are the most commonly used references for genomic studies. To get the most accurate and up-to-date information on the genome builds supported by AlphaMissenseR, I recommend checking the package's documentation on the Bioconductor website or within the package's vignettes directly.

For detailed instructions on how to access this information, you can visit the Bioconductor project website at https://bioconductor.org and search for the AlphaMissenseR package. The package documentation and vignettes will provide comprehensive details on the genome builds supported and how to use the package effectively.

This response is AI-generated and should be independently verified by the user.</b>

In [153]:
## Loop through the questions and ask the RAG model for answers, the questions are in the column "Question", write the response in a new column "Response-RAG"
for ind, row in df.iterrows():
    question = row['Question']
    response = ask_rag(question)

<b>Which genome builds does the AlphaMissenseR package use to provide data on missense variant pathogenicity?</b>

<b>The AlphaMissenseR package in Bioconductor 3.19 is designed to provide insights into the pathogenicity of missense variants. However, without direct access to the specific documentation or vignettes of the AlphaMissenseR package at this moment, I cannot provide the exact genome builds it uses. Typically, packages that analyze genomic data and variant pathogenicity may support widely used genome builds like hg19 (GRCh37) or hg38 (GRCh38), among others, depending on their design and the datasets they incorporate.

For the most accurate and up-to-date information on the genome builds supported by the AlphaMissenseR package, I recommend checking the package's documentation on the Bioconductor website or within the package's vignettes. These resources will provide detailed information on the data sources, genome builds, and methodologies the package uses.

To access this information, you can visit the Bioconductor website (https://bioconductor.org) and search for the AlphaMissenseR package. The package's landing page should include links to its documentation, including vignettes and reference manuals, which will detail the genome builds it supports.

This response is AI-generated and should be independently verified by the user.</b>

---------------------

<b>Hi,

I would like to retrieve the AlphaMissense pathogenicity predictions for all possible single amino acid substitutions. I am using the AlphaMissenseR package to retrieve the table like that:

library(AlphaMissenseR)
am_table <- db_connect() |> tbl("aa_substitutions")

However, I am encountering the following error:

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
ℹ Using SQL: SELECT * FROM (FROM aa_substitutions) q01 WHERE (0 = 1)
Caused by error:
! rapi_prepare: Failed to prepare query SELECT *
FROM (FROM aa_substitutions) q01
WHERE (0 = 1)
Error: Catalog Error: Table with name aa_substitutions does not exist!
Did you mean "pg_settings"?
LINE 2: FROM (FROM aa_substitutions) q01
                   ^
Run `rlang::last_trace()` to see where the error occurred.

Please, do you have any idea how to solve this problem?</b>

<b>It appears that the error you're encountering is related to the database connection and query execution within the `AlphaMissenseR` package. The error message indicates that the table `aa_substitutions` cannot be found in the database you're querying. Here are a few steps you can take to troubleshoot and potentially solve this problem:

1. **Verify Database Connection**: Ensure that the `db_connect()` function is correctly connecting to the intended database. If `db_connect()` requires parameters (like database name, user, password, etc.), make sure they are correctly specified.

2. **Check Table Existence**: The error suggests that the table `aa_substitutions` does not exist in the database. If possible, access the database using a database management tool or command line interface to check if the table actually exists. It's also a good idea to verify that you're connected to the correct database where the table is supposed to exist.

3. **Correct SQL Syntax**: The error message shows an unusual SQL syntax: `SELECT * FROM (FROM aa_substitutions)`. This is not valid SQL. However, this might be an artifact of how the error message is generated and not your actual query. If you have the ability to customize the SQL query within `tbl()`, ensure it follows correct SQL syntax.

4. **Package Documentation**: Review the `AlphaMissenseR` package documentation or vignettes to ensure that you're using the `db_connect()` and `tbl()` functions correctly. There might be specific instructions or examples on how to query the database for amino acid substitutions.

5. **Contact Package Maintainer**: If the issue persists, consider reaching out to the maintainers of the `AlphaMissenseR` package. They might provide specific insights or updates that could resolve the issue.

6. **Bioconductor Support Forum**: You can also seek help on the Bioconductor support forum. It's a helpful community where you can post your issue and potentially get solutions from other users or the developers themselves.

Remember, when working with databases and external packages, it's crucial to ensure that all components (database, table, package version) are correctly aligned and functioning.

This response is AI-generated and should be independently verified by the user.</b>

---------------------

<b>What is the primary function of the BERT package in Bioconductor and how does it contribute to batch-effect adjustment of data with missing values?</b>

<b>The BERT package in Bioconductor is not a standard package associated with the Bioconductor project. BERT typically stands for Bidirectional Encoder Representations from Transformers and is a method in natural language processing. However, in the context of Bioconductor, BERT is not a recognized package or tool.

If you are referring to batch-effect adjustment in biological data, you might be thinking of other packages that are designed for this purpose. For example, the `limma` package is widely used for analyzing gene expression data and includes functions for adjusting batch effects. Another package, `sva` (Surrogate Variable Analysis), can be used to identify and account for batch effects and other unwanted variation in high-throughput experiments.

For handling missing values in the context of batch-effect adjustment, different strategies might be employed, such as imputation of missing values before batch correction or the use of methods that can inherently handle missing data.

If you are looking for specific information on a package that deals with batch-effect adjustment and can handle missing values, I would recommend checking the Bioconductor website or searching through the Bioconductor 3.19 vignettes that you have stored in your vector index. These vignettes will provide detailed information on the usage and functionalities of the packages included in that release.

For accurate and up-to-date information on Bioconductor packages and their functionalities, please refer to the official Bioconductor website at https://bioconductor.org.

This response is AI-generated and should be independently verified by the user.</b>

---------------------

<b>What class does the SingleCellAlleleExperiment extend from other Bioconductor packages? </b>

<b>The `SingleCellAlleleExperiment` class extends the `SingleCellExperiment` class, which is a widely used class for storing and managing single-cell genomics data within the Bioconductor ecosystem. The `SingleCellExperiment` class itself is designed to provide a unified interface for handling single-cell RNA sequencing data, incorporating features such as counts, metadata, and dimensionality reduction results. By extending this class, `SingleCellAlleleExperiment` inherits its structure and functionality, allowing it to specifically cater to the needs of analyzing allele-specific expression in single-cell data while leveraging the robust infrastructure provided by the `SingleCellExperiment` framework.

Please note, this response is AI-generated and should be independently verified by the user.</b>

---------------------

<b>What kind of data does the SingleCellAlleleExperiment object contain?</b>

<b>The `SingleCellAlleleExperiment` object is designed to store single-cell allele-specific expression data. This type of data typically includes information on gene expression levels for each allele at the single-cell level, allowing researchers to study allele-specific expression patterns and their implications in various biological contexts, such as cellular heterogeneity, clonal evolution, and the effects of genetic variants.

In a `SingleCellAlleleExperiment` object, you would expect to find:

1. Counts or expression values for each allele of genes across single cells.
2. Metadata about the cells, such as cell type, experimental conditions, or batch information.
3. Metadata about the genes or genomic features, including gene identifiers and possibly genomic coordinates.
4. Information about the genetic variants (e.g., SNPs) that distinguish the alleles being measured.

This object is likely part of a Bioconductor package that deals with single-cell genomic data, and it would be tailored to handle the complexities and nuances of allele-specific expression analysis in single-cell RNA sequencing (scRNA-seq) experiments.

For detailed information on how to use the `SingleCellAlleleExperiment` object, including creating, manipulating, and analyzing the data it contains, you should refer to the specific Bioconductor package documentation or vignette that introduces this object. If this object is part of the new software packages in Bioconductor 3.19, you can look into the vignettes stored in your vector index for relevant examples and usage guidelines.

If you need further assistance or more detailed information about the `SingleCellAlleleExperiment` object and its associated package, please refer to the Bioconductor website at https://bioconductor.org.

_Disclaimer: This response is AI-generated and should be independently verified by the user._</b>

---------------------

<b>What is the purpose of the HybridExpress package and how does it contribute to the analysis of hybrid gene expression data?</b>

<b>The `HybridExpress` package in Bioconductor is designed to facilitate the analysis of hybrid gene expression data. This type of data arises when researchers use microarrays or other gene expression platforms to measure the expression levels of genes across different species or across different conditions where a hybridization of genetic material may occur.

The purpose of the `HybridExpress` package is to provide tools that can handle the complexities associated with analyzing hybrid gene expression data. This includes functions for normalization, background correction, and summarization of expression levels when dealing with cross-species hybridization experiments. The package may also include methods for detecting and correcting for cross-hybridization effects, which can occur when probes on a microarray bind to non-target sequences, leading to mixed signals that can confound the analysis.

By using `HybridExpress`, researchers can improve the accuracy of their gene expression measurements in hybrid experiments, leading to more reliable conclusions about gene expression differences and similarities across species or experimental conditions. This is particularly important in comparative genomics and transcriptomics studies, where understanding the conservation and divergence of gene expression is key.

To get the most accurate and up-to-date information on the `HybridExpress` package, including its specific functionalities and how to use it, please refer to the Bioconductor website and the package's vignettes.

Disclaimer: This response is AI-generated and should be independently verified by the user at https://bioconductor.org.</b>

---------------------

<b>What datasets are provided in the MouseAgingData package and how can these datasets be accessed?</b>

<b>The `MouseAgingData` package in Bioconductor provides datasets that are specifically curated for the analysis of mouse aging studies. To access the datasets provided by the `MouseAgingData` package, you would first need to install the package and then load it into your R session. Here's how you can do it:

1. Install the `MouseAgingData` package using the `BiocManager` package:

```R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MouseAgingData")
```

2. Load the package:

```R
library(MouseAgingData)
```

Once the package is loaded, you can access the datasets included in it. The datasets are usually stored as data objects or accessible through specific functions provided by the package. You can use the `data()` function to list the datasets available in the package or to load a specific dataset. For example:

```R
data(package = "MouseAgingData")  # List all available datasets in the package
```

To load a specific dataset (assuming 'dataset_name' is the name of the dataset you want to access):

```R
data("dataset_name")
```

After loading the dataset, you can access it directly by its name in your R session.

To get more detailed information about the datasets and how to use them, you can check the package vignettes, which provide comprehensive examples and explanations:

```R
browseVignettes("MouseAgingData")
```

This will open a web page with links to the HTML versions of the package vignettes. You can read through these documents to understand the structure of the datasets and the recommended analysis workflows.

Please note that the actual content and structure of the datasets can vary, and the package might have been updated since my knowledge cutoff date. For the most accurate and up-to-date information, please refer to the package documentation on the Bioconductor website.

This response is AI-generated and should be independently verified by the user.</b>

---------------------

<b>Using data from the MouseAgingData package, how can I reproduce Figure 2C from the following publication?

Ximerakis et al. (2023) Heterochronic parabiosis reprograms the mouse brain transcriptome by shifting aging signatures in multiple cell types. Nature Aging, 3, 327–345.</b>

<b>To reproduce Figure 2C from the publication by Ximerakis et al. (2023) using data from the MouseAgingData package in Bioconductor, you would typically follow these steps:

1. Install the MouseAgingData package if you haven't already. You can do this using the `BiocManager::install()` function:

```R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MouseAgingData")
```

2. Load the MouseAgingData package:

```R
library(MouseAgingData)
```

3. Access the specific dataset used in the publication. The MouseAgingData package should contain documentation on the datasets included. You can view the documentation by typing:

```R
?MouseAgingData
```

4. Once you have identified the correct dataset, load it into your R session.

5. The publication should provide details on the analysis performed to generate Figure 2C. This might include normalization procedures, statistical tests, and plotting commands. You will need to follow the methods described in the paper as closely as possible.

6. Use the appropriate Bioconductor or R packages to perform the analysis. This may include packages for data manipulation (like `dplyr` or `tidyverse`), statistical analysis (like `limma` or `edgeR`), and plotting (like `ggplot2` or `ComplexHeatmap`).

7. Once you have performed the analysis, you can plot the figure using the plotting package of your choice. The `ggplot2` package is commonly used for creating publication-quality figures.

Here is an example of how you might start the plotting process, assuming you have a data frame `df` with the necessary data:

```R
library(ggplot2)

ggplot(df, aes(x = variableX, y = variableY, color = group)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Figure 2C: Description",
       x = "X-axis label",
       y = "Y-axis label") +
  scale_color_manual(values = c("Group1" = "color1", "Group2" = "color2"))
```

Please note that without specific details on the analysis and the exact figure, this is a general guide. For detailed instructions, you should refer to the methods section of the publication and the documentation of the MouseAgingData package.

If you encounter any issues or need further assistance with the specific methods used in the paper, I recommend reaching out to the authors or consulting the Bioconductor support forum.

Disclaimer: This response is AI-generated and should be independently verified by the user. For the most accurate and up-to-date information, please refer to the official Bioconductor website at https://bioconductor.org.</b>

---------------------