# Test Azure GPT-4 based model with RAG store of Top 500 Bioconductor package

### Get relevant keys from Azure

In [74]:
import os, openai, requests
import pandas as pd
    
## Declare keys
api_key = os.getenv("OPENAI_API_KEY")
search_key = os.getenv("OPENAI_SEARCH_KEY")
endpoint = os.getenv("OPENAI_ENDPOINT")

Declare the deployment name, endpoint and the name of the search index.

In [75]:

openai.api_type = "azure"
# Azure OpenAI on your own data is only supported by the 2023-08-01-preview API version
openai.api_version = "2023-08-01-preview"

# Azure OpenAI setup
openai.api_base = endpoint
openai.api_key = api_key 
deployment_id = "gpt-4-test" 

# Azure AI Search setup
search_endpoint = "https://hmsaisearch.search.windows.net"; # Add your Azure AI Search endpoint here
search_index_name = "bioc-top-500"; # Add your Azure AI Search index name here
    
# https://hms-it-openai.openai.azure.com/openai/deployments/gpt-4-test/extensions/chat/completions?api-version=2024-02-15-preview

Create a "client" object with AzureOpenAI with the correct api key.

In [76]:
# https://hms-it-openai.openai.azure.com/openai/deployments/gpt-4-test/extensions/chat/completions?api-version=2024-02-15-preview

client = openai.AzureOpenAI(
    base_url=f"{endpoint}/openai/deployments/{deployment_id}/extensions",
    api_key=api_key,
    api_version="2023-08-01-preview",
)

Note: This step is to set up the RAG store i.e "bring your own data" search index that has to be registered with the deployment in hand - `gpt-4-test`.

In [77]:
def setup_byod(deployment_id: str) -> None:
    """Sets up the OpenAI Python SDK to use your own data for the chat endpoint.
 
    :param deployment_id: The deployment ID for the model to use with your own data.

    To remove this configuration, simply set openai.requestssession to None.
    """

    class BringYourOwnDataAdapter(requests.adapters.HTTPAdapter):

     def send(self, request, **kwargs):
         request.url = f"{openai.api_base}/openai/deployments/{deployment_id}/extensions/chat/completions?api-version={openai.api_version}"
         return super().send(request, **kwargs)

    session = requests.Session()

    # Mount a custom adapter which will use the extensions endpoint for any call using the given `deployment_id`
    session.mount(
        prefix=f"{openai.api_base}/openai/deployments/{deployment_id}",
        adapter=BringYourOwnDataAdapter()
    )

    openai.requestssession = session

setup_byod(deployment_id)

### Generic function to ask a question from our BYOD model with GPT-4

In [78]:
def ask_rag(question, verbose = False):

    completion = client.chat.completions.create(
        messages=[{"role": "user", "content": question}],
        model=deployment_id,
        extra_body={
            "dataSources": [
                {
                    "type": "AzureCognitiveSearch",
                    "parameters": {
                        "endpoint": search_endpoint,
                        "key": search_key,
                        "indexName": search_index_name,
                        "roleInformation": "Act as an expert in the R programming language and the Bioconductor suite of packages.  ​\n\nYour job is to advise users on the usage of the various Bioconductor packages considering the documents you have in the data store.  ​\n\nTo complete this task, you can use the data you have stored that contain the vignettes of all the packages in Bioconductor and all the reference files of every function in every package of Bioconductor. ​You may also answer some general R, general programming, or Biomedical information.\n\nIf you do not know the answer ask the user to refer to https://bioconductor.org. \n\nAdd a disclaimer at the end of each response saying this model works only on the top 500 most used Bioconductor packages."
                    }
                }
            ]
        }
    )
    if (verbose == True):
        print(f"{completion.choices[0].message.role}: {completion.choices[0].message.content}")

    return(completion.choices[0].message.content)

In [79]:
# `context` is in the model_extra for Azure
# print(f"\nContext: {completion.choices[0].message.model_extra['context']['messages'][0]['content']}")

### Questions for SummarizedExperiemnt

In [80]:
## summarized experiment 
question = "How many classes are there in the Summarized Experiment package? Just give me a number."

ask_rag(question, verbose=True)

assistant: The SummarizedExperiment package contains two classes: SummarizedExperiment and RangedSummarizedExperiment [doc1].


'The SummarizedExperiment package contains two classes: SummarizedExperiment and RangedSummarizedExperiment [doc1].'

### Questions for DESeq2

In [81]:
## DESeq2

question = "DESeq2 performs normalization by estimating size factors for each sample. If your experiment has 5 samples, how many size factors will DESeq2 estimate?"

ask_rag(question, verbose=True)

assistant: DESeq2 estimates one size factor for each sample in your experiment. Therefore, if your experiment has 5 samples, DESeq2 will estimate 5 size factors [doc1]. 

Please note that this model works only on the top 500 most used Bioconductor packages.


'DESeq2 estimates one size factor for each sample in your experiment. Therefore, if your experiment has 5 samples, DESeq2 will estimate 5 size factors [doc1]. \n\nPlease note that this model works only on the top 500 most used Bioconductor packages.'

### Questions for limma

In [82]:
## Limma
question = "You use limma to analyze RNA-seq data from a case-control study with 30 control samples and 30 case samples. After fitting the linear model, how many coefficients will be estimated by limma for the gene expression data (assuming no additional covariates are included in the model)?"

ask_rag(question, verbose=True)

assistant: In the context of a case-control study with 30 control samples and 30 case samples, when you fit a linear model using the limma package in R, you will estimate two coefficients. 

The first coefficient is the intercept, which represents the baseline level of gene expression in the reference group (usually the control group). The second coefficient is the slope, which represents the difference in gene expression between the case group and the control group. 

This is under the assumption that no additional covariates are included in the model. If additional covariates were included, each would have its own coefficient.

Please note that this model works only on the top 500 most used Bioconductor packages. For more specific information, please refer to the limma documentation or the Bioconductor website [doc3][doc4].


'In the context of a case-control study with 30 control samples and 30 case samples, when you fit a linear model using the limma package in R, you will estimate two coefficients. \n\nThe first coefficient is the intercept, which represents the baseline level of gene expression in the reference group (usually the control group). The second coefficient is the slope, which represents the difference in gene expression between the case group and the control group. \n\nThis is under the assumption that no additional covariates are included in the model. If additional covariates were included, each would have its own coefficient.\n\nPlease note that this model works only on the top 500 most used Bioconductor packages. For more specific information, please refer to the limma documentation or the Bioconductor website [doc3][doc4].'

### Questions for SingleCellExperiment

This questions doesn't give a good answer.

In [83]:
question = "how many slots does the SingleCellExperiment object have?"

ask_rag(question, verbose=True)

assistant: The SingleCellExperiment object is a complex object with multiple slots. However, the exact number of slots is not specified in the retrieved documents. The SingleCellExperiment object is designed to store single-cell experiment data, including assay data, feature data, and cell metadata. It is a part of the Bioconductor project and is used extensively in R for single-cell data analysis.

For detailed information about the structure and slots of the SingleCellExperiment object, you can refer to the official Bioconductor documentation at https://bioconductor.org.

Please note that this model works only on the top 500 most used Bioconductor packages.


'The SingleCellExperiment object is a complex object with multiple slots. However, the exact number of slots is not specified in the retrieved documents. The SingleCellExperiment object is designed to store single-cell experiment data, including assay data, feature data, and cell metadata. It is a part of the Bioconductor project and is used extensively in R for single-cell data analysis.\n\nFor detailed information about the structure and slots of the SingleCellExperiment object, you can refer to the official Bioconductor documentation at https://bioconductor.org.\n\nPlease note that this model works only on the top 500 most used Bioconductor packages.'

## Read in Bioc QA top 10 

Read in with Pandas

In [84]:
top_10_qs = pd.read_csv('bioc_qa_top10.csv')

In [85]:
top_10_qs

Unnamed: 0,AID,QID,Question,Response
0,answer1,question1,I am a bit confused about the concepts of the ...,The thing to understand is that terms like FDR...
1,answer2,question2,I am working on RNA-Seq data. I'm using DESeq2...,"Just to be clear, there's an important differe..."
2,answer3,question3,I am new in this kind of analysis and I have a...,There is no good way to do a DE analysis of RN...
3,answer4,question4,I am testing salmon and kallisto for RNA-seq. ...,To answer your questions:1) scaledTPM is TPM's...
4,answer5,question5,In all RNA-seq analysis applications they talk...,The most complete explanation of what the disp...
5,answer6,question6,I know findOverlaps() from GenomicRanges packa...,"From the discussion below, an efficient starti..."
6,answer7,question7,I have just downloaded CNV level 3 files from ...,"I wrote two helper functions, explained belowg..."
7,answer8,question8,How can I filter out the genes with low read c...,If you want to filter out genes with low expre...
8,answer9,question9,I am analysing my RNA-Seq data with DESeq2. At...,You can use the ensembldb package to do the ma...
9,answer10,question10,How do I merge a list of GRanges? What I want ...,Merge is a pretty vague term. My understanding...


Ask the RAG model with the top 500 bioconductor packages questions from the `top_10_qs.csv` and append the responses to `azure_rag_responses`.

In [86]:
azure_rag_responses = []
for question in top_10_qs["Question"]:
    response = ask_rag(question)
    azure_rag_responses.append(response)


## We now run the same model without RAG i.e just a generic GPT-4 model

In [87]:
client = openai.AzureOpenAI(
  azure_endpoint = os.getenv("OPENAI_ENDPOINT"), 
  api_key=os.getenv("OPENAI_API_KEY"),  
  api_version="2024-02-15-preview"
)


def ask_gpt4(question, verbose = False):
    message_text = [{"role":"system","content":"You are an AI assistant that helps people find information."},
                    {"role": "user", "content": question}]

    completion = client.chat.completions.create(
        model="gpt-4-test", 
        messages = message_text,
        temperature=0.0,
        max_tokens=800,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )

    if (verbose == True):
        print(f"{completion.choices[0].message.role}: {completion.choices[0].message.content}")

    return(completion.choices[0].message.content)    


In [88]:
azure_gpt4_temp0_responses = []
for question in top_10_qs["Question"]:
    response = ask_gpt4(question)
    azure_gpt4_temp0_responses.append(response)

In [89]:
top_10_qs.insert(4, "Response_Azure_Bioc_RAG", azure_rag_responses, True)
top_10_qs.insert(5, "Response_Azure_GPT4_Temp0", azure_gpt4_temp0_responses, True)
top_10_qs.to_csv("top_10_qs_with_azure_RAG.csv", index=False)


### Evaluate firt question by eye

In [114]:
def print_method(df, question_number):
    print("*****Question 1:****** ", df["Question"][question_number])
    print("\n******Response Ground Truth:***** \n", df["Response"][question_number])
    print("\n*****Response from Azure RAG:**** \n ", df["Response_Azure_Bioc_RAG"][question_number])
    print("\n*****Response from Azure GPT4 Temp0:*** \n", df["Response_Azure_GPT4_Temp0"][question_number])

print_method(top_10_qs, 0)

*****Question 1:******  I am a bit confused about the concepts of the 3 things: FDR, FDR adjusted p-value and q-value, which I initially thought I was clear about. Are FDR adjusted p-value the same as q-value? (my understanding is that FDR adjusted p-value = original p-value * number of genes/rank of the gene, is that right?) When people say xxx genes are differentially expressed with an FDR cutoff of 0.05, does that mean xxx genes have an FDR adjusted p-value smaller than 0.05?

******Response Ground Truth:***** 
 The thing to understand is that terms like FDR and q-value were defined in specific ways by their original inventors but are used in more generic ways by later researchers who adapt, modify or use the ideas.The term "false discovery rate (FDR)" was created by Benjamini and Hochberg in their 1995 paper. They gave a particular definition of what they meant by FDR.  Their procedure accepted or rejected hypotheses, but did not produce adjusted p-values.Benjamini and Yekutieli pr

## Evaluate answers

In [102]:

top_10_qs


Unnamed: 0,AID,QID,Question,Response,Response_Azure_Bioc_RAG,Response_Azure_GPT4_Temp0
0,answer1,question1,I am a bit confused about the concepts of the ...,The thing to understand is that terms like FDR...,The False Discovery Rate (FDR) is a statistica...,"FDR, FDR adjusted p-value, and q-value are all..."
1,answer2,question2,I am working on RNA-Seq data. I'm using DESeq2...,"Just to be clear, there's an important differe...",It seems like you're dealing with a common iss...,"In DESeq2, adding the batch effect in the desi..."
2,answer3,question3,I am new in this kind of analysis and I have a...,There is no good way to do a DE analysis of RN...,"Yes, you're correct that many differential exp...","You're correct that DESeq2, EdgeR, and limma a..."
3,answer4,question4,I am testing salmon and kallisto for RNA-seq. ...,To answer your questions:1) scaledTPM is TPM's...,The `tximport` function in the Bioconductor pa...,1. ScaledTPM and lengthScaledTPM are methods u...
4,answer5,question5,In all RNA-seq analysis applications they talk...,The most complete explanation of what the disp...,"The dispersion parameter in RNA-seq analysis, ...","In RNA-seq analysis, dispersion refers to the ..."
5,answer6,question6,I know findOverlaps() from GenomicRanges packa...,"From the discussion below, an efficient starti...","Based on your question, it seems you want to p...",It seems like you're trying to perform an elem...
6,answer7,question7,I have just downloaded CNV level 3 files from ...,"I wrote two helper functions, explained belowg...",To map the coordinates of genes to gene symbol...,Mapping genomic coordinates to gene symbols is...
7,answer8,question8,How can I filter out the genes with low read c...,If you want to filter out genes with low expre...,"Yes, you are on the right track. However, the ...","Yes, you are on the right track. DESeq2 is a p..."
8,answer9,question9,I am analysing my RNA-Seq data with DESeq2. At...,You can use the ensembldb package to do the ma...,The issue you're facing is not uncommon when d...,It seems like you are doing everything correct...
9,answer10,question10,How do I merge a list of GRanges? What I want ...,Merge is a pretty vague term. My understanding...,You can merge a list of GRanges objects using ...,"To merge a list of GRanges objects, you can us..."
