<a href="https://colab.research.google.com/github/claudio1975/Medium-blog/blob/master/Gemini_in_Healthcare/Gemini_in_Healthcare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Competition

The competition's goal is to explore and demonstrate innovative use cases for the Gemini 1.5 model's huge context window, which can process up to 2 million tokens at once. Participants are encouraged to create public Kaggle Notebooks and YouTube Videos showcasing how various applications can leverage this capability. The competition aims to highlight the potential of long context windows in AI, allowing for more advanced and direct methods like in-context retrieval and extensive many-shot prompting and invites participants to "stress test" the model by applying it to creative and challenging scenarios.

# Gemini 1.5

On 8th of August 2024, Google published a report https://arxiv.org/pdf/2403.05530 to introduce the Gemini 1.5, a family of multimodal models boasting exceptional compute efficiency and the ability to recall and reason over intricate details from extensive contexts spanning millions of tokens. This includes the processing of long documents, hours of video and audio content. Gemini 1.5 comprises two models: Gemini 1.5 Pro, an improved version with enhanced capabilities, and Gemini 1.5 Flash, a lightweight variant prioritising efficiency with minimal quality compromise. A key advancement is the models' ability to achieve near-perfect recall in long-context retrieval tasks across modalities, exceeding the capabilities of existing models like Claude 3.0 and GPT-4 Turbo.nts.


# Use Case: A Comprehensive Review of Generative AI in Healthcare

This review paper examines the applications of generative AI models, specifically transformers and diffusion models, in the healthcare sector. The authors argue that these models have revolutionized the analysis of various data forms, including medical imaging, protein structure prediction, clinical documentation, and drug desig . They present a novel classification of generative AI models in healthcare: diffusion models and transformer-based modeL They further categorize the uses of these models based on specific healthcare tasks. For example:
Transformer-based models have been used for tasks such as:
Protein structure prediction.
Clinical documentation and information extraction.
Diagnostic assistance.
Medical imaging and radiology interpretation.
Clinical decision support.
Medical coding and billing.
Drug design and molecular representation.
Diffusion models have been used for tasks such as:
Image reconstruction.
Image-to-image translation.
Image generation.
Image classification.
The review argues that generative AI has the potential to become a highly trustworthy tool in healthcare, potentially replacing human doctors in certain tasks.
https://arxiv.org/abs/2310.00795

# Installation of Required Packages

pip command to install Python packages required for various tasks

In [None]:
!pip install -U sentence-transformers &>/dev/null # for sentence embeddings and similarity computation.

In [None]:
!pip install -U transformers &>/dev/null # for Autotokenizer.

In [None]:
!pip install PyPDF2 &>/dev/null # for reading PDF files.

In [None]:
!pip install rouge-score &>/dev/null # for evaluating the similarity of texts.

In [None]:
!pip install bert-score &>/dev/null # for evaluating the similarity of texts.

In [None]:
!pip install faiss-cpu &>/dev/null # for efficient similarity searches.

In [None]:
!pip install ipywidgets &>/dev/null

# Importing Libraries

Imports various libraries needed for specific tasks

In [None]:
# Get the API key from here: https://ai.google.dev/tutorials/setup
# Create a new secret called "GEMINI_API_KEY" via Add-ons -> Secrets in the top menu, and attach it to this notebook.
from kaggle_secrets import UserSecretsClient # for accessing stored secrets in Kaggle (like API keys).
import os # File-system operations
import PyPDF2 # reading PDFs
from IPython.display import display, Markdown # displaying Markdown
from rouge_score import rouge_scorer # Scoring functions (ROUGE and BERT) for text evaluation
from bert_score import score as bert_score # Scoring functions (ROUGE and BERT) for text evaluation
import faiss # for handling and searching text data
from sklearn.feature_extraction.text import TfidfVectorizer # for handling and searching text data
from sentence_transformers import SentenceTransformer, util # for obtaining sentence embeddings
from transformers import AutoTokenizer # For counting tokens
from sklearn.preprocessing import normalize
import numpy as np


# This suppresses any future warnings, which can be useful for avoiding clutter in the output.
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore", FutureWarning)

# Set up access to the GEMINI API
user_secrets = UserSecretsClient()
apiKey = user_secrets.get_secret("GEMINI_API_KEY") # Retrieve the API key from the Kaggle secrets.
import google.generativeai as genai
genai.configure(api_key = apiKey) # Configure the genai library with the API key to access GEMINI's models.
llm = genai.GenerativeModel(
    model_name="gemini-1.5-flash-latest") # Initialize a generative model for use in later queries.


In [None]:
# Walks the input directory (/kaggle/input) to find files, specifically PDF paths.
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



/kaggle/input/gemini-long-context/submission_instructions.txt
/kaggle/input/ai-in-healthcare/Figure_14.pdf
/kaggle/input/ai-in-healthcare/Figure_10.pdf
/kaggle/input/ai-in-healthcare/Figure_1.pdf
/kaggle/input/ai-in-healthcare/Figure_8.pdf
/kaggle/input/ai-in-healthcare/Figure_9.pdf
/kaggle/input/ai-in-healthcare/Figure_11.pdf
/kaggle/input/ai-in-healthcare/Figure_15.pdf
/kaggle/input/ai-in-healthcare/Figure_2.pdf
/kaggle/input/ai-in-healthcare/Figure_5.pdf
/kaggle/input/ai-in-healthcare/Figure_13.pdf
/kaggle/input/ai-in-healthcare/2310.00795v1.pdf


### Reading document

In [None]:
file_paths = ['/kaggle/input/ai-in-healthcare/2310.00795v1.pdf']

# Extracts text from each PDF using PyPDF2 and stores text in papers_text.
papers_text = []  # List to store the extracted text from papers
# Extract text from PDFs
for file_path in file_paths:
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            text = ""
            for page in pdf_reader.pages:
                text += page.extract_text() + "\n"
            papers_text.append(text)
        print(f"Successfully extracted text from: {file_path}") # Handles any exceptions that occur during the extraction process and logs success or failure messages.
    except Exception as e:
        print(f"Failed to extract text from {file_path}: {e}")

Successfully extracted text from: /kaggle/input/ai-in-healthcare/2310.00795v1.pdf


In [None]:
# Load tokenizer for a specific model, e.g., BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

total_tokens = 0

for paper in papers_text:
    # Tokenize the paper's text and count the tokens
    tokens = tokenizer.tokenize(paper)
    total_tokens += len(tokens)

print(f"Total number of tokens: {total_tokens}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (35853 > 512). Running this sequence through the model will result in indexing errors


Total number of tokens: 35853


### Questions

In [None]:
# Define a set of questions to ask regarding the paper's content.
question_1 = "Which types of encoder-decoder transformer architectures are discussed in the review ?"
question_2 = "Could you say who developed MERGIS, and what is it ?"
question_3 = "Could you say who developed AdaDiff, and what is it  ?"
question_4 = "What process represents the equation 1 used in the review, could you explain it ?"
question_5 = "Could you say who developed ProteinBERT, and what is it ?"


### Scoring Functions

ROUGE scores: Measures overlap between a generated and reference text.

In [None]:
def calculate_rouge_scores(reference_text, generated_text):
    # Initialize the ROUGE scorer with the desired metrics
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    # Calculate ROUGE scores
    scores = scorer.score(reference_text, generated_text)
    # Display or return the results
    print("ROUGE Scores:")
    for key, value in scores.items():
        print(f"{key}: precision={value.precision:.4f}, recall={value.recall:.4f}, fmeasure={value.fmeasure:.4f}")
    return scores


BERT score: Uses a pre-trained BERT model to evaluate the semantic similarity between texts.

In [None]:
def calculate_bert_score(reference_text, generated_text):
    precision, recall, f1 = bert_score([generated_text], [reference_text], model_type="bert-base-multilingual-cased",lang="en")
    print(f"BERTScore: Precision={precision.mean().item():.4f}, Recall={recall.mean().item():.4f}, F1={f1.mean().item():.4f}")
    return precision.mean().item(), recall.mean().item(), f1.mean().item()

### Gemini context and task

In [None]:
# This context is used to frame the interaction with the generative model, providing a perspective or role to shape responses.
context_info = """ You are a Machine Learning research scientist with more than
15 years of experience. Your role is to retrieve and analyse information from a
paper about AI in healthcare topic. Please provide only information from the paper."""

# Model 1: knowledge base Gemini 1.5 Flash

Gemini 1.5 Flash is a smaller and more efficient version of Gemini 1.5 Pro, designed for faster performance and lower latency. Here is a summary of its features:
Model Architecture: Unlike the sparse mixture-of-experts (MoE) architecture of Gemini 1.5 Pro, Flash is a dense transformer decoder model. This design choice, along with techniques like parallel computation of attention and feedforward components, contributes to its efficiency and lower latency.
Distillation from Gemini 1.5 Pro: Gemini 1.5 Flash is trained through online distillation from the larger Gemini 1.5 Pro model. This means it learns from the knowledge and capabilities of the more powerful model, achieving comparable performance while maintaining a smaller size.
Long-Context Handling: Like Gemini 1.5 Pro, Flash can handle a context window of over 2 million tokens. This allows it to process and recall information from very long inputs, including extensive documents, hours of video, and days of audio.
Multimodality: Gemini 1.5 Flash inherits the multimodal capabilities of its larger counterpart, enabling it to process and understand various data types, including text, images, audio, and video. This makes it suitable for tasks requiring comprehension and reasoning across different modalities.
Serving Efficiency and Latency: One of the most prominent features of Gemini 1.5 Flash is its fast inference time. It consistently demonstrates faster output generation than other leading large language models, including GPT-3.5 Turbo, GPT-4 Turbo, and Claude 3, across languages like English, Japanese, Chinese, and French.
Performance: While smaller than Gemini 1.5 Pro, Flash maintains high performance across various tasks. It achieves near-perfect recall in long-context retrieval tasks across modalities, excels in long-document and long-video question answering, and matches or surpasses Gemini 1.0 Ultra's performance on several benchmarks.
In-Context Learning: The long-context capabilities of Gemini 1.5 Flash also enhance its in-context learning ability. This is exemplified by its performance in low-resource language translation, where it shows continuous improvement as the number of in-context examples increases.
Safety and Security: Gemini 1.5 Flash demonstrates significant improvements in safety and security compared to previous models. It shows a substantial decrease in policy violation rates and improved robustness against jailbreak attempts.
Overall, Gemini 1.5 Flash offers a compelling combination of efficiency, long-context understanding, and multimodal capabilities. It excels in tasks that benefit from its fast inference time and ability to process extensive inputs, making it a valuable tool for a wide range of applications.



### Step 1: Implementing the Query System

These functions manage conversations with the generative model

In [None]:

# start_chat_session: Starts a chat by combining a context and the text of the document
def start_chat_session(context_info, papers_text):
    if papers_text:
        # Combine context with the text content of the first document
        combined_content = context_info + "\n\n" + papers_text[0]

        # Use the combined text in the model history
        chat_session = llm.start_chat(
            history=[
                {
                    'role': 'user',
                    'parts': [combined_content]
                }
            ]
        )
        return chat_session
    return None

# chatAI: Sends prompts to the chat session and returns the model's response
def chatAI(chat_session, context_info, prompt):
    if chat_session:
        # Combine the prompt with context to provide more depth
        full_prompt = f"{context_info}\n\n{prompt}"
        response = chat_session.send_message(full_prompt)
        return response.text
    return "No chat session initialised."

# The chat session is initiated using the extracted paper text and context.
chat_session = start_chat_session(context_info, papers_text)

### Step 2: Asking Questions

In [None]:
# Each question is sent to the model, and its responses are captured and stored.
response_1_base=chatAI(chat_session, context_info,question_1)
response_2_base=chatAI(chat_session, context_info,question_2)
response_3_base=chatAI(chat_session, context_info,question_3)
response_4_base=chatAI(chat_session, context_info,question_4)
response_5_base=chatAI(chat_session, context_info,question_5)


# What is a Retrieval-Augmented Generation ?

Retrieval-Augmented Generation (RAG) enhances the capabilities of Large Language Models (LLMs) by incorporating external knowledge into their processing. While LLMs have shown great promise, they can struggle with tasks requiring specific or up-to-date information, sometimes generating inaccurate outputs, known as "hallucinations".
RAG aims to solve this by retrieving relevant information from external databases to supplement the LLM's internal knowledge. This process makes the LLM's outputs more accurate, credible, and grounded in factual data https://arxiv.org/abs/2312.10997.
Here's a breakdown of how RAG typically works:
User Interaction: A user poses a question or provides a prompt to the RAG system.
Retrieval Phase:
Query Processing: The user's query is transformed into a format suitable for retrieval. This may involve techniques like query expansion, rewriting, or transformation to improve its clarity and relevance for searching external data sources.
Search and Retrieval: The processed query is used to search an external knowledge base, which could consist of text documents, databases, knowledge graphs, or even previously generated LLM content. The system retrieves the most relevant chunks of information based on semantic similarity calculations.
Generation Phase:
Context Integration: The retrieved information is combined with the user's original query to create a comprehensive prompt for the LLM.
LLM Processing: The LLM processes the enriched prompt, drawing on both its internal knowledge and the retrieved external information to generate a response.
Output Generation: The LLM generates a response to the user's query, informed by both its own knowledge and the relevant external information.



# Model 2: knowledge base RAG

Naive RAG, the earliest methodology in Retrieval-Augmented Generation, gained popularity shortly after the widespread use of ChatGPT. This approach follows a traditional, chain-like process: indexing, retrieval, and generation.

Indexing:

This stage involves cleaning and extracting data from various formats like PDF, HTML, and Markdown, converting it into plain text.
To handle the context limitations of language models, the text is divided into smaller chunks.
An embedding model is used to encode these chunks into vector representations which are stored in a vector database. This step is crucial for efficient similarity searches during retrieval.

Retrieval:

When a user asks a question, the same embedding model used during indexing transforms the query into a vector.
The RAG system calculates similarity scores between the query vector and the vectors of the indexed chunks.
The system retrieves the top K chunks with the highest similarity to the query and uses them as expanded context within the prompt.

Generation:

The chosen documents and the user's query are combined into a prompt, which is then fed to a large language model for response generation.
The model may use its internal knowledge or only information from the documents to answer, depending on the task.
For ongoing conversations, the dialogue history is integrated into the prompt, allowing for multi-turn interactions.

-Limitations of Naive RAG

Retrieval Challenges: The retrieval process often struggles with accurately identifying and retrieving relevant information, potentially selecting misaligned chunks or missing crucial information.
Generation Difficulties: The model might generate content unsupported by the retrieved context (hallucination), and outputs could suffer from irrelevance, bias, or toxicity.

Augmentation Hurdles: Integrating retrieved information effectively can be difficult, leading to disjointed outputs. Redundancy from similar information in multiple sources can result in repetitive responses. Further complexities arise in determining the importance of passages and ensuring stylistic consistency.

Over-reliance on Retrieved Information: There is a risk that the generation models may depend too heavily on augmented information, echoing retrieved content without adding inshtful synthesis.



### Step 1: Text Processing with TF-IDF

Use TF-IDF (Term Frequency-Inverse Document Frequency) to convert the extracted text into a numerical format that represents the significance of each word in the document. This will help in identifying important sections of the document to respond to queries.

In [None]:
# Split paper text into sections for better vectorization (e.g., paragraphs or sentences)
sections = papers_text[0].split('\n')  # Split by newline or any other delimiter as per requirement

# Compute TF-IDF matrix
np.random.seed(0)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections)

### Step 2: Creating FAISS Index

Use FAISS (Facebook AI Similarity Search) for efficient similarity search. Index the TF-IDF vectors of the document segments to retrieve relevant sections for each query.

In [None]:
# Create FAISS index
dimension = tfidf_matrix.shape[1]  # Dimension of TF-IDF vectors
index = faiss.IndexFlatL2(dimension)  # Using L2 (euclidean) distance

# Convert the sparse matrix to dense for FAISS
dense_vectors = tfidf_matrix.toarray()

# Add vectors to the index
index.add(dense_vectors)

### Step 3: Implementing the Query System

For each user query, use FAISS to retrieve the most relevant sections of the document and pass these sections as context, along with the query, to the Gemini model.

In [None]:
def retrieve_relevant_section(query, top_k=2):
    # Compute query vector
    query_vector = vectorizer.transform([query]).toarray()

    # Search FAISS index
    _, indices = index.search(query_vector, top_k)

    # Retrieve relevant text sections
    relevant_sections = [sections[i] for i in indices[0]]

    return " ".join(relevant_sections)

# Function to generate a response using retrieved information and context
def chatAI_RAG(query):
    # Retrieve relevant sections
    relevant_text = retrieve_relevant_section(query)

    # Combine context with relevant sections and user query
    full_prompt = f"{context_info}\n\nRelevant Information:\n{relevant_text}\n\nQuestion:\n{query}"
    response = chat_session.send_message(full_prompt)

    return response.text

### Step 4: Asking Questions

Saves responses for each question using the chatAI_RAG function

In [None]:
response_1_rag = chatAI_RAG(question_1)
response_2_rag = chatAI_RAG(question_2)
response_3_rag = chatAI_RAG(question_3)
response_4_rag = chatAI_RAG(question_4)
response_5_rag = chatAI_RAG(question_5)


### Model 3: knowledge Advanced RAG

Advanced RAG builds upon the foundation of Naive RAG, focusing on refining retrieval quality to address the limitations of the earlier approach. It employs strategies both before and after the retrieval stage to improve accuracy and relevance.

-Pre-Retrieval Process.

This stage focuses on optimising the indexing structure and the initial user query to ensure that the retrieval process starts with the most relevant and refined inputs.
Indexing optimisation aims to enhance the quality of the indexed content.

Optimising index structures: using techniques like hierarchical indexing or knowledge graph indexing can facilitate faster and more accurate retrieval.
Adding metadata, enriching chunks with metadata allows for targeted filtering during retrieval.

Alignment optimisation: ensuring that the indexing and retrieval processes are aligned can improve the accuracy of retrieved information.
Mixed retrieval, combining different retrieval methods can leverage their complementary strengths.
Query optimisation focuses on refining the user's question to be more suitable for retrieval.

Query rewriting: LLMs can be used to rephrase the original query into a format that is more effective for retrieval.

Query transformation: this involves techniques like using prompt engineering to generate a new query based on the original one.

Query expansion: expanding the query into multiple queries or generating sub-queries can provide additional context and improve retrieval relevance.

-Post-Retrieval Process

After retrieving potentially relevant context, Advanced RAG implements strategies to effectively integrate this information with the query and prepare it for the language model.

Reranking chunks: reordering the retrieved information to prioritise the most relevant content at the beginning of the prompt enhances the language model's focus. This is implemented in frameworks like LlamaIndex, LangChain, and Haystack.

Context compression: to prevent information overload and ensure the language model focuses on essential details, post-retrieval efforts concentrate on:
selecting the most important information from the retrieved content;
emphasising critical sections within the retrieved documents;
shortening the context to be processed by the language model.

By addressing the limitations of Naive RAG through these pre- and post-retrieval strategies, Advanced RAG achieves significant improvements in retrieval quality, paving the way for more accurate and relevant responsesrom the language model.



### Step 1: Text Processing with semantic embeddings

Load a pre-trained sentence transformer model, which is used to generate sentence embeddings.
'all-MiniLM-L6-v2' is a specific variant of the transformer model that provides a balance between performance and computational efficiency.

In [None]:
# Load a sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Define a function that re-ranks documents based on a query with semantic similarity measures from a transformer model

In [None]:
def re_rank_documents(query, documents):
    # Encode the query and documents using the transformer model
    query_embedding = model.encode(query, convert_to_tensor=True)
    doc_embeddings = model.encode(documents, convert_to_tensor=True)

    # Compute cosine similarities
    cosine_scores = util.pytorch_cos_sim(query_embedding, doc_embeddings)[0]

    # Sort documents based on descending cosine similarity scores
    scores_and_docs = sorted(zip(cosine_scores.tolist(), documents), key=lambda x: x[0], reverse=True)

    # Retrieve top ranked documents and their scores
    top_ranked_docs = [doc for _, doc in scores_and_docs]
    return top_ranked_docs


### Step 2: Text Processing with TF-IDF

In [None]:
# Pre-process and index documents for initial retrieval
# Split a text (likely a scientific paper or a large document) into sections using newline characters as delimiters.
sections = papers_text[0].split('\n')
# Create a TF-IDF vectorizer and use it to transform the sections into a TF-IDF matrix representation.
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections)
# Convert the sparse TF-IDF matrix to a dense representation (array format).
dense_vectors = tfidf_matrix.toarray()


### Step 3: Creating FAISS Index

Create a FAISS index for fast nearest-neighbor search.

In [None]:
# Create FAISS index
dimension = tfidf_matrix.shape[1] # The number of features (or terms) in the TF-IDF matrix.
index = faiss.IndexFlatL2(dimension) # A FAISS index using the L2 (Euclidean) distance metric, often used for dense vector similarity.
index.add(dense_vectors) # Add the dense TF-IDF vectors to the index.


### Step 4: Implementing the Query System with re-ranking

Define a function to retrieve the most relevant sections from the text based on a query.
Define a function to generate a conversational response based on a query.

In [None]:
def retrieve_relevant_section(query, top_k=10):
    # Compute query vector
    query_vector = vectorizer.transform([query]).toarray()

    # Search FAISS index for initial retrieval
    _, indices = index.search(query_vector, top_k)
    initial_relevant_sections = [sections[i] for i in indices[0]]

    # Re-rank the initially retrieved sections
    top_ranked_docs = re_rank_documents(query, initial_relevant_sections)

    # Select a subset for the final response, if desired
    return " ".join(top_ranked_docs[:2])  # Take top 2 after re-ranking

def chatAI_ARAG(query):
    relevant_text = retrieve_relevant_section(query)
    full_prompt = f"{context_info}\n\nRelevant Information:\n{relevant_text}\n\nQuestion:\n{query}"
    response = chat_session.send_message(full_prompt)
    return response.text


### Step 5: Asking Questions

Call the chatAI_ARAG function for multiple queries and store the responses in separate variables.

In [None]:
response_1_arag = chatAI_ARAG(question_1)
response_2_arag = chatAI_ARAG(question_2)
response_3_arag = chatAI_ARAG(question_3)
response_4_arag = chatAI_ARAG(question_4)
response_5_arag = chatAI_ARAG(question_5)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Right answers

List of the actual answers retrieved from the text

In [None]:
right_answer_1 = """
Encoder-only models:
These models are advantageous for text classification tasks such as sentiment analysis. An example of an LLM utilizing an encoder-only model is BERT.

Decoder-only or autoregressive models:
These models are suitable for text generation tasks, akin to the predictive text functionality in a smartphone chat application. For instance, as you input text, the AI predicts the subsequent word or phrase. An example of this model is GPT-3.

Encoder-decoder models:
These models facilitate generative AI tasks such as language translation and summarization. Notable LLMs employing this methodology include Facebook’s BART and Google’s T5.
"""

In [None]:
right_answer_2 = """
MERGIS, was proposed by (Nimalsiri et al. 2023). It uses image segmentation and a modern
transformer-based encoder-decoder model to enhance the accuracy of automated report generation.
"""

In [None]:
right_answer_3 = """
(Özbey et al. 2023) introduced an innovative technique named Adaptive Diffusion
Priors (AdaDiff) for the reconstruction of MRI. This approach involves a series of diffusion processes
that enhance the authenticity of the generated images. AdaDiff dynamically adjusts its priors during
the inference stage to align more closely with the distribution of the test data.
"""

In [None]:
right_answer_4 ="""
The forward diffusion process is depicted as a Markov Chain, characterized by the inclusion of
Gaussian noise in a series of stages, culminating in generating noisy samples. The uncorrupted
or original data distribution is represented as 𝑞(𝑥0). With a data sample 𝑥0 drawn from this
distribution, 𝑞(𝑥0), a forward noising operation, denoted as 𝑝, is employed.
This operation introduces Gaussian noise iteratively at various time points, represented by 𝑡,
resulting in a series of latent states 𝑥1 through 𝑥𝑇. The process can be mathematically defined as
follows:
𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}
𝑇 denotes the number of diffusion steps, while 𝛽1,…, 𝛽𝑇, each within the interval of [0, 1), signify
the variance schedule spread throughout the diffusion steps.
The identity matrix is symbolized by 𝐈, and 𝒩(𝑥; 𝜇,𝜎), which characterizes the normal distribution
possessing a mean of 𝜇 and a covariance of 𝜎.
"""

In [None]:
right_answer_5 ="""
ProteinBERT was developed by (Brandes et al. 2022). It's a specialized deep language model for protein
sequences that amalgamates local and global representations for comprehensive end-to-end processing.
"""

### Responses and evaluation

LLM evaluation metrics are used to assess the quality of text generated by Large Language Models (LLMs). They provide a quantifiable measure of how well the LLM is performing on a specific task. While general-purpose metrics exist, it's crucial to select metrics tailored to the specific use case of the LLM application.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a statistical metric commonly used for evaluating text summaries generated by NLP models. It calculates the overlap of n-grams (sequences of words) between the generated summary and a reference summary. However, statistical methods like ROUGE are considered less effective for evaluating complex LLM outputs because they don't fully capture the semantic nuances and reasoning involved.
BERTScore, a model-based metric, offers a more semantically aware approach. It leverages pre-trained language models like BERT to compute the similarity between the contextual embeddings of words in the generated text and the reference text. However, BERTScore can be susceptible to biases present in the pre-trained models and may struggle with long, complex texts
https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation.


In [None]:
display(Markdown(response_1_base))

The review mentions three types of encoder-decoder transformer architectures:

1.  Encoder-only models (e.g., BERT) suitable for text classification tasks.
2.  Decoder-only or autoregressive models (e.g., GPT-3) suitable for text generation tasks.
3.  Encoder-decoder models (e.g., Facebook's BART and Google's T5) suitable for tasks like language translation and summarization.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6724, recall=0.3861, fmeasure=0.4906
rouge2: precision=0.4561, recall=0.2600, fmeasure=0.3312
rougeL: precision=0.4828, recall=0.2772, fmeasure=0.3522


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

BERTScore: Precision=0.7877, Recall=0.7473, F1=0.7670


In [None]:
display(Markdown(response_1_rag))

The review discusses three types of encoder-decoder transformer architectures: encoder-only models (e.g., BERT), decoder-only or autoregressive models (e.g., GPT-3), and encoder-decoder models (e.g., Facebook's BART and Google's T5).


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6579, recall=0.2475, fmeasure=0.3597
rouge2: precision=0.4324, recall=0.1600, fmeasure=0.2336
rougeL: precision=0.5789, recall=0.2178, fmeasure=0.3165
BERTScore: Precision=0.7573, Recall=0.6871, F1=0.7205


In [None]:
display(Markdown(response_1_arag))

The review mentions three types of encoder-decoder transformer architectures:  encoder-only models (like BERT), decoder-only or autoregressive models (like GPT-3), and encoder-decoder models (like BART and T5).


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6774, recall=0.2079, fmeasure=0.3182
rouge2: precision=0.3667, recall=0.1100, fmeasure=0.1692
rougeL: precision=0.5806, recall=0.1782, fmeasure=0.2727
BERTScore: Precision=0.7824, Recall=0.6746, F1=0.7245


In [None]:
display(Markdown(response_2_base))

MERGIS was developed by Nimalsiri et al.  It is a transformer-based encoder-decoder model that uses image segmentation to enhance the accuracy of automated report generation.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8889, recall=0.8571, fmeasure=0.8727
rouge2: precision=0.6538, recall=0.6296, fmeasure=0.6415
rougeL: precision=0.7778, recall=0.7500, fmeasure=0.7636
BERTScore: Precision=0.9364, Recall=0.9032, F1=0.9195


In [None]:
display(Markdown(response_2_rag))

Nimalsiri et al. developed MERGIS.  It's a transformer-based encoder-decoder model that uses image segmentation to improve the accuracy of automated report generation.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8400, recall=0.7500, fmeasure=0.7925
rouge2: precision=0.5417, recall=0.4815, fmeasure=0.5098
rougeL: precision=0.6800, recall=0.6071, fmeasure=0.6415
BERTScore: Precision=0.9198, Recall=0.8871, F1=0.9032


In [None]:
display(Markdown(response_2_arag))

Nimalsiri et al. developed MERGIS.  It is a transformer-based encoder-decoder model that uses image segmentation to improve the accuracy of automated report generation.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8400, recall=0.7500, fmeasure=0.7925
rouge2: precision=0.5417, recall=0.4815, fmeasure=0.5098
rougeL: precision=0.6800, recall=0.6071, fmeasure=0.6415
BERTScore: Precision=0.9269, Recall=0.8885, F1=0.9073


In [None]:
display(Markdown(response_3_base))

AdaDiff was developed by Özbey et al.  It is an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, leading to improved reconstruction quality and speed.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6053, recall=0.4259, fmeasure=0.5000
rouge2: precision=0.2703, recall=0.1887, fmeasure=0.2222
rougeL: precision=0.5263, recall=0.3704, fmeasure=0.4348
BERTScore: Precision=0.8303, Recall=0.7937, F1=0.8116


In [None]:
display(Markdown(response_3_rag))

Özbey et al. developed AdaDiff.  It's an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, resulting in superior reconstruction quality and speed.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6389, recall=0.4259, fmeasure=0.5111
rouge2: precision=0.2857, recall=0.1887, fmeasure=0.2273
rougeL: precision=0.5556, recall=0.3704, fmeasure=0.4444
BERTScore: Precision=0.8316, Recall=0.7973, F1=0.8141


In [None]:
display(Markdown(response_3_arag))

Özbey et al. developed AdaDiff. It is an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, leading to superior reconstruction quality and speed.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6389, recall=0.4259, fmeasure=0.5111
rouge2: precision=0.2857, recall=0.1887, fmeasure=0.2273
rougeL: precision=0.5556, recall=0.3704, fmeasure=0.4444
BERTScore: Precision=0.8348, Recall=0.7984, F1=0.8162


In [None]:
display(Markdown(response_4_base))

Equation (1),  𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}, represents the forward diffusion process in Denoising Diffusion Probabilistic Models (DDPMs).  This process is a Markov chain where Gaussian noise is iteratively added to the input data (x<sub>t-1</sub>) at each time step (t).  The amount of noise added is controlled by the variance schedule (β<sub>t</sub>), resulting in a series of increasingly noisy latent states (x<sub>1</sub> through x<sub>T</sub>).  The equation shows that each noisy state x<sub>t</sub> is drawn from a normal distribution (𝒩) with a mean dependent on the previous state (scaled by √1− 𝛽𝑡) and a variance (β<sub>t</sub>I), where I is the identity matrix.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.5517, recall=0.4848, fmeasure=0.5161
rouge2: precision=0.2174, recall=0.1908, fmeasure=0.2033
rougeL: precision=0.2845, recall=0.2500, fmeasure=0.2661
BERTScore: Precision=0.7437, Recall=0.7865, F1=0.7645


In [None]:
display(Markdown(response_4_rag))

Equation (1), 𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}, represents the forward diffusion process in Denoising Diffusion Probabilistic Models (DDPMs).  This is a Markov chain where Gaussian noise is iteratively added to the input data at each time step. The amount of noise added is controlled by the variance schedule (βt), resulting in a series of increasingly noisy latent states.  The equation shows that each noisy state x<sub>t</sub> is drawn from a normal distribution (𝒩) with a mean dependent on the previous state (scaled by √1− 𝛽𝑡) and a variance (β<sub>t</sub>I), where I is the identity matrix.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6224, recall=0.4621, fmeasure=0.5304
rouge2: precision=0.2577, recall=0.1908, fmeasure=0.2193
rougeL: precision=0.3163, recall=0.2348, fmeasure=0.2696
BERTScore: Precision=0.7739, Recall=0.7870, F1=0.7804


In [None]:
display(Markdown(response_4_arag))

Equation (1),  𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}, describes the forward diffusion process within Denoising Diffusion Probabilistic Models (DDPMs).  It's a Markov chain where Gaussian noise is iteratively added to the input data (x<sub>t-1</sub>) at each timestep (t). The noise amount is controlled by the variance schedule (β<sub>t</sub>), creating a sequence of increasingly noisy latent states (x<sub>1</sub> to x<sub>T</sub>). The equation states that each noisy state x<sub>t</sub> is sampled from a normal distribution (𝒩) with a mean dependent on the previous state (scaled by √1− 𝛽𝑡) and a variance (β<sub>t</sub>I), where I is the identity matrix.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.4865, recall=0.4091, fmeasure=0.4444
rouge2: precision=0.1727, recall=0.1450, fmeasure=0.1577
rougeL: precision=0.2523, recall=0.2121, fmeasure=0.2305
BERTScore: Precision=0.7366, Recall=0.7771, F1=0.7563


In [None]:
display(Markdown(response_5_base))

ProteinBERT was developed by Brandes et al.  It is a specialized deep language model for protein sequences that combines local and global representations for comprehensive end-to-end processing.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.9310, recall=0.9000, fmeasure=0.9153
rouge2: precision=0.8214, recall=0.7931, fmeasure=0.8070
rougeL: precision=0.9310, recall=0.9000, fmeasure=0.9153
BERTScore: Precision=0.9656, Recall=0.9331, F1=0.9491


In [None]:
display(Markdown(response_5_rag))

Brandes et al. developed ProteinBERT. It's a specialized deep language model for protein sequences that uses both local and global representations for complete end-to-end processing.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8929, recall=0.8333, fmeasure=0.8621
rouge2: precision=0.7037, recall=0.6552, fmeasure=0.6786
rougeL: precision=0.8214, recall=0.7667, fmeasure=0.7931
BERTScore: Precision=0.9408, Recall=0.9086, F1=0.9245


In [None]:
display(Markdown(response_5_arag))

Brandes et al. developed ProteinBERT.  It is a specialized deep language model for protein sequences that combines local and global representations for comprehensive end-to-end processing.


In [None]:
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.9259, recall=0.8333, fmeasure=0.8772
rouge2: precision=0.7308, recall=0.6552, fmeasure=0.6909
rougeL: precision=0.8519, recall=0.7667, fmeasure=0.8070
BERTScore: Precision=0.9472, Recall=0.9127, F1=0.9296


### Conclusions

In this study, I stressed Gemini 1.5 Flash to chat directly with the paper vs a Naive RAG and an Advanced RAG for a knowledge Q&A on a review about the use of AI in Healthcare.
The first considerations are oriented on the use case evaluation metrics.
I haven't used Ragas, a library appropriately built for the RAG use case, because in showing three models, only two have the RAG architecture. Moreover, Ragas is born recently and I haven't seen Google documentation support.
I used two metrics ROUGE and BERTScore.
Looking at the results and focusing on the F1 score, it is clear that ROUGE is not fully suitable for the evaluation, unlike BERTScore which offers a better representation of the evaluation since the answers are overall correct.
All models offer quite the same results, the advanced RAG may require better tuning and maybe a different embedding model, surely Gemini long text is an invaluable tool in competition with Retrieval-Augmented Generation.
Though the dataset provided uses 1/3 of the suggested number of tokens, I think it's useful because it shows that Gemini 1.5 Flash for a text with almost 50 pages is competitive against RAG tools.