<a href="https://colab.research.google.com/github/caglarmert/DI725/blob/main/DI725_Lab_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DI 725: Transformers and Attention-Based Deep Networks

## A Tutorial for Retrieval Augmented Generation
<a id='intro_tutorial'></a>
The purpose of this notebook is to introduce Retrieval Augmented Generation, RAG for short. RAG is a concept of in-context learning. In context learning is a way of transfering information from a context to a large language model that has never seen that information before!

RAG is a tool that can enhance the question-answering abilities of already trained models by incorporating new information. To achieve this, we will import the fresh data into a vector database, which will essentially act as an external memory for the model. The retrieval model, which in our case is llama-index, will then utilize this database to create a specific prompt for the given task and retrieve the relevant document, which will be passed along with the prompt to the language model.

We will have two examples, first we will start with Quantized Low Rank Adapters (QLoRA) from the work [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314). This is an interesting apporach, one that is particularly useful for the researchers with limited resources. It basically combines quantization and finetuning to achieve similar performance of the baseline models, which require tremendeous amount of resources compared to QLoRA. The second example is [Class Distance Weighted Cross Entropy](https://arxiv.org/abs/2202.05167) loss function, a special loss function that is proven to be useful for ordinal classes, that is common in medical domain. None of these papers are present in the training of the LLM we are going to use in this lab! We will make sure of it by asking questions about these terms, and will observe either the model failing to answer, or hallucinate about it outright!

We will start with an LLM from Antropic, see the details in [Computational Requirements section](#comp_req). This model does not known anything about QloRA, and we will make sure of it by asking the term. The LLM will respond with a phrase that it doesn't know the term.

After downloading the paper that describes QLoRA work, we will then use Retrieval Augmented Generation (RAG) to provide the context to the LLM. This is how in-context learning is applied to an already existing and powerful enough model (a foundational model).

In our second example, we will use a work conducted in METU. This time, when asked about the term "Class Distance Weighted Cross Entropy" the LLM will provide an answer, although it doesnt know anything about it, it will hallucinate about the term. I have conducted a brief research and concluded that this LLM hallucinates about CDW-CE, comes up with "Contrastive Divergence with Data Augmentation and Consistency Encoding" or other nonsensical contents. Furthermore, it explains this hallucinated content and explains it like it is a real phenomenon. This is totall unacceptable!

We will again use RAG (and thus in-context learning) to enhance the capability of our model, and provide a context for it to work on! We will upload the work conducted in our Institute. The answers provided by the model will resemble the original work, and now has a context, thus no hallucination will occur!

Up to this point, we have demonstrated the capability of RAG with a single context document. How would the models fare with multiple documents, like an encyclopedia with many different topics to choose from? This is a pretty valid and important use-case scenario for companies. They would have in-house documentation, or very specific documents, and would want their LLM (their own chatbots) to answer without hallucinating.




### Author
[Ümit Mert Çağlar](https://avesis.metu.edu.tr/mecaglar)



### Computational Requirements
<a id='comp_req'></a>
In this notebook, we will be using lightweight computational resources, you do not need to employ any GPU or training of any sort to follow or complete this lab.

Although we will be employing Large Language Models (LLM), we will be using API granted by [anthropic](https://www.anthropic.com/), specifically API for their foundational AI Model [Claude-3](https://www.anthropic.com/claude). If you want to try this yourself you can follow the steps from [here](https://console.anthropic.com/login?returnTo=%2F%3F), register with your e-mail, enter your phone number and earn your introductory 5 dollars for 14 days. Whole tutorial will cost like 10 cents so you can do whatever you like with the remaining credit balance.

# Example-1: Single Document about QloRA to Provide Context

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. We release all of our models and code, including CUDA kernels for 4-bit training.

## Context

In [None]:
!wget "https://arxiv.org/pdf/2305.14314.pdf" -O /content/QLORA.pdf

--2024-05-09 18:18:16--  https://arxiv.org/pdf/2305.14314.pdf
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2305.14314 [following]
--2024-05-09 18:18:16--  http://arxiv.org/pdf/2305.14314
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1065470 (1.0M) [application/pdf]
Saving to: ‘/content/QLORA.pdf’


2024-05-09 18:18:16 (15.8 MB/s) - ‘/content/QLORA.pdf’ saved [1065470/1065470]



## Requirements
Install the requirements with pip install.

In [None]:
!pip install torch llama-index==0.10.20 transformers accelerate bitsandbytes pypdf chromadb==0.4.24 sentence-transformers pydantic==1.10.11 llama-index-embeddings-huggingface llama-index-llms-huggingface llama-index-readers-file llama-index-vector-stores-chroma llama-index-llms-anthropic --quiet

## Imports
We will import the necessary imports here. Basically, we have the [torch](https://pytorch.org/) library, [llama index](https://www.llamaindex.ai/), [transformers](https://huggingface.co/docs/transformers/index) from Huggingface. You can read more about llama index and RAG from [here](https://docs.llamaindex.ai/en/stable/).

In [None]:
import torch
import sys
import chromadb
from llama_index.core import VectorStoreIndex, download_loader, ServiceContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.readers.file import PDFReader
from llama_index.llms.anthropic import Anthropic
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
from llama_index.core import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from IPython.display import Markdown, display, HTML
from pathlib import Path
import os

## API Key
Provide your API key here.

In [None]:
os.environ["ANTHROPIC_API_KEY"] = "YOUR API KEY HERE"

## Document
Provide the document here, load the PDF for now

In [None]:
loader = PDFReader()
documents = loader.load_data(file=Path("/content/QLORA.pdf"))

## Large Language Model
Here we setup the LLM, a claude-3 model, as described in the [introduction section](#intro_tutorial).

In [None]:
llm = Anthropic(
    model="claude-3-sonnet-20240229",
)

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
Settings.llm = llm
Settings.chunk_size = 1024

## First response
Here we will get our first response, when asked (queried) about QLORA, the LLM agent responds that it doesn't know the term. Good, but how can we "teach" this term? Do we require a training from scracth? Thankfully no, we can use this foundational model and RAG!

In [None]:
# resp contains the response
resp = llm.complete("What is QLORA?")

# Using HTML with inline CSS for styling (gray color, smaller font size)
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{resp}</b></p>'
display(HTML(html_text))

Response: QLORA is not a commonly recognized acronym or term that I'm familiar with. Without more context, it's difficult for me to provide a definitive explanation of what QLORA means or refers to. Acronyms can have multiple meanings across different fields or contexts. Could you provide some additional details about where you encountered this term or what domain it relates to? That would help me try to determine the intended meaning of QLORA.

## Vector Database

We can observe from the following output that the model has no data on this topic, which is beneficial for us. This provides us with an opportunity to enhance its knowledge on the subject using RAG. We will now configure ChromaDB as our vector database and load the data from the paper we downloaded into it. Chroma is a vector embedding database that is open-source. When a query is made, it computes the feature vector of our prompt and retrieves the most relevant documents from the one we loaded into it using similarity search. This document can then be passed to the language model as context. ChromaDB can run within our Jupyter Notebook and has been installed, so there is no need to attach any external servers. Since we only use one document, Chroma will not have any difficulty determining which document to return, but you can experiment with loading additional documents to see how it affects the result.

In [None]:
#Create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("firstcollection")

# Load the embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index = VectorStoreIndex.from_documents(
  documents, storage_context=storage_context, service_context=service_context
)

  service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)


In [None]:
#Define query
query="what is QLORA?"

query_engine =index.as_query_engine(response_mode="compact")
response = query_engine.query(query)

# Using HTML with inline CSS for styling (blue color)
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{response}</b></p>'
display(HTML(html_text))

Response: Based on the context provided, QLORA appears to be a technique or method used for finetuning large language models like OPT on various datasets and tasks. The context mentions using QLORA for finetuning OPT models of different sizes (7B, 13B, 33B, 65B) on datasets like Self-Instruct, Alpaca, Unnatural Instructions, Longform, and Chip2. It also provides hyperparameter details used for QLORA finetuning across different model sizes and datasets. However, the context does not explicitly define or describe what QLORA is in detail.

## RAG Results
We have successfully implemented RAG with our LLM based foundational model! We can use a powerful tool, and a context to answer a question, while staying within the context! Here are some additional questions and answers. Note that we can use this approach to extract many useful information from documents.

In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What would be potential real world use cases for QLoRA?")
print(response)

Querying with: What would be potential real world use cases for QLoRA?
Based on the context information provided, some potential real-world use cases for QLoRA (Quantized Low-Rank Adaptation) could be:

1. Fine-tuning large language models for specific tasks or domains while requiring significantly less compute resources and memory compared to full model fine-tuning. This makes it feasible to personalize large models on consumer hardware like laptops or edge devices.

2. Enabling on-device fine-tuning and personalization of language models for virtual assistants, chatbots, and other conversational AI applications on mobile devices or IoT products with limited compute capabilities.

3. Allowing efficient model adaptation and knowledge injection for large language models in cloud/server environments, reducing costs associated with full fine-tuning at scale.

4. Facilitating research and experimentation with large language model customization by making the process more accessible and reso

Response:

Querying with: What would be potential real world use cases for QLoRA?
Based on the context information provided, some potential real-world use cases for QLoRA (Quantized Low-Rank Adaptation) could be:

1. Fine-tuning large language models for specific tasks or domains while requiring significantly less compute resources and memory compared to full model fine-tuning. This makes it feasible to personalize large models on consumer hardware like laptops or edge devices.

2. Enabling on-device fine-tuning and personalization of language models for virtual assistants, chatbots, and other conversational AI applications on mobile devices or IoT products with limited compute capabilities.

3. Allowing efficient model adaptation and continuous learning for large language models in scenarios where data is arriving in a stream, such as for customer service chatbots or language translation systems.

4. Facilitating privacy-preserving fine-tuning, where the base model remains unchanged, and only small LoRA weights need to be transmitted for personalization on the client-side.

5. Enabling more efficient and scalable fine-tuning pipelines for large language model providers, reducing the computational costs associated with serving fine-tuned models for various downstream tasks.

The low-rank, quantized nature of QLoRA makes it particularly well-suited for resource-constrained environments and applications requiring efficient model customization or continuous adaptation.

Let's find those key points from the text ourselves.


1. Our QLORA finetuning method is the first method that enables the finetuning of 33B parameter
models on a single consumer GPU and 65B parameter models on a single professional GPU, while
not degrading performance relative to a full finetuning baseline.
1. Another potential source of impact is deployment to mobile phones.
1. Since instruction finetuning is an essential tool to transform raw pretrained LLMs into ChatGPT-like
chatbots, we believe that our method will make finetuning widespread and common in particular for
the researchers that have the least resources, a big win for the accessibility of state of the art NLP
technology.
1. QLORA
can help enable privacy-preserving usage of LLMs, where users can own and manage their own data
and models, while simultaneously making LLMs easier to deploy.

The results are convincing!


In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("Explain QLoRA vs. Standard Finetuning")
print(response)

Querying with: Explain QLoRA vs. Standard Finetuning
QLoRA (Quantized Low-Rank Adaptation) is a method that enables efficient finetuning of large language models by combining quantization and low-rank adaptation techniques. Unlike standard full finetuning, which updates all the parameters of the pre-trained model, QLoRA first quantizes the pre-trained model to a lower precision (e.g., 4-bit) and then only finetunes a small set of additional parameters, called adapters, during the finetuning process.

The key advantages of QLoRA over standard finetuning are:

1. Memory efficiency: By quantizing the pre-trained model to lower precision, QLoRA significantly reduces the memory footprint, allowing finetuning of much larger models on hardware with limited memory.

2. Computational efficiency: Since QLoRA only updates the small adapter parameters during finetuning, it requires significantly fewer computational resources compared to full finetuning, which updates all model parameters.

3. Matc

In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("Explain QLoRA Finetuning steps")
print(response)

Querying with: Explain QLoRA Finetuning steps
QLoRA (Quantized Low-Rank Adaptation) is a method for efficient finetuning of large language models. Here are the key steps involved:

1. Start with a pre-trained large language model that has been quantized to lower precision (e.g. 4-bit) to reduce memory requirements.

2. Add a small set of trainable parameters called LoRA (Low-Rank Adaptation) to the quantized model. These are low-rank matrices that can adapt the pre-trained model to a new task or domain.

3. Finetune only the LoRA parameters on the target dataset, while keeping the quantized pre-trained model fixed. This requires much less memory and compute compared to full finetuning.

4. During inference, combine the LoRA parameters with the quantized pre-trained model to obtain the finetuned model outputs.

The key advantages of QLoRA are its high efficiency, enabling finetuning of very large models like 33B or 65B parameters on a single GPU, and its performance being on par with fu

In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("Explain Qualitative Analysis on QLoRA")
print(response)

Querying with: Explain Qualitative Analysis on QLoRA
Based on the provided context, QLoRA (Quantized Low-Rank Adaptation) is a technique used for efficient finetuning of large language models. It involves adding low-rank matrices to the existing weights of the model during finetuning, rather than updating all weights. This allows for faster and more memory-efficient finetuning compared to standard finetuning methods.

The context provides details on the hyperparameters used for QLoRA finetuning experiments across different model sizes (7B, 13B, 33B, 65B) and datasets like OASST1, HH-RLHF, Longform, and others. It mentions using techniques like LoRA dropout, tuning the LoRA rank (r), and other settings like learning rates and training steps for different model-dataset combinations.

The qualitative analysis likely refers to evaluating the performance and behavior of models finetuned with QLoRA on various tasks and datasets, compared to standard finetuning or other baselines. This could 

In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What are the benchmark results for Guanaco")
print(response)

Querying with: What are the benchmark results for Guanaco
According to the context information provided, the Guanaco models achieved impressive results on various benchmarks:

1. On the Vicuna benchmark, which favors open-source models, the Guanaco 33B and 65B models outperformed all other models except GPT-4. They performed comparably to ChatGPT.

2. On the larger OA (Open Assistant) benchmark, the Guanaco 33B and 65B models also outperformed all models besides GPT-4 and performed similarly to ChatGPT.

3. In the Elo rating competition, where GPT-4 judged the quality of responses on the Vicuna benchmark, the Guanaco 65B model achieved an Elo rating of 1022, and the Guanaco 33B model achieved an Elo rating of 992, ranking them as the second and third best models after GPT-4.

4. The Guanaco 33B model achieved 97.8% of the performance level of ChatGPT on the Vicuna benchmark, while the Guanaco 65B model essentially closed the gap, reaching 99.3% of ChatGPT's performance.

5. Even the sm

Response:
Querying with: What are the benchmark results for Guanaco
According to the context information provided, the Guanaco models achieved impressive results on various benchmarks:

1. On the Vicuna benchmark, which favors open-source models, the Guanaco 33B and 65B models outperformed all other models except GPT-4. They performed comparably to ChatGPT.

2. On the larger OA (Open Assistant) benchmark, the Guanaco 33B and 65B models also outperformed all models besides GPT-4 and performed similarly to ChatGPT.

3. In the Elo rating competition, where GPT-4 judged the responses, the Guanaco 65B model achieved an Elo rating of 1022, and the Guanaco 33B model achieved an Elo rating of 992, ranking them as the second and third best models, respectively, after GPT-4.

4. The Guanaco 13B model outperformed the Bard model in the Elo rating competition, with an Elo rating of 916 compared to Bard's 902.

5. On the Vicuna benchmark, the smallest Guanaco model (7B parameters) outperformed the 26GB Alpaca model by more than 20 percentage points.

Overall, the Guanaco models, especially the 33B and 65B versions, demonstrated state-of-the-art performance, rivaling ChatGPT on various benchmarks and outperforming other open-source models by a significant margin.

# Example-2: Single Document about CDW-CE to Prevent Hallucination

In this example, we will start with the following document, with the following abstract:

In scoring systems used to measure the endoscopic activity of ulcerative colitis, such as Mayo endoscopic score or Ulcerative Colitis Endoscopic Index Severity, levels increase with severity of the disease activity. Such relative ranking among the scores makes it an ordinal regression problem. On the other hand, most studies use categorical cross-entropy loss function to train deep learning models, which is not optimal for the ordinal regression problem. In this study, we propose a novel loss function, class distance weighted cross-entropy (CDW-CE), that respects the order of the classes and takes the distance of the classes into account in calculation of the cost. Experimental evaluations show that models trained with CDW-CE outperform the models trained with conventional categorical cross-entropy and other commonly used loss functions which are designed for the ordinal regression problems. In addition, the class activation maps of models trained with CDW-CE loss are more class-discriminative and they are found to be more reasonable by the domain experts.

In [None]:
!wget "https://arxiv.org/pdf/2202.05167" -O /content/cdwce.pdf

--2024-05-09 18:24:39--  https://arxiv.org/pdf/2202.05167
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5256318 (5.0M) [application/pdf]
Saving to: ‘/content/cdwce.pdf’


2024-05-09 18:24:39 (24.6 MB/s) - ‘/content/cdwce.pdf’ saved [5256318/5256318]



Again we will load our PDF reader, initiate our LLM, and ask potential use case scenarios for CDW-CE.

In [None]:
loader = PDFReader()
documents_cdw_ce = loader.load_data(file=Path("/content/cdwce.pdf"))

In [None]:
llm2 = Anthropic(
    model="claude-3-sonnet-20240229",
)

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
Settings.llm = llm2
Settings.chunk_size = 1024

In [None]:
# resp contains the response
resp = llm2.complete("What would be potential real world use cases for CDW-CE?")
print(resp)

CDW-CE, or Contrastive Divergence with Data Augmentation and Consistency Encoding, is a machine learning technique used for training generative models, particularly energy-based models (EBMs) and implicit generative models. Some potential real-world use cases for CDW-CE include:

1. Image generation and manipulation: CDW-CE can be used to train generative models for creating realistic synthetic images or manipulating existing images. This has applications in computer vision, graphics, and multimedia domains, such as image editing, style transfer, and data augmentation for training computer vision models.

2. Anomaly detection: Generative models trained with CDW-CE can learn the underlying distribution of normal data, making them useful for detecting anomalies or outliers in various domains, such as fraud detection, manufacturing defect detection, and medical imaging analysis.

3. Data imputation and denoising: CDW-CE can be used to train generative models for imputing missing data or d

This is hallucination!
All of this information, is based upon hallucination so there is no possible way of determining where it originated, how the LLM answered or such. How can we prevent this from happening? We can use RAG!

In [None]:
#Create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("secondcollection")

# Load the embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index_cdw_ce = VectorStoreIndex.from_documents(
  documents_cdw_ce, storage_context=storage_context, service_context=service_context
)

  service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)


We again use the ChromaDB, to create a vector DB of our document (CDW-CE paper). and as

In [None]:
chat_engine = index_cdw_ce.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What would be potential real world use cases for CDW-CE?")
print(response)

Querying with: What would be potential real world use cases for CDW-CE?
Potential real-world use cases for the Class Distance Weighted Cross-Entropy Loss (CDW-CE) method could include:

1. Medical image analysis and disease diagnosis - As demonstrated in the context, CDW-CE can improve the performance of convolutional neural networks in classifying medical images like endoscopy images for assessing ulcerative colitis severity. It provides better explainability through class activation maps highlighting more relevant and discriminative regions related to the disease.

2. Any ordinal classification task where there is a natural ordering between classes and misclassifications to distant classes should be penalized more heavily. Examples could include age estimation, product rating prediction, document grading etc.

3. Applications requiring better interpretability of model decisions, as CDW-CE encourages the model to focus on more relevant features/regions compared to standard cross-entro

In [None]:
chat_engine = index_cdw_ce.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("How does CDW-CE perform compared to other loss functions?")
print(response)

Querying with: How does CDW-CE perform compared to other loss functions?
According to the experimental results presented in the context, the proposed Class Distance Weighted Cross-Entropy (CDW-CE) loss function outperforms other ordinal loss approaches like Cross-Entropy (CE), Squared Ordinal Regression (CORN), and Hierarchical Ordinal Regression (HO2). 

Specifically, CDW-CE achieves the highest performance scores across different evaluation metrics and CNN models for the remission classification task. It significantly reduces mispredictions that are farther away (two or more class distances) from the true class compared to CE. While the sensitivity for edge classes may remain similar, CDW-CE notably increases the sensitivity for intermediate classes by centering wrong estimates closer to the true class due to the higher penalty given to distant mispredictions.

The context also mentions that CDW-CE provides better explainability through Class Activation Map (CAM) visualizations, high

In [None]:
chat_engine = index_cdw_ce.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("Which loss function is the best performing loss function in terms of QWK and F1 Score?")
print(response)

Querying with: Which loss function is the best performing loss function in terms of QWK and F1 Score?
Based on the results presented, the Class Distance Weighted Cross-Entropy (CDW-CE) loss function performs the best in terms of Quadratic Weighted Kappa (QWK) and F1 scores across the different models evaluated - ResNet18, Inception-v3, and MobileNet-v3-Large. The CDW-CE loss outperforms the standard cross-entropy (CE) loss as well as other ordinal loss functions like CORN, HO2, and CO2, achieving the highest QWK and F1 scores for all three model architectures. The visualizations in Figure 2 also show that CDW-CE provides more stable performance across different values of the hyperparameter α compared to CE loss.


In [None]:
chat_engine = index_cdw_ce.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("Which loss function has best explainability?")
print(response)

Querying with: Which loss function has best explainability?
Based on the experimental results described in the context, the proposed Class Distance Weighted Cross-Entropy (CDW-CE) loss function provides better explainability compared to the standard cross-entropy loss. Specifically, the context mentions that training models with the CDW-CE loss results in Class Activation Map (CAM) visualizations that highlight more relevant and discriminative regions for the decision-making process. This improved explainability is attributed to the way CDW-CE penalizes mispredictions according to their distance from the true class, encouraging the model to learn more meaningful representations. Therefore, the CDW-CE loss function has the best explainability according to the information provided.


In [None]:
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{response}</b></p>'
display(HTML(html_text))

In [None]:
chat_engine = index_cdw_ce.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What is QLORA?")
print(response)

Querying with: What is QLORA?
Unfortunately, there is no mention of "QLORA" in the given context information. The context appears to be discussing deep learning models, loss functions, and evaluation metrics for analyzing medical images related to ulcerative colitis. Without any relevant information provided, I cannot determine what "QLORA" refers to.


In [None]:
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{response}</b></p>'
display(HTML(html_text))

# Example-3: Multiple Documents and Letting LLM to Choose

In [None]:
!wget "https://github.com/caglarmert/DI725/blob/b056f9f05e612942689031655d92dc21f802f28c/METU_Regulations.pdf?raw=true" -O /content/gda.pdf

--2024-05-09 18:55:56--  https://github.com/caglarmert/DI725/blob/b056f9f05e612942689031655d92dc21f802f28c/METU_Regulations.pdf?raw=true
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/caglarmert/DI725/raw/b056f9f05e612942689031655d92dc21f802f28c/METU_Regulations.pdf [following]
--2024-05-09 18:55:57--  https://github.com/caglarmert/DI725/raw/b056f9f05e612942689031655d92dc21f802f28c/METU_Regulations.pdf
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/caglarmert/DI725/b056f9f05e612942689031655d92dc21f802f28c/METU_Regulations.pdf [following]
--2024-05-09 18:55:57--  https://raw.githubusercontent.com/caglarmert/DI725/b056f9f05e612942689031655d92dc21f802f28c/METU_Regulations.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185

In [None]:
loader = PDFReader()
documents_gda = loader.load_data(file=Path("/content/gda.pdf"))

llm3 = Anthropic(
    model="claude-3-sonnet-20240229",
)

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
Settings.llm = llm3
Settings.chunk_size = 1024





In [None]:
# resp contains the response
resp = llm3.complete("What is the date of Higher Education Act 2547?")
print(resp)

Unfortunately, I could not find any specific information about a "Higher Education Act 2547". The year 2547 seems to refer to the Buddhist Era calendar, which is widely used in several Southeast Asian countries like Thailand.

Without more context about which country this act is from, it's difficult to pinpoint the exact date corresponding to the year 2547 BE on the Gregorian calendar that is commonly used internationally.

If you could provide some more details about which country's legislation this refers to, I may be able to determine the equivalent Gregorian calendar year for 2547 BE. Thailand, for example, is currently in the year 2566 BE, which corresponds to 2023 CE on the Gregorian calendar.


In [None]:
# resp contains the response
resp = llm3.complete("I want to improve my grade from a course I took 5 semesters ago, can I take the course now?")
print(resp)

The ability to retake a course to improve your grade usually depends on the policies of the specific college or university. Here are a few common scenarios:

- Many schools allow students to retake courses in which they received a low grade (C or below). The new grade replaces the old grade in calculating your GPA, though the original grade may still appear on your transcript.

- Some schools have time limits on retaking courses for grade replacement, such as within 1-2 years after originally taking the course.

- Other schools may average the new grade with the old grade instead of replacing it entirely.

- There are usually restrictions on how many courses or credits can be retaken for grade replacement purposes.

- Retaking a course after graduating is typically not allowed for the purpose of improving your final degree GPA.

The best thing to do is to check with your school's registrar or academic advising office about their specific course repeat/grade replacement policy. They can

In [None]:
# resp contains the response
resp = llm3.complete("I want to apply to a graduate school in METU, which exam shall I take?")
print(resp)

To apply for graduate programs at Middle East Technical University (METU) in Turkey, you will typically need to take one of the following standardized exams, depending on the program you are applying to:

1. GRE (Graduate Record Examination): The GRE is a widely accepted exam for admission to many graduate programs, including those at METU. It tests your verbal reasoning, quantitative reasoning, and analytical writing skills.

2. ALES (Academic Personnel and Postgraduate Education Entrance Exam): ALES is a national exam administered by the Student Selection and Placement Center (ÖSYM) in Turkey. It is required for admission to many graduate programs in Turkish universities, including METU.

3. Subject-specific exams: Some graduate programs at METU may require you to take a subject-specific exam, such as the GMAT for business-related programs or the GRE Subject Tests for specific fields of study.

It's important to check the specific requirements of the graduate program you are interest

In [None]:
# resp contains the response
resp = llm3.complete("What is the minimum CGPA requirement to take DCA in METU?")
print(resp)

Unfortunately, I do not have specific information about the minimum CGPA (Cumulative Grade Point Average) requirement to take the DCA (Department of Computer Applications) program at METU (Middle East Technical University) in Turkey.

The CGPA requirements can vary between universities and even between different programs within the same university. They are usually set by the individual institution based on factors such as program competitiveness, available seats, and their own academic standards.

To get the most accurate and up-to-date information about the CGPA cutoff for the DCA program at METU, I would recommend checking the university's official website, contacting their admissions office directly, or speaking with an academic advisor there. They would have the authoritative details on the specific CGPA criteria needed to be eligible for that particular program.


In [None]:
#Create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("thirdcollection")

# Load the embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index_gan = VectorStoreIndex.from_documents(
  documents_gda, storage_context=storage_context, service_context=service_context
)

  service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)


In [None]:

chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What is the date of Higher Education Act 2547?")
print(response)

Querying with: What is the date of Higher Education Act 2547?
The Higher Education Act 2547 is dated November 4, 1981.


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What is the maximum duration of education for an undergraduate student at METU?")
print(response)

Querying with: What is the maximum duration of education for an undergraduate student at METU?
According to Article 6, the maximum duration of an undergraduate program at METU is seven years (fourteen semesters). For programs offering a master's degree along with an undergraduate degree, the maximum duration is nine years (sixteen semesters).


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What is the maximum duration of education for a master's student at METU?")
print(response)

Querying with: What is the maximum duration of education for a master's student at METU?
According to Article 31, the maximum duration for a Master's program with a thesis at METU is six semesters. This does not include any time spent in the Academic Deficiency Program.


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What is the maximum duration of education for a doctoral student at METU?")
print(response)

Querying with: What is the maximum duration of education for a doctoral student at METU?
According to Article 39, the maximum duration of a Ph.D. program at METU is twelve academic semesters. For Ph.D. on Bachelor's degree programs, the maximum duration is fourteen academic semesters.


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("Can I withrdaw from a course as an undergraduate student?")
print(response)

Querying with: Can I withrdaw from a course as an undergraduate student?
Yes, as an undergraduate student at METU, you can withdraw from a course. The context information does not explicitly mention course withdrawal, but it does outline the registration process and rules around taking courses with an "NI" (not included) status. Specifically, Article 19(d) states that "The status of courses falling into the NI status cannot be altered after the registration process of the concerned semester is completed." This implies that during the registration period, you can likely withdraw from or drop a course before it enters the "NI" status, after which the status cannot be changed for that semester.


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What is the regulation regarding course withdrawal for graduate students?")
print(response)

Querying with: What is the regulation regarding course withdrawal for graduate students?
According to the provided context information, the regulations regarding course withdrawal for graduate students are as follows:

b) Course withdrawals are processed and advisor approvals are given online.
c) An advisor approval is required following a one-on-one meeting with the advisor for course withdrawal.  
e) Course withdrawal may be processed for only one course in a semester.
f) Course withdrawal may be processed for a maximum of six courses throughout the duration of education.
g) Course withdrawal is not possible in the first two semesters of the curriculum.
h) Course withdrawal is not possible for repeated courses, previously withdrawn courses, courses in NI (Not Included) status, or non-credit courses.
i) Course withdrawal is not allowed for students taking the minimum course load or below in a semester.


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What are scores and coefficents corresponding to letter grade AA and CC?")
print(response)

Querying with: What are scores and coefficents corresponding to letter grade AA and CC?
According to the context information provided:

For the letter grade AA:
- The score interval is 90-100
- The coefficient is 4.00

For the letter grade CC: 
- The score interval is 70-74
- The coefficient is 2.00


In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("I am a graduate student, I have a CGPA of 2.6 will I be dismissed?")
print(response)

Querying with: I am a graduate student, I have a CGPA of 2.6 will I be dismissed?
Yes, based on the information provided in the context, if you are a graduate student with a Cumulative Grade Point Average (CGPA) below 3.00, you will be dismissed from your graduate program.

Specifically, for Master's programs, Article 33 (5) states: "Students who cannot successfully complete the courses (credit courses and seminar course) specified by the concerned GSD in a maximum of two academic years (four semesters), and/or students whose Cumulative Grade Point Average is below 3.00 are dismissed from their graduate program."

Similarly, for Ph.D. programs, Article 41 (6) mentions: "Students who cannot successfully complete the courses (credit courses and seminar course) specified by the concerned GSD within four academic semesters in Ph.D. programs, and in six academic semesters in Ph.D. on Bachelor's degree programs, and/or students whose Cumulative Grade Point Average is below 3.00 may not sit t

In [None]:
chat_engine = index_gan.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("I am an undergraduate student, I have a CGPA of 2.6 will I be considered successful?")
print(response)

Querying with: I am an undergraduate student, I have a CGPA of 2.6 will I be considered successful?
Yes, based on the information provided in the context, if you are an undergraduate student with a Cumulative Grade Point Average (CGPA) of 2.6, you will be considered successful. According to Article 31, one of the requirements for graduation from an undergraduate program is to have a CGPA of at least 2.00. Since your CGPA of 2.6 exceeds this minimum requirement, you will be considered a successful student.


According to these questions and answers, with multiple documents, our RAG approach can choose between them according to the relevance of the vectors of the document and the query. This enables us to combine larger knowledge databases, obtain the vectorized database version and retrieve the useful information. We have asked the same question for undergraduate and graduate students, with different answers from the regulations, and the LLM that is not specifically trained on this topic successfully answered these questions!

## Conclusion
In this lab we have seen how RAG operates, we have observed how we can enable a foundational model (LLM), that is trained on closed datasets, with enormous amount of data, time and computation power, to adapt to our needs. Specifically, we have observed how we can let the LLM answer in-context provided, prevented hallucination and let RAG choose from different documents.

## References

1. [A documentation about RAG](https://docs.llamaindex.ai/en/stable/getting_started/concepts/)
1. [A blog about in-context learning](https://www.lakera.ai/blog/what-is-in-context-learning)
1. [A about LLM](https://app.datacamp.com/learn/courses/introduction-to-llms-in-python)
1. [Another course about LLM applications with LangChain](https://app.datacamp.com/learn/courses/developing-llm-applications-with-langchain)
1. [A video tutorial about RAG](https://www.youtube.com/watch?v=sVcwVQRHIc8)
1. [A tutorial about RAG implementation with open source LLM](https://learnbybuilding.ai/tutorials/rag-from-scratch)
1. [Another tutorial about RAG implementation](https://blog.risingstack.com/retrieval-augmented-generation-tutorial-google-colab/)