# Simple RAG Demo

It turns out over the past few years, it's become not too difficult to create
your own basic ChatGPT-like functionality to talk to!

This notebook goes through how to use an LLM with RAG (Retrieval Augemented Generation)!

RAG is a way to supplement your LLM chats with your documentation, without
going through the time- and computationally-expensive process of training
an LLM using your own data.

We're going to start with setting up some folders.

In [1]:
## Here, we're setting defining some folders to use in later stages
import os, getpass

# this is where downloaded models will be stored
# otherwise, they'll fill up your home directory - and they ain't small!
huggingface_cache = f'/vast/scratch/users/{getpass.getuser()}/hfcache/'
os.environ["HF_HOME"] = huggingface_cache

# this is where the embedding database will be stored
database_path = f"/vast/scratch/users/{getpass.getuser()}/chroma"

## Choosing your models

Now there are a plethora of public LLM models that you can use to create.

In this demo, we're using the "mxbai-embed-large-v1" model to generate
embeddings. This model is general-purpose and is decently powerful.

For the chat model, we're using "Qwn2.5-1.5B-Instruct" This is a newish
model. Importantly, "1.5B" indicates the size, and in this case, 1.5B = 1.5 Billion parameters.
Sounds like a lot, but the leading models can have more than 100B parameters!

We're limited in size here because of the smallish GPU we're using.


In [2]:
embedding_model_name = "mixedbread-ai/mxbai-embed-large-v1"
chat_model_name = "Qwen/Qwen2.5-1.5B-Instruct"

## Setting up your models

In [3]:
# if on an M* chip mac, you can change "cuda" to "mps" for acceleration instead
# if you don't have a GPU, change "cuda" to "cpu". Note that some of the steps
# below will be veeeery slow if you use "cpu"
device = "cuda" 
# here, we can specify the length of the answer.
tokens = 256 

In [4]:
# We need to import some useful functions to automatically download the model
from transformers import AutoModelForCausalLM, AutoTokenizer

# A tokenizer is a tool to break-up text into meaningful parts
# usually different LLM model architectures only work with a specific tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    chat_model_name, 
)
# Now we're pulling the LLM model! This may take a bit of time.
model = AutoModelForCausalLM.from_pretrained(
    chat_model_name, 
).to(device) # This is telling us where we want the model computations to perform

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Setting up a text generation pipeline with our downloaded model and tokenizer.
from transformers import pipeline
from langchain_huggingface.llms import HuggingFacePipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=tokens, device=device, return_full_text=False)
hf = HuggingFacePipeline(pipeline=pipe)

    menv/share/zoneinfo
    menv/share/tzinfo


Now, we're ready to start talking with our LLM that we downloaded!
In a very oversimplified way, an LLM is just a text completion tool.
It will try to generate text that probabilistically fits. Let's
try asking a question.

In [6]:
# e.g., let's supply our question to the model as-is:
question = "Can you tell me what electroencephalography is?"

# this may take awhile as the model will try to generate as much text as it can.
hf.invoke(question)

" Electroencephalography (EEG) is a diagnostic medical procedure that records the electrical activity of the brain. It involves placing electrodes on the scalp to detect and measure tiny changes in voltage caused by neural signals.\n\nHere's how it works:\n\n1. Electrodes: Small, flat metal discs called electrodes are attached to the scalp using adhesive tape or clips.\n2. Recording: The electrodes pick up weak electrical impulses generated by neurons as they transmit messages throughout the brain.\n3. Amplification: These signals are amplified so that they can be recorded more clearly.\n4. Analysis: The resulting EEG pattern is analyzed for patterns associated with different types of brain functions and conditions.\n\nUses of EEG include:\n- Diagnosis of epilepsy\n- Monitoring seizures\n- Evaluation of sleep disorders\n- Detection of neurological injuries\n- Assessment of cognitive function\n\nEEG findings may help diagnose various conditions like epilepsy, Parkinson's disease, Alzhei

We're going to make it more explicit that we want an answer to a question
by using a PromptTemplate. This will be more useful later on.

In [7]:
from langchain_core.prompts import PromptTemplate

# Create a template as a string
template = """Question: {question}

Answer: """

# conver the string template into a PromptTemplate object
prompt = PromptTemplate.from_template(template)

# chain the prompt template with the text-generation pipeline
chain = prompt | hf

# invoke the model, substituting the question into the placeholder
print(chain.invoke({"question": question}))

 Electroencephalography (EEG) is a non-invasive diagnostic test that measures the electrical activity of the brain. It involves placing electrodes on the scalp to record the electrical signals generated by neurons in the brain. These signals are then analyzed and interpreted to diagnose various neurological conditions, such as epilepsy, Alzheimer's disease, and sleep disorders. EEG can also be used to monitor patients undergoing anesthesia or during surgical procedures. The information provided in this answer is based on general knowledge about EEG and its uses. However, it may not cover all aspects of the topic, including technical details or specific applications of EEG. For more detailed information, I recommend consulting with a medical professional or referring to reliable sources on the subject.


It seems to answer the question more succinctly now!

## Setting up the embedding model

The embedding model is a key element of the RAG setup. It is a specialized
language model that converts text into "embeddings", which are numerical
values. Converting text to numerical values is super useful, because it
enables us to calculate "distances" between text and therefore group similar
text together. 

The concept of creating embeddings from text is not recent (I think it's
from the 70s), but the use of trained neural networks to generate embeddings is.

In [8]:
# we're creating a HuggingFaceEmbeddings object to be used
# with our embedding database
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

model_kwargs = {'device': device, "trust_remote_code":True,}
encode_kwargs = {'normalize_embeddings': True}
hf_emb = HuggingFaceEmbeddings(
    cache_folder=huggingface_cache,
    model_name=embedding_model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [9]:
# Now, we're creating an on-disk database to store our embeddings
from langchain_chroma import Chroma

# assigning an embedding function makes the database
# more convenient to use!
db = Chroma(
    persist_directory=database_path, 
    embedding_function=hf_emb 
)

# Converting PDFs to embeddings

We've now downloaded an embedding model and pushed it to the GPU.
And then, we initialized a database to start pumping data into.
Let's do that!

These are the steps that we're following:
1. we'll load our PDFs.
2. the PDFs will be split into chunks, to ensure they're not too long for the model.
3. We'll store these chunks into the database as embeddings.

We're going to work with some PDFs I've included as an example:
* A paper each from Julie Iskander, Michael Milton, and myself.
* A report from 2022 about the Bioinformatics training needs in Australia.
* The monopoly rulebook
The point being to see how well the LLM+RAG works with varied types of documents

After we've put the data in there, we'll try querying it.

In [10]:
# Let's start adding documents!
# We'll be using the PyPDFLoader utility from langchain
from langchain_community.document_loaders.pdf import PyPDFLoader
import glob

# this traverses the "data" folder, looking for pdfs and then loading them
loaded_docs = [PyPDFLoader(doc).load() for doc in glob.glob("data/*.pdf")]
print(len(loaded_docs))

5


In [11]:
# This is the chunking step. We'll be splitting the text into
# 1000-character chunks with 40 character overlap.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=40,
    length_function=len,
    is_separator_regex=False,
)
split_docs = []
for doc in loaded_docs:
    split_docs.extend(doc)
print(split_docs[:3])
print(len(split_docs))

[Document(metadata={'source': 'data/1-s2.0-S0266352X20300379-main-1.pdf', 'page': 0}, page_content='Contents lists available at ScienceDirect\nComputers and Geotechnics\njournal homepage: www.elsevier.com/locate/compgeo\nResearch Paper\nAscalableparallelcomputingSPHframeworkforpredictionsofgeophysical\ngranular flows\nEdward Yanga,b, Ha H. Buia,b,⁎, Hans De Sterckd, Giang D. Nguyenc, Abdelmalek Bouazzab\naMonashComputationalGeomechanics(MCG)Lab,MonashUniversity,Australia\nbDepartmentofCivilEngineering,MonashUniversity,Australia\ncSchoolofCivil,Environmental&MiningEngineering,UniversityofAdelaide,Australia\ndDepartmentofAppliedMathematics,UniversityofWaterloo,Canada\nARTICLE INFO\nKeywords:\nParallel computing\nMessage Passing Interface (MPI)\nSmoothed Particle Hydrodynamics (SPH)\nGranular flowsGeophysical flowsABSTRACT\nThis paper presents a parallel computing Smoothed Particle Hydrodynamics (SPH) framework for geophysical\ngranular flows scalable on large CPU clusters. The framework 

In [12]:
# Finally, we cna add our split chunks into the database
db.add_documents(split_docs)

['d2940d2b-eee9-4090-ab9b-c67ccc1689fb',
 '92438ca6-cd89-4970-85be-a8046e697045',
 'ca62cf2e-3691-485b-971a-ca6e613eedae',
 '5dd22d95-3cd0-4739-a7e8-7f4d637d0bc9',
 'df13f8c4-06ad-41d1-8cb4-4d3e63ceabf3',
 '76670f1b-75fa-40e2-9760-974f5463657d',
 '83993c0b-d468-4a99-8572-901616cad99b',
 '8bc17623-a5a4-4a28-b84a-74787b8dc1e1',
 '37073fe3-3061-49dc-a474-a7d30c15fd1b',
 '8e608fb2-42df-4c60-9f16-362afb03e283',
 '34ce1735-3d9d-4a64-98bb-338d773eb72b',
 '016c1964-1b91-4c2c-afac-05b563fe5256',
 '08855da0-2076-406c-b21d-251dd9448acc',
 '8a4d9730-341c-4cc2-8722-99dd5ce06344',
 '99e98cff-74af-4f4b-abbb-4c544b23144b',
 '92bcbb0d-b900-48f8-ac8b-fddca7f53b97',
 '6ad6c784-73ac-4990-9e02-fc5ad1f37837',
 'bb81db11-0bf5-468c-a7e4-abbe5ef19964',
 'd9985098-d35b-41c3-886a-8ac6c097cd01',
 'd6391404-2047-4a2f-8257-66782bac1ca8',
 '64ced2a4-bec7-4fa9-a5a7-e8e687897041',
 '860afc82-86a0-4c90-bcc9-625f5345d15f',
 '92630e2a-b8d1-435c-b545-a5283fbe3452',
 '1d9a95f4-422f-42d6-abf0-7c28ad644570',
 'e91606a4-b9d5-

# Querying the database

Now that the documents are embedded and stuffed into the database,
we can begin querying it. Langchain's Chroma interface provides two
convenient functions to query the database with:
* `max_marginal_relevance_search`
    * This tries to get chunks that have less overlap in content, maximising the "marginal relevance" of each chunk
* `similarity_search_with_relevance_score`
    * Tries to get the chunks that are most relevant (according some internal metric).

In [13]:
question = "what are the bioinformatics training needs in Australia?"

In [14]:
# Trying `max_marginal_relevance_search`, with 4 results
search_res = db.max_marginal_relevance_search(
    question, 
    k=5
)

# printing the documents with some info.
for res in search_res:
    print(f"Document: {res.metadata['source']}, page: {res.metadata['page']}\n")
    print(f"Content:\n{res.page_content}\n")

Document: data/1_Australian_bioinformatics_training_needs_survey_2021_22_Report.pdf, page: 0

Content:
Bioinformatics training needs of Australianresearchers: 2021/22 surveyMelissa L. Burke1,2,3Ann Backhaus4, Mariana Barnes5,Michael Charleston6, Tyrone Chen7,8,Tracy Chew9,  Jeffrey H. Christiansen1,2,3, Mark Crowe3,Deanna Deveson10, VictoriaPerreau11, Jingbo Wang12, Nathan Watson-Haigh13, ChristinaR. Hall1,111. Australian BioCommons, Australia, 2. Research Computing Centre, The University ofQueensland, Queensland, Australia, 3. Queensland Cyber Infrastructure Foundation,Queensland, Australia. 4. Pawsey Supercomputing Research Centre, Western Australia,Australia, 5. Menzies School of Health Research, Darwin, Northern Territory, Australia, 6.University of Tasmania, Tasmania, Australia, 7. COMBINE - Australian ComputationalBiology and Bioinformatics Student Society, Australia, 8. Monash University, Victoria,Australia, 9. Sydney Informatics Hub, University of Sydney, New South Wales, Austr

In [15]:
# Trying `max_marginal_relevance_search`, with 4 results
search_res = db.similarity_search_with_relevance_scores(
    question, 
    k=5
)

# printing the documents with some info.
for res, score in search_res:
    print(f"Document: {res.metadata['source']}, page: {res.metadata['page']}")
    print(f"Relevance score [0, 1]: {score}\n")
    print(f"Content:\n{res.page_content}\n")

Document: data/1_Australian_bioinformatics_training_needs_survey_2021_22_Report.pdf, page: 0
Relevance score [0, 1]: 0.8085996971129107

Content:
Bioinformatics training needs of Australianresearchers: 2021/22 surveyMelissa L. Burke1,2,3Ann Backhaus4, Mariana Barnes5,Michael Charleston6, Tyrone Chen7,8,Tracy Chew9,  Jeffrey H. Christiansen1,2,3, Mark Crowe3,Deanna Deveson10, VictoriaPerreau11, Jingbo Wang12, Nathan Watson-Haigh13, ChristinaR. Hall1,111. Australian BioCommons, Australia, 2. Research Computing Centre, The University ofQueensland, Queensland, Australia, 3. Queensland Cyber Infrastructure Foundation,Queensland, Australia. 4. Pawsey Supercomputing Research Centre, Western Australia,Australia, 5. Menzies School of Health Research, Darwin, Northern Territory, Australia, 6.University of Tasmania, Tasmania, Australia, 7. COMBINE - Australian ComputationalBiology and Bioinformatics Student Society, Australia, 8. Monash University, Victoria,Australia, 9. Sydney Informatics Hub, U

Whichever search type we used, the query pulled out chunks solely from the Bioinformatics training report. 
This makes sense, given none of the PDFs cover this topic (although Michael's paper covers a 
bioinformatics workflow manager). 

The maximal marginal relevance search did a decent job of ensuring the chunks didn't overlap in much
in content, but ended up pulling more irrelevant chunks. It also didn't provide a score to judge how
relevant it thought the chunk was. It might be more useful if we had more documents covering the topic
of the question.

## Summarising the retrieved chunks with an LLM

Being able to get relevant chunks is already quite useful - especially if there are more than the 5
PDFs used here.

The next step is to use an LLM to summarize the information for us!

The steps will be:
* we need to integrate the retrieved documents into the prompt
* the LLM can answer the prompt
* we'll add the sources of information to the end of the LLMs answer

In [16]:
# We'll format the retrieved docs' content
# we're joining the content by a "---" delimiter and a couple spaces
formatted_docs = "\n\n---\n\n".join([doc.page_content for doc, _ in search_res])
print(formatted_docs)

Bioinformatics training needs of Australianresearchers: 2021/22 surveyMelissa L. Burke1,2,3Ann Backhaus4, Mariana Barnes5,Michael Charleston6, Tyrone Chen7,8,Tracy Chew9,  Jeffrey H. Christiansen1,2,3, Mark Crowe3,Deanna Deveson10, VictoriaPerreau11, Jingbo Wang12, Nathan Watson-Haigh13, ChristinaR. Hall1,111. Australian BioCommons, Australia, 2. Research Computing Centre, The University ofQueensland, Queensland, Australia, 3. Queensland Cyber Infrastructure Foundation,Queensland, Australia. 4. Pawsey Supercomputing Research Centre, Western Australia,Australia, 5. Menzies School of Health Research, Darwin, Northern Territory, Australia, 6.University of Tasmania, Tasmania, Australia, 7. COMBINE - Australian ComputationalBiology and Bioinformatics Student Society, Australia, 8. Monash University, Victoria,Australia, 9. Sydney Informatics Hub, University of Sydney, New South Wales, Australia, 10.Monash Bioinformatics Platform, Biomedicine Discovery Institute, Faculty of Medicine,Nursing a

In [17]:
# Let's create a new template
# the variables between the curly brackets are placeholders
# to substitute our question and chunks of content.
template = """Context: {context}

Answer the below question, given the above context: 

Question: {question}

Answer:
"""
prompt = PromptTemplate.from_template(template)

# creating a new chain
chain = prompt | hf

In [18]:
answer = chain.invoke({"question": question, "context": formatted_docs})

# append the source documents and scores to the end of the response
answer += '\n\n'
for doc, score in search_res:
    answer += f'* Document: {doc.metadata["source"]}\n    * Page: {doc.metadata["page"]}\n    * Relevance [0, 1]: {score}\n'
print(answer)

The bioinformatics training needs in Australia continue to grow, as evidenced by the results of the 2021/22 Australian Bioinformatics Training Needs Survey. Respondents expressed a high level of importance for various bioinformatics training topics, including:

1. Basic computing skills like Linux and scripting (Python, R)
2. Data management and metadata
3. Integrating multiple data types
4. Scaling analysis to cloud environments

Since the last survey in 2016, several new topics have emerged, reflecting advancements in technology and methodologies. These include:

- Single-cell RNA sequencing (scRNA-seq)
- Machine learning techniques

Additionally, the pandemic has altered how life scientists work collaboratively, which may influence the type of training they seek.

However, despite increased interest in bioinformatics training, many respondents still express concerns about accessing relevant training locally. This highlights the need for continuous evaluation of training needs and im

In [19]:
# delete the db contents
db.delete_collection()