### Based on:
https://docs.llamaindex.ai/en/stable/presentations/materials/2024-02-28-rag-bootcamp-vector-institute/?h=rag

- If using Ollama LLM and embeddings, feel free to use this notebook as is by setting USE_OPENAI = False
- If using OpenAI LLM and embeddings, use the llamaindex_using_pickledata.ipynb since that will save requesting embeddings again

In [96]:
import os
import nest_asyncio

nest_asyncio.apply()

In [97]:
USE_OPENAI = False

In [98]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

if USE_OPENAI:
    Settings.llm = OpenAI(model="gpt-3.5-turbo", api_key=os.getenv('OPENAI_API_KEY'))
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
else:
    Settings.llm = Ollama(model="llama3:instruct")
    Settings.embed_model = OllamaEmbedding(
        model_name="llama3:instruct",
        base_url="http://localhost:11434",
        ollama_additional_kwargs={"mirostat": 0},
    )

In [100]:
"""Load the data.

With llama-index, before any transformations are applied,
data is loaded in the `Document` abstraction, which is
a container that holds the text of the document.
"""

from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(input_files=["./data/idpp.pdf", "./data/metagpt.pdf", "./data/state_of_the_union.txt"]) # input_dir="./data")
documents = loader.load_data()

In [101]:
# if you want to see what the text looks like
# print (documents[0].text[:100])

In [102]:
"""Chunk, Encode, and Store into a Vector Store.

To streamline the process, we can make use of the IngestionPipeline
class that will apply your specified transformations to the
Document's.
"""

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        Settings.embed_model,
    ],
    vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)



In [103]:
# if you want to see the nodes
print (len(_nodes))
# print (_nodes[0].text)

59


In [104]:
"""Create a llama-index... wait for it... Index.

After uploading your encoded documents into your vector
store of choice, you can connect to it with a VectorStoreIndex
which then gives you access to all of the llama-index functionality.
"""

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

In [105]:
"""Retrieve relevant documents against a query.

With our Index ready, we can now query it to
retrieve the most relevant document chunks.
"""

retriever = index.as_retriever(similarity_top_k=2)
retrieved_nodes = retriever.retrieve("What did the president say about Justice Breyer?")

In [106]:
# to view the retrieved node
print (retrieved_nodes[0].text)
print ("================")
print (retrieved_nodes[1].text)

And he loved building Legos with their daughter. But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body.

Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she.

Through her pain, she found purpose to demand that we do better. Tonight, Danielle, we are going to do better.

The VA — the VA is pioneering new ways of linking toxic exposures to disease, already helping more veterans get benefits. And tonight, I’m announcing we’re expanding eligibility to veterans suffering from nine respiratory cancers.

I’m also calling on Congress to pass a law to make sure veterans devastated by toxic exposure in Iraq and Afghanistan finally get the benefits and the comprehensive healthcare they deserve.

And fourth and last, let’s end cancer as we know it. This is personal. This is personal to me and to Jill and to Kamala and so many of you. So many of you have lost someone you love — husband, wife, son, daughter, mom, dad.



In [107]:
"""Context-Augemented Generation.

With our Index ready, we can create a QueryEngine
that handles the retrieval and context augmentation
in order to get the final response.
"""

query_engine = index.as_query_engine(similarity_top_k=2)

In [108]:
# to inspect the default prompt being used
print(
    query_engine.get_prompts()[
        "response_synthesizer:text_qa_template"
    ].default_template.template
)

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


In [109]:
response = query_engine.query("What were the models tried for predicting ALS progression in the idpp paper?.")
print (response)
# According to the provided context, several machine learning algorithms for regression as well as a Long Short-Term Memory (LSTM) neural network were explored to model the temporal dependencies in the sequential sensor data. The naive model was also tried, which simply carries the last observed value forward. Additionally, ElasticNet and Lasso models were used with different subsets of Task1 and Task2 data for training.
# The models tried for predicting ALS progression in the iDPP paper included a naive model that carried the last observed value forward, various Machine Learning algorithms for regression, and a Long Short-Term Memory (LSTM) neural network to model the temporal dependencies in the sequential sensor data.


According to the provided context, various machine learning algorithms for regression, as well as a Long Short-Term Memory (LSTM) neural network, were implemented to model the temporal dependencies in the sequential sensor data. The specific models tried include:

* Naive model
* ElasticNet model
* Lasso model
* FS Model Q1-Q12

These models were used to predict ALSFRS-R scores assigned by medical professionals using sensor data collected via a dedicated app, as part of Task 1 in the iDPP@CLEF 2024 competition.


In [110]:
response = query_engine.query("What was the validation strategy used by the authors in the idpp paper?.")
print (response)
# The validation strategy used by the authors in the iDPP paper is not explicitly mentioned. However, based on the provided context, it can be inferred that the authors evaluated the performance of their models using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) metrics, as identified by the challenge organizers. The authors also report the results of their experiments, comparing the performance of different models, which suggests a hold-out set or cross-validation approach was used for model evaluation.
# The authors in the idpp paper used a nested k-fold cross-validation strategy for their validation. This strategy consisted of two loops - an inner loop and an outer loop. In the inner loop, a test set containing 10% of complete patient data was set aside, while the remaining data underwent further k-fold cross-validation. This was adapted to ensure that each patient's complete set of observations was included in both the training and validation sets. The outer loop repeated the same procedure on another test set covering another 10% of the patients. This process was repeated for 10 iterations in the outer loop to compute RMSE for model selection. The best hyperparameters were chosen based on these iterations, and all the data was fit using those hyperparameters prior to the final submission.


The validation strategy used by the authors in the iDPP paper is not explicitly mentioned. However, based on the provided context, it can be inferred that the authors evaluated the performance of their models using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) metrics, as identified by the challenge organizers. The authors also report the results of their experiments, comparing the performance of different models, which suggests a hold-out set or cross-validation approach was used for model evaluation.


In [111]:
response = query_engine.query("Which model performed the best with lowest RMSE in the idpp paper?.")
print (response)
# According to Table 2 in the provided context, the best-performing model for each of the 12 questions in ALSFRS-R scores is denoted by a green box. However, it's not specified which model performed the best overall with the lowest RMSE.
# The ElasticNet + Naive model performed the best with the lowest RMSE in the iDPP paper.

According to Table 2 in the provided context, the best-performing model for each of the 12 questions in ALSFRS-R scores is denoted by a green box. However, it's not specified which model performed the best overall with the lowest RMSE.

To answer your query, I would recommend looking at the text or other tables in the paper to find the overall best-performing model. Alternatively, you could try contacting the authors of the paper for more information on their results.


In [112]:
response = query_engine.query("What did the president say about Justice Breyer")
print (response)
# I apologize, but there is no mention of a president or Justice Breyer in the provided context. The text appears to be discussing a study on predicting ALSFRS-R scores based on time-series sensor data and does not contain any information related to politics or justices. Therefore, I cannot provide an answer to this query as it is outside the scope of the given context.
# The president expressed gratitude and appreciation for Justice Breyer's service to the country, acknowledging his dedication as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court.


I'm happy to help! However, I must point out that there is no mention of a president or Justice Breyer in the provided context information. The text appears to be discussing a study on predicting ALSFRS-R scores based on time-series sensor data and machine learning models. Therefore, it's not possible for me to provide an answer about what the president said about Justice Breyer, as this information is not present in the given context.


In [113]:
response = query_engine.query("How do agents share information with other agents in MetaGPT?")
print (response)
# According to the provided context, MetaGPT is a meta-programming framework for multi-agent collaboration based on Large Language Models (LLMs). It has well-defined functions like role definition and message sharing, making it a useful platform for developing LLM-based multi-agent systems. However, it does not explicitly mention how agents share information with other agents.
# Agents share information with other agents by utilizing a shared message pool. This shared message pool allows all agents to exchange messages directly. Agents publish their structured messages in the pool and can also access messages from other entities transparently. This system enables any agent to retrieve necessary information directly from the shared pool without having to inquire about other agents and wait for their responses, ultimately enhancing communication efficiency.


The MetaGPT framework enables multi-agent collaboration by providing well-defined functions for role definition and message sharing. This allows agents to effectively share information and work together towards a common goal.


In [116]:
import pickle

# Store the vectore store's nodes which have the actual embeddings so we don't have to reuse OpenAIEmbeddings() everytime we run this. Cost savings.
# with open('models/openAI_idpp_metagpt_state/_nodes.pickle', 'wb') as handle:
#     pickle.dump(_nodes, handle, protocol=pickle.HIGHEST_PROTOCOL)