### IRIS GraphRAG Demo

This notebook is a demo of using IRIS Vector Search capabilities for a graphrag application

The following cell is used to get all the requirements. The jupyter image should already have these downloaded, but running this cell just to be safe is advised

In [3]:
! pip install -U langchain_community arxiv tiktoken langchainhub pymilvus langchain langgraph tavily-python sentence-transformers langchain-milvus langchain-ollama langchain-huggingface beautifulsoup4 langchain-experimental neo4j json-repair langchain-openai langchain-ollama iris flask setuptools==69.0.3

Collecting langchain_community
  Using cached langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting pymilvus
  Using cached pymilvus-2.5.10-py3-none-any.whl.metadata (5.7 kB)
Collecting langchain
  Using cached langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting langgraph
  Downloading langgraph-0.4.7-py3-none-any.whl.metadata (6.8 kB)
Collecting tavily-python
  Downloading tavily_python-0.7.3-py3-none-any.whl.metadata (7.0 kB)
Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting langchain-milvus
  Using cached langchain_milvus-0.1.10-py3-none-any.whl.metadata (3.7 kB)
Collecting langchain-ollama
  Using cached langchain_ollama-0.3.3-py3-none-any.whl.metadata (1.5 kB)
Collecting langchain-huggingface
  Using cached langchain_huggingface-0.2.0-py3-none-any.whl.metadata (941 bytes)
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting js

This is just some basic setup for the langchain application

In [4]:
from dotenv import load_dotenv
from langchain.globals import set_verbose, set_debug
import os
import warnings

warnings.simplefilter("ignore")

max_papers=30

data_path="/home/jovyan/workspace/data/"

#load_dotenv()

# Set langchain variables
set_debug(False)
set_verbose(False)

Here you should set your OPENAI KEY to be used for the llm model

In [None]:
### LLM
import os
os.environ["OPENAI_API_KEY"] = "openai_key"

gpt4omini = "gpt-4o-mini"

model = gpt4omini

For our project we are using immunology and clinical trials papers from arxiv x. This data has already been converted into graphs and exported as CSV files. Below our example you will find the code we used to load the data.

The following code cell connects to our IRIS container, which has the Vector Search Code. If you are running this locally, please enter the information for you own IRIS server

In [5]:
# load iris module
import iris
import pandas as pd

import warnings

# change these variables to reflect your connection
hostname = "iris"
port = 1972
namespace = "IRISAPP"
username = "SuperUser"
password = "SYS"

# connect
connection = iris.connect("{:}:{:}/{:}".format(hostname, port, namespace), username, password)
irispy = iris.createIRIS(connection)

RuntimeError: <COMMUNICATION LINK ERROR> Failed to connect to server; Details: <COMMUNICATION LINK ERROR> Failed to connect to server; Details: Error code: -1 Error message: <SSL Error> 

In [2]:
print(iris.__file__)



/opt/conda/lib/python3.10/site-packages/iris/__init__.py


This cell is used to load the data into IRIS, and create embeddings for the documents

In [None]:

docsfile = '/home/irisowner/dev/CSV/papers300.csv'
relationsfile = '/home/irisowner/dev/CSV/relations300.csv'
entitiesfile = '/home/irisowner/dev/CSV/entities300.csv'
entitiesembeddingsfile = '/home/irisowner/dev/CSV/entities_embeddings300.csv'
papersembeddingsfile = '/home/irisowner/dev/CSV/papers_embeddings300.csv'


irispy.classMethodValue("GraphKB.Documents", "LoadData", docsfile)
irispy.classMethodValue("GraphKB.Entity", "LoadData", entitiesfile)
irispy.classMethodValue("GraphKB.Relations", "LoadData", relationsfile)
irispy.classMethodValue("GraphKB.DocumentsEmbeddings", "LoadData", papersembeddingsfile)
irispy.classMethodValue("GraphKB.EntityEmbeddings", "LoadData", entitiesembeddingsfile)


Result Set idx: 0

1

Finally, here are the functions which will perform the GraphRAG. Note that these are relatively simple implementations, and can be improved.

In [26]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
import ast

import os, sys

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

def extract_query_entities(query):

  prompt_text = '''Based on the following example, extract entities from the user provided queries.
                Below are a number of example queries and their extracted entities. Provide only the entities.
                'How many wars was George Washington involved in' -> ['War', 'George Washington'].\n
                'What are the relationships between the employees' -> ['relationships','employees].\n

                For the following query, extract entities as in the above example.\n query: {content}'''

  llm = ChatOpenAI(temperature=0, model_name=model)
  prompt = ChatPromptTemplate.from_template(prompt_text)
  chain = prompt | llm | StrOutputParser()
  response = chain.invoke({"content": query})
  return ast.literal_eval(response)

def global_query(query, items=50,vector_search=10, batch_size = 10):
    with HiddenPrints():
        docs = irispy.classMethodValue("GraphKB.Query","Search",query,items/2,items/2)
        docs = docs.split('\n\r\n')
    
    answers = []
    
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        response = llm_answer_for_batch(batch, query)
        answers.append(response)

    return llm_answer_summarize(query, answers)

def ask_query(query, items = 10, method='local'):
    with HiddenPrints():
        docs = [irispy.classMethodValue("GraphKB.Query","Search",query,items/2,items/2)]
        
    response = llm_answer_for_batch(docs, query, False)
    return response


def llm_answer_summarize(query, answers):
    llm = ChatOpenAI(temperature=0, model_name=model)
    prompt_text = """You are an assistant for question-answering tasks. 
    Use the following answers to a query derived from analyzing batches of documents. Please compile these answers into one overall answer. If you don't know the answer, just say that you don't know. 
    Question: {question}  
    Previous Answers: {answers}
    Answer: 
    """
    prompt = ChatPromptTemplate.from_template(prompt_text)
    chain = prompt | llm | StrOutputParser()
    response = chain.invoke({"question": query, 'answers':answers})
    return response
    
    
    
def llm_answer_for_batch(batch, query, cutoff=True):
    llm = ChatOpenAI(temperature=0, model_name=model)
    prompt_text = """You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context from a graph database to answer the question. If you don't know the answer, just say that you don't know. 
    """ + (("Use three sentences maximum and keep the answer concise:") if cutoff else " ") + """
    Question: {question}  
    Graph Context: {graph_context}
    Answer: 
    """
    prompt = ChatPromptTemplate.from_template(prompt_text)
    chain = prompt | llm | StrOutputParser()
    response = chain.invoke({"question": query, 'graph_context':batch})
    return response



### Example

Here we show an example of running the GraphRAG algorithm on our loaded data. First, we use the ask_query method, which will retrieve the inputted number of relevant documents and perform RAG with them. The most relevant documents identified using Vector Search, as well as traversing the created graph.

In [33]:
print(ask_query("what data do you have", items = 10))

The most significant papers in immunology summarized from the provided abstracts are:

1. **Longitudinal Evaluation of T and B Cell Immunity to SARS-CoV-2**: This paper presents a detailed longitudinal study of immune responses in a single individual over a year following vaccination. It emphasizes the insights gained from personal biological sample analysis, contributing to understanding vaccine-elicited immunity.

2. **Vaccine Development and Immune Response Markers**: This research discusses the challenges in developing effective vaccines for high-burden diseases like HIV. It proposes a nonparametric methodology for estimating the impact of immune response markers on infection probabilities, enhancing the evaluation of immune responses in clinical trials.

3. **Diagnostic Algorithm for HIV Treatment Monitoring**: This paper critiques existing WHO guidelines for monitoring HIV treatment effectiveness and proposes a new diagnostic algorithm that optimizes the use of viral load testing

### Note

You may notice that if we decide to search for many items (maybe 100), these may be too large for the LLM context, and thus we won't be able to get a good result.

The 'global_query' method thus takes advantage of another GraphRAG concept. This query will batch files together, answer the question for each batch, and then summarize the batch answers into one final answer. This means that we can now ask for as many items as we think may be necessary, and won't run into an context issues

In [28]:
print(global_query("What are the next directions in HIV treatment", items = 50))

The next directions in HIV treatment involve several key areas of focus. Firstly, there is an emphasis on optimizing clinical trial designs through advanced data analysis and the use of synthetic clinical trial data to enhance patient outcomes. This includes developing diagnostic algorithms that selectively utilize viral load testing to monitor treatment effectiveness, particularly in resource-limited settings. 

Additionally, there is ongoing research into personalized treatment regimens that take into account immunologic and virologic parameters, which may improve the efficacy of structured treatment interruptions. Another important direction is the development of effective preventive vaccines, which, despite slow progress, hold significant potential for public health impact. Furthermore, leveraging historical clinical trial data and advanced automated coding systems is expected to enhance data interoperability, thereby accelerating research and drug development. Collectively, these 

This ends the example. Feel free to play around with the above cell, asking more questions about the dataset we are using

### Data processing

Below is the code we used to download the arxiv data, create graphs from the data, and then export the graphs into csv files that can be imported into IRIS. The below code can be modified to make use of other types of data and documents.

(Note -> The GraphKB classes are currently hard-coded to accept the format of data we provide. We did not have time to write a more generalized implementation, but this should prove to be relatively simple)

In [27]:
import arxiv
import tarfile
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_milvus import Milvus
from langchain_community.embeddings import HuggingFaceEmbeddings

#Uncomment and replace with your own data if desired
search_query = "immunology OR 'clinical trials' OR 'neuroscience'"
max_papers=30
max_results = max_papers
max_papers=30

# Fetch papers from arXiv
client = arxiv.Client()
search = arxiv.Search(
    query=search_query, max_results=max_results, sort_by=arxiv.SortCriterion.Relevance
)

docs = []
for result in client.results(search):
    docs.append(
        {"title": result.title, "summary": result.summary, "url": result.entry_id, "authors": result.authors}
    )
docs_to_print = docs[:3]


# Print the details of each paper in the docs list
for i, doc in enumerate(docs_to_print, start=1):
    authors_str = ", ".join([str(author) for author in doc['authors']])  # Convert authors to strings
    print(f"Paper {i}:")
    print(f"Title: {doc['title']}")
    print(f"Summary: {doc['summary']}")
    print(f"URL: {doc['url']}")
    print(f"Authors: {authors_str}")  # Join the authors as a string
    print("-" * 50)  # Divider to separate the papers

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=2000, chunk_overlap=50
)
doc_splits = text_splitter.create_documents(
    [doc["summary"]+" "+doc["title"]+""+str(doc["authors"]) for doc in docs], metadatas=docs
)

print(f"Number of papers: {len(docs)}")
print(f"Number of chunks: {len(doc_splits)}")


Paper 1:
Title: Bayesian Models and Decision Algorithms for Complex Early Phase Clinical Trials
Summary: An early phase clinical trial is the first step in evaluating the effects in
humans of a potential new anti-disease agent or combination of agents. Usually
called "phase I" or "phase I/II" trials, these experiments typically have the
nominal scientific goal of determining an acceptable dose, most often based on
adverse event probabilities. This arose from a tradition of phase I trials to
evaluate cytotoxic agents for treating cancer, although some methods may be
applied in other medical settings, such as treatment of stroke or immunological
diseases. Most modern statistical designs for early phase trials include
model-based, outcome-adaptive decision rules that choose doses for successive
patient cohorts based on data from previous patients in the trial. Such designs
have seen limited use in clinical practice, however, due to their complexity,
the requirement of intensive, computer-

Here we save the raw data files as a csv, which will then be loaded into an IRIS table in future steps

In [28]:
# Save the file into the datapath
data_path="/home/jovyan/workspace/data/"
filename=data_path+"docs"+str(max_papers)+".csv"
with open(filename,"w") as file:
    print("docid|title|abstract|url|authors",file=file)
    s="|,"
    for i,doc in enumerate(docs):
        abstract=doc['summary'].replace("\n",' ')
        title=doc['title']
        try:
            print(f"{i}|{title}|{abstract}|{doc['url']}",end="",file=file)
        except UnicodeEncodeError:
            err=1
        a=0
        for author in doc["authors"]:
            auth=str(author).replace('\u0107','').replace('\u0131','').replace('\u0142','').replace('\u016b','').replace('\u010d','')
            auth=auth.replace('\u0111','').replace('\u015f','')
            try:
                print(f"{s[a]}{auth}",end="",file=file)
                a=1
            except UnicodeEncodeError:
                err=2
        print(file=file)


In [29]:
# GraphRAG Setup

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_core.documents import Document
from langchain_experimental.llms.ollama_functions import OllamaFunctions
from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama

graph_llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")

graph_transformer = LLMGraphTransformer(
    llm=graph_llm,
    allowed_nodes=["Paper", "Author", "Topic"],
    node_properties=["title", "summary", "url", "author"],
    allowed_relationships=["AUTHORED", "DISCUSSES", "RELATED_TO"],
)

graph_documents = graph_transformer.convert_to_graph_documents(doc_splits)

print(f"Graph documents: {len(graph_documents)}")
print(f"Nodes from 1st graph doc:{graph_documents[0].nodes}")
print(f"Relationships from 1st graph doc:{graph_documents[0].relationships}")


Graph documents: 30
Nodes from 1st graph doc:[Node(id='Early Phase Clinical Trial', type='Topic', properties={'summary': 'An early phase clinical trial is the first step in evaluating the effects in humans of a potential new anti-disease agent or combination of agents.', 'title': 'Bayesian Models and Decision Algorithms for Complex Early Phase Clinical Trials'}), Node(id='Peter F. Thall', type='Author', properties={'author': 'Peter F. Thall'})]
Relationships from 1st graph doc:[Relationship(source=Node(id='Bayesian Models And Decision Algorithms For Complex Early Phase Clinical Trials', type='Topic', properties={}), target=Node(id='Peter F. Thall', type='Author', properties={}), type='AUTHORED', properties={})]


In [31]:

filename=data_path+"entities"+str(max_papers)+".csv"
with open(filename,"w") as file:
    print("docid|entityid|type",file=file)
    for i, doc in enumerate(graph_documents):
        for node in doc.nodes:
            try:
                print(f"{i}|{node.id}|{node.type}",file=file)
            except UnicodeEncodeError:
                err=3


In [32]:
filename=data_path+"relations"+str(max_papers)+".csv"
with open(filename,"w") as file:
    print("docid|source|sourcetype|target|targettype|type",file=file)
    for i, doc in enumerate(graph_documents):
        for rel in doc.relationships:
            try:
                print(f"{i}|{rel.source.id}|{rel.source.type}|{rel.target.id}|{rel.target.type}|{rel.type}",file=file)
            except UnicodeEncodeError:
                err=4

In [34]:
import sys
print(sys.executable)

/opt/conda/bin/python


In [35]:
conda env list


# conda environments:
#
base                     /opt/conda


Note: you may need to restart the kernel to use updated packages.


In [39]:
conda env list


# conda environments:
#
base                     /opt/conda


Note: you may need to restart the kernel to use updated packages.


In [38]:
conda list --explicit > environment.yml


Note: you may need to restart the kernel to use updated packages.


SyntaxError: invalid syntax (945115591.py, line 1)

Exception ignored in: <module 'threading' from '/opt/conda/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1567, in _shutdown
    lock.acquire()
  File "/opt/conda/lib/python3.10/site-packages/werkzeug/_reloader.py", line 452, in <lambda>
    signal.signal(signal.SIGTERM, lambda *args: sys.exit(0))
SystemExit: 0
