#Summary

This notebook contains instructions on how to setup Microsoft GraphRAG, load a medical text book and then ask questions against that text book.

*   Setup Micosoft GraphRAG
*   Load Anatomy text book
*   Ask questions from an Anatomy question set  

The Anatomy book is sourced from https://huggingface.co/datasets/MedRAG/textbooks

The question set is sourced from https://github.com/Teddy-XiongGZ/MIRAGE/blob/main/rawdata/mmlu/data/test/anatomy_test.csv

# Install Graph RAG

In [None]:
! pip install graphrag

Collecting graphrag
  Downloading graphrag-0.5.0-py3-none-any.whl.metadata (6.1 kB)
Collecting aiofiles<25.0.0,>=24.1.0 (from graphrag)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting aiolimiter<2.0.0,>=1.1.0 (from graphrag)
  Downloading aiolimiter-1.2.0-py3-none-any.whl.metadata (4.5 kB)
Collecting azure-identity<2.0.0,>=1.17.1 (from graphrag)
  Downloading azure_identity-1.19.0-py3-none-any.whl.metadata (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.6/80.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-search-documents<12.0.0,>=11.4.0 (from graphrag)
  Downloading azure_search_documents-11.5.2-py3-none-any.whl.metadata (23 kB)
Collecting azure-storage-blob<13.0.0,>=12.22.0 (from graphrag)
  Downloading azure_storage_blob-12.24.0-py3-none-any.whl.metadata (26 kB)
Collecting datashaper<0.0.50,>=0.0.49 (from graphrag)
  Downloading datashaper-0.0.49-py3-none-any.whl.metadata (3.7 kB)
Collecting devtools<0.13.0

# Initialize the Graph RAG Workspace

In [None]:
! graphrag init --root ./ragtest


[?25l⠋ GraphRAG Indexer [2KInitializing project at [35m/content/[0m[95mragtest[0m
⠋ GraphRAG Indexer 


# Create folder for input files

In [None]:
import os

os.makedirs("ragtest/input", exist_ok=True)

# Update your settings.yaml file



*   Got to the ./ragtest folder and open settings.yaml
*   Update all instances of api_key with your api key from OpenAI
*   Update the LLM model to one you have access to from OpenAI. For this task I used gpt-4o
*   Update the embedding LLM model to one you have access to from OpenAI. For this task I used text-embedding-small



# Create the knowledge Graph

In [None]:
! graphrag index --root ./ragtest


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
├── create_base_text_units
├── create_final_documents
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── create_final_relationships
├── create_final_text_units
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K⠇ GraphRAG Indexer 
├── Loading Input (text) - 3017 files loaded (0 filtered) [90m━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m [33m0:00:00[0m
├── create_base_text_units
├── create_final_documents
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── create_final_relationships
├── create_final_text_units
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (text) - 3017 files loaded (0 filtered) [90m━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m [33m0:00:00[0m
├── create_ba

# View the Communities

In [None]:
import pandas as pd

file_path = './ragtest/output/create_final_communities.parquet'
df = pd.read_parquet(file_path)
df.size


12950

: 

# View the Community Reports

In [None]:
import pandas as pd

file_path = '/content/ragtest/output/create_final_community_reports.parquet'
df = pd.read_parquet(file_path)
df.size


16848

# View the documents

In [None]:
import pandas as pd

file_path = '/content/ragtest/output/create_final_documents.parquet'
df = pd.read_parquet(file_path)
df["title"].unique().size


3017

# View the entities

In [None]:
import pandas as pd

file_path = './ragtest/output/create_final_entities.parquet'
df = pd.read_parquet(file_path)
df.size

35652

# View the nodes

In [None]:
import pandas as pd

file_path = '/content/ragtest/output/create_final_nodes.parquet'
df = pd.read_parquet(file_path)
df.size

139488

# View the relationships

In [None]:
import pandas as pd

file_path = './ragtest/output/create_final_relationships.parquet'
df = pd.read_parquet(file_path)
df.size

85968

# View text units

In [None]:
import pandas as pd

file_path = '/content/ragtest/output/create_final_text_units.parquet'
df = pd.read_parquet(file_path)
df.size

21119

# Global Queries

In [None]:
! graphrag query --root ./ragtest --method global --query "What is Anatomy? A - Study of the Body. B - Study of Animals. C - Study of Finance"



creating llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_chat", 'model': 'gpt-4o', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'audience': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response:
### Understanding Anatomy

Anatomy is the study of the human body and its structure. It is a foundational discipline that is crucial for various healthcare professions. This field provides essential knowledge for understanding the human body, which is vital for medical education and practice [Data: Reports (0)].

### Conclusion

Based on the provided information, the correct answer is **A - Study of the Body**. Anatomy focuses specifically on the human body, di

# Local  Queries

In [None]:
! graphrag query --root ./ragtest --method local --query "What is a LINGUAL ARTERY?"




INFO: Vector Store Args: {
    "type": "lancedb",
    "db_uri": "/content/ragtest/output/lancedb",
    "container_name": "==== REDACTED ====",
    "overwrite": true
}
creating llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_chat", 'model': 'gpt-4o', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'audience': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'audience': None, 'deployment_name': None

In [None]:
import os

os.makedirs("anatomy_questions", exist_ok=True)

# Create Question files from the anatomy test csv

In [None]:
# read the test csv file
# file sourced from https://github.com/Teddy-XiongGZ/MIRAGE/blob/main/rawdata/mmlu/data/test/anatomy_test.csv

import pandas as pd

# Specify the file path
file_path = "anatomy_test.csv"

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path, header=None)

for index, row in df.iterrows():
    # Convert the row to a DataFrame
    row_df = pd.DataFrame([row])

    # Save the row as a CSV file
    output_file = f"anatomy_questions/test_anatomy-question_{index:03}.txt"  # Adjust filename as needed

    with open(output_file, "w") as f:
        f.write(row[0] + "\n\n")
        f.write("A. " + row[1] + "\n")
        f.write("B. " + row[2] + "\n")
        f.write("C. " + row[3] + "\n")
        f.write("D. " + row[4])




# Ask Anatomy Questions from GraphRAG

In [None]:
import os

os.makedirs("anatomy_answers", exist_ok=True)

In [None]:
import subprocess
import os

# Define the directory path
directory_path = "./anatomy_questions"
output_directory = "./anatomy_answers"

# Loop through each file in the directory
for file_name in os.listdir(directory_path):
    file_path = os.path.join(directory_path, file_name)
    output_file_path = os.path.join(output_directory, file_name)
    # Check if it's a file
    if os.path.isfile(file_path):
        with open(file_path, 'r') as file:
            question = file.read()
            answer = subprocess.run("graphrag query \
            --root ./ragtest \
            --method local \
            --query \"" + question +"\" > " + output_file_path, shell=True, capture_output=True, text=True)


# Compare Answers between GraphRAG and Question Key

In [None]:

# compare the answers
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.expand_frame_repr', False)


source = pd.read_csv("./anatomy_test.csv", header=None)
answer_key = source.iloc[:,5]
graph_rag_answers = pd.read_csv("./anatomy_answers_graph_rag/answers.csv")
graph_rag_answers = graph_rag_answers.replace(np.nan, '', regex=True)
graph_rag_answer_key = graph_rag_answers.iloc[:,1]
new_df = pd.DataFrame({'correct_answer': answer_key, 'graph_rag_answer': graph_rag_answer_key})
new_df['is_graph_rag_answer_correct'] = new_df['correct_answer'] == new_df['graph_rag_answer']
new_df['graph_rag_answer_comment'] = graph_rag_answers['Comments']
correct_count = new_df['is_graph_rag_answer_correct'].sum()
print(correct_count)
new_df.to_csv('./graph_rag_answers_comparison_to_answer_key.csv', index=False)


114
