# SBML Model RAG System for Background Knowledge Extraction

This notebook demonstrates how to extract background knowledge for species in an SBML model using a RAG (Retrieval Augmented Generation) approach with LangChain and OpenAI.

**What this notebook does:**
1. Loads and processes a PDF containing background information
2. Extracts key concepts/keywords from the PDF
3. Parses an SBML model to identify species
4. Uses RAG to generate background knowledge for each species
5. Returns two lists: species background knowledge and PDF keywords

## 1. Install Required Packages

In [1]:
# Uncomment to install required packages if not already installed
# !pip install langchain langchain-openai langchain-community faiss-cpu python-libsbml pypdf

## 2. Import the SBML RAG Utilities

In [2]:
# Import the utilities from our custom module
import sys
sys.path.append('/Users/U1013680/workplace/projects/inigoo18/AIAgents4Pharma/notebook')
from sbml_rag_utils import process_sbml_and_pdf
import os
import pandas as pd

## 3. Setup File Paths and API Key

In [3]:
# Set your OpenAI API key

# Get the API key from environment variables
openai_api_key = os.environ.get("OPENAI_API_KEY")

# Check if the key exists
if openai_api_key is None:
    raise ValueError("The OPENAI_API_KEY environment variable is not set")

# Define file paths
sbml_file_path = "./data/Dwivedi_Model537_empty.xml"   # Replace with your SBML file path
pdf_file_path = "./data/psp201364a.pdf"  # Replace with your PDF file path

# Verify files exist
assert os.path.exists(sbml_file_path), f"SBML file not found at {sbml_file_path}"
assert os.path.exists(pdf_file_path), f"PDF file not found at {pdf_file_path}"

## 4. Process the SBML Model and PDF

Now we'll run the main processing function to extract species background knowledge and PDF keywords.

In [5]:
# Process the SBML model and PDF
species_backgrounds, keywords, species_stats = process_sbml_and_pdf(
    sbml_file_path=sbml_file_path,
    pdf_file_path=pdf_file_path,
    api_key=openai_api_key,
    max_species=2
)

Analyzing SBML model species...
Found 44 species in the model
Species are distributed across 4 compartments
Loading and processing PDF...
Extracting keywords from PDF...
Parsing SBML model for species...
Limiting processing to 2 species (out of 44 total)
Extracting background information for 2 species...
Processing species 1/2: IL6
Processing species 2/2: sgp130


## 5. Display the Results

### 5.1 PDF Keywords

In [6]:
print("PDF Keywords:")
for i, keyword in enumerate(keywords, 1):
    print(f"{i}. {keyword}")

PDF Keywords:
1. Crohn's disease
2. Interleukin-6 (IL-6)
3. T-cells


### 5.2 Species Background Knowledge

In [7]:
# Create a DataFrame for better visualization
species_df = pd.DataFrame(species_backgrounds)
species_df

Unnamed: 0,id,name,original_name,compartment,background
0,mwf626e95e_543f_41e4_aad4_c6bf60ab345b,IL6,IL6,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,Interleukin-6 (IL-6) is a cytokine that plays ...
1,mwbbbce920_e8dd_4320_9386_fc94bfb2fc99,sgp130,sgp130,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,Based on the provided context from the documen...


### 5.3 Detailed Species Information

In [8]:
# Display detailed information for each species
for i, species in enumerate(species_backgrounds, 1):
    print(f"\n{'='*80}\n{i}. {species['name']} (ID: {species['id']})\n{'='*80}")
    print(species['background'])


1. IL6 (ID: mwf626e95e_543f_41e4_aad4_c6bf60ab345b)
Interleukin-6 (IL-6) is a cytokine that plays a significant role in the immune system and has been implicated in various inflammatory diseases, including Crohn’s disease. In the context of Crohn's disease, IL-6 is identified as an important factor contributing to enhanced T-cell survival and resistance to apoptosis in the lamina propria, which is a part of the intestinal mucosa. This activity is associated with increased chemokine secretion, which can exacerbate the inflammatory response.

IL-6 signaling can occur via two pathways: the classical signaling pathway and the trans-signaling pathway. The classical pathway is mediated by the membrane-bound IL-6 receptor (IL-6Rα), while the trans-signaling pathway involves a soluble form of the IL-6 receptor (sIL-6Rα). Both pathways involve the recruitment of a membrane-bound gp130 coreceptor, forming a complex that ultimately leads to the activation of Janus kinase (Jak) proteins and signa

## 6. Save Results to Files (Optional)

In [9]:
# Save species background information to CSV
species_df.to_csv("species_backgrounds.csv", index=False)

# Save keywords to a text file
with open("pdf_keywords.txt", "w") as f:
    f.write(", ".join(keywords))

print("Results saved to 'species_backgrounds.csv' and 'pdf_keywords.txt'")

Results saved to 'species_backgrounds.csv' and 'pdf_keywords.txt'


## 7. Return the Required Lists

Here are the two lists that were requested:

In [10]:
# List 1: Species background knowledge
species_backgrounds

[{'id': 'mwf626e95e_543f_41e4_aad4_c6bf60ab345b',
  'name': 'IL6',
  'original_name': 'IL6',
  'compartment': 'mw53ffe9e6_beef_45c4_90a5_a79197ed506e',
  'background': "Interleukin-6 (IL-6) is a cytokine that plays a significant role in the immune system and has been implicated in various inflammatory diseases, including Crohn’s disease. In the context of Crohn's disease, IL-6 is identified as an important factor contributing to enhanced T-cell survival and resistance to apoptosis in the lamina propria, which is a part of the intestinal mucosa. This activity is associated with increased chemokine secretion, which can exacerbate the inflammatory response.\n\nIL-6 signaling can occur via two pathways: the classical signaling pathway and the trans-signaling pathway. The classical pathway is mediated by the membrane-bound IL-6 receptor (IL-6Rα), while the trans-signaling pathway involves a soluble form of the IL-6 receptor (sIL-6Rα). Both pathways involve the recruitment of a membrane-boun

In [11]:
# List 2: PDF keywords
keywords

["Crohn's disease", 'Interleukin-6 (IL-6)', 'T-cells']