# SBML Model RAG System for Background Knowledge Extraction

This notebook demonstrates how to extract background knowledge for species in an SBML model using a RAG (Retrieval Augmented Generation) approach with LangChain and OpenAI.

**What this notebook does:**
1. Loads and processes a PDF containing background information
2. Extracts key concepts/keywords from the PDF
3. Parses an SBML model to identify species
4. Uses RAG to generate background knowledge for each species
5. Returns two lists: species background knowledge and PDF keywords

## 1. Install Required Packages

First, let's make sure we have all the required packages installed.

In [1]:
# Uncomment to install required packages if not already installed
# !pip install langchain langchain-openai langchain-community faiss-cpu python-libsbml pypdf

## 2. Import the SBML RAG Utilities

In [2]:
# Import the utilities from our custom module
import sys
sys.path.append('/Users/U1013680/workplace/projects/inigoo18/AIAgents4Pharma/notebook')
from sbml_rag_utils import process_sbml_and_pdf
import os
import pandas as pd

## 3. Setup File Paths and API Key

In [3]:
# Set your OpenAI API key

# Get the API key from environment variables
openai_api_key = os.environ.get("OPENAI_API_KEY")

# Check if the key exists
if openai_api_key is None:
    raise ValueError("The OPENAI_API_KEY environment variable is not set")

# Define file paths
sbml_file_path = "./data/Dwivedi_Model537_empty.xml"   # Replace with your SBML file path
pdf_file_path = "./data/psp201364a.pdf"  # Replace with your PDF file path

# Verify files exist
assert os.path.exists(sbml_file_path), f"SBML file not found at {sbml_file_path}"
assert os.path.exists(pdf_file_path), f"PDF file not found at {pdf_file_path}"

## 4. Process the SBML Model and PDF

Now we'll run the main processing function to extract species background knowledge and PDF keywords.

In [4]:
# Process the SBML model and PDF
species_backgrounds, keywords = process_sbml_and_pdf(
    sbml_file_path=sbml_file_path,
    pdf_file_path=pdf_file_path,
    api_key=openai_api_key,
    max_species=10 
)

Analyzing SBML model species...
Found 44 species in the model
Species are distributed across 4 compartments
Loading and processing PDF...
Extracting keywords from PDF...
Parsing SBML model for species...
Extracting background information for 44 species...
Processing species 1/44: IL6
Processing species 2/44: sgp130
Processing species 3/44: sR_IL6_sgp130
Processing species 4/44: CRP
Processing species 5/44: sR
Processing species 6/44: sR_IL6
Processing species 7/44: Ab
Processing species 8/44: Ab_sR
Processing species 9/44: Ab_sR_IL6
Processing species 10/44: CRP Suppression (%)
Processing species 11/44: CRP (% of baseline)
Processing species 12/44: gp130
Processing species 13/44: R_IL6_gp130
Processing species 14/44: sR_IL6 (mw88ca8d9a_f5cf_41bf_9d9d_fc48f6e1a19e, #2)


## 5. Display the Results

### 5.1 PDF Keywords

In [None]:
print("PDF Keywords:")
for i, keyword in enumerate(keywords, 1):
    print(f"{i}. {keyword}")

### 5.2 Species Background Knowledge

In [None]:
# Create a DataFrame for better visualization
species_df = pd.DataFrame(species_backgrounds)
species_df

### 5.3 Detailed Species Information

In [None]:
# Display detailed information for each species
for i, species in enumerate(species_backgrounds, 1):
    print(f"\n{'='*80}\n{i}. {species['name']} (ID: {species['id']})\n{'='*80}")
    print(species['background'])

## 6. Save Results to Files (Optional)

In [None]:
# Save species background information to CSV
species_df.to_csv("species_backgrounds.csv", index=False)

# Save keywords to a text file
with open("pdf_keywords.txt", "w") as f:
    f.write(", ".join(keywords))

print("Results saved to 'species_backgrounds.csv' and 'pdf_keywords.txt'")

## 7. Return the Required Lists

Here are the two lists that were requested:

In [None]:
# List 1: Species background knowledge
species_backgrounds

In [None]:
# List 2: PDF keywords
keywords