# SBML Model RAG System for Background Knowledge Extraction

This notebook demonstrates how to extract background knowledge for species in an SBML model using a RAG (Retrieval Augmented Generation) approach with LangChain and OpenAI.

**What this notebook does:**
1. Loads and processes a PDF containing background information
2. Extracts key concepts/keywords from the PDF
3. Parses an SBML model to identify species
4. Uses RAG to generate background knowledge for each species
5. Returns two lists: species background knowledge and PDF keywords

## 1. Install Required Packages

In [1]:
# Uncomment to install required packages if not already installed
# !pip install langchain langchain-openai langchain-community faiss-cpu python-libsbml pypdf

## 2. Import the SBML RAG Utilities

In [2]:
# Import the utilities from our custom module
import sys
sys.path.append('/Users/U1013680/workplace/projects/inigoo18/AIAgents4Pharma/notebook')
from sbml_rag_utils import process_sbml_and_pdf
import os
import pandas as pd
import re

## 3. Setup File Paths and API Key

In [3]:
# Set your OpenAI API key

# Get the API key from environment variables
openai_api_key = os.environ.get("OPENAI_API_KEY")

# Check if the key exists
if openai_api_key is None:
    raise ValueError("The OPENAI_API_KEY environment variable is not set")

# Define file paths
sbml_file_path = "./data/Dwivedi_Model537_empty.xml"   # Replace with your SBML file path
pdf_file_path = "./data/psp201364a.pdf"  # Replace with your PDF file path

# Verify files exist
assert os.path.exists(sbml_file_path), f"SBML file not found at {sbml_file_path}"
assert os.path.exists(pdf_file_path), f"PDF file not found at {pdf_file_path}"

## 4. Process the SBML Model and PDF

Now we'll run the main processing function to extract species background knowledge and PDF keywords.

In [4]:
# Process the SBML model and PDF
species_backgrounds, keywords, species_stats = process_sbml_and_pdf(
    sbml_file_path=sbml_file_path,
    pdf_file_path=pdf_file_path,
    api_key=openai_api_key,
    max_species=44
)

Analyzing SBML model species...
Found 44 species in the model
Species are distributed across 4 compartments
Loading and processing PDF...
Extracting keywords from PDF...
Parsing SBML model for species...
Extracting background information for 44 species...
Processing species 1/44: IL6
Processing species 2/44: sgp130
Processing species 3/44: sR_IL6_sgp130
Processing species 4/44: CRP
Processing species 5/44: sR
Processing species 6/44: sR_IL6
Processing species 7/44: Ab
Processing species 8/44: Ab_sR
Processing species 9/44: Ab_sR_IL6
Processing species 10/44: CRP Suppression (%)
Processing species 11/44: CRP (% of baseline)
Processing species 12/44: gp130
Processing species 13/44: R_IL6_gp130
Processing species 14/44: sR_IL6 (mw88ca8d9a_f5cf_41bf_9d9d_fc48f6e1a19e, #2)
Processing species 15/44: R
Processing species 16/44: IL6 (mw88ca8d9a_f5cf_41bf_9d9d_fc48f6e1a19e, #2)
Processing species 17/44: R_IL6
Processing species 18/44: Ractive
Processing species 19/44: STAT3
Processing species 2

## 5. Display the Results

### 5.1 PDF Keywords

In [5]:
print("PDF Keywords:")
for i, keyword in enumerate(keywords, 1):
    print(f"{i}. {keyword}")

PDF Keywords:
1. Crohns disease
2. IL-6
3. IL-6R
4. sIL-6R
5. CRP
6. gp130
7. Jak
8. STAT3
9. T-cells
10. hepatocytes
11. apoptosis
12. inflammation
13. GI tract
14. sgp130
15. tocilizumab
16. cytokines
17. Janus kinase
18. lamina propria
19. leukocytes
20. chemokine


### 5.2 Species Background Knowledge

In [6]:
# Create a DataFrame for better visualization
species_df = pd.DataFrame(species_backgrounds)
species_df

Unnamed: 0,id,name,original_name,compartment,background
0,mwf626e95e_543f_41e4_aad4_c6bf60ab345b,IL6,IL6,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,The provided context outlines the biological f...
1,mwbbbce920_e8dd_4320_9386_fc94bfb2fc99,sgp130,sgp130,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,"Based on the context provided, sgp130 is a bio..."
2,mw810ff751_fa4e_4143_bd50_169b3e325e1e,sR_IL6_sgp130,sR_IL6_sgp130,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,Based on the provided context from the documen...
3,mw114aa90f_5f5b_4fe8_9406_361c8489b6a1,CRP,CRP,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,The term 'CRP' within the context provided ref...
4,mw30ae63db_6cd3_4b6f_93ad_3350cd360bcc,sR,sR,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,"Based on the provided context, the species nam..."
5,mw03db56ac_8dc6_4931_ae82_fef706d2ee3d,sR_IL6,sR_IL6,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,The species labeled 'sR_IL6' in the context pr...
6,mwf345ed7a_0622_403c_b816_c8749a2c9ded,Ab,Ab,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,"Based on the provided documents, there isn't d..."
7,mw1da111f2_a036_4392_8512_015005bdcbb7,Ab_sR,Ab_sR,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,The specific species 'Ab_sR' (ID: mw1da111f2_a...
8,mw9947742a_0e4b_4636_9a4b_b6eef2a8f6ac,Ab_sR_IL6,Ab_sR_IL6,mw53ffe9e6_beef_45c4_90a5_a79197ed506e,The context provided does not directly specify...
9,CRP_Suppression___,CRP Suppression (%),CRP Suppression (%),mw53ffe9e6_beef_45c4_90a5_a79197ed506e,"Based on the context provided, ""CRP Suppressio..."


### 5.3 Detailed Species Information

In [7]:
# Display detailed information for each species
for i, species in enumerate(species_backgrounds, 1):
    print(f"\n{'='*80}\n{i}. {species['name']} (ID: {species['id']})\n{'='*80}")
    print(species['background'])


1. IL6 (ID: mwf626e95e_543f_41e4_aad4_c6bf60ab345b)
The provided context outlines the biological function and role of IL-6, a cytokine important in the immune system. Based on the information:

1. **Biological Function**:
   IL-6 is a cytokine involved in immune regulation. It plays a crucial role in enhancing T-cell survival and resistance to apoptosis, particularly in the context of Crohn’s Disease, as well as in promoting chemokine secretion.

2. **Role in Pathways**:
   IL-6 signaling can proceed via two distinct pathways:
   - The classical pathway, which involves the membrane-bound IL-6 receptor (IL-6Rα).
   - The trans-signaling pathway, which involves a soluble IL-6 receptor (sIL-6Rα). In both pathways, IL-6 forms a complex with its respective receptor, which then recruits the gp130 coreceptor, leading to the formation of a hexameric receptor complex. This complex initiates phosphorylation of gp130-bound Janus kinase (Jak) proteins and subsequent activation of STAT3 (Signal Tr

## 6. Save Results to Files

In [8]:
# Save species background information to CSV
species_df.to_csv("species_backgrounds.csv", index=False)

# Save keywords to a text file
with open("pdf_keywords.txt", "w") as f:
    f.write(", ".join(keywords))

print("Results saved to 'species_backgrounds.csv' and 'pdf_keywords.txt'")

Results saved to 'species_backgrounds.csv' and 'pdf_keywords.txt'


## 7. Return the Required Lists

Here are the two lists that were requested:

In [9]:
# List 1: Species background knowledge
species_backgrounds

[{'id': 'mwf626e95e_543f_41e4_aad4_c6bf60ab345b',
  'name': 'IL6',
  'original_name': 'IL6',
  'compartment': 'mw53ffe9e6_beef_45c4_90a5_a79197ed506e',
  'background': "The provided context outlines the biological function and role of IL-6, a cytokine important in the immune system. Based on the information:\n\n1. **Biological Function**:\n   IL-6 is a cytokine involved in immune regulation. It plays a crucial role in enhancing T-cell survival and resistance to apoptosis, particularly in the context of Crohn’s Disease, as well as in promoting chemokine secretion.\n\n2. **Role in Pathways**:\n   IL-6 signaling can proceed via two distinct pathways:\n   - The classical pathway, which involves the membrane-bound IL-6 receptor (IL-6Rα).\n   - The trans-signaling pathway, which involves a soluble IL-6 receptor (sIL-6Rα). In both pathways, IL-6 forms a complex with its respective receptor, which then recruits the gp130 coreceptor, leading to the formation of a hexameric receptor complex. Thi

In [10]:
# List 2: PDF keywords
keywords

['Crohns disease',
 'IL-6',
 'IL-6R',
 'sIL-6R',
 'CRP',
 'gp130',
 'Jak',
 'STAT3',
 'T-cells',
 'hepatocytes',
 'apoptosis',
 'inflammation',
 'GI tract',
 'sgp130',
 'tocilizumab',
 'cytokines',
 'Janus kinase',
 'lamina propria',
 'leukocytes',
 'chemokine']