## 1. Setup and Imports

This script uses libraries for cloud data access (google.cloud.bigquery), process mining and visualization (pm4py), environment management (dotenv), data handling (pandas), and language model orchestration (langchain and langchain_openai).

In [1]:
# Import libraries for cloud data access
from google.cloud import bigquery
from google_auth_oauthlib.flow import InstalledAppFlow

In [2]:
# Import libraries for logging
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

In [3]:
# Import libraries for data handling
import pandas as pd

In [8]:
# Import libraries for process mining
import pm4py

In [48]:
# Import libraries for RAG (Retrieval-Augmented Generation)
from pinecone import Pinecone
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.chat_models import init_chat_model

In [51]:
# Import libraries for loading environment variables
import os
from dotenv import load_dotenv

load_dotenv()

True

## 2. Authentication Setup

Before accessing Google Cloud services, a service account must be created in the Google Cloud Console, with the necessary IAM roles (e.g., BigQuery Admin) assigned to it. The service account’s JSON key file is securely stored locally, and its path is set using the GOOGLE_APPLICATION_CREDENTIALS environment variable to enable programmatic authentication.

In [9]:
# Configuration
project_id = "integration-of-pm-and-llms"
client_secret_path = "/Users/alejandromateocobo/Documents/PythonProjects/Integration_Of_LLMs_And_Process_Mining/keys/client_secret_316641064865-57id3o26obibotvs226jeevisjdujha5.apps.googleusercontent.com.json"

# Authentication flow
logging.info("Starting OAuth authentication...")
SCOPES = ["https://www.googleapis.com/auth/cloud-platform"]
flow = InstalledAppFlow.from_client_secrets_file(
    client_secret_path,
    scopes=SCOPES
)
credentials = flow.run_local_server(port=0)
client = bigquery.Client(credentials=credentials, project=project_id)

logging.info("Authentication successful.")
logging.info("BigQuery client initialized.")

2025-07-12 11:20:19,738 - INFO - Starting OAuth authentication...


Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=316641064865-57id3o26obibotvs226jeevisjdujha5.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A51350%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&state=8Tey3xSSvnwo7WprOFtv0TARLz4LLN&access_type=offline


2025-07-12 11:20:25,781 - INFO - "GET /?state=8Tey3xSSvnwo7WprOFtv0TARLz4LLN&code=4/0AVMBsJhSGxVbFmpp_ElFyf7G7dbsHlgK4JTq4S6BPwXslj4ZAwdCM9TQ3lZtSnvAy1Veqw&scope=https://www.googleapis.com/auth/cloud-platform HTTP/1.1" 200 65
2025-07-12 11:20:26,145 - INFO - Authentication successful.
2025-07-12 11:20:26,146 - INFO - BigQuery client initialized.


## 3. Query BigQuery

Google Cloud Platform (GCP) is a suite of cloud computing services that enables scalable storage, processing, and data analysis using Google’s infrastructure. To use GCP for analyzing clinical datasets like MIMIC-III, users must create a Google account, set up a GCP project with billing, enable the necessary APIs, and configure OAuth client authentication to securely access cloud resources. The MIMIC-III database, which contains detailed health records from over 40,000 critical care patients, can be accessed through Google BigQuery for efficient cloud-based analysis, which is the recommended method by the MIT Lab for Computational Physiology.

In [39]:
query = """
        SELECT e.*
        FROM `integration-of-pm-and-llms.integration_of_pm_and_llms.filtered_eventlog` e
        INNER JOIN `physionet-data.mimiciii_clinical.icustays` icu
        ON e.icustay_id = icu.icustay_id
        WHERE e.icustay_id IN (211555, 290738, 236225, 213113)
        AND e.linksto = 'datetimeevents'
    """

# Check if the client is initialized before running the query
if not client:
    raise Exception("BigQuery client not initialized. Run authenticate() first.")

# Run the query
logging.info("Running user-provided query...")
job = client.query(query)

logging.info("Waiting for query to complete...")
df = job.to_dataframe(progress_bar_type='tqdm')

logging.info(f"Query completed. Retrieved {len(df)} rows.")

2025-07-12 12:03:45,959 - INFO - Running user-provided query...
2025-07-12 12:03:47,160 - INFO - Waiting for query to complete...


Job ID 9da185b4-94aa-4bcd-bc0f-9ec8e1b66e7f successfully executed: 100%|[32m██████████[0m|
Downloading: 100%|[32m██████████[0m|

2025-07-12 12:03:48,928 - INFO - Query completed. Retrieved 1616 rows.





In [40]:
# Process the dataframe as needed
print(df.head())

   itemid  subject_id  hadm_id  icustay_id     event_timestamp  \
0  224295       47546   112012      236225 2113-04-18 18:09:00   
1  224282       97599   135263      213113 2142-01-11 00:00:00   
2  225766       97599   135263      213113 2142-01-01 00:00:00   
3  224279       97599   135263      213113 2142-01-20 06:00:00   
4  224290       97599   135263      213113 2142-01-17 16:45:00   

                               label                 category         linksto  
0  Cordis/Introducer Dressing Change  Access Lines - Invasive  datetimeevents  
1          Multi Lumen Tubing Change  Access Lines - Invasive  datetimeevents  
2              Sheath Insertion Date  Access Lines - Invasive  datetimeevents  
3        Multi Lumen Dressing Change  Access Lines - Invasive  datetimeevents  
4        Arterial line Tubing Change  Access Lines - Invasive  datetimeevents  


## 4. Prepare Event Log for PM4PY

In this step, the dataset is reformatted to match the structure expected by PM4PY, where each event log requires a case identifier, an activity name, and a timestamp. The data is then converted into a PM4PY event log object, which enables process mining algorithms to analyze the sequence of events across cases.

In [41]:
# Rename columns for PM4PY (using icustay_id as case identifier)
df_eventlog = df.rename(columns={
    "icustay_id": "case:concept:name",  # Update to use icustay_id as case identifier
    "event_timestamp": "time:timestamp",
    "label": "concept:name"
})

# Convert the timestamp column to datetime
df_eventlog["time:timestamp"] = pd.to_datetime(df_eventlog["time:timestamp"], errors="coerce")

# Format the dataframe using pm4py.utils.format_dataframe
df_eventlog = pm4py.utils.format_dataframe(
    df_eventlog,
    case_id='case:concept:name',
    activity_key='concept:name',
    timestamp_key='time:timestamp'
)

# Convert the dataframe to an event log object
event_log = pm4py.convert_to_event_log(df_eventlog)

# Print basic statistics
logging.info(f"Total cases: {len(set(df_eventlog['case:concept:name']))}")
logging.info(f"Total activities: {len(set(df_eventlog['concept:name']))}")
logging.info(f"Total events: {len(df_eventlog)}")

2025-07-12 12:03:57,004 - INFO - Total cases: 4
2025-07-12 12:03:57,009 - INFO - Total activities: 35
2025-07-12 12:03:57,018 - INFO - Total events: 1616


In [42]:
# Preview the formatted event log
print(df_eventlog.head())

   itemid  subject_id  hadm_id case:concept:name            time:timestamp  \
0  226515       53232   115492            211555 2128-01-02 00:00:00+00:00   
1  224290       53232   115492            211555 2151-06-29 00:00:00+00:00   
2  224298       53232   115492            211555 2151-06-29 00:00:00+00:00   
3  224288       53232   115492            211555 2151-06-29 00:00:00+00:00   
4  224298       53232   115492            211555 2151-06-29 00:00:00+00:00   

                      concept:name                 category         linksto  \
0                    Date of Birth                      ADT  datetimeevents   
1      Arterial line Tubing Change  Access Lines - Invasive  datetimeevents   
2  Cordis/Introducer Tubing Change  Access Lines - Invasive  datetimeevents   
3     Arterial line Insertion Date  Access Lines - Invasive  datetimeevents   
4  Cordis/Introducer Tubing Change  Access Lines - Invasive  datetimeevents   

   @@index  @@case_index  
0        0             0  
1 

## 5. Discover and Visualize the Directly-Follows Graph (DFG) with PM4PY

In this step, the frequency-based Directly-Follows Graph (DFG) is discovered from the event log using PM4PY and visualized to understand the control-flow structure of the process. The DFG is then converted into a readable textual format to prepare it for further analysis with a Large Language Model (LLM).

In [46]:
# Discover the Directly-Follows Graph (DFG)
dfg, start_activities, end_activities = pm4py.discovery.discover_dfg(
    event_log,
    activity_key='concept:name',
    case_id_key='case:concept:name',
    timestamp_key='time:timestamp'
)

In [None]:
# Visualize the DFG
pm4py.vis.view_dfg(
    dfg,
    start_activities,
    end_activities,
    format='png',       # You can also use 'svg' if supported
    bgcolor='white',
    rankdir='LR'        # Left-to-right layout
)

In [43]:
# Discover Petri net using the Inductive Miner
net, im, fm = pm4py.discovery.discover_petri_net_inductive(
    event_log,
    activity_key='concept:name',
    case_id_key='case:concept:name',
    timestamp_key='time:timestamp',
    multi_processing=True
)

In [None]:
# Visualize the Petri net
pm4py.vis.view_petri_net(net, im, fm, format='png')

## 6. LangChain

In [53]:
# Load Pinecone and OpenAI keys
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Initialize Pinecone client
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(name="integration-of-pm-and-llms")

In [54]:
# Load documents from generated folder
loader = DirectoryLoader(
    "/Users/alejandromateocobo/Documents/PythonProjects/Integration_Of_LLMs_And_Process_Mining/data/context",
    glob="**/*.txt",
    loader_cls=TextLoader
)
documents = loader.load()

In [55]:
# Split text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunked_docs = text_splitter.split_documents(documents)

# Generate unique chunk IDs
ids = [f"doc_{i}" for i in range(len(chunked_docs))]

# Compute embeddings
embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model="text-embedding-3-small")

# Store documents in Pinecone
vector_store = PineconeVectorStore(index=index, embedding=embeddings)
vector_store.add_documents(documents=chunked_docs, ids=ids)

logging.info("Documents successfully embedded and stored in Pinecone.")

2025-07-12 13:33:38,579 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-07-12 13:33:40,353 - INFO - Documents successfully embedded and stored in Pinecone.


In [58]:
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
query = "What is the most common workflow for patients in the ICU??"

retrieved_docs = vector_store.similarity_search(query, k=10)
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

prompt = f"""You are a helpful and knowledgeable assistant specialized in process mining. Your role is to analyze process-related data, models, and documentation to provide accurate and concise answers.

Here’s a user question: {query}

Here is some extracted documentation and context:

{docs_content}

Now please answer the question as clearly and precisely as possible:
"""

print(prompt)
answer = llm.invoke(prompt)
print(answer)

2025-07-12 13:40:04,621 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


You are a helpful and knowledgeable assistant specialized in process mining. Your role is to analyze process-related data, models, and documentation to provide accurate and concise answers.

Here’s a user question: What is the most common workflow for patients in the ICU??

Here is some extracted documentation and context:

Abstract
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge). MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three

2025-07-12 13:40:10,046 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


content="The most common workflow for patients in the ICU can be summarized through a series of frequent measures and interventions based on the data extracted from the MIMIC-III database. The workflow typically includes the following key components:\n\n1. **Monitoring Vital Signs**: Continuous monitoring of vital signs such as heart rate, blood pressure, respiratory rate, and oxygen saturation. Measurements are often taken at least hourly.\n\n2. **Ventilator Management**: Many ICU patients require mechanical ventilation. This involves:\n   - Regular adjustments of ventilator settings (e.g., Tidal Volume, PEEP, Inspired O2 Fraction).\n   - Continual observation of patient response (e.g., Arterial CO2 and O2 levels).\n   \n3. **Laboratory Testing**: Frequent laboratory tests are conducted to monitor vital biochemical parameters, such as:\n   - Serum electrolytes (Potassium, HCO3).\n   - Blood gas analyses (Arterial pH, Arterial Base Excess).\n   - Other critical values (e.g., Glucose le