<a href="https://colab.research.google.com/github/adamstiefel/AI-Business-Agents/blob/main/Bottleneck_Detection_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Bottleneck Detection Agent üöß

**Leveraging Data to Uncover Process Delays and Inefficiencies**

## What This Tool Does

This Google Colab notebook implements a **Bottleneck Detection Agent**. It's designed to:

1.  **Ingest Process Data:** Primarily focused on event logs (CSV files) containing case IDs, activity names, and timestamps.
2.  **Calculate Key Time Metrics:** Automatically computes essential metrics like:
    * **Activity Duration:** How long each specific task takes.
    * **Waiting Time:** The idle time between the completion of one activity and the start of the next for the same process instance (case).
3.  **Detect Anomalous Delays:** Uses statistical methods (like Interquartile Range - IQR) and/or Machine Learning (Isolation Forest) to identify activities or waiting periods that are unusually long compared to the norm.
4.  **Highlight Potential Bottlenecks:** Flags these anomalous delays, providing insights into where processes are getting stuck or taking longer than expected.

## How This Delivers Business Value (The "Makes Money" Aspect)

Identifying and addressing bottlenecks is crucial for operational excellence. This agent helps by:

* **üí∞ Reducing Wasted Time & Costs:** Pinpoints specific delays, allowing for targeted improvements that cut down on idle time and associated operational costs.
* **üöÄ Accelerating Processes:** By tackling the slowest parts of your process, you can significantly improve overall cycle times and throughput.
* **üõ†Ô∏è Optimizing Resource Use:** Highlights areas where resources might be overloaded or where delays indicate a need for reallocation or support.
* **üìà Enhancing Productivity:** Smoother processes with fewer delays mean teams can achieve more.
* **üîç Enabling Proactive Management:** Spot unusual delays early to prevent minor issues from becoming major disruptions.
* **üìä Supporting Data-Driven Improvements:** Provides concrete data to justify and guide process re-engineering efforts.

This tool empowers you to move from raw process data to actionable insights for a more efficient and cost-effective operation.

In [2]:
#@title 2. Setup and Installations
# --- Install necessary libraries ---
!pip install pandas scikit-learn plotly -q

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from google.colab import files
import io
import plotly.express as px
import plotly.graph_objects as go
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

print("Libraries installed and imported.")
print(f"Pandas version: {pd.__version__}")

Libraries installed and imported.
Pandas version: 2.2.2


In [6]:
# Python code to create the sample_process_log.csv file in your Colab environment

csv_data_content = """CaseID,Activity,Timestamp_Start,Timestamp_End
1,Order Received,2023-01-01 09:00:00,2023-01-01 09:04:00
1,Payment Processed,2023-01-01 09:05:00,2023-01-01 09:08:00
1,Items Packed,2023-01-01 09:30:00,2023-01-01 09:55:00
1,Shipped,2023-01-01 10:00:00,2023-01-01 10:05:00
2,Order Received,2023-01-01 10:15:00,2023-01-01 10:19:00
2,Payment Processed,2023-01-01 10:20:00,2023-01-01 10:25:00
2,Shipped,2023-01-01 11:30:00,2023-01-01 11:33:00
3,Order Received,2023-01-02 08:00:00,2023-01-02 08:02:00
3,Payment Processed,2023-01-02 08:03:00,2023-01-02 08:06:00
3,Items Packed,2023-01-02 08:45:00,2023-01-02 09:10:00
3,Shipped,2023-01-02 09:15:00,2023-01-02 09:20:00
4,Order Received,2023-01-02 11:00:00,2023-01-02 11:04:00
4,Payment Processed,2023-01-02 11:05:00,2023-01-02 11:08:00
4,Items Packed,2023-01-02 11:35:00,2023-01-02 11:58:00
4,Shipped,2023-01-02 12:05:00,2023-01-02 12:08:00
5,Order Received,2023-01-03 13:00:00,2023-01-03 13:01:00
5,Payment Processed,2023-01-03 13:02:00,2023-01-03 13:05:00
5,Items Packed,2023-01-03 13:30:00,2023-01-03 14:25:00
5,Items Packed,2023-01-03 14:30:00,2023-01-03 14:55:00
5,Shipped,2023-01-03 15:00:00,2023-01-03 15:03:00
6,Order Received,2023-01-04 09:00:00,2023-01-04 09:03:00
6,Payment Processed,2023-01-04 09:05:00,2023-01-04 09:15:00
6,Items Packed,2023-01-04 10:00:00,2023-01-04 10:30:00
6,Shipped,2023-01-04 10:35:00,2023-01-04 10:40:00
7,Order Received,2023-01-04 11:00:00,2023-01-04 11:02:00
7,Payment Processed,2023-01-04 11:03:00,2023-01-04 11:05:00
7,Items Packed,2023-01-04 11:06:00,2023-01-04 12:30:00
7,Shipped,2023-01-04 12:35:00,2023-01-04 12:38:00
8,Order Received,2023-01-05 14:00:00,2023-01-05 14:03:00
8,Payment Processed,2023-01-05 14:30:00,2023-01-05 14:35:00
8,Items Packed,2023-01-05 14:40:00,2023-01-05 14:55:00
8,Shipped,2023-01-05 15:00:00,2023-01-05 15:05:00
"""

file_name = "sample_process_log.csv"
with open(file_name, "w") as f:
    f.write(csv_data_content)

print(f"'{file_name}' has been created in your Colab environment's current directory.")
print("You can now download it using the Colab file explorer (folder icon on the left sidebar).")

# Optional: Code to trigger a download directly (might be blocked by some browsers if not user-initiated)
# from google.colab import files
# files.download(file_name)
# print(f"If your browser didn't block it, a download prompt for '{file_name}' should appear.")

'sample_process_log.csv' has been created in your Colab environment's current directory.
You can now download it using the Colab file explorer (folder icon on the left sidebar).


In [7]:
#@title 3. Data Input: Upload CSV or Use Sample Data

#@markdown Select your data source:
data_source = "Upload CSV" #@param ["Upload CSV", "Use Sample Data"]

#@markdown ---
#@markdown ### Expected CSV Column Names:
#@markdown Please ensure your CSV has columns with these (or similar) names. You can map them later if needed.
case_id_col = "CaseID" #@param {type:"string"}
activity_col = "Activity" #@param {type:"string"}
timestamp_start_col = "Timestamp_Start" #@param {type:"string"}
timestamp_end_col = "Timestamp_End" #@param {type:"string"}
#@markdown ---
#@markdown **Timestamp Format (if not automatically parsed by Pandas):**
#@markdown Example: `%Y-%m-%d %H:%M:%S` or `%m/%d/%Y %I:%M:%S %p`. Leave blank if Pandas can infer.
timestamp_format_str = "" #@param {type:"string"}

df_process_data = None

if data_source == "Upload CSV":
    print("Please upload your process event log CSV file.")
    print("The file should contain columns for Case ID, Activity Name, Start Timestamp, and End Timestamp.")
    uploaded = files.upload()

    if not uploaded:
        print("No file uploaded. Please upload a file or select 'Use Sample Data'.")
    else:
        file_name = next(iter(uploaded))
        print(f"\nProcessing uploaded file: \"{file_name}\"")
        try:
            df_process_data = pd.read_csv(io.BytesIO(uploaded[file_name]))
            print("CSV file loaded successfully into a DataFrame.")
        except Exception as e:
            print(f"Error reading CSV: {e}")
            print("Please ensure it's a valid CSV file.")

elif data_source == "Use Sample Data":
    print("Loading sample process event log data...")
    # Create a sample DataFrame
    data = {
        case_id_col: [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5],
        activity_col: ['Order Received', 'Payment Processed', 'Items Packed', 'Shipped',
                       'Order Received', 'Payment Processed', 'Shipped', # Skipped 'Items Packed' for Case 2
                       'Order Received', 'Payment Processed', 'Items Packed', 'Shipped',
                       'Order Received', 'Payment Processed', 'Items Packed', 'Shipped',
                       'Order Received', 'Payment Processed', 'Items Packed', 'Items Packed', 'Shipped'], # Repetitive 'Items Packed'
        timestamp_start_col: pd.to_datetime([
            '2023-01-01 09:00:00', '2023-01-01 09:05:00', '2023-01-01 09:30:00', '2023-01-01 10:00:00', # Case 1
            '2023-01-01 10:15:00', '2023-01-01 10:20:00', '2023-01-01 11:30:00',                    # Case 2
            '2023-01-02 08:00:00', '2023-01-02 08:03:00', '2023-01-02 08:45:00', '2023-01-02 09:15:00', # Case 3
            '2023-01-02 11:00:00', '2023-01-02 11:05:00', '2023-01-02 11:35:00', '2023-01-02 12:05:00', # Case 4 (normal)
            '2023-01-03 13:00:00', '2023-01-03 13:02:00', '2023-01-03 13:30:00', '2023-01-03 14:30:00', '2023-01-03 15:00:00' # Case 5 (delay packing, then ship)
        ]),
        timestamp_end_col: pd.to_datetime([
            '2023-01-01 09:04:00', '2023-01-01 09:08:00', '2023-01-01 09:55:00', '2023-01-01 10:05:00', # Case 1
            '2023-01-01 10:19:00', '2023-01-01 10:25:00', '2023-01-01 11:33:00',                    # Case 2
            '2023-01-02 08:02:00', '2023-01-02 08:06:00', '2023-01-02 09:10:00', '2023-01-02 09:20:00', # Case 3
            '2023-01-02 11:04:00', '2023-01-02 11:08:00', '2023-01-02 11:58:00', '2023-01-02 12:08:00', # Case 4
            '2023-01-03 13:01:00', '2023-01-03 13:05:00', '2023-01-03 14:25:00', '2023-01-03 14:55:00', '2023-01-03 15:03:00' # Case 5
        ])
    }
    df_process_data = pd.DataFrame(data)
    print("Sample data loaded.")

if df_process_data is not None:
    print("\n--- Data Preview (First 5 rows) ---")
    display(df_process_data.head())
    print("\n--- Data Info ---")
    df_process_data.info()
else:
    print("\nNo data loaded. Please run the cell again and choose a data source.")

# Basic validation of expected columns
if df_process_data is not None:
    required_cols = [case_id_col, activity_col, timestamp_start_col, timestamp_end_col]
    missing_cols = [col for col in required_cols if col not in df_process_data.columns]
    if missing_cols:
        print(f"\n‚ö†Ô∏è WARNING: The following expected columns are missing from your DataFrame: {missing_cols}")
        print("Please check your column name parameters or your CSV file. Subsequent cells might fail.")
    else:
        print("\n‚úÖ All expected columns found.")

Please upload your process event log CSV file.
The file should contain columns for Case ID, Activity Name, Start Timestamp, and End Timestamp.


Saving sample_process_log.csv to sample_process_log (1).csv

Processing uploaded file: "sample_process_log (1).csv"
CSV file loaded successfully into a DataFrame.

--- Data Preview (First 5 rows) ---


Unnamed: 0,CaseID,Activity,Timestamp_Start,Timestamp_End
0,1,Order Received,2023-01-01 09:00:00,2023-01-01 09:04:00
1,1,Payment Processed,2023-01-01 09:05:00,2023-01-01 09:08:00
2,1,Items Packed,2023-01-01 09:30:00,2023-01-01 09:55:00
3,1,Shipped,2023-01-01 10:00:00,2023-01-01 10:05:00
4,2,Order Received,2023-01-01 10:15:00,2023-01-01 10:19:00



--- Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   CaseID           32 non-null     int64 
 1   Activity         32 non-null     object
 2   Timestamp_Start  32 non-null     object
 3   Timestamp_End    32 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.1+ KB

‚úÖ All expected columns found.


In [8]:
#@title 4. Data Preprocessing & Feature Calculation

if df_process_data is None:
    print("üõë Data not loaded. Please run the 'Data Input' cell (Step 3) first.")
else:
    df_analysis = df_process_data.copy()

    # --- 1. Timestamp Parsing ---
    print("\n--- Timestamp Parsing ---")
    try:
        if timestamp_format_str: # If user provided a format string
            df_analysis[timestamp_start_col] = pd.to_datetime(df_analysis[timestamp_start_col], format=timestamp_format_str)
            df_analysis[timestamp_end_col] = pd.to_datetime(df_analysis[timestamp_end_col], format=timestamp_format_str)
        else: # Try to let Pandas infer
            df_analysis[timestamp_start_col] = pd.to_datetime(df_analysis[timestamp_start_col])
            df_analysis[timestamp_end_col] = pd.to_datetime(df_analysis[timestamp_end_col])
        print("Timestamps parsed successfully.")
    except Exception as e:
        print(f"‚ö†Ô∏è Error parsing timestamps: {e}")
        print("Please check your timestamp columns and the format string if provided.")
        print("Attempting to continue, but further calculations might fail.")

    # --- 2. Calculate Activity Duration ---
    print("\n--- Calculating Activity Duration ---")
    if timestamp_start_col in df_analysis and timestamp_end_col in df_analysis:
        try:
            df_analysis['ActivityDuration_Seconds'] = (df_analysis[timestamp_end_col] - df_analysis[timestamp_start_col]).dt.total_seconds()
            df_analysis['ActivityDuration_Minutes'] = df_analysis['ActivityDuration_Seconds'] / 60
            print("Activity Duration calculated in seconds and minutes.")
        except Exception as e:
            print(f"‚ö†Ô∏è Error calculating Activity Duration: {e}")
            df_analysis['ActivityDuration_Seconds'] = np.nan
            df_analysis['ActivityDuration_Minutes'] = np.nan
    else:
        print(f"‚ö†Ô∏è Timestamp columns for duration calculation not found. Skipping Activity Duration.")
        df_analysis['ActivityDuration_Seconds'] = np.nan
        df_analysis['ActivityDuration_Minutes'] = np.nan


    # --- 3. Calculate Waiting Time (Transition Time) ---
    print("\n--- Calculating Waiting Time ---")
    # Sort by case and then by start time to ensure correct order for waiting time calculation
    if case_id_col in df_analysis and timestamp_start_col in df_analysis and timestamp_end_col in df_analysis:
        try:
            df_analysis = df_analysis.sort_values(by=[case_id_col, timestamp_start_col, timestamp_end_col])
            # Calculate waiting time as: start_time_of_current_activity - end_time_of_previous_activity (for the same case)
            df_analysis['Prev_Activity_End_Time'] = df_analysis.groupby(case_id_col)[timestamp_end_col].shift(1)
            df_analysis['WaitingTime_Seconds'] = (df_analysis[timestamp_start_col] - df_analysis['Prev_Activity_End_Time']).dt.total_seconds()
            df_analysis['WaitingTime_Minutes'] = df_analysis['WaitingTime_Seconds'] / 60
            # Fill NaN for the first activity in each case (as there's no preceding activity)
            # df_analysis['WaitingTime_Seconds'].fillna(0, inplace=True) # Or keep as NaN
            # df_analysis['WaitingTime_Minutes'].fillna(0, inplace=True)
            print("Waiting Time calculated in seconds and minutes (NaN for first activity in a case).")
        except Exception as e:
            print(f"‚ö†Ô∏è Error calculating Waiting Time: {e}")
            df_analysis['WaitingTime_Seconds'] = np.nan
            df_analysis['WaitingTime_Minutes'] = np.nan
    else:
        print(f"‚ö†Ô∏è Necessary columns for Waiting Time calculation not found. Skipping Waiting Time.")
        df_analysis['WaitingTime_Seconds'] = np.nan
        df_analysis['WaitingTime_Minutes'] = np.nan


    print("\n--- Preprocessed Data Preview (with new features) ---")
    display(df_analysis.head())

    # Store for next step
    processed_df = df_analysis


--- Timestamp Parsing ---
Timestamps parsed successfully.

--- Calculating Activity Duration ---
Activity Duration calculated in seconds and minutes.

--- Calculating Waiting Time ---
Waiting Time calculated in seconds and minutes (NaN for first activity in a case).

--- Preprocessed Data Preview (with new features) ---


Unnamed: 0,CaseID,Activity,Timestamp_Start,Timestamp_End,ActivityDuration_Seconds,ActivityDuration_Minutes,Prev_Activity_End_Time,WaitingTime_Seconds,WaitingTime_Minutes
0,1,Order Received,2023-01-01 09:00:00,2023-01-01 09:04:00,240.0,4.0,NaT,,
1,1,Payment Processed,2023-01-01 09:05:00,2023-01-01 09:08:00,180.0,3.0,2023-01-01 09:04:00,60.0,1.0
2,1,Items Packed,2023-01-01 09:30:00,2023-01-01 09:55:00,1500.0,25.0,2023-01-01 09:08:00,1320.0,22.0
3,1,Shipped,2023-01-01 10:00:00,2023-01-01 10:05:00,300.0,5.0,2023-01-01 09:55:00,300.0,5.0
4,2,Order Received,2023-01-01 10:15:00,2023-01-01 10:19:00,240.0,4.0,NaT,,


In [12]:
#@title 5. Anomaly Detection for Bottlenecks

if 'processed_df' not in globals() or processed_df is None:
    print("üõë Processed data not found. Please run Step 4 first.")
else:
    #@markdown ### Select Metric for Anomaly Detection:
    #@markdown Choose the time-based feature you want to analyze for unusual delays.
    metric_to_analyze = "ActivityDuration_Minutes" #@param ["ActivityDuration_Minutes", "WaitingTime_Minutes", "ActivityDuration_Seconds", "WaitingTime_Seconds"]

    #@markdown ---
    #@markdown ### Select Anomaly Detection Method:
    detection_method = "Isolation Forest (Machine Learning)" #@param ["IQR (Statistical)", "Isolation Forest (Machine Learning)"]

    #@markdown ---
    #@markdown ### Method-Specific Parameters:
    #@markdown **For IQR:**
    iqr_multiplier = 1.5 #@param {type:"number"}
    #@markdown An event is an anomaly if Metric > Q3 + (IQR_Multiplier * IQR)

    #@markdown **For Isolation Forest:**
    #@markdown Contamination: Expected proportion of outliers (0.01 for 1%, 'auto' for model to decide)
    isolation_forest_contamination = 'auto' #@param ["auto", "0.01", "0.02", "0.05", "0.1"]
    if isolation_forest_contamination != 'auto':
        try:
            isolation_forest_contamination = float(isolation_forest_contamination)
        except ValueError:
            print("Invalid contamination value, defaulting to 'auto'.")
            isolation_forest_contamination = 'auto'


    df_anomalies = processed_df.copy()
    df_anomalies = df_anomalies.dropna(subset=[metric_to_analyze])

    if df_anomalies.empty or df_anomalies[metric_to_analyze].isnull().all():
        print(f"‚ö†Ô∏è No valid data available for metric '{metric_to_analyze}' after dropping NaNs. Cannot perform anomaly detection.")
    else:
        print(f"\n--- Performing Anomaly Detection using: {detection_method} on '{metric_to_analyze}' ---")

        metric_values = df_anomalies[metric_to_analyze].values.reshape(-1, 1)

        if detection_method == "IQR (Statistical)":
            Q1 = df_anomalies[metric_to_analyze].quantile(0.25)
            Q3 = df_anomalies[metric_to_analyze].quantile(0.75)
            IQR = Q3 - Q1
            upper_bound = Q3 + iqr_multiplier * IQR

            df_anomalies['Is_Anomaly_IQR'] = df_anomalies[metric_to_analyze] > upper_bound
            anomalies_detected_df = df_anomalies[df_anomalies['Is_Anomaly_IQR'] == True].copy() # Use .copy()
            print(f"IQR Method: Q1={Q1:.2f}, Q3={Q3:.2f}, IQR={IQR:.2f}, Upper Bound (Anomaly Threshold)={upper_bound:.2f}")

        elif detection_method == "Isolation Forest (Machine Learning)":
            model_if = IsolationForest(contamination=isolation_forest_contamination, random_state=42)
            model_if.fit(metric_values)

            df_anomalies['Anomaly_Score_IF'] = model_if.decision_function(metric_values)
            df_anomalies['Is_Anomaly_IF'] = model_if.predict(metric_values)
            df_anomalies['Is_Anomaly_IF'] = df_anomalies['Is_Anomaly_IF'].apply(lambda x: True if x == -1 else False)
            anomalies_detected_df = df_anomalies[df_anomalies['Is_Anomaly_IF'] == True].copy() # Use .copy()
            print(f"Isolation Forest: Model trained and predictions made. Contamination set to: {isolation_forest_contamination}")

        print(f"\nFound {len(anomalies_detected_df)} potential anomalous delays (bottlenecks).")

        if not anomalies_detected_df.empty:
            print("\n--- Detected Anomalous Delays (Potential Bottlenecks) ---")

            # Prepare a DataFrame for display purposes
            anomalies_to_display_df = anomalies_detected_df.copy()
            cols_to_show = [case_id_col, activity_col, timestamp_start_col, metric_to_analyze]

            if metric_to_analyze.startswith("WaitingTime"):
                # Add 'Previous_Activity' to anomalies_detected_df (which becomes final_anomalies_df)
                # and also to anomalies_to_display_df for current display.
                # Ensure it's added to the DataFrame that will be passed as final_anomalies_df
                anomalies_detected_df.loc[:, 'Previous_Activity'] = anomalies_detected_df.groupby(case_id_col)[activity_col].shift(1)
                anomalies_to_display_df.loc[:, 'Previous_Activity'] = anomalies_detected_df['Previous_Activity']

                if 'Previous_Activity' in anomalies_to_display_df.columns:
                    # Ensure 'Previous_Activity' is not already in cols_to_show to avoid duplicates
                    if 'Previous_Activity' not in cols_to_show:
                         cols_to_show.insert(2, 'Previous_Activity')
                else:
                    print("‚ö†Ô∏è Could not reliably determine Previous_Activity for display.")

            # Ensure all columns in cols_to_show actually exist in anomalies_to_display_df
            valid_cols_to_show = [col for col in cols_to_show if col in anomalies_to_display_df.columns]
            display(anomalies_to_display_df[valid_cols_to_show].sort_values(by=metric_to_analyze, ascending=False).head(20))

            final_anomalies_df = anomalies_detected_df # This now potentially includes 'Previous_Activity'
            all_data_with_flags_df = df_anomalies
        else:
            final_anomalies_df = pd.DataFrame()
            all_data_with_flags_df = df_anomalies # Still pass this for visualization consistency
            print("No anomalies detected with the current settings.")


--- Performing Anomaly Detection using: Isolation Forest (Machine Learning) on 'ActivityDuration_Minutes' ---
Isolation Forest: Model trained and predictions made. Contamination set to: auto

Found 10 potential anomalous delays (bottlenecks).

--- Detected Anomalous Delays (Potential Bottlenecks) ---


Unnamed: 0,CaseID,Activity,Timestamp_Start,ActivityDuration_Minutes
26,7,Items Packed,2023-01-04 11:06:00,84.0
17,5,Items Packed,2023-01-03 13:30:00,55.0
22,6,Items Packed,2023-01-04 10:00:00,30.0
2,1,Items Packed,2023-01-01 09:30:00,25.0
9,3,Items Packed,2023-01-02 08:45:00,25.0
18,5,Items Packed,2023-01-03 14:30:00,25.0
13,4,Items Packed,2023-01-02 11:35:00,23.0
30,8,Items Packed,2023-01-05 14:40:00,15.0
21,6,Payment Processed,2023-01-04 09:05:00,10.0
15,5,Order Received,2023-01-03 13:00:00,1.0


In [13]:
#@title 6. Visualization & Reporting of Potential Bottlenecks

if 'all_data_with_flags_df' not in globals() or all_data_with_flags_df is None:
    print("üõë Anomaly detection not run or data not available. Please run Step 5 first.")
elif 'final_anomalies_df' not in globals():
    print("üõë Anomaly detection results not found. Please run Step 5.")
else:
    # Determine the anomaly flag column based on the method used
    anomaly_flag_col = ''
    if detection_method == "IQR (Statistical)" and 'Is_Anomaly_IQR' in all_data_with_flags_df.columns:
        anomaly_flag_col = 'Is_Anomaly_IQR'
    elif detection_method == "Isolation Forest (Machine Learning)" and 'Is_Anomaly_IF' in all_data_with_flags_df.columns:
        anomaly_flag_col = 'Is_Anomaly_IF'

    if not anomaly_flag_col:
        print("‚ö†Ô∏è Anomaly flag column not found. Cannot create visualizations based on anomaly status.")
    else:
        # --- 1. Distribution Plot (Histogram or Box Plot) ---
        print(f"\n--- Distribution of '{metric_to_analyze}' with Anomalies ---")
        # Create a temporary column for plotting colors based on anomaly status
        # Ensure we are working with a copy if modifications are needed for plotting
        plot_df = all_data_with_flags_df.copy()
        plot_df['Anomaly_Status_Plot'] = plot_df[anomaly_flag_col].map({True: 'Anomaly', False: 'Normal'})

        fig_dist = px.histogram(plot_df, x=metric_to_analyze,
                                color='Anomaly_Status_Plot',
                                color_discrete_map={'Normal': 'blue', 'Anomaly': 'red'},
                                marginal="box",
                                title=f"Distribution of {metric_to_analyze} (Anomalies Highlighted)")
        fig_dist.update_layout(xaxis_title=metric_to_analyze, yaxis_title="Frequency")
        fig_dist.show()

        # --- 2. Summary of Anomalous Activities/Transitions ---
        if not final_anomalies_df.empty:
            print(f"\n--- Summary: Top Activities/Transitions Associated with Delays ({metric_to_analyze}) ---")

            summary_df = None # Initialize summary_df

            if metric_to_analyze.startswith("WaitingTime"):
                # Check if 'Previous_Activity' column exists (should have been added in Step 5)
                if 'Previous_Activity' in final_anomalies_df.columns:
                    # Group by the 'Previous_Activity' to see which activities typically precede long waits
                    # Make sure to handle potential NaN values in 'Previous_Activity' if they can occur
                    summary_df = final_anomalies_df.dropna(subset=['Previous_Activity']).groupby('Previous_Activity')[metric_to_analyze].agg(['count', 'mean', 'max']).sort_values(by='count', ascending=False)
                    summary_df.columns = [f'Anomalous_Wait_Count_After_This_Activity', f'Mean_Anomalous_{metric_to_analyze}', f'Max_Anomalous_{metric_to_analyze}']
                    print(f"(Summary focuses on the activity *before* the anomalous waiting time)")
                else:
                    print("‚ö†Ô∏è 'Previous_Activity' column not found in anomaly data for WaitingTime summary. Grouping by current activity instead.")
                    summary_df = final_anomalies_df.groupby(activity_col)[metric_to_analyze].agg(['count', 'mean', 'max']).sort_values(by='count', ascending=False)
                    summary_df.columns = [f'Anomalous_Wait_Count_Before_This_Activity', f'Mean_Anomalous_{metric_to_analyze}', f'Max_Anomalous_{metric_to_analyze}']
            else: # For ActivityDuration
                summary_df = final_anomalies_df.groupby(activity_col)[metric_to_analyze].agg(['count', 'mean', 'max']).sort_values(by='count', ascending=False)
                summary_df.columns = [f'Anomalous_Duration_Count_For_This_Activity', f'Mean_Anomalous_{metric_to_analyze}', f'Max_Anomalous_{metric_to_analyze}']

            if summary_df is not None and not summary_df.empty:
                display(summary_df.head(10))
            elif summary_df is not None and summary_df.empty: # Should not happen if final_anomalies_df was not empty
                 print("Summary table is empty after grouping.")
        else:
            print("No anomalies were detected, so no summary to display.")


--- Distribution of 'ActivityDuration_Minutes' with Anomalies ---



--- Summary: Top Activities/Transitions Associated with Delays (ActivityDuration_Minutes) ---


Unnamed: 0_level_0,Anomalous_Duration_Count_For_This_Activity,Mean_Anomalous_ActivityDuration_Minutes,Max_Anomalous_ActivityDuration_Minutes
Activity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Items Packed,8,35.25,84.0
Order Received,1,1.0,1.0
Payment Processed,1,10.0,10.0
