<table align="center">

  <td align="center"><a target="_blank" href="https://colab.research.google.com/github/a-rebmann/nlp4bpa/blob/main/nlp4bpa_tutorial_2023.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# NLP for BPA - Hands-on Exercises

## Outline

### 1. Event log analysis
    1.1 Importing and analyzing an event log with LLMs and pm4py
    1.2 Creating functions for custom tasks: action and object extraction from event labels using LLMs
    1.3 Label standardization.
    1.4 Action type categorization.

### 2. Analyzing textual process descriptions
    Imperative model extraction from text
    
### 3. Future of NLP for BPA

## Setup

### API key

In [5]:
import os
oaik = "sk-EJTrFxDPn8IySszc1HeNT3BlbkFJft6L5GKOKyGV########"
os.environ["OPENAI_API_KEY"] = oaik

#### How to obtain your own API key?
To get your own key do th following 
1. Create an account on https://platform.openai.com (as of September 8 2023, new accounts get a budget of 18$ to get started)
2. Log into your account
3. In the top-right corner, click on your account
4. Select "View API keys"
5. Generate a key by clicking on "Create new secret key"
6. Copy the generated key from the pop-up window and save it somewhere.

To use the key in this notebook, just replace 'replace this with an OpenAI API key' by your key in the cell above.


### Required installs

In [None]:
# Required installs
!pip install -q pm4py==2.7.3
!pip install -q openai
!pip install -q langchain[all]
!pip install -q chromadb

## 1. Event log analysis

In this hands-on exercise we will anaylze an event log using NLP techniques. In particular, we will analyze (a subset of) an [event log of a travel process](https://data.4tu.nl/datasets/db35afac-2133-40f3-a565-2dc77a9329a3) at a university, which was published in the context of the BPI Challenge 2020.

### 1.1 Importing and analyzing an event log with GPTs and pm4py

In this part, we will use a [Generative Pre-trained Transformer (GPT)](https://openai.com) and [pm4py](https://pm4py.fit.fraunhofer.de) to analyze the real-life event log. 

<small>
Alessandro Berti, Daniel Schuster, and Wil M. P. van der Aalst: Abstractions, Scenarios, and Prompt Definitions for Process Mining with LLMs: A Case Study. In: BPM 2023 Workshops.
</small>

#### Importing an event log (Google Colab)

In [None]:
!git clone https://github.com/a-rebmann/nlp4bpa

In [None]:
# Importing an example event log
import pm4py
travel_event_log = pm4py.read_xes("/content/nlp4bpa/content/PermitLog_small.xes")

In [None]:
travel_event_log

#### Importing an event log (locally)

In [None]:
# Importing an example event log
import pm4py
travel_event_log = pm4py.read_xes("content/PermitLog_small.xes")

In [None]:
travel_event_log

#### Describing the process captured in an event log

In [None]:
ans_desc = pm4py.openai.describe_process(travel_event_log, openai_model="gpt-3.5-turbo", api_key=os.environ["OPENAI_API_KEY"])
print(ans_desc)

This result is based on the following query, which abstracts the log to a directly-follows graph that is in turn described textually.

In [None]:
from pm4py.algo.querying.openai import log_to_dfg_descr
d_query = log_to_dfg_descr.apply(travel_event_log, parameters={})
d_query+= "can you provide a description of the process?"
print(d_query)

#### Checking for potentially undesired behavior

In [None]:
ans_ad = pm4py.openai.anomaly_detection(travel_event_log, openai_model="gpt-3.5-turbo", api_key=os.environ["OPENAI_API_KEY"])
print(ans_ad)

This result is obtained by a abstracting the event log to trace variants:

In [None]:
from pm4py.algo.querying.openai import log_to_variants_descr
a_query = log_to_variants_descr.apply(travel_event_log, parameters={})
a_query += "what are the main anomalies? An anomaly involves a strange ordering of the activities, or a significant amount of rework. Please only data and process specific considerations, not general considerations. Please sort the anomalies based on their seriousness."
print(a_query)

### 1.2 Creating functions for custom tasks: action and object extraction from event labels using LLMs
In this part, we will focus on an NPL task in the context of business process analysis.
We will show how such a task can be implemented using GPTs without any fine-tuning.

As an example, we will look at the extraction of business objecs and actions from event labels. 
The automated analysis of such labels based on traditional NLP techniques and based on transformers has been actively researched. It enables many downstream pre-processing tasks, such as the cleaning/standardization of activity labels and the automated assessment of the type of activity that is performed (we will explore both of these later in this notebook).

The central part of creating such a function is the creation of a "prompt" for the GPT. Designing a prompt is essentially how you “program” a GPT model, usually by providing instructions or some examples of how to successfully complete a task.

For our task, such a prompt can look like this:

In [None]:
task_prompt = """You are an expert activity label tagger system. 
Your task is to accept activity labels such as 'create purchase order' as input and provide a list of pairs, where each pair consists of the main action and the object it is applied on. 
For 'create purchase order', you would return [('create', 'purchase order')] and for 'purchase order' [('', 'purchase order')]. 
If actions are not provided as verbs, change them into verbs, e.g., for 'purchase order creation' you would hence return ('create', 'purchase order') as well. 
Also turn past tense actions into present tense ones, e.g., 'purchase order created' becomes ('create', 'purchase order') too. 
If multiple actions are applied to the same object, split this into multiple pairs, e.g., 'create and send purchase order' becomes [('create', 'purchase order'), ('send', 'purchase oder')]
If there is additional information, e.g., about who is performing the action or about an IT system that is involved, discard that. 
If there are any special characters, just replace them with whitespace. 

Under no circumstances (!) put any other text in your answer, only a (possibly empty) list of pairs with nothing before or after. 
In each pair the (optional) action comes first, followed by the object (if any).
If the activity label does not contain any actions, return an empty list , ie., []

Here is the activity label that shall be tagged.
Text:
"""

Next, we define the actual python function wrapping the task

In [None]:
def extract_object_action_pairs_from_label(label, model="gpt-3.5-turbo"):
    label_promt = task_prompt + " '" + label.lower() + "''"
    import openai
    openai.api_key = os.environ["OPENAI_API_KEY"]
    messages = [{"role": "user", "content": label_promt}]
    response = openai.ChatCompletion.create(model=model, messages=messages, temperature=0)
    res = response["choices"][0]["message"]["content"]
    res = eval(res)
    return res
    
extract_object_action_pairs_from_label("notification letter creation and approval")

### 1.3 Label standardization

Next, we use the implementation of our custom task for preprocessing our event log

#### Inspecting the unique labels of the event log

In [None]:
unique_labels = set(travel_event_log["concept:name"].unique())
print(len(unique_labels), "unique activity labels are in the log.")
unique_labels

#### Applying the funtion to the event log
Next, we apply the custom function to the distinct event labels of our event log and create a mapping from original label to a new label that only consistes of an action applied to an object.

In [None]:
label_mapping = {}
for label in unique_labels:
    processed_label = extract_object_action_pairs_from_label(label)
    print(label, "->" , processed_label)
    label_mapping[label] = " ".join(processed_label[0] if len(processed_label) > 0 else "")

In [None]:
label_mapping

Note that not all labels could be processed properly by the GPT. One option is to provide more examples to it by changing the prompt. 

Still, we may improve performance by using dedicated techniques, e.g., an activity-label tagger based on a pretrained Language Model that was specifically fine-tuned fot he task at hand; see [here](https://hanvanderaa.com/wp-content/uploads/2022/08/IS2022-Enabling-semantics-aware-process-mining-through-the-automated-annotation-of-event-logs.pdf) for details and [here](https://gitlab.uni-mannheim.de/processanalytics/label-tagger) for an easy-to-use python package wrapping such a dedicated label tagger. Note that, to use the tagger, a bunch of python packages (> 1GB) and a relatively large model (500MB) are needed, which takes to long to install in the context of the live session.

#### Replacing the labels in the event log
Having established this mapping, we can relace the original event lables with the ones obtained though our custom function.
In this manner, we obtain simpler/standardized labels and reduce the number of total labels in the event log. The latter in turn can lead to simpler process models (due to fewer nodes) when applying process discovery.
Also note that the role information (e.g., EMPLOYEE, SUPERVISOR, ADMINISTRATION) that we ommitted from the original labels in the standardization process are still available in the log, because these were already captured in the dedicated <code>org:role</code> attribute.

In [None]:
travel_event_log["concept:name"] = travel_event_log["concept:name"].apply(lambda x: x if label_mapping[x] == '' else label_mapping[x])

In [None]:
travel_event_log

In [None]:
unique_labels_preprocessed = set(travel_event_log["concept:name"].unique())
print(len(unique_labels_preprocessed), "unique preprocessed activity labels are in the log.")
unique_labels_preprocessed

### 1.4 Activity type categorization
Given the new/standardized event labels of our event log, we next want to aquire more information about the type of the activities executed in the process underlying the event log.

Knowing about the type of an activity, e.g., if an activity captures the update of a request or the decision about the acceptance of a request, can be helpful for other analyisis tasks. 
For instance, in conformance checking we may want to assign a higher severity to conformance violations if these involve decisions than if they involve updates.

We can create a prompt that desribes this task and define a function to wrap it as follows:

In [None]:
task_2_prompt = """You are an expert activity label classification system. 
Your task is to accept activity labels such as 'create purchase order' as input and provide the 
category of the activity as output. The categories are 'decide', 'create', 'update', 'delete'.
Do not put any other text in your answer, only a category with nothing before or after. 
If the activity label does not ccorrespond to any of these categories return 'other'.

Here is the activity label that shall be tagged.
Text:
"""

def categorize_activity(label, model="gpt-3.5-turbo"):
    label_promt = task_2_prompt + "'" + label.lower() + "''"
    import openai
    openai.api_key = os.environ["OPENAI_API_KEY"]
    messages = [{"role": "user", "content": label_promt}]
    response = openai.ChatCompletion.create(model=model, messages=messages, temperature=0)
    res = response["choices"][0]["message"]["content"]
    return res

categorize_activity("approve declaration")

Of course these activity types are not set in stone and can be adjusted accoring to your preferences and analysis purpose. For instance, having an acticity type "communicate" may be beneficial for the analysis of handovers in the process.

In [None]:
activity_categories = {}
for label in unique_labels_preprocessed:
    processed_label = categorize_activity(label)
    print(label, "->" , processed_label)
    activity_categories[label] = processed_label

In [None]:
travel_event_log["activity:category"] = travel_event_log["concept:name"].apply(lambda x: "" if x not in activity_categories else activity_categories[x])
travel_event_log

## 2. Analyzing textual process descriptions

In this hands-on exercise we will have LLMs extract imperative process models from textual process descriptions.
We provide three examples of textual process descriptions taken from exercises in the [Fundamentals of Business Process Management Textbook](http://fundamentals-of-bpm.org) (Exercises 3.1-3.3). 

We will give these to a GPT model asking it to extract an imperative process model.
Technically, GPT models can generate markup output such as BPMN-XML. This is (currently) not practical, though. Due to the verbosity of XML, the output limit of GPTs prevents the direct generation of complete XML-code even for small BPMN diagrams. Also, after several attempts (on small models), we did not obtain XML code such that it could be displayed by a (commercial) BPMN modeling tool. 

Therefore, we will create a propt that defines an intermediate notation that the LLM shall use to capture the imperative process model. This intermediate notiation only captures tasks, arcs, and gateways (AND, XOR, and OR).

We give the prompt including the textual process description to the LLM and obtain a "model". We then compare this result to a process model created by a humal modeling expert for the respective description. Note that naturally there are multiple valid options on how to model a process based on natural language text.

These are the descriptions we will use:

<b>2.1</b>: Once a loan application has been approved by the loan provider, an acceptance pack is prepared and sent to the customer. The acceptance pack includes a repayment schedule which the customer needs to agree upon by sending the signed documents back to the loan provider. The latter then verifies the repayment agreement: if the applicant disagreed with the repayment schedule, the loan provider cancels the application; if the applicant agreed, the loan provider approves the application. In either case, the process completes with the loan provider notifying the applicant of the application status.

<b>2.2</b>: A loan application is approved if it passes two checks: (i) the applicant’s loan risk assessment, done automatically by a system, and (ii) the appraisal of the property for which the loan has been asked, carried out by a property appraiser. The risk assessment requires a credit history check on the applicant, which is performed by a financial officer. Once both the loan risk assessment and the property appraisal have been performed, a loan officer can assess the applicant’s eligibility. If the applicant is not eligible, the application is rejected, otherwise the acceptance pack is prepared and sent to the applicant.

<b>2.3</b>: A loan application may be coupled with a home insurance which is offered at discounted prices. The applicants may express their interest in a home insurance plan at the time of submitting their loan application to the loan provider. Based on this information, if the loan application is approved, the loan provider may either only send an acceptance pack to the applicant, or also send a home insurance quote. The process then continues with the verification of the repayment agreement.



In [None]:
process_description_3_1 = "Once a loan application has been approved by the loan provider, an acceptance pack is prepared and sent to the customer. The acceptance pack includes a repayment schedule which the customer needs to agree upon by sending the signed documents back to the loan provider. The latter then verifies the repayment agreement: if the applicant disagreed with the repayment schedule, the loan provider cancels the application; if the applicant agreed, the loan provider approves the application. In either case, the process completes with the loan provider notifying the applicant of the application status."
process_description_3_2 = "A loan application is approved if it passes two checks: (i) the applicant’s loan risk assessment, done automatically by a system, and (ii) the appraisal of the property for which the loan has been asked, carried out by a property appraiser. The risk assessment requires a credit history check on the applicant, which is performed by a financial officer. Once both the loan risk assessment and the property appraisal have been performed, a loan officer can assess the applicant’s eligibility. If the applicant is not eligible, the application is rejected, otherwise the acceptance pack is prepared and sent to the applicant."
process_description_3_3 = "A loan application may be coupled with a home insurance which is offered at discounted prices. The applicants may express their interest in a home insurance plan at the time of submitting their loan application to the loan provider. Based on this information, if the loan application is approved, the loan provider may either only send an acceptance pack to the applicant, or also send a home insurance quote. The process then continues with the verification of the repayment agreement."


In [None]:
text_to_intermediate_prompt = """
create a BPMN process model for the process description that I’ll give to you. Do not consider tasks of external parties.
Use the following notation for control-flow constructs in your output:
1. Tasks, i.e., the basic construct, represent tasks as words in a verb-object style, e.g., receive order, when possible.
2. (Potentially nested) constructs:
2.1 Sequences denoted as ->(construct1, construct2, ...), which means that construct1 is followed by construct2 and construct2 is followed by ...
2.2 XOR construct as XOR(construct1, construct2, ...), which resembles XOR gateways. In case of XOR, provide me with the condition of using its elements using this notation: XOR([condition]construct1, [condition]construct2, ...).
2.3 OR construct OR(construct1, construct2, ...), which resembles OR gateways.
2.4 AND construct AND(construct1, construct2, ...), which resembles AND gateways.
Do not include any line breaks or textual explanation in your output and stick to the provided notation.
"""

# The prompt below can be used to play around with the LLMs capabilities to generate actual BPMN-XML 
# (which does not really work well yet, as explained above.)
text_to_bpmn_prompt = "Convert the given description into a BPMN diagram and provide the XML code for it. Provide only the code and nothing else."

In [None]:
def text_to_model(description, model="gpt-3.5-turbo", xml=False):
    prompt = text_to_bpmn_prompt if xml else text_to_intermediate_prompt + "\n Here is the description:\n" + description
    import openai
    openai.api_key = os.environ["OPENAI_API_KEY"]
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(model=model, messages=messages, temperature=0)
    return response["choices"][0]["message"]["content"]

This is the model provided by a human expert:
<img src="https://raw.githubusercontent.com/a-rebmann/nlp4bpa/main/content/exercise_3_1.png" alt="Exercise 3.1" />

This is what the GPT comes up with:

In [None]:
ans_sketch = text_to_model(process_description_3_1)
print(ans_sketch)

This is the model provided by a human expert:
<img src="https://raw.githubusercontent.com/a-rebmann/nlp4bpa/main/content/exercise_3_2.png" alt="Exercise 3.2" />

This is what the GPT comes up with:

In [None]:
ans_sketch = text_to_model(process_description_3_2)
print(ans_sketch)

This is the model provided by a human expert:
<img src="https://raw.githubusercontent.com/a-rebmann/nlp4bpa/main/content/exercise_3_3.png" alt="Exercise 3.3" />

This is what the GPT comes up with:

In [None]:
ans_sketch = text_to_model(process_description_3_3)
print(ans_sketch)

## 3. Future of NLP for BPA

A conceptual architecture of an **Large Process Models (LPM)** 

<img src="content/LPM_architecture.png" alt="LPM" />
Kampik et. al., 2023 

<b>Exercise 4 : Data connection exercise</b> - Ask a Company Process Bot about the internal processes

Your function should do the following:

* **Document Loading:** Read the example_process_descriptions.txt file inside the content folder
* **Splitting:** Split this into chunks (you choose the size)
* **Storage:** Write this to a ChromaDB Vector Store
* **Retrieval:** Use Context Compression to return the relevant portion of the document to the question

<img src="content/exercise_4.png" alt="Exercise 4" />

In [2]:
# import
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.chat_models import ChatOpenAI
from langchain.retrievers.document_compressors import LLMChainExtractor 
from langchain.retrievers import ContextualCompressionRetriever

# PART ONE:
# LOAD "content/example_process_descriptions" in a Document object
loader = TextLoader("/content/nlp4bpa/content/example_process_descriptions.txt")
documents = loader.load()

# PART TWO
# Split the document into chunks (you choose how and what size)
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

# PART THREE
# EMBED THE Documents (now in chunks) to a persisted ChromaDB
embedding_function = OpenAIEmbeddings()#SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(docs, embedding_function)

#PART FOUR 
# query it
query = "What is the first step in the claim handling process?"
llm = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                       base_retriever=db.as_retriever())
compressed_docs = compression_retriever.get_relevant_documents(query)
compressed_docs[0].page_content # NEED TO COMPRESS THESE RESULTS!

Created a chunk of size 1047, which is longer than the specified 200
Created a chunk of size 858, which is longer than the specified 200
