<table align="center">

  <td align="center"><a target="_blank" href="https://colab.research.google.com/github/a-rebmann/nlp4bpa/blob/main/nlp4bpa_tutorial_2023.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# NLP for BPA - Hands-on Exercises

## Outline

### 1. Event log and Process Model Analysis
    1.1 Importing and analyzing an event log with GPT4 and pm4py
    1.2 Custom task: action and object extraction from activity labels using LLMs
    1.3 Potential use cases of the custom task.

### 2. Analyzing textual process descriptions
    Imperative model extraction from text
    
### 3. Future of NLP for BPA

#### API key

In [2]:
import os
os.environ["OPENAI_API_KEY"] = ""

#### Required installs

In [3]:
# Required installs
!pip install -q pm4py==2.7.3
!pip install -q openai

### 1.1 Importing and analyzing an event log with GPTs and pm4py

In this part, we will use a [Generative Pre-trained Transformer (GPT)](https://openai.com) and [pm4py](https://pm4py.fit.fraunhofer.de) to analyze a real-life event log. 

<small>
Alessandro Berti, Daniel Schuster, and Wil M. P. van der Aalst: Abstractions, Scenarios, and Prompt Definitions for Process Mining with LLMs: A Case Study. In: BPM 2023 Workshops.
</small>

#### Importing an event log (Google Colab)

In [None]:
!git clone https://github.com/a-rebmann/nlp4bpa

In [None]:
# Importing an example event log
import pm4py
travel_event_log = pm4py.read_xes("/content/nlp4bpa/content/PermitLog_small.xes")

In [None]:
travel_event_log

#### Importing an event log (locally)

In [4]:
# Importing an example event log
import pm4py
travel_event_log = pm4py.read_xes("content/PermitLog_small.xes")



parsing log, completed traces ::   0%|          | 0/671 [00:00<?, ?it/s]

In [5]:
travel_event_log

Unnamed: 0,id,org:resource,concept:name,time:timestamp,org:role,case:OrganizationalEntity,case:ProjectNumber,case:TaskNumber,case:dec_id_0,case:ActivityNumber,...,case:Cost Type_14,case:Cost Type_10,case:Cost Type_11,case:Cost Type_12,case:Task_5,case:Task_4,case:Task_9,case:Task_8,case:Task_7,case:Task_6
0,st_step 76310_0,STAFF MEMBER,Permit SUBMITTED by EMPLOYEE,2017-01-10 07:19:45+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,
1,st_step 76311_0,STAFF MEMBER,Permit FINAL_APPROVED by SUPERVISOR,2017-01-10 07:19:48+00:00,SUPERVISOR,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,
2,rv_travel permit 76305_6,STAFF MEMBER,Start trip,2017-01-18 23:00:00+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,
3,rv_travel permit 76305_7,STAFF MEMBER,End trip,2017-01-19 23:00:00+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,
4,st_step 76312_0,STAFF MEMBER,Declaration SUBMITTED by EMPLOYEE,2017-01-24 08:18:42+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8335,st_step 18822_0,STAFF MEMBER,Permit APPROVED by SUPERVISOR,2019-01-02 10:03:24+00:00,SUPERVISOR,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,
8336,st_step 18820_0,STAFF MEMBER,Permit FINAL_APPROVED by DIRECTOR,2019-01-09 08:26:24+00:00,DIRECTOR,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,
8337,rv_travel permit 18817_7,STAFF MEMBER,End trip,2019-04-14 22:00:00+00:00,EMPLOYEE,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,
8338,rv_travel permit 18817_27,SYSTEM,Send Reminder,2019-06-01 04:00:52+00:00,UNDEFINED,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,


#### Describing the process captured in an event log

In [6]:
ans_desc = pm4py.openai.describe_process(travel_event_log, openai_model="gpt-3.5-turbo", api_key=os.environ["OPENAI_API_KEY"])
print(ans_desc)

Based on the provided flow, the process involves various steps and actions. The description of the process can be summarized as follows:

1. The process starts with a "Request Payment" action, which is followed by the handling of the payment.
2. A trip is initiated by the "Start trip" action and ends with the "End trip" action.
3. An employee submits a "Permit" request, which is then approved by the administration. This approval is further approved by a supervisor, and ultimately, the permit is finalized by a supervisor.
4. A declaration is submitted by an employee, which is then approved by the administration. This approval is further approved by a budget owner, and ultimately, the declaration is finalized by a supervisor.
5. The process involves interactions related to "Request For Payment," including submission, approval, and final approval by a supervisor.
6. The process also includes actions related to the rejection of declarations and request for payments by an employee and the a

This result is based on the following query, which abstracts the log to a directly-follows graph that is in turn described textually.

In [7]:
from pm4py.algo.querying.openai import log_to_dfg_descr
d_query = log_to_dfg_descr.apply(travel_event_log, parameters={})
d_query+= "can you provide a description of the process?"
print(d_query)

If I have a process with flow:

Request Payment -> Payment Handled ( frequency = 681  performance = 286966.174743025 )
Start trip -> End trip ( frequency = 601  performance = 458463.8935108153 )
Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION ( frequency = 549  performance = 52210.22222222222 )
Declaration FINAL_APPROVED by SUPERVISOR -> Request Payment ( frequency = 505  performance = 271652.11881188117 )
Declaration SUBMITTED by EMPLOYEE -> Declaration APPROVED by ADMINISTRATION ( frequency = 442  performance = 84768.30542986425 )
End trip -> Declaration SUBMITTED by EMPLOYEE ( frequency = 409  performance = 978732.7701711492 )
Permit FINAL_APPROVED by SUPERVISOR -> Start trip ( frequency = 391  performance = 2810881.997442455 )
Permit APPROVED by ADMINISTRATION -> Permit FINAL_APPROVED by SUPERVISOR ( frequency = 290  performance = 172946.19310344828 )
Declaration APPROVED by ADMINISTRATION -> Declaration FINAL_APPROVED by SUPERVISOR ( frequency = 256  performance 

#### Checking for potentially undesired behavior

In [8]:
ans_ad = pm4py.openai.anomaly_detection(travel_event_log, openai_model="gpt-3.5-turbo", api_key=os.environ["OPENAI_API_KEY"])
print(ans_ad)

Based on the provided process variants, the main anomalies can be identified as follows:

1. Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION -> Permit APPROVED by BUDGET OWNER -> Permit FINAL_APPROVED by SUPERVISOR -> Start trip -> End trip -> Send Reminder -> Send Reminder:
This anomaly involves sending multiple reminders, which indicates inefficiency and a potential delay in the process.

2. Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION -> Start trip -> Permit FINAL_APPROVED by SUPERVISOR -> End trip -> Declaration SUBMITTED by EMPLOYEE -> Declaration APPROVED by ADMINISTRATION -> Declaration FINAL_APPROVED by SUPERVISOR -> Request Payment -> Payment Handled:
This anomaly involves an out-of-order approval of the permit. The permit should ideally be approved by the administration before the start of the trip.

3. Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION -> Permit FINAL_APPROVED by SUPERVISOR -> Start trip -> End trip -> Se

This result is obtained by a abstracting the event log to trace variants:

In [9]:
from pm4py.algo.querying.openai import log_to_variants_descr
a_query = log_to_variants_descr.apply(travel_event_log, parameters={})
a_query += "what are the main anomalies? An anomaly involves a strange ordering of the activities, or a significant amount of rework. Please only data and process specific considerations, not general considerations. Please sort the anomalies based on their seriousness."
print(a_query)

If I have a process with the following process variants:

 Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION -> Permit FINAL_APPROVED by SUPERVISOR -> Start trip -> End trip -> Declaration SUBMITTED by EMPLOYEE -> Declaration APPROVED by ADMINISTRATION -> Declaration FINAL_APPROVED by SUPERVISOR -> Request Payment -> Payment Handled ( frequency = 86  performance = 4588332.023255814 )
 Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION -> Permit APPROVED by BUDGET OWNER -> Permit FINAL_APPROVED by SUPERVISOR -> Start trip -> End trip -> Declaration SUBMITTED by EMPLOYEE -> Declaration APPROVED by ADMINISTRATION -> Declaration APPROVED by BUDGET OWNER -> Declaration FINAL_APPROVED by SUPERVISOR -> Request Payment -> Payment Handled ( frequency = 46  performance = 5932601.934782608 )
 Permit SUBMITTED by EMPLOYEE -> Permit APPROVED by ADMINISTRATION -> Permit FINAL_APPROVED by SUPERVISOR -> Start trip -> End trip -> Send Reminder -> Send Reminder ( frequency

### 1.2 Implementing a custom task using GPTs
In this part, we will focus on an NPL task in the context of business process analysis.
We will show how such a task can be implemented using LLMs without any fine-tuning.

As an example, we will look at the extraction of business objecs and actions from event labels. 
The automated analysis of such labels based on traditional NLP techniques and based on transformers has been actively researched. It enables many downstream pre-processing tasks, such as the cleaning/standardization of activity labels and the automated assessment of the type of activity that is performed.

In [10]:
task_prompt = """You are an expert activity label tagger system. 
Your task is to accept activity labels such as 'create purchase order' as input and provide a list of pairs, where each pair consists of the main action and the object it is applied on. 
For 'create purchase order', you would return [('create', 'purchase order')] and for 'purchase order' [('', 'purchase order')]. 
If actions are not provided as verbs, change them into verbs, e.g., for 'purchase order creation' you would hence return ('create', 'purchase order') as well. 
Also turn past tense actions into present tense ones, e.g., 'purchase order created' becomes ('create', 'purchase order') too. 
If multiple actions are applied to the same object, split this into multiple pairs, e.g., 'create and send purchase order' becomes [('create', 'purchase order'), ('send', 'purchase oder')]
If there is additional information, e.g., about who is performing the action or about an IT system that is involved, discard that. 
If there are any special characters, just replace them with whitespace. 

Under no circumstances (!) put any other text in your answer, only a (possibly empty) list of pairs with nothing before or after. 
In each pair the (optional) action comes first, followed by the object (if any).
If the activity label does not contain any actions, return an empty list , ie., []

Here is the activity label that shall be tagged.
Text:
"""

def extract_object_action_pairs_from_label(label, model="gpt-3.5-turbo"):
    label_promt = task_prompt + " '" + label.lower() + "''"
    import openai
    openai.api_key = os.environ["OPENAI_API_KEY"]
    messages = [{"role": "user", "content": label_promt}]
    response = openai.ChatCompletion.create(model=model, messages=messages, temperature=0)
    res = response["choices"][0]["message"]["content"]
    res = eval(res)
    return res
    
extract_object_action_pairs_from_label("notification letter creation and approval")

[('create', 'notification letter'), ('approve', 'notification letter')]

### 1.3 Examplary use case of the custom task

Next, we use the implementation of our custom task for preprocessing our event log

#### Inspecting the unique labels of the event log

In [11]:
unique_labels = set(travel_event_log["concept:name"].unique())
print(len(unique_labels), "unique activity labels are in the log.")
unique_labels

46 unique activity labels are in the log.


{'Declaration APPROVED by ADMINISTRATION',
 'Declaration APPROVED by BUDGET OWNER',
 'Declaration APPROVED by PRE_APPROVER',
 'Declaration APPROVED by SUPERVISOR',
 'Declaration FINAL_APPROVED by DIRECTOR',
 'Declaration FINAL_APPROVED by SUPERVISOR',
 'Declaration REJECTED by ADMINISTRATION',
 'Declaration REJECTED by BUDGET OWNER',
 'Declaration REJECTED by EMPLOYEE',
 'Declaration REJECTED by MISSING',
 'Declaration REJECTED by PRE_APPROVER',
 'Declaration REJECTED by SUPERVISOR',
 'Declaration SAVED by EMPLOYEE',
 'Declaration SUBMITTED by EMPLOYEE',
 'End trip',
 'Payment Handled',
 'Permit APPROVED by ADMINISTRATION',
 'Permit APPROVED by BUDGET OWNER',
 'Permit APPROVED by PRE_APPROVER',
 'Permit APPROVED by SUPERVISOR',
 'Permit FINAL_APPROVED by DIRECTOR',
 'Permit FINAL_APPROVED by SUPERVISOR',
 'Permit REJECTED by ADMINISTRATION',
 'Permit REJECTED by BUDGET OWNER',
 'Permit REJECTED by EMPLOYEE',
 'Permit REJECTED by MISSING',
 'Permit REJECTED by PRE_APPROVER',
 'Permit RE

#### Applying the funtion to the event log
Next, we apply the custom function to the distinct event labels of our event log and create a mapping from original label to a new label that only consistes of an action applied to an object.

In [12]:
label_mapping = {}
for label in unique_labels:
    processed_label = extract_object_action_pairs_from_label(label)
    print(label, "->" , processed_label)
    label_mapping[label] = " ".join(processed_label[0] if len(processed_label) > 0 else "")

Permit REJECTED by EMPLOYEE -> [('reject', 'permit')]
Request For Payment SAVED by EMPLOYEE -> [('save', 'request for payment')]
Declaration REJECTED by EMPLOYEE -> [('reject', 'declaration')]
Permit APPROVED by PRE_APPROVER -> [('approve', 'permit')]
Request For Payment APPROVED by ADMINISTRATION -> [('approve', 'request for payment')]
Declaration REJECTED by PRE_APPROVER -> [('reject', 'declaration')]
Permit REJECTED by SUPERVISOR -> [('reject', 'permit')]
Request Payment -> [('request', 'payment')]
Declaration FINAL_APPROVED by DIRECTOR -> [('approve', 'declaration final'), ('direct', 'declaration final')]
Start trip -> [('start', 'trip')]
Declaration REJECTED by MISSING -> []
Permit FINAL_APPROVED by SUPERVISOR -> [('permit', 'final approved')]
Declaration APPROVED by BUDGET OWNER -> [('approve', 'declaration')]
Permit SUBMITTED by EMPLOYEE -> [('submit', 'permit')]
Declaration SUBMITTED by EMPLOYEE -> [('submit', 'declaration')]
Request For Payment APPROVED by SUPERVISOR -> [('app

In [13]:
label_mapping

{'Permit REJECTED by EMPLOYEE': 'reject permit',
 'Request For Payment SAVED by EMPLOYEE': 'save request for payment',
 'Declaration REJECTED by EMPLOYEE': 'reject declaration',
 'Permit APPROVED by PRE_APPROVER': 'approve permit',
 'Request For Payment APPROVED by ADMINISTRATION': 'approve request for payment',
 'Declaration REJECTED by PRE_APPROVER': 'reject declaration',
 'Permit REJECTED by SUPERVISOR': 'reject permit',
 'Request Payment': 'request payment',
 'Declaration FINAL_APPROVED by DIRECTOR': 'approve declaration final',
 'Start trip': 'start trip',
 'Declaration REJECTED by MISSING': '',
 'Permit FINAL_APPROVED by SUPERVISOR': 'permit final approved',
 'Declaration APPROVED by BUDGET OWNER': 'approve declaration',
 'Permit SUBMITTED by EMPLOYEE': 'submit permit',
 'Declaration SUBMITTED by EMPLOYEE': 'submit declaration',
 'Request For Payment APPROVED by SUPERVISOR': 'approve request for payment',
 'Request For Payment APPROVED by BUDGET OWNER': 'approve request for payme

Note that not all labels could be processed properly by the GPT. One option is to provide more examples to it by changing the prompt. 

Still, we may improve performance by using dedicated techniques, e.g., an activity-label tagger based on a pretrained Language Model that was specifically fine-tuned fot he task at hand; see [here](https://hanvanderaa.com/wp-content/uploads/2022/08/IS2022-Enabling-semantics-aware-process-mining-through-the-automated-annotation-of-event-logs.pdf) for details and [here](https://gitlab.uni-mannheim.de/processanalytics/label-tagger) for an easy-to-use python package wrapping such a dedicated label tagger.

##### Installing and using a dedicated label tagger
Note that this will download a relatively large model (500MB), which takes to long in the context of the live session.

In [None]:
!git clone https://gitlab.uni-mannheim.de/processanalytics/label-tagger.git
!pip install "/content/label-tagger/."
!python -m spacy download en_core_web_sm

In [None]:
from label_tagger.tagger import LabelTagger
labels = ["create purchase order", "Archive invoice", "send request to customer"]
tagger = LabelTagger()
tagger.tag_list_of_labels(labels)

#### Replacing the labels in the event log
Having established this mapping, we can relace the original event lables with the ones obtained though our custom function.
In this manner, we obtain simpler/standardized labels and reduce the number of total labels in the event log. The latter in turn can lead to simpler process models (due to fewer nodes) when applying process discovery.
Also note that the role information (e.g., EMPLOYEE, SUPERVISOR, ADMINISTRATION) that we ommitted from the original labels in the standardization process are still available in the log, because these were already captured in the dedicated <code>org:role</code> attribute.

In [14]:
travel_event_log["concept:name"] = travel_event_log["concept:name"].apply(lambda x: x if label_mapping[x] == '' else label_mapping[x])

In [15]:
travel_event_log

Unnamed: 0,id,org:resource,concept:name,time:timestamp,org:role,case:OrganizationalEntity,case:ProjectNumber,case:TaskNumber,case:dec_id_0,case:ActivityNumber,...,case:Cost Type_10,case:Cost Type_11,case:Cost Type_12,case:Task_5,case:Task_4,case:Task_9,case:Task_8,case:Task_7,case:Task_6,start_timestamp
0,st_step 76310_0,STAFF MEMBER,submit permit,2017-01-10 07:19:45+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,2017-01-10 07:19:45+00:00
1,st_step 76311_0,STAFF MEMBER,permit final approved,2017-01-10 07:19:48+00:00,SUPERVISOR,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,2017-01-10 07:19:48+00:00
2,rv_travel permit 76305_6,STAFF MEMBER,start trip,2017-01-18 23:00:00+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,2017-01-18 23:00:00+00:00
3,rv_travel permit 76305_7,STAFF MEMBER,end trip,2017-01-19 23:00:00+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,2017-01-19 23:00:00+00:00
4,st_step 76312_0,STAFF MEMBER,submit declaration,2017-01-24 08:18:42+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,,2017-01-24 08:18:42+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8335,st_step 18822_0,STAFF MEMBER,approve permit,2019-01-02 10:03:24+00:00,SUPERVISOR,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,2019-01-02 10:03:24+00:00
8336,st_step 18820_0,STAFF MEMBER,permit final approved,2019-01-09 08:26:24+00:00,DIRECTOR,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,2019-01-09 08:26:24+00:00
8337,rv_travel permit 18817_7,STAFF MEMBER,end trip,2019-04-14 22:00:00+00:00,EMPLOYEE,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,2019-04-14 22:00:00+00:00
8338,rv_travel permit 18817_27,SYSTEM,send reminder,2019-06-01 04:00:52+00:00,UNDEFINED,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,,2019-06-01 04:00:52+00:00


In [16]:
unique_labels_preprocessed = set(travel_event_log["concept:name"].unique())
print(len(unique_labels_preprocessed), "unique preprocessed activity labels are in the log.")
unique_labels_preprocessed

24 unique preprocessed activity labels are in the log.


{'Declaration REJECTED by MISSING',
 'Permit REJECTED by MISSING',
 'approve declaration',
 'approve declaration final',
 'approve permit',
 'approve request for payment',
 'declaration final approved supervisor',
 'end trip',
 'handle payment',
 'permit final approved',
 'permit saved employee',
 'reject declaration',
 'reject permit',
 'reject request for payment',
 'request for payment final approved director',
 'request for payment final approved supervisor',
 'request payment',
 'save declaration',
 'save request for payment',
 'send reminder',
 'start trip',
 'submit declaration',
 'submit permit',
 'submit request for payment'}

Given the new/standardized event labels of our event log, we next want to aquire more information about the type of the activities executed in the process underlying the event log.

Knowing about the type of an activity, e.g., if an activity captures the update of a request or the decision about the acceptance of a request, can be helpful for other analyisis tasks. 
For instance, in conformance checking we may want to assign a higher severity to conformance violations if these involde decisions than if they involde updates.

We can create a prompt that desribes this task and define a function to wrap it as follows:

In [20]:
task_2_prompt = """You are an expert activity label classification system. 
Your task is to accept activity labels such as 'create purchase order' as input and provide the 
category of the activity as output. The categories are 'decide', 'create', 'update', 'delete'.
Do not put any other text in your answer, only a category with nothing before or after. 
If the activity label does not ccorrespond to any of these categories return 'other'.

Here is the activity label that shall be tagged.
Text:
"""

def categorize_activity(label, model="gpt-3.5-turbo"):
    label_promt = task_2_prompt + "'" + label.lower() + "''"
    import openai
    openai.api_key = os.environ["OPENAI_API_KEY"]
    messages = [{"role": "user", "content": label_promt}]
    response = openai.ChatCompletion.create(model=model, messages=messages, temperature=0)
    res = response["choices"][0]["message"]["content"]
    return res

categorize_activity("approve declaration")

'decide'

Of course these activity types are not srt in stone and can be adjusted accoring to your preferences and analysis purpose. For instance, having an acticity type "communicate" may be beneficial for the analysis of handovers in the process.

In [21]:
activity_categories = {}
for label in unique_labels_preprocessed:
    processed_label = categorize_activity(label)
    print(label, "->" , processed_label)
    activity_categories[label] = processed_label

request for payment final approved supervisor -> decide
submit permit -> create
permit final approved -> create
request for payment final approved director -> decide
start trip -> other
submit declaration -> create
approve permit -> decide
send reminder -> other
Declaration REJECTED by MISSING -> other
reject request for payment -> decide
permit saved employee -> create
save request for payment -> create
request payment -> create
end trip -> other
handle payment -> create
approve request for payment -> decide
reject declaration -> decide
approve declaration final -> decide
approve declaration -> decide
reject permit -> decide
Permit REJECTED by MISSING -> other
declaration final approved supervisor -> decide
submit request for payment -> create
save declaration -> create


In [22]:
travel_event_log["activity:category"] = travel_event_log["concept:name"].apply(lambda x: "" if x not in activity_categories else activity_categories[x])
travel_event_log

Unnamed: 0,id,org:resource,concept:name,time:timestamp,org:role,case:OrganizationalEntity,case:ProjectNumber,case:TaskNumber,case:dec_id_0,case:ActivityNumber,...,case:Cost Type_11,case:Cost Type_12,case:Task_5,case:Task_4,case:Task_9,case:Task_8,case:Task_7,case:Task_6,start_timestamp,activity:category
0,st_step 76310_0,STAFF MEMBER,submit permit,2017-01-10 07:19:45+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,2017-01-10 07:19:45+00:00,create
1,st_step 76311_0,STAFF MEMBER,permit final approved,2017-01-10 07:19:48+00:00,SUPERVISOR,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,2017-01-10 07:19:48+00:00,create
2,rv_travel permit 76305_6,STAFF MEMBER,start trip,2017-01-18 23:00:00+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,2017-01-18 23:00:00+00:00,other
3,rv_travel permit 76305_7,STAFF MEMBER,end trip,2017-01-19 23:00:00+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,2017-01-19 23:00:00+00:00,other
4,st_step 76312_0,STAFF MEMBER,submit declaration,2017-01-24 08:18:42+00:00,EMPLOYEE,organizational unit 65455,project 76307,task 427,declaration 76308,UNKNOWN,...,,,,,,,,,2017-01-24 08:18:42+00:00,create
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8335,st_step 18822_0,STAFF MEMBER,approve permit,2019-01-02 10:03:24+00:00,SUPERVISOR,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,2019-01-02 10:03:24+00:00,decide
8336,st_step 18820_0,STAFF MEMBER,permit final approved,2019-01-09 08:26:24+00:00,DIRECTOR,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,2019-01-09 08:26:24+00:00,create
8337,rv_travel permit 18817_7,STAFF MEMBER,end trip,2019-04-14 22:00:00+00:00,EMPLOYEE,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,2019-04-14 22:00:00+00:00,other
8338,rv_travel permit 18817_27,SYSTEM,send reminder,2019-06-01 04:00:52+00:00,UNDEFINED,organizational unit 65454,UNKNOWN,UNKNOWN,,UNKNOWN,...,,,,,,,,,2019-06-01 04:00:52+00:00,other


## 2. Analyzing textual process descriptions

In this hands-on exercise we will have LLMs extract imperative process models from textual process descriptions.
We provide three examples of textual process descriptions taken from exercises in the [Fundamentals of Business Process Management Textbook](http://fundamentals-of-bpm.org) (Exercises 3.1-3.3). 

We will give these to a GPT model asking it to extract an imperative process model.
Technically, GPT models can generate markup output such as BPMN-XML. This is (currently) not practical, though. Due to the verbosity of XML, the output limit of GPTs prevents the direct generation of complete XML-code even for small BPMN diagrams. Also, after several attempts (on small models), we did not obtain XML code such that it could be displayed by a (commercial) BPMN modeling tool. 

Therefore we will create a propt that defines an intermediate notation that the LLM shall use to capture the imperative process model. This intermediate notiation only captures tasks, arcs, and gateways (AND, XOR, and OR).

We give the prompt including the textual process description to the LLM, obtain a resulting "model". We then compare this result to a process model created by a humal modeling expert for the respective description. Note that naturally there are multiple valid options on how to model a process based on natural language text.

<b>2.1</b>: Once a loan application has been approved by the loan provider, an acceptance pack is prepared and sent to the customer. The acceptance pack includes a repayment schedule which the customer needs to agree upon by sending the signed documents back to the loan provider. The latter then verifies the repayment agreement: if the applicant disagreed with the repayment schedule, the loan provider cancels the application; if the applicant agreed, the loan provider approves the application. In either case, the process completes with the loan provider notifying the applicant of the application status.

<b>2.2</b>: A loan application is approved if it passes two checks: (i) the applicant’s loan risk assessment, done automatically by a system, and (ii) the appraisal of the property for which the loan has been asked, carried out by a property appraiser. The risk assessment requires a credit history check on the applicant, which is performed by a financial officer. Once both the loan risk assessment and the property appraisal have been performed, a loan officer can assess the applicant’s eligibility. If the applicant is not eligible, the application is rejected, otherwise the acceptance pack is prepared and sent to the applicant.

<b>2.3</b>: A loan application may be coupled with a home insurance which is offered at discounted prices. The applicants may express their interest in a home insurance plan at the time of submitting their loan application to the loan provider. Based on this information, if the loan application is approved, the loan provider may either only send an acceptance pack to the applicant, or also send a home insurance quote. The process then continues with the verification of the repayment agreement.



In [24]:
process_description_3_1 = "Once a loan application has been approved by the loan provider, an acceptance pack is prepared and sent to the customer. The acceptance pack includes a repayment schedule which the customer needs to agree upon by sending the signed documents back to the loan provider. The latter then verifies the repayment agreement: if the applicant disagreed with the repayment schedule, the loan provider cancels the application; if the applicant agreed, the loan provider approves the application. In either case, the process completes with the loan provider notifying the applicant of the application status."
process_description_3_2 = "A loan application is approved if it passes two checks: (i) the applicant’s loan risk assessment, done automatically by a system, and (ii) the appraisal of the property for which the loan has been asked, carried out by a property appraiser. The risk assessment requires a credit history check on the applicant, which is performed by a financial officer. Once both the loan risk assessment and the property appraisal have been performed, a loan officer can assess the applicant’s eligibility. If the applicant is not eligible, the application is rejected, otherwise the acceptance pack is prepared and sent to the applicant."
process_description_3_3 = "A loan application may be coupled with a home insurance which is offered at discounted prices. The applicants may express their interest in a home insurance plan at the time of submitting their loan application to the loan provider. Based on this information, if the loan application is approved, the loan provider may either only send an acceptance pack to the applicant, or also send a home insurance quote. The process then continues with the verification of the repayment agreement."


In [41]:
text_to_intermediate_prompt = """
create a BPMN process model for the process description that I’ll give to you. Do not consider tasks of external parties.
Use the following notation for control-flow constructs in your output:
1. Tasks, i.e., the basic construct, represent tasks as words in a verb-object style, e.g., receive order, when possible.
2. (Potentially nested) constructs:
2.1 Sequences denoted as ->(construct1, construct2, ...), which means that construct1 is followed by construct2 and construct2 is followed by ...
2.2 XOR construct as XOR(construct1, construct2, ...), which resembles XOR gateways. In case of XOR, provide me with the condition of using its elements using this notation: XOR([condition]construct1, [condition]construct2, ...).
2.3 OR construct OR(construct1, construct2, ...), which resembles OR gateways.
2.4 AND construct AND(construct1, construct2, ...), which resembles AND gateways.
Do not include any line breaks or textual explanation in your output and stick to the provided notation.
"""

# The prompt below can be used to play around with the LLMs capabilities to generate actual BPMN-XML 
# (which does not really work well yet, as explained above.)
text_to_bpmn_prompt = "Convert the given description into a BPMN diagram and provide the XML code for it. Provide only the code and nothing else."

In [42]:
def text_to_model(description, model="gpt-3.5-turbo", xml=False):
    prompt = text_to_bpmn_prompt if xml else text_to_intermediate_prompt + "\n Here is the description:\n" + description
    import openai
    openai.api_key = os.environ["OPENAI_API_KEY"]
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(model=model, messages=messages, temperature=0)
    return response["choices"][0]["message"]["content"]

This is the model provided by a human expert:
<img src="https://raw.githubusercontent.com/a-rebmann/nlp4bpa/main/content/exercise_3_1.png" alt="Exercise 3.1" />

This is what the GPT comes up with:

In [43]:
ans_sketch = text_to_model(process_description_3_1)
print(ans_sketch)

receive loan application -> approve loan application -> prepare acceptance pack -> send acceptance pack to customer -> receive signed documents from customer -> verify repayment agreement -> XOR([applicant disagreed] cancel application, [applicant agreed] approve application) -> notify applicant of application status


This is the model provided by a human expert:
<img src="https://raw.githubusercontent.com/a-rebmann/nlp4bpa/main/content/exercise_3_2.png" alt="Exercise 3.2" />

This is what the GPT comes up with:

In [44]:
ans_sketch = text_to_model(process_description_3_2)
print(ans_sketch)

receive loan application ->(perform loan risk assessment, perform property appraisal) -> AND(assess eligibility, prepare acceptance pack) -> XOR([not eligible]reject application, send acceptance pack)


This is the model provided by a human expert:
<img src="https://raw.githubusercontent.com/a-rebmann/nlp4bpa/main/content/exercise_3_3.png" alt="Exercise 3.3" />

This is what the GPT comes up with:

In [45]:
ans_sketch = text_to_model(process_description_3_3)
print(ans_sketch)

receive loan application -> XOR([express interest in home insurance]send acceptance pack, [express interest in home insurance]send acceptance pack, send acceptance pack and home insurance quote) -> verify repayment agreement


## 3. Future of NLP for BPA

A conceptual architecture of an **Large Process Models (LPM)** 

<img src="content/LPM_architecture.png" alt="LPM" />
Kampik et. al., 2023 

<b>Exercise 4 : Data connection exercise</b> - Ask a Company Process Bot about the internal processes

Your function should do the following:

* **Document Loading:** Read the example_process_descriptions.txt file inside the content folder
* **Splitting:** Split this into chunks (you choose the size)
* **Storage:** Write this to a ChromaDB Vector Store
* **Retrieval:** Use Context Compression to return the relevant portion of the document to the question

<img src="content/exercise_4.png" alt="Exercise 4" />

In [None]:
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor 

In [None]:
def process_helper(question):
    '''
    Takes in a question about the internal processes and returns the most relevant
    part. Notice it may not directly answer the actual question! 
    
    Follow the steps below to fill out this function:
    '''
    # PART ONE:
    # LOAD "content/example_process_descriptions" in a Document object
    loader = TextLoader("content/example_process_descriptions.txt")
    documents = loader.load()
    
    # PART TWO
    # Split the document into chunks (you choose how and what size)
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
    docs = text_splitter.split_documents(documents)
    
    # PART THREE
    # EMBED THE Documents (now in chunks) to a persisted ChromaDB
    embedding_function = OpenAIEmbeddings()
    db = Chroma.from_documents(docs, embedding_function,persist_directory='./example_process_descriptions')
    db.persist()

    # PART FOUR
    # Use ChatOpenAI and ContextualCompressionRetriever to return the most
    # relevant part of the documents.

    llm = ChatOpenAI(temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)

    compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                           base_retriever=db.as_retriever())

    compressed_docs = compression_retriever.get_relevant_documents(question)

    return compressed_docs[0].page_content # NEED TO COMPRESS THESE RESULTS!

In [None]:
print(process_helper("What activity follows after BLABLABLA?"))