# PM4PY for Process Mining
<hr>
pm4py is a python library that supports (state-of-the-art) process mining algorithms in python.

In [2]:
pip install pm4py

Collecting pm4pyNote: you may need to restart the kernel to use updated packages.
  Using cached pm4py-2.2.22-py3-none-any.whl (1.8 MB)
Collecting deprecation
  Using cached deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting jsonpickle
  Using cached jsonpickle-2.2.0-py2.py3-none-any.whl (39 kB)

Collecting cvxopt
  Using cached cvxopt-1.3.0-cp39-cp39-win_amd64.whl (12.7 MB)
Collecting tqdm
  Using cached tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
Collecting pyvis
  Using cached pyvis-0.2.1-py3-none-any.whl
Collecting lxml
  Using cached lxml-4.9.0-cp39-cp39-win_amd64.whl (3.6 MB)
Collecting graphviz
  Using cached graphviz-0.20-py3-none-any.whl (46 kB)
Collecting networkx
  Using cached networkx-2.8.4-py3-none-any.whl (2.0 MB)
Collecting stringdist
  Using cached StringDist-1.0.9-py3-none-any.whl
Collecting sympy
  Using cached sympy-1.10.1-py3-none-any.whl (6.4 MB)
Collecting pydotplus
  Using cached pydotplus-2.0.2-py3-none-any.whl
Collecting intervaltree
  Using cached interva

In [3]:
import pm4py

## Importing XES files

In [1]:
from pm4py.objects.log.importer.xes import importer as xes_importer

# Event logs are stored as an extension of the Python list data structure.
log = xes_importer.apply('C:/Users/AREFA/Documents/DataScienceProjects/PM4PY/running-example.xes')

parsing log, completed traces ::   0%|          | 0/6 [00:00<?, ?it/s]

Event logs are stored as an extension of the Python list data structure. 

In [2]:
print(log[0]) #prints the first trace of the log
print(log[0][0]) #prints the first event of the first trace

{'attributes': {'concept:name': '3', 'creator': 'Fluxicon Nitro'}, 'events': [{'concept:name': 'register request', 'org:resource': 'Pete', 'time:timestamp': datetime.datetime(2010, 12, 30, 14, 32, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'register request', 'Resource': 'Pete', 'Costs': '50'}, '..', {'concept:name': 'pay compensation', 'org:resource': 'Ellen', 'time:timestamp': datetime.datetime(2011, 1, 15, 10, 45, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'pay compensation', 'Resource': 'Ellen', 'Costs': '200'}]}
{'concept:name': 'register request', 'org:resource': 'Pete', 'time:timestamp': datetime.datetime(2010, 12, 30, 14, 32, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'register request', 'Resource': 'Pete', 'Costs': '50'}


#### apply() method
The apply() method of the xes_importer, i.e. located in pm4py.objects.log.importer.xes.importer.py, contains two optional parameters: variant and parameters. 
- The variant parameter indicates which variant of the importer to use. 
- The parameters parameter is a Python dictionary, specifying specific parameters of choice.

In [3]:
from pm4py.objects.log.importer.xes import importer as xes_importer
variant = xes_importer.Variants.ITERPARSE
parameters = {variant.value.Parameters.TIMESTAMP_SORT: True}
log = xes_importer.apply('C:/Users/AREFA/Documents/DataScienceProjects/PM4PY/running-example.xes',
                         variant=variant, parameters=parameters)

parsing log, completed traces ::   0%|          | 0/6 [00:00<?, ?it/s]

## Importing CSV Files

- Import directly as pandas dataframe
- Import and convert the dataframe to an Event Log object. (not Event Stream)

In [27]:
import pandas as pd
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter

log_csv = pd.read_csv('C:/Users/AREFA/Documents/DataScienceProjects/PM4PY/running-example.csv', sep=';')
log_csv.head(20)

Unnamed: 0,case_id,activity,timestamp,costs,resource
0,3,register request,2010-12-30 14:32:00+01:00,50,Pete
1,3,examine casually,2010-12-30 15:06:00+01:00,400,Mike
2,3,check ticket,2010-12-30 16:34:00+01:00,100,Ellen
3,3,decide,2011-01-06 09:18:00+01:00,200,Sara
4,3,reinitiate request,2011-01-06 12:18:00+01:00,200,Sara
5,3,examine thoroughly,2011-01-06 13:06:00+01:00,400,Sean
6,3,check ticket,2011-01-08 11:43:00+01:00,100,Pete
7,3,decide,2011-01-09 09:55:00+01:00,200,Sara
8,3,pay compensation,2011-01-15 10:45:00+01:00,200,Ellen
9,2,register request,2010-12-30 11:32:00+01:00,50,Mike


### Case Identifier for events
CSV is like an Event Stream. For converting to Event log, we need to specify to the converter what attribute to use for case identifier of the events. 

The parameter we need to set for this, i.e., in the converter is the CASE_ID_KEY parameter. 

Its default value is 'case:concept:name'. Hence, when our input event data, stored in a csv-file has a column with the name case:concept:name, that column is used to define traces.

Hence, we rename case_id column to case:concept:name.

In [29]:
log_csv.rename(columns={'case_id': 'case:concept:name'}, inplace=True)
log_csv.head()

Unnamed: 0,case:concept:name,activity,timestamp,costs,resource
0,3,register request,2010-12-30 14:32:00+01:00,50,Pete
1,3,examine casually,2010-12-30 15:06:00+01:00,400,Mike
2,3,check ticket,2010-12-30 16:34:00+01:00,100,Ellen
3,3,decide,2011-01-06 09:18:00+01:00,200,Sara
4,3,reinitiate request,2011-01-06 12:18:00+01:00,200,Sara


In [30]:
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
log_csv = log_csv.sort_values('timestamp')
event_log = log_converter.apply(log_csv)

In [31]:
print(event_log)

[{'attributes': {'concept:name': 1}, 'events': [{'activity': 'register request', 'timestamp': Timestamp('2010-12-30 10:02:00+0000', tz='UTC'), 'costs': 50, 'resource': 'Pete'}, '..', {'activity': 'reject request', 'timestamp': Timestamp('2011-01-07 13:24:00+0000', tz='UTC'), 'costs': 200, 'resource': 'Pete'}]}, '....', {'attributes': {'concept:name': 6}, 'events': [{'activity': 'register request', 'timestamp': Timestamp('2011-01-06 14:02:00+0000', tz='UTC'), 'costs': 50, 'resource': 'Mike'}, '..', {'activity': 'pay compensation', 'timestamp': Timestamp('2011-01-16 10:47:00+0000', tz='UTC'), 'costs': 200, 'resource': 'Mike'}]}]


The events are ordered by timestamp.

In [32]:
log_csv.rename(columns={'case:concept:name': 'case'}, inplace=True)
log_csv.head()

Unnamed: 0,case,activity,timestamp,costs,resource
14,1,register request,2010-12-30 10:02:00+00:00,50,Pete
9,2,register request,2010-12-30 10:32:00+00:00,50,Mike
10,2,check ticket,2010-12-30 11:12:00+00:00,100,Mike
11,2,examine casually,2010-12-30 13:16:00+00:00,400,Sean
0,3,register request,2010-12-30 13:32:00+00:00,50,Pete


### Case-Level attributes
PM4Py allows us to specify that a column actually describes a case-level attribute (under the assumption that the attribute does not change during the execution of a process). 

However, for this, we need to specify an additional parameter, i.e., the CASE_ATTRIBUTE_PREFIX parameter, with default value 'case:'.

In [34]:
log_csv = pd.read_csv('C:/Users/AREFA/Documents/DataScienceProjects/PM4PY/running-example.csv', sep=';')
log_csv.rename(columns={'case_id': 'case'}, inplace = True)
log_csv.rename(columns={'clientID': 'case:clientID'}, inplace = True)
parameters = {log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case'}
event_log = log_converter.apply(log_csv, parameters=parameters, variant=log_converter.Variants.TO_EVENT_LOG)

## Exporting XES Files
In the example, the log object is assumed to be an Event Log object. 

The exporter also accepts an Event Stream or DataFrame object as an input.

In [36]:
from pm4py.objects.log.exporter.xes import exporter as xes_exporter
xes_exporter.apply(log, 'C:/Users/AREFA/Documents/DataScienceProjects/PM4PY/exported.xes')

exporting log, completed traces ::   0%|          | 0/6 [00:00<?, ?it/s]

## Exporting CSV Files
The Event log is converted to a Pandas dataframe then exported.

In [37]:
dataframe = log_converter.apply(log_csv, variant = log_converter.Variants.TO_DATA_FRAME)
dataframe.to_csv('C:/Users/AREFA/Documents/DataScienceProjects/PM4PY/exported_csv.csv')

## Filtering

### 1. Filtering on timeframe
Example: Traces (cases) within a specific time interval.

In [39]:
# For log object
from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_contained(log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")

In [46]:
dataframe.head()

Unnamed: 0,case,activity,timestamp,costs,resource
0,3,register request,2010-12-30 14:32:00+01:00,50,Pete
1,3,examine casually,2010-12-30 15:06:00+01:00,400,Mike
2,3,check ticket,2010-12-30 16:34:00+01:00,100,Ellen
3,3,decide,2011-01-06 09:18:00+01:00,200,Sara
4,3,reinitiate request,2011-01-06 12:18:00+01:00,200,Sara


In [57]:
dataframe['timestamp'].dtype

dtype('O')

In [52]:
#converting strings to datetime
import datetime

date1 = '2011-03-09 00:00:00'
date1 = datetime.datetime.strptime(date1, '%Y-%m-%d %H:%M:%S')

date2 = '2012-01-18 23:59:59'
date2 = datetime.datetime.strptime(date2, '%Y-%m-%d %H:%M:%S')

In [66]:
#converting timestamp column to type timestamp
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
dataframe = log_converter.apply(log_csv, variant = log_converter.Variants.TO_DATA_FRAME)

In [67]:
# For Pandas Dataframe (traces that are intersecting with a time interval)
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_contained(dataframe, '2011-01-09 00:00:00', '2012-01-18 23:59:59',parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case",timestamp_filter.Parameters.TIMESTAMP_KEY: "timestamp"})

### 2. Filtering on Case Performance
If a trace takes longer than a certain interval of time, remove it.

Time parameters are given in seconds.

In [68]:
#for log
from pm4py.algo.filtering.log.cases import case_filter
filtered_log2 = case_filter.filter_case_performance(log, 86400, 864000)

In [70]:
#for pandas df
from pm4py.algo.filtering.pandas.cases import case_filter
df_cases = case_filter.filter_case_performance(dataframe, min_case_performance=86400, max_case_performance=864000,parameters={case_filter.Parameters.CASE_ID_KEY: "case",case_filter.Parameters.TIMESTAMP_KEY: "timestamp"})

### 3. Filtering on Start Activity

In [71]:
from pm4py.algo.filtering.log.start_activities import start_activities_filter

log_start = start_activities_filter.get_start_activities(log)
filtered_log = start_activities_filter.apply(log, ["register request"])

In [78]:
dataframe.head()

Unnamed: 0,case,activity,timestamp,costs,resource
0,3,register request,2010-12-30 13:32:00+00:00,50,Pete
1,3,examine casually,2010-12-30 14:06:00+00:00,400,Mike
2,3,check ticket,2010-12-30 15:34:00+00:00,100,Ellen
3,3,decide,2011-01-06 08:18:00+00:00,200,Sara
4,3,reinitiate request,2011-01-06 11:18:00+00:00,200,Sara


In [90]:
#adding case identifier
parameters = {log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case'}
dataframe = log_converter.apply(log_csv, parameters=parameters,variant = log_converter.Variants.TO_DATA_FRAME)

In [93]:
from pm4py.algo.filtering.log.start_activities import start_activities_filter
#only those with frequency of start activity greater than 0.6 are considered
log_af_sa = start_activities_filter.apply_auto_filter(log, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6}) 

  log_af_sa = start_activities_filter.apply_auto_filter(log, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})


### 4. Filtering on end activities
This filter permits to keep only traces with an end activity among a set of specified activities. 

In [96]:
#to look at keys
dataframe2 = log_converter.apply(log, parameters=parameters,variant = log_converter.Variants.TO_DATA_FRAME)

In [105]:
dataframe2.head(20)

Unnamed: 0,concept:name,org:resource,time:timestamp,Activity,Resource,Costs,case:concept:name,case:creator
0,register request,Pete,2010-12-30 11:02:00+01:00,register request,Pete,50,1,Fluxicon Nitro
1,examine thoroughly,Sue,2010-12-31 10:06:00+01:00,examine thoroughly,Sue,400,1,Fluxicon Nitro
2,check ticket,Mike,2011-01-05 15:12:00+01:00,check ticket,Mike,100,1,Fluxicon Nitro
3,decide,Sara,2011-01-06 11:18:00+01:00,decide,Sara,200,1,Fluxicon Nitro
4,reject request,Pete,2011-01-07 14:24:00+01:00,reject request,Pete,200,1,Fluxicon Nitro
5,register request,Mike,2010-12-30 11:32:00+01:00,register request,Mike,50,2,Fluxicon Nitro
6,check ticket,Mike,2010-12-30 12:12:00+01:00,check ticket,Mike,100,2,Fluxicon Nitro
7,examine casually,Sean,2010-12-30 14:16:00+01:00,examine casually,Sean,400,2,Fluxicon Nitro
8,decide,Sara,2011-01-05 11:22:00+01:00,decide,Sara,200,2,Fluxicon Nitro
9,pay compensation,Ellen,2011-01-08 12:05:00+01:00,pay compensation,Ellen,200,2,Fluxicon Nitro


In [102]:
from pm4py.algo.filtering.log.end_activities import end_activities_filter

end_activities = end_activities_filter.get_end_activities(log)
filtered_log_ea = end_activities_filter.apply(log, ["pay compensation"])

In [103]:
print(filtered_log_ea)

[{'attributes': {'concept:name': '2', 'creator': 'Fluxicon Nitro'}, 'events': [{'concept:name': 'register request', 'org:resource': 'Mike', 'time:timestamp': datetime.datetime(2010, 12, 30, 11, 32, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'register request', 'Resource': 'Mike', 'Costs': '50', 'case:concept:name': '2', 'case:creator': 'Fluxicon Nitro'}, '..', {'concept:name': 'pay compensation', 'org:resource': 'Ellen', 'time:timestamp': datetime.datetime(2011, 1, 8, 12, 5, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'pay compensation', 'Resource': 'Ellen', 'Costs': '200', 'case:concept:name': '2', 'case:creator': 'Fluxicon Nitro'}]}, '....', {'attributes': {'concept:name': '6', 'creator': 'Fluxicon Nitro'}, 'events': [{'concept:name': 'register request', 'org:resource': 'Mike', 'time:timestamp': datetime.datetime(2011, 1, 6, 15, 2, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'register request', 'R

In [104]:
dataframe3 = log_converter.apply(filtered_log_ea, parameters=parameters,variant = log_converter.Variants.TO_DATA_FRAME)
dataframe3.head(20)

Unnamed: 0,concept:name,org:resource,time:timestamp,Activity,Resource,Costs,case:concept:name,case:creator
0,register request,Mike,2010-12-30 11:32:00+01:00,register request,Mike,50,2,Fluxicon Nitro
1,check ticket,Mike,2010-12-30 12:12:00+01:00,check ticket,Mike,100,2,Fluxicon Nitro
2,examine casually,Sean,2010-12-30 14:16:00+01:00,examine casually,Sean,400,2,Fluxicon Nitro
3,decide,Sara,2011-01-05 11:22:00+01:00,decide,Sara,200,2,Fluxicon Nitro
4,pay compensation,Ellen,2011-01-08 12:05:00+01:00,pay compensation,Ellen,200,2,Fluxicon Nitro
5,register request,Pete,2010-12-30 14:32:00+01:00,register request,Pete,50,3,Fluxicon Nitro
6,examine casually,Mike,2010-12-30 15:06:00+01:00,examine casually,Mike,400,3,Fluxicon Nitro
7,check ticket,Ellen,2010-12-30 16:34:00+01:00,check ticket,Ellen,100,3,Fluxicon Nitro
8,decide,Sara,2011-01-06 09:18:00+01:00,decide,Sara,200,3,Fluxicon Nitro
9,reinitiate request,Sara,2011-01-06 12:18:00+01:00,reinitiate request,Sara,200,3,Fluxicon Nitro


Thus, case 1 is removed as it didn't end with 'pay compensation'

### 5. Filtering on Attribute values

#### Attributes of cases: 
- case identifier/name
- resource (Examples: resource executing the case, such as Manager) (org:resource attribute)

#### Attributes of events:
- activity (concept:name attribute)
- cost
- resource

Filtering on attributes values permits alternatively to:

- Keep cases that contains at least an event with one of the given attribute values
- Remove cases that contains an event with one of the the given attribute values
- Keep events (trimming traces) that have one of the given attribute values
- Remove events (trimming traces) that have one of the given attribute values

In [106]:
#getting list of resources and activities before filtering 
from pm4py.algo.filtering.log.attributes import attributes_filter

activities = attributes_filter.get_attribute_values(log, "concept:name")
resources = attributes_filter.get_attribute_values(log, "org:resource")

In [107]:
print(activities)
print(resources)

{'register request': 6, 'examine thoroughly': 3, 'check ticket': 9, 'decide': 9, 'reject request': 3, 'examine casually': 6, 'pay compensation': 3, 'reinitiate request': 3}
{'Pete': 7, 'Sue': 2, 'Mike': 11, 'Sara': 12, 'Sean': 3, 'Ellen': 7}


In [109]:
#filter traces containing a given list of resources
tracefilter_log_pos = attributes_filter.apply(log, ["Resource10"],parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: True})

#filter traces not containing a given list of resources
tracefilter_log_neg = attributes_filter.apply(log, ["Resource10"],parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: False})

In [113]:
df= log_converter.apply(log, parameters=parameters,variant = log_converter.Variants.TO_DATA_FRAME)

In [114]:
df_traces_pos = attributes_filter.apply(df, ["Resource10"],parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: True})
df_traces_neg = attributes_filter.apply(df, ["Resource10"], parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: False})

#### Trimming the cases

In [119]:
#keep only the events performed by a given list of resources
tracefilter_log_pos = attributes_filter.apply_events(log, ["Resource10"],parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: True})
#keep only the events not performed by a given list of resources
tracefilter_log_neg = attributes_filter.apply_events(log, ["Resource10"],parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: False})