This project contains a python utility intended to use data coming from fiber
to create .xes
event logs.
To use this tool you need access to the Mount Sinai Data Warehouse.
Follow these steps to install fiber2xes
:
- Install fiber according to their installation guide.
- Download and install Spark 3.1.2 according to their installation guide. This website provides a concise overview of how the Spark environment can be set up. Make sure that both the
SPARK_HOME
andJAVA_HOME
environment variables are correctly set and exported. Should the Spark Version available change, the pyspark version of this package, as well as the one of the docker image, needs to be changed accordingly. - Run the pip installation to install
fiber2xes
:
pip install git+https://gitlab.hpi.de/pm1920/fiber2xes.git
For development and testing, all dev dependencies can be installed using
pip install -e .[dev]
If you're using zsh
, escape the square brackets: pip install -e .\[dev\]
In case you encounter version or dependency issues in relation to fiber
, it is advisable to run
sed -i 's/==/>=/' requirements.txt
in the fiber
directory in order to allow the installation of fiber2xes
to override the right dependency versions.
After following all installation steps, example.py
, a demo file containing a short overview of how fiber2xes can be executed, can be run by calling
python3 ./example.py
This example creates a sample cohort for a MRN-based event log, which will be extracted and saved to the repository's root directory as a file called ./log_<timestamp>_mrn_5.xes
This file can then be used for process mining.
The package offers two methods for the event log creation and filter for trace and event filtering. The following chapters contains more details about these methods.
To create a log from a fiber cohort, just call the cohort_to_event_log
-method:
from fiber2xes import cohort_to_event_log
cohort_to_event_log(
cohort,
trace_type,
verbose=False,
remove_unlisted=True,
remove_duplicates=True,
event_filter=None,
trace_filter=None,
cores=multiprocessing.cpu_count(),
window_size=500,
abstraction_path=None,
abstraction_exact_match=False,
abstraction_delimiter=";",
include_anamnesis_events,
duplicate_event_identifier,
event_identifier_to_merge,
perform_complex_duplicate_detection
)
Parameters:
- cohort: The fiber cohort with the patient
- trace_type: The type of a trace (
mrn
orvisit
) - verbose=False: Flag if the events should contain original non abstracted values (default False)
- remove_unlisted=True: Flag if a trace should only contain listed events (default True)
- remove_duplicates=True: Flag if duplicate events should be removed (default True)
- event_filter=None: A custom filter to filter events (default None)
- trace_filter=None: A custom filter to filter traces (default None)
- cores=multiprocessing.cpu_count(): The number of cores which should be used to process the cohort (default amount of CPUs)
- window_size=500: The number of patients per window (default 500)
- abstraction_path=None: The path to the abstraction file (default None)
- abstraction_exact_match=False: Flag if the abstraction algorithm should only abstract exacted matches (default False)
- abstraction_delimiter=";": The delimiter of the abstraction file (default ;)
- include_anamnesis_events=True: Should anamnesis events be included in the log (default True)
- duplicate_event_identifier="BACK PAIN": Event identifier to be analysed separately for duplications (default "BACK PAIN")
- event_identifier_to_merge="CHRONIC LOW BACK PAIN": Event identifier to be used for separately identified duplicates (default "CHRONIC LOW BACK PAIN")
- perform_complex_duplicate_detection=False: should complex time- and lifecycle-based duplicate detection be performed (default False)
The method save_event_log_to_file
serialises a created log to a file.
from fiber2xes import save_event_log_to_file
save_event_log_to_file(log, file_path)
Parameters:
- log: The log generated by the
cohort_to_event_log
method - file_path: The file path / name
With the trace or event filter its possible to filter the traces or events during the creation process. Therefore there are the following conditions:
These can be combined by And, Or and Not operations.
A filter for a specific diagnosis given by the code.
from fiber2xes.filter.condition import Diagnosis
filter = Diagnosis(diagnosis_code)
Parameter:
- diagnosis_code: The diagnosis code
A filter for a specific material given by the code.
from fiber2xes.filter.condition import Material
filter = Material(material_code)
Parameter:
- material_code: The material code
A filter for a specific procedure given by the code
from fiber2xes.filter.condition import Procedure
filter = Procedure(procedure_code)
Parameter:
- procedure_code: The procedure code
A filter the traces based on timing conditions (see parameter)
from fiber2xes.filter.condition import Time
filter = Time(one_event_after=None, one_event_before=None, all_events_after=None, all_events_before=None)
Parameters:
- one_event_after: The trace is relevant if one event of the trace was after the given date
- one_event_before: The trace is relevant if one event of the trace was before the given date
- all_events_after: The trace is relevant if all events of the are were after the given date
- all_events_before: The trace is relevant if all events of the are were after the given date
A filter the traces or events with the given lambda expression. The lambda expression gets the trace or event as a parameter and it should return true or false. In case of true its a relevant trace or event, otherwise not.
from fiber2xes.filter.condition import Generic
filter = Generic(lambda_expression)
Parameter:
- lambda_expression: The lambda expression which will be applied on all traces and events
An aggregation of two other filters with a logical and as aggregation function.
from fiber2xes.filter.operator import And
filter = And(filter1, filter2)
Parameter:
- filter1 and filter2: Two other trace or event filters which will be aggregated by a logical and.
An aggregation of two other filters with a logical or as aggregation function.
from fiber2xes.filter.operator import Or
filter = Or(filter1, filter2)
Parameter:
- filter1 and filter2: Two other trace or event filters which will be aggregated by a logical or.
An inverter of the result of another filter.
from fiber2xes.filter.operator import Not
filter = Not(filter)
Parameter:
- filter: The result of the given filter will be negated.
This pipeline tool utilises spark for transforming large event data sets. For local development, or for using the tool on differently equipped hardware,
it can be sensible to change memory requirements and other spark configuration options. For this, the .env
file in the project's root directory can be used
in order to override the default options passed to the spark calls.
To contribute please fork this repository and create a merge request. Assign one of the developer of this project for a review. Please always add a short introduction of your submission containing a reason for your submission.