# Current Process-Atom Research Code (main end-to-end pipeline)

### Main issues
* Individual components that fulfil tasks are intertwined
  * strong focus on individual methods (bert tagger, sent transformer) for assessing semantic similarity/matching
  * all implemented using Pandas Dataframes -> not modular enough...

### High-level TODOs
* Create proper data model for atoms of different types and atoms that are instantiated for a certain event log
* Need to decouple different sub-tasks
* Make them more modular
  * Extraction per Atom Type
  * Extraction per Model
  * Matching function as parameters
  * ...

In [None]:
from semconstmining.parsing.label_parser.nlp_helper import NlpHelper
from semconstmining.selection.instantiation.recommendation_config import RecommendationConfig
from semconstmining.selection.consistency.consistency import ConsistencyChecker
from semconstmining.config import Config
from semconstmining.main import get_resource_handler, get_or_mine_constraints, get_parts_of_constraints, get_log_and_info, compute_relevance_for_log, recommend_constraints_for_log, add_original_labels, check_constraints, get_violation_to_cases
from pathlib import Path
import os
import pandas as pd

In [None]:
LOG_FILE = "BPI_Challenge_2019.xes"
MODEL_COLLECTION = "semantic_sap_sam_filtered"

config = Config(Path(os.getcwd()).parents[0].resolve(), MODEL_COLLECTION)

### Load the necessary NLP stuff for matching

loads 
* SpaCy, 
* GloVe embeddings,
* A custom pretrained transformer for analyzing activitiy labels (extracting objects and actions)
* a pretrained sentence transformer to assess similarities between process atom components and event log components
* WordNet for synonym checking of actions (such as 'assess' and 'check')

In [None]:
nlp_helper = NlpHelper(config)

### Loads and processes models so process atoms can be extracted/mined from them 

* transforms models into Petri nets (PNs) and plays them out (generated activity sequences)
* filters out models that are unsuitable (e.g., unsound)
* parses and stores activity labels 

In [None]:
resource_handler = get_resource_handler(config, nlp_helper)

### Extracts process atoms from activity sequences.
* Extracts four types of process atoms from the activity sequence set that has been produced per model
  * Activity
  * Multi-object (projects the activity sequences to object types contained in the labels)
  * Object (projects the activity sequences to actiions for each distinct object type found in the labels)
  * Resource (uses pool and lane info stored in a role_to_activity mapping in the resource handler)
* Extraction is based on Declare mining (extends an existing Python lib

In [None]:
all_constraints = get_or_mine_constraints(config, resource_handler, min_support=1)

### Pre-computes embeddings
of natural language process atom components

In [None]:
nlp_helper.pre_compute_embeddings(sentences=get_parts_of_constraints(config, all_constraints))

### Load the event log from disk 
(this part will be handled via API to PINT)

In [None]:
event_log, log_info = get_log_and_info(config, nlp_helper, LOG_FILE)

### Compute the relevance/semantic similarity between log components (events/activities and parts of events/activities) and process atom components 

In [None]:
all_constraints_with_relevance = compute_relevance_for_log(config, all_constraints, nlp_helper, LOG_FILE, pd_log=event_log, precompute=True)

### Instantiate the constraints based on their relevance -> generate queries that match the "language of the event log"

In [None]:
rec_config = RecommendationConfig(config, semantic_weight=0.9, top_k=250)
recommended_constraints = recommend_constraints_for_log(config, rec_config, all_constraints_with_relevance, nlp_helper, LOG_FILE, pd_log=event_log)

### Check if there are obvious contradictions between pairwise constraints

In [None]:
consistency_checker = ConsistencyChecker(config)
# Check for trivial inconsistencies
consistent_recommended_constraints = consistency_checker.check_trivial_consistency(recommended_constraints)

### Add the original labels to process atoms so connection to events is prossible 
(since before the labels were normalized/preprocessed) Should happen much ssoner already when matching/instantiating atoms

In [None]:
consistent_recommended_constraints = add_original_labels(config, consistent_recommended_constraints, log_info)

### Identify violations: needs to be handled via queries to PINT

In [None]:
violations = check_constraints(config, LOG_FILE, consistent_recommended_constraints, nlp_helper, pd_log=event_log, with_id=True)
violations_to_cases = get_violation_to_cases(config, violations, with_id=True)
violation_df = pd.DataFrame.from_records([{"violation": violation, "num_violations": len(cases), "cases": cases} for violation, cases in violations_to_cases.items()])

In [None]:
violation_df

In [None]:
if len(violation_df) > 0:
    violation_df = pd.merge(consistent_recommended_constraints.reset_index(), violation_df,
                         left_on=config.RECORD_ID, right_on='violation', how='inner')
violation_df