In [28]:
import pandas as pd
import numpy as np
import pm4py
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display
import ipywidgets as widgets
from ipywidgets import interactive
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import pandas as pd

from pm4py.algo.transformation.log_to_features import algorithm as log_to_features
from pm4py.algo.transformation.log_to_features.variants.event_based import extract_all_ev_features_names_from_log
from pm4py.objects.log.util import interval_lifecycle
from pm4py.util import constants

from helper import create_sample

DICT = {'3-way match, invoice after GR':'DF1', '3-way match, invoice before GR':'DF2', '2-way match':'DF3', 'Consignment':'DF4'}
WINDOWS = False
VERSION = 'DF1'

In [29]:
df = create_sample(num_cases=10000, Windows=WINDOWS, version=VERSION)
dataframe = pm4py.format_dataframe(df, case_id='case_concept_name', activity_key='event_concept_name', timestamp_key='event_time_timestamp')
event_log = pm4py.convert_to_event_log(dataframe)
print(f"event_log length: {len(event_log)}")

event_log length: 63


Here are a few important questions that we might
seek to address:
1. Is there a relationship between advertising budget and sales?
Our first goal should be to determine whether the data provide evidence
of an association between advertising expenditure and sales. If
the evidence is weak, then one might argue that no money should be
spent on advertising!
2. How strong is the relationship between advertising budget and sales?
Assuming that there is a relationship between advertising and sales,
we would like to know the strength of this relationship. Does knowledge
of the advertising budget provide a lot of information about
product sales?
3. Which media are associated with sales?
Are all three media—TV, radio, and newspaper—associated with
sales, or are just one or two of the media associated? To answer this
question, we must find a way to separate out the individual contribution
of each medium to sales when we have spent money on all three
media.
4. How large is the association between each medium and sales?
For every dollar spent on advertising in a particular medium, by
what amount will sales increase? How accurately can we predict this
amount of increase?
5. How accurately can we predict future sales?
For any given level of television, radio, or newspaper advertising, what
is our prediction for sales, and what is the accuracy of this prediction?
6. Is the relationship linear?
If there is approximately a straight-line relationship between advertising
expenditure in the various media and sales, then linear regression
is an appropriate tool. If not, then it may still be possible to transform
the predictor or the response so that linear regression can be
used.
7. Is there synergy among the advertising media?
Perhaps spending $50,000 on television advertising and $50,000 on radio
advertising is associated with higher sales than allocating $100,000
to either television or radio individually. In marketing, this is known
as a synergy effect, while in statistics it is called an interaction effect.


### Data exploration

Exploring your dataframe to assess its suitability for linear regression involves several steps. Linear regression assumes a linear relationship between the independent variables (features) and the dependent variable (target). Here's a structured approach to explore your data:

1. Understanding Your Data
Summary Statistics: Use .describe() to get summary statistics for numerical columns. This includes count, mean, standard deviation, minimum, and maximum values. Look for any anomalies or outliers that might need addressing.
Data Types: Check the data types of each column with .info() to ensure they are appropriate for the analysis (e.g., numerical data for the features and target variable for regression).
2. Visual Exploration
Histograms: Plot histograms for your features and target variable to understand the distributions. Linear regression assumes a normal (or near-normal) distribution of variables.
Scatter Plots: Create scatter plots between each feature and the target variable. This helps in visually inspecting the linear relationship between variables.
Correlation Matrix: Use .corr() to generate a correlation matrix and visualize it using a heatmap. High correlation between a feature and the target suggests a potential predictor, while high correlation between features indicates multicollinearity, which could be problematic.
3. Checking for Linearity
The core assumption of linear regression is that there is a linear relationship between the predictors and the response variable. You can:

Plot scatter plots to visually check for a linear relationship.
Calculate correlation coefficients to measure the strength and direction of the linear relationships.
4. Assessing Multicollinearity
Variance Inflation Factor (VIF): Calculate the VIF for your features. A VIF value greater than 10 (some suggest a stricter threshold of 5) indicates high multicollinearity that could undermine the performance of a linear regression model.
5. Outliers and Leverage Points
Outliers: Outliers can significantly affect the fit of a linear regression model. Use box plots or Z-scores to identify outliers in your data.
Leverage Points: Points that have an unusual combination of predictor values can unduly influence the model. Leverage plots can help identify these.
6. Homoscedasticity and Normality of Residuals
Homoscedasticity: The variance of error terms should be constant across all levels of the independent variables. You can check this assumption by plotting residuals vs. predicted values.
Normality of Residuals: The residuals should normally be distributed. This can be checked using a Q-Q plot or a normality test like the Shapiro-Wilk test.
7. Data Preprocessing
Handling Missing Values: Decide on a strategy for handling missing data, which might include imputation or removal of rows/columns.
Feature Engineering: Consider creating new features that might better capture the linear relationship with the target variable.
Scaling/Normalization: Linear regression doesn’t require it for fitting the model, but it can be important for regularization techniques like Ridge or Lasso.

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1455 entries, 0 to 6547
Data columns (total 28 columns):
 #   Column                            Non-Null Count  Dtype              
---  ------                            --------------  -----              
 0   eventID                           1455 non-null   int64              
 1   case_Spend_area_text              1455 non-null   object             
 2   case_Company                      1455 non-null   object             
 3   case_Document_Type                1455 non-null   object             
 4   case_Sub_spend_area_text          1455 non-null   object             
 5   case_Purchasing_Document          1455 non-null   int64              
 6   case_Purch._Doc._Category_name    1455 non-null   object             
 7   case_Vendor                       1455 non-null   object             
 8   case_Item_Type                    1455 non-null   object             
 9   case_Item_Category                1455 non-null   object            

In [44]:
#use pm4py to extract features
data, feature_names = log_to_features.apply(event_log, parameters={constants.PARAMETER_CONSTANT_ACTIVITY_KEY: 'event_concept_name', constants.PARAMETER_CONSTANT_CASEID_KEY: 'case_concept_name', constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: 'event_time_timestamp'})
print(f"features length: {len(data)}")
print(feature_names)
print(data[:1])
#get the feature names that start with 'event:case_Document_Type@' from feature_names and related values from the first five rows of the data

# Convert data to pandas DataFrame
data = pd.DataFrame(data, columns=feature_names)

event_features = [f for f in feature_names if f.startswith('event:case_Document_Type@')]
print(event_features)
print(data[event_features][:5])


features length: 63
['event:case_Document_Type@EC Purchase order', 'event:case_Document_Type@Framework order', 'event:case_Document_Type@Standard PO', 'event:case_Spend_classification_text@NPR', 'event:case_Spend_classification_text@PR', 'event:case_Item_Type@Service', 'event:case_Item_Type@Standard', 'event:case_Item_Type@Subcontracting', 'event:case_Item_Type@Third-party', 'event:case_Purch._Doc._Category_name@Purchase order', 'event:case_Company@companyID_0000', 'event:case_Spend_area_text@Additives', 'event:case_Spend_area_text@CAPEX & SOCS', 'event:case_Spend_area_text@Enterprise Services', 'event:case_Spend_area_text@Latex & Monomers', 'event:case_Spend_area_text@Logistics', 'event:case_Spend_area_text@Marketing', 'event:case_Spend_area_text@Packaging', 'event:case_Spend_area_text@Sales', 'event:case_Spend_area_text@Trading & End Products', 'event:case_Item_Category@3-way match, invoice after GR', 'event:case_Source@sourceSystemID_0000', 'event:log_type@DF1', 'event:event_concept

In [51]:
import sys
import os

# Get the current notebook path
notebook_path = os.getcwd()

# Construct the path to the parent directory
parent_dir = os.path.dirname(notebook_path)

# Add the parent directory to sys.path
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

from wise.utils.get_features import total_number_of_events

# Get the total number of events
total_number_of_events(data, event_features)


ImportError: cannot import name 'event_names' from 'utils' (/Users/urszulajessen/code/gitHub/WISE/wise_flow/wise/utils/__init__.py)