Hello, and welcome to this walkthrough. Here, you will quickly see the inner workings of the framework.
Let us start by importing the necessary libraries.

In [None]:
from Managers.common_functions import *
from Managers.encoder_manager import *
from Editors import *
from Encoders import *
from DataPreparators import *
from Trainers import *

Here, the goal is to predict the next activity from an incomplete case, by training a LSTM network on the "helpdesk" dataset. To do so, the data must be edited, encoded, and processed into pairs of "prefixes" (the incomplete cases) and "suffixes" (the activity to predict). Here is the content that can be found in the helpdesk database:

In [None]:
pd.read_csv("../Data/env_permit.csv", nrows=5)

Unnamed: 0,CaseID,ActivityID,CompleteTimestamp
0,2742737,23,2011-06-01 08:00:00
1,2742737,25,2011-06-16 20:17:11
2,2742737,26,2011-06-16 20:17:14
3,2742737,29,2011-06-16 20:17:16
4,2742737,30,2011-06-16 20:17:17


As we can see, it has three columns: a case id, an activity, and a date. Each column must be encoded to fit inside a neural network.

Here are the main parameters for the framework, stored in the "config.py" file. From top to bottom:
* The path for the input file, here "helpdesk.csv",
* The name of the folder for the results, stored inside the "Output" folder, here "helpdesk",
* The size of the chunk when reading the input file, i.e. the maximum number of lines to store in the RAM,
* The number of processed cases to store in the RAM,
* The number of epochs for the neural network,
* A boolean to set if the input file has two columns with timestamps tied to the activity (for the start and end date). Here, helpdesk has only the start date,
* The indexes of the columns which contains dates. Here, only the third column has dates in it.

In [None]:
input_path = "../Data/env_permit.csv"
output_name = "env_permit_online"
input_chunk_size = 50000
output_chunk_size = 500
batch_size = 32
epoch_counter = 5
double_timestamps = False
dates_ids = [2]

Now that everything is set, we can create the main components of the framework: editors and encoders.

Editors edit cases, by adding a step in it (e.g. "End of State"). Here, two editors are created:
* An editor which creates a "Start of State" at the start of each case, named SosForAll,
* An editor which creates an "End of State" at the end of each case, named EosForAll.

In [None]:
editors = [SosForAll(), EosForAll()]

Editors encodes edited cases, to data that are interpretable by the neural network. Here, three encoders are created:
* An encoder which deletes the first column, named DeleteEncoder,
* An encoder which convert categorical data from the second column into one-hot vectors, named OneHotEncoder,
* An encoder which computes a time difference between two successive dates inside a case, named TimeDifferenceSingleEncoder.

In [None]:
encoders = [DeleteEncoder(0), OneHotEncoder(1, activity=True), TimeDifferenceSingleEncoder(2)]

Now that everything is set, we can create the main components of the framework: editors and encoders.

Editors edit cases, by adding a step in it (e.g. "End of State"). Here, two editors are created:
* An editor which creates a "Start of State" at the start of each case, named SosForAll,
* An editor which creates an "End of State" at the end of each case, named EosForAll.

Now that the main components are created, we can assign a manager to them. Those managers will be commanded by an **orchestrator**, which manages the all pre-processing of the data.

In [None]:
editor_manager = EditorManager(editors)
encoder_manager = EncoderManager(encoders)

The encoder manager can display the internal parameters of its encoders:

In [None]:
df = encoder_manager.get_all_encoders_description_df()
df

Unnamed: 0,Name,Properties,Input column names,Input column indexes,Output column names,Output column indexes
0,Delete,[],,0,,
1,OneHot,[],,1,,
2,TimeDifferenceSingle,[0],,2,[Time_diff],[0]


As we can see, the encoders have no internal parameters at the moment. However, two encoders need to set parameters to function:
* The one-hot encoder needs the list of all activities of the input file,
* The date difference encoder needs to get the maximum value of the difference between two consecutive dates inside a case.

To get those parameters, the orchestrator, while created, reads the database once, in full.

In [None]:
orchestrator = build_orchestrator(input_path, output_name, input_chunk_size, encoder_manager, editor_manager, dates_ids, double_timestamps)

Analyze data: 100%|██████████| 1/1 [00:00<00:00, 27.18it/s]


While reading the file, the orchestrator gets information about it, which can be seen here:

In [None]:
orchestrator.show_infos()

--------Orchestrator--------
Database file name: ../Data/env_permit.csv
Ouput folder: env_permit_online
Database column names: ['CaseID', 'ActivityID', 'CompleteTimestamp']
Indexes of the dates columns: [2]
Number of cases: 937
Number of chunks: 1
Is it a double timestamps file? False
Number of activities: 383
Maximum length of a case: 97
Number of features of the neural network: 384
Are there leftovers? True
Number of encoders: 3
---------------------------


And, if we check the properties of the encoders, we can see that is has been updated:

In [None]:
df = encoder_manager.get_all_encoders_description_df()
df

Unnamed: 0,Name,Properties,Input column names,Input column indexes,Output column names,Output column indexes
0,Delete,[],CaseID,0,,
1,OneHot,"[EoS, SoS, 1, 10, 100, 101, 102, 103, 104, 105...",ActivityID,1,"[EoS, SoS, 1, 10, 100, 101, 102, 103, 104, 105...","[0, 382]"
2,TimeDifferenceSingle,[62398463],CompleteTimestamp,2,[Time_diff],[383]


Let us now go through the inner workings of the framework. First, the orchestrator breaks the file into cases. Here is the first case of helpdesk:

In [None]:
# Get the list of chunks
chunks = pd.read_csv(orchestrator.input_path, chunksize=input_chunk_size, parse_dates=orchestrator.dates_ids)
# Create all preliminary data before the chunks are processed
id_column = orchestrator.column_names[0]
for og_chunk in chunks:
    complete_cases, case_ids, previous_case, previous_case_id = \
                    get_complete_cases(og_chunk, id_column, True, False, None, "")
    case = complete_cases[0]
    break
case

Unnamed: 0,CaseID,ActivityID,CompleteTimestamp
0,2742737,23,2011-06-01 08:00:00
1,2742737,25,2011-06-16 20:17:11
2,2742737,26,2011-06-16 20:17:14
3,2742737,29,2011-06-16 20:17:16
4,2742737,30,2011-06-16 20:17:17
5,2742737,31,2011-06-16 20:17:22
6,2742737,32,2011-06-16 20:17:24


Now, the orchestrator calls the editor manager to edit the case, i.e. add a Start of State and an End of State. To do so, the manager calls an editor after the other:

In [None]:
np_case = case.to_numpy()
edited_case = orchestrator.editor_manager.edit_case(np_case, orchestrator)
pd.DataFrame(edited_case, columns=orchestrator.column_names)

Unnamed: 0,CaseID,ActivityID,CompleteTimestamp
0,2742737,SoS,2011-06-01 08:00:00
1,2742737,23,2011-06-01 08:00:00
2,2742737,25,2011-06-16 20:17:11
3,2742737,26,2011-06-16 20:17:14
4,2742737,29,2011-06-16 20:17:16
5,2742737,30,2011-06-16 20:17:17
6,2742737,31,2011-06-16 20:17:22
7,2742737,32,2011-06-16 20:17:24
8,2742737,EoS,2011-06-16 20:17:24


The case has been edited, it can now be encoded by the encoder manager, that calls every encoders and merges their results:

In [None]:
encoded_case = orchestrator.encoder_manager.encode_case(edited_case)
pd.DataFrame(encoded_case, columns=orchestrator.encoder_manager.all_output_column_names)

Unnamed: 0,EoS,SoS,1,10,100,101,102,103,104,105,...,91,92,93,94,95,96,97,98,99,Time_diff
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02147859
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.807811e-08
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.205207e-08
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.602604e-08
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.013018e-08
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.205207e-08
8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can see the result of the one-hot encoder on activities, concatenated with the time difference encoder. The case id has been deleted.

Every encoder can leave a **leftover**, i.e. a data that has been lost during the encoding. As we will see after, it is necessary for decoding and get back the original data.

Two encoders out of three leaves a leftover:
* The delete encoder leaves the case id that has been deleted (here "2"),
* The time difference encoder leaves the first date of the case (here "2012-04-03 16:55:38"). For decoding, the time difference will help to get the original dates of the case.

In [None]:
leftover = orchestrator.encoder_manager.get_leftover(edited_case)
leftover.tolist()

[2742737, Timestamp('2011-06-01 08:00:00')]

This framework makes it possible to automatically decode this encoded data, to interpret the results. To do so, decoders are created. Here, they are created automatically according to the encoders that were built.

In [None]:
decoder_manager = create_all_decoders(orchestrator)
decoder_manager.get_all_encoders_description_df()
decoder_manager.set_all_output_column_names()

Once the decoders are built, we assign to them the leftovers that were generated before:

In [None]:
decoder_manager.encoders[0].set_leftover([leftover[0]])
decoder_manager.encoders[2].set_leftover([leftover[1]])

And we can run them to get the original data:

In [None]:
decoded_case = decoder_manager.encode_case(encoded_case)
#pd.DataFrame(data=decoded_case, columns=decoder_manager.all_output_column_names)

The case can now be edited into prefixes and suffixes, as in this example:

In [None]:
preparator = SlicerLSTM()
preparator.build(input_chunk_size, output_chunk_size, batch_size, orchestrator)
result = next(preparator.run_online())
prefix = result[0][3]
suffix = result[1][3]

Which makes this the input of the neural network. You can see the original values in the end of the table:

In [None]:
prefix

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 2.14785899e-02],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.80781073e-08]])

And this the expected output of this example:

In [None]:
suffix

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Now, you have seen the inner workings of the framework. You can tinker with the configuration, and observe the "main.py" file. you can also run code by yourself, as in this example...

In [None]:
create_directories(orchestrator.output_name)
orchestrator.process_online(input_chunk_size)
preparator = SlicerLSTM()
preparator.build(input_chunk_size, output_chunk_size, batch_size, orchestrator)
preparator.run_online()
lstm_trainer = LSTMTrainer()
lstm_trainer.build(preparator, epoch_counter)

... and train the model!

In [None]:
model = lstm_trainer.train_model_online()

Build model...
39881 cases
Epoch 1/5


2022-07-04 16:45:37.446162: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


  33/1247 [..............................] - ETA: 2:09 - loss: 4.2293


KeyboardInterrupt



You can see all the results in the "Output/env_permit" folder, as well as the encoded, decoded data, description of the input file and the model that has been trained.