Batch processing by takojunior · Pull Request #42 · fidelity/seq2pat

takojunior · 2022-12-15T14:52:24Z

The PR is to enable pattern mining on batches of sequences instead of being only able to run on entire set. The changes included in this PR include:

Enable mining on batches of sequences. Mining tasks can run in parallel
Enable backend mining task to run on a child process which improves memory efficiency
Enable aggregation of patterns mined from batches as final results
Require that the input integer sequences must contain positive integers
Fixes on the docs and readme files

Signed-off-by: Xin Wang <xin.wang@fmr.com>

.github/workflows/ci.yml

bkleyn

Looks good to me at high-level. Added a few clarifying comments/questions.

Have you done any analysis to evaluate speed-up and/or improved scalability?

CHANGELOG.txt

sequential/seq2pat.py

sequential/utils.py

sequential/seq2pat.py

tests/test_seq2pat.py

Signed-off-by: Xin Wang <xin.wang@fmr.com>

README.md

sequential/seq2pat.py

sequential/utils.py

tests/test_seq2pat_batch.py

skadio · 2022-12-19T12:32:52Z

I am done with my first pass.

To be direct, I am not comfortable with a PR that came a little too late in the year and just weeks before the AAAI Tutorial. This PR changes BOTH the API and the Behavior.. and with a tutorial ahead of us, for this PR to fly, it should have been crystal clear --which is unfortunately not the case.

High-level comments:

We should have had a quick API Design Review beforehand (note to ourselves)
What's the functionality of this feature? I might have missed it, but what happens when one runs this? What are the runtime benefits of batching? What are experimental gains/observations? Any analysis on batch size/runtime/dataset? There is a quick comment on pydocs about setting batch-size. what analysis is that based on? More importantly, apart from runtime, what happens to Patterns? Do we get the roughly same set of patterns? Are patterns now completely different when batched? How does pattern change a function of bath_size()? What should user expect on pattern changes?

takojunior · 2022-12-19T20:04:21Z

I am done with my first pass.

To be direct, I am not comfortable with a PR that came a little too late in the year and just weeks before the AAAI Tutorial. This PR changes BOTH the API and the Behavior.. and with a tutorial ahead of us, for this PR to fly, it should have been crystal clear --which is unfortunately not the case.

High-level comments:

We should have had a quick API Design Review beforehand (note to ourselves)

What's the functionality of this feature? I might have missed it, but what happens when one runs this? What are the runtime benefits of batching? What are experimental gains/observations? Any analysis on batch size/runtime/dataset? There is a quick comment on pydocs about setting batch-size. what analysis is that based on? More importantly, apart from runtime, what happens to Patterns? Do we get the roughly same set of patterns? Are patterns now completely different when batched? How does pattern change a function of bath_size()? What should user expect on pattern changes?

Agree that this PR includes some major updates to the API and mining behaviors when it goes batching. I will send another note to the reviewers and facilitate discussions as the follow-ups to resolve comments. Given the AAAI tutorial will be in a few weeks, the PR would not be ready to be pushed through.

I think it might be better to withdraw the PR and take the comments as follow-up actions. I will resubmit the PR once I have reviewers' questions and comments addressed more clearly.

Signed-off-by: Xin Wang <xin.wang@fmr.com>

…into batch_processing

Signed-off-by: Xin Wang <xin.wang@fmr.com>

takojunior · 2023-04-09T19:16:44Z

Hi All,

I would like to reopen this PR for the consideration of adding batch processing capability. Since the last review, I made a few updates to the PR:

Updated naming of parameter min_frequency_df to discount_factor
Revised comments in unit tests for clarity, mostly in test_seq2pat_batch.py for those to test if results from original c++ implementation and seq2pat are the same
Moved shuffle_data method to be internal to Seq2Pat, since it's only used in batching model. Users might not necessarily need to see it.
Moved set_attribute to be internal to Attribute, since that is also only used in batching. It is not necessarily to be seen.
Added validations of the parameters used in batching, e.g. batch_size, n_jobs, discount_factor
Added an example notebook batch_processing.ipynb to show the usage of batch mode on sample_data.

For the analysis on batch_size/discount_factor/runtime, we addressed the questions in last PR with more experiments. Here is a summary of analysis:

We tested on a data set with approximately 100k sequences. The results show that when batch_size increases , e.g. from 10000 to 100000, we observe an increase in runtime, while the mined patterns are all the same as mining on the entire set using single thread. When batch_size=10000, we get the most runtime benefit when it is compared to running on entire set.
On the same 100k sequences, we set batch_size=10000 and change discount_factor from 0.1 to 1.0. We observe that the runtime decreases as discount_factor increases. Only when discount_factor=1.0, the batching mode will miss some patterns compared to running on entire set. We would recommend discount_factor=0.2 by default for the robustness in results, at the expenses of runtime.
In an even larger test on ~1M sequences, we set batch_size=10000, discount_factor=0.8, n_jobs=8. Batch mode saves 60% of the runtime compared to running on entire set, while the resulted patterns from the two processes are the same.
When data size is small, e.g. a few thousands sequences, there is no benefit to run batch mode. Thus we would recommend to use the batch mode only when data has at least hundreds of thousands sequences for gaining the runtime benefit.

Let me know if you have more questions.

skadio

LGTM. See some of comments before the final merge

sequential/seq2pat.py

sequential/utils.py

sequential/seq2pat.py

sequential/utils.py

sequential/seq2pat.py

Signed-off-by: Xin Wang <xin.wang@fmr.com>

takojunior · 2023-04-12T16:23:06Z

Hi All, I believe all comments have been resolved. A few recent commits make these updates:

Enable dynamically apply batch processing. For regular users, when batch_size is not set but dataset has >500k sequences, we automatically set batch_size=10000 to ease the task and improve user experience.
Allow integer min_frequency threshold in batch processing. The value will be converted to a ratio for the algorithm to run.
Add an internal random state to seq2pat, which implements the shuffle process.
Clarify the notes for set_value function, batch_size attribute in seq2pat, attach experimental analysis to comments.
Update docs.

I plan to merge and have the new release either today or tomorrow.

Thank you all again for the constructive comments!

Signed-off-by: Xin Wang <xin.wang@fmr.com>

takojunior added 17 commits October 28, 2022 15:38

test

e83f0e5

test python x32

4e6a466

test python x86

0a6bd8f

test python x86

f5acd1d

test

33602bf

test

62509e0

test

41542df

test

1c147be

tst

5a84806

test

a137add

mining job in separate thread

ad6d4a8

updates

6a3278a

Signed-off-by: Xin Wang <xin.wang@fmr.com>

updates

1de269a

Signed-off-by: Xin Wang <xin.wang@fmr.com>

updates

df6a862

Signed-off-by: Xin Wang <xin.wang@fmr.com>

updates

99aded4

Signed-off-by: Xin Wang <xin.wang@fmr.com>

updates

d4d8e8d

updates

eb773cd

takojunior requested review from bkleyn, dorukkilitcioglu and skadio December 15, 2022 14:52

updates

8f91293

Signed-off-by: Xin Wang <xin.wang@fmr.com>

bkleyn reviewed Dec 15, 2022

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

bkleyn reviewed Dec 15, 2022

View reviewed changes

CHANGELOG.txt Outdated Show resolved Hide resolved

sequential/seq2pat.py Outdated Show resolved Hide resolved

sequential/utils.py Outdated Show resolved Hide resolved

sequential/seq2pat.py Outdated Show resolved Hide resolved

tests/test_seq2pat.py Outdated Show resolved Hide resolved

docstrings and comments

b0e0826

Signed-off-by: Xin Wang <xin.wang@fmr.com>

bkleyn approved these changes Dec 15, 2022

View reviewed changes

skadio reviewed Dec 16, 2022

View reviewed changes

README.md Show resolved Hide resolved

skadio reviewed Dec 16, 2022

View reviewed changes

sequential/seq2pat.py Show resolved Hide resolved

skadio reviewed Dec 16, 2022

View reviewed changes

sequential/utils.py Outdated Show resolved Hide resolved

skadio reviewed Dec 16, 2022

View reviewed changes

tests/test_seq2pat_batch.py Show resolved Hide resolved

skadio reviewed Dec 16, 2022

View reviewed changes

tests/test_seq2pat_batch.py Outdated Show resolved Hide resolved

takojunior closed this Dec 21, 2022

takojunior added 10 commits March 3, 2023 10:45

updates

f9b6fba

Signed-off-by: Xin Wang <xin.wang@fmr.com>

Merge branch 'batch_processing' of https://github.com/fidelity/seq2pat …

b6b1ba6

…into batch_processing

updates

0694929

Signed-off-by: Xin Wang <xin.wang@fmr.com>

update

2ee05c9

Signed-off-by: Xin Wang <xin.wang@fmr.com>

update

738d7df

Signed-off-by: Xin Wang <xin.wang@fmr.com>

update

b69c719

update

c454174

Signed-off-by: Xin Wang <xin.wang@fmr.com>

update

c8810ee

update

22d0708

Signed-off-by: Xin Wang <xin.wang@fmr.com>

update

755cb7b

Signed-off-by: Xin Wang <xin.wang@fmr.com>

takojunior reopened this Apr 9, 2023

skadio approved these changes Apr 10, 2023

View reviewed changes

skadio reviewed Apr 10, 2023

View reviewed changes

sequential/seq2pat.py Outdated Show resolved Hide resolved

skadio reviewed Apr 10, 2023

View reviewed changes

sequential/seq2pat.py Show resolved Hide resolved

takojunior added 7 commits April 10, 2023 09:57

update

31ca313

Signed-off-by: Xin Wang <xin.wang@fmr.com>

update

610213a

update

4c0f6a3

Signed-off-by: Xin Wang <xin.wang@fmr.com>

pydocs

78a4acc

Signed-off-by: Xin Wang <xin.wang@fmr.com>

docs

b145c97

Signed-off-by: Xin Wang <xin.wang@fmr.com>

pydocs

fd6255b

Signed-off-by: Xin Wang <xin.wang@fmr.com>

chang log

db71e5a

Signed-off-by: Xin Wang <xin.wang@fmr.com>

takojunior added 2 commits April 13, 2023 10:32

docs

8d993ae

Signed-off-by: Xin Wang <xin.wang@fmr.com>

Merge branch 'master' into batch_processing

7887bf7

Signed-off-by: Xin Wang <xin.wang@fmr.com>

takojunior merged commit e9e002f into master Apr 13, 2023

takojunior deleted the batch_processing branch April 13, 2023 14:45

Conversation

takojunior commented Dec 15, 2022

Uh oh!

Uh oh!

bkleyn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skadio commented Dec 19, 2022

Uh oh!

takojunior commented Dec 19, 2022

Uh oh!

takojunior commented Apr 9, 2023

Uh oh!

skadio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

takojunior commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants