Skip to content

Batch processing#42

Merged
takojunior merged 40 commits intomasterfrom
batch_processing
Apr 13, 2023
Merged

Batch processing#42
takojunior merged 40 commits intomasterfrom
batch_processing

Conversation

@takojunior
Copy link
Copy Markdown
Contributor

The PR is to enable pattern mining on batches of sequences instead of being only able to run on entire set. The changes included in this PR include:

  • Enable mining on batches of sequences. Mining tasks can run in parallel
  • Enable backend mining task to run on a child process which improves memory efficiency
  • Enable aggregation of patterns mined from batches as final results
  • Require that the input integer sequences must contain positive integers
  • Fixes on the docs and readme files

Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Copy link
Copy Markdown
Contributor

@bkleyn bkleyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me at high-level. Added a few clarifying comments/questions.

Have you done any analysis to evaluate speed-up and/or improved scalability?

Signed-off-by: Xin Wang <xin.wang@fmr.com>
@skadio
Copy link
Copy Markdown
Contributor

skadio commented Dec 19, 2022

I am done with my first pass.

To be direct, I am not comfortable with a PR that came a little too late in the year and just weeks before the AAAI Tutorial. This PR changes BOTH the API and the Behavior.. and with a tutorial ahead of us, for this PR to fly, it should have been crystal clear --which is unfortunately not the case.

High-level comments:

  • We should have had a quick API Design Review beforehand (note to ourselves)
  • What's the functionality of this feature? I might have missed it, but what happens when one runs this? What are the runtime benefits of batching? What are experimental gains/observations? Any analysis on batch size/runtime/dataset? There is a quick comment on pydocs about setting batch-size. what analysis is that based on? More importantly, apart from runtime, what happens to Patterns? Do we get the roughly same set of patterns? Are patterns now completely different when batched? How does pattern change a function of bath_size()? What should user expect on pattern changes?

@takojunior
Copy link
Copy Markdown
Contributor Author

I am done with my first pass.

To be direct, I am not comfortable with a PR that came a little too late in the year and just weeks before the AAAI Tutorial. This PR changes BOTH the API and the Behavior.. and with a tutorial ahead of us, for this PR to fly, it should have been crystal clear --which is unfortunately not the case.

High-level comments:

  • We should have had a quick API Design Review beforehand (note to ourselves)
  • What's the functionality of this feature? I might have missed it, but what happens when one runs this? What are the runtime benefits of batching? What are experimental gains/observations? Any analysis on batch size/runtime/dataset? There is a quick comment on pydocs about setting batch-size. what analysis is that based on? More importantly, apart from runtime, what happens to Patterns? Do we get the roughly same set of patterns? Are patterns now completely different when batched? How does pattern change a function of bath_size()? What should user expect on pattern changes?

Agree that this PR includes some major updates to the API and mining behaviors when it goes batching. I will send another note to the reviewers and facilitate discussions as the follow-ups to resolve comments. Given the AAAI tutorial will be in a few weeks, the PR would not be ready to be pushed through.

I think it might be better to withdraw the PR and take the comments as follow-up actions. I will resubmit the PR once I have reviewers' questions and comments addressed more clearly.

@takojunior takojunior closed this Dec 21, 2022
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
@takojunior
Copy link
Copy Markdown
Contributor Author

Hi All,

I would like to reopen this PR for the consideration of adding batch processing capability. Since the last review, I made a few updates to the PR:

  • Updated naming of parameter min_frequency_df to discount_factor
  • Revised comments in unit tests for clarity, mostly in test_seq2pat_batch.py for those to test if results from original c++ implementation and seq2pat are the same
  • Moved shuffle_data method to be internal to Seq2Pat, since it's only used in batching model. Users might not necessarily need to see it.
  • Moved set_attribute to be internal to Attribute, since that is also only used in batching. It is not necessarily to be seen.
  • Added validations of the parameters used in batching, e.g. batch_size, n_jobs, discount_factor
  • Added an example notebook batch_processing.ipynb to show the usage of batch mode on sample_data.

For the analysis on batch_size/discount_factor/runtime, we addressed the questions in last PR with more experiments. Here is a summary of analysis:

  • We tested on a data set with approximately 100k sequences. The results show that when batch_size increases , e.g. from 10000 to 100000, we observe an increase in runtime, while the mined patterns are all the same as mining on the entire set using single thread. When batch_size=10000, we get the most runtime benefit when it is compared to running on entire set.
  • On the same 100k sequences, we set batch_size=10000 and change discount_factor from 0.1 to 1.0. We observe that the runtime decreases as discount_factor increases. Only when discount_factor=1.0, the batching mode will miss some patterns compared to running on entire set. We would recommend discount_factor=0.2 by default for the robustness in results, at the expenses of runtime.
  • In an even larger test on ~1M sequences, we set batch_size=10000, discount_factor=0.8, n_jobs=8. Batch mode saves 60% of the runtime compared to running on entire set, while the resulted patterns from the two processes are the same.
  • When data size is small, e.g. a few thousands sequences, there is no benefit to run batch mode. Thus we would recommend to use the batch mode only when data has at least hundreds of thousands sequences for gaining the runtime benefit.

Let me know if you have more questions.

@takojunior takojunior reopened this Apr 9, 2023
Copy link
Copy Markdown
Contributor

@skadio skadio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. See some of comments before the final merge

Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
@takojunior
Copy link
Copy Markdown
Contributor Author

Hi All, I believe all comments have been resolved. A few recent commits make these updates:

  • Enable dynamically apply batch processing. For regular users, when batch_size is not set but dataset has >500k sequences, we automatically set batch_size=10000 to ease the task and improve user experience.

  • Allow integer min_frequency threshold in batch processing. The value will be converted to a ratio for the algorithm to run.

  • Add an internal random state to seq2pat, which implements the shuffle process.

  • Clarify the notes for set_value function, batch_size attribute in seq2pat, attach experimental analysis to comments.

  • Update docs.

I plan to merge and have the new release either today or tomorrow.

Thank you all again for the constructive comments!

Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
@takojunior takojunior merged commit e9e002f into master Apr 13, 2023
@takojunior takojunior deleted the batch_processing branch April 13, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants