Skip to content

WIP: Default span constraint on index#25

Merged
takojunior merged 14 commits intomasterfrom
default-span-constraint-on-index
Jul 5, 2022
Merged

WIP: Default span constraint on index#25
takojunior merged 14 commits intomasterfrom
default-span-constraint-on-index

Conversation

@takojunior
Copy link
Copy Markdown
Contributor

The PR is for adding a default index attribute and a default maximum span constraint on this attribute. This is going to avoid regular users to run into scaling issues when they have long sequences in data and usually apply no constraints at the beginning.

The PR has the following changes:

  • Given a new parameter max_span_index to initialize Seq2Pat, add the default constraint. max_span_index=10 by default.
  • A fix to backend for checking upper and lower bounds in Gap constraints, where we may want to use original gap value ( a_i - a_{i-1} ) instead of the abs value | a_i - a_{i-1} |. With this change, I can get expected results when gap has negative upper/lower bounds in the tests, where it works together with default span cs. For this change, I might want to ask aminhn to confirm that it is a safe change.
  • Update unit tests to align with the default constraint. All tests get passed.

I like to have the above changes reviewed first, then notebooks and docs can be updated with the default constraint.

Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
if ((*lgap)[att] == 0)
continue;
if (abs((*attrs)[(*lgapi)[att]].at(i).at(endp - 1) - (*attrs)[(*lgapi)[att]].at(i).at(strp - 1)) < (*lgap)[att])
// Allow lower bound of gap constraint to be 0 if not work with abs values
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix that Amin did in the original/earlier repository, correct?

My suggestions is; go that repo and look at the latest commits. It should be easy to find the commit for this fix at the backend. Then, you can compare & contrast the solutions.

I vaguely remember looking into it at the time (~2 years ago) and it was a simple/one-liner change like this.

It should be easy to verify.

Copy link
Copy Markdown
Contributor Author

@takojunior takojunior Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see his original repo uses abs value at this line though (see the line 81), which is the latest commit.

In an earlier commit, there is a fix at another line (line 46), which was to apply abs instead of removing it. So I tried to do the same fix as Amin did, which didn't give me expected results as I have negative upper/lower bounds and run it on paper though.

I am thinking that maybe I can leave an inline comment (shown as an issue) to the original repo and Amin can provide his thoughts there?

Copy link
Copy Markdown
Contributor

@skadio skadio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is a small change but will make the behavior much better for new comers / first experience with the library. Thank you!


def __init__(self, sequences: List[list]):
# Validate input sequences
def __init__(self, sequences: List[list], max_span_index: Union[int, None] = 10):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the naming.. The "index" is about the Attribute but this parameter is about the "length" of the span. It is not index, really.

So I would suggest max_span instead.

On the contrary, index_attr = Attribute() in the code below makes total sense --because that thing is about the index

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_span makes sense. I will change the naming.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the naming.. The "index" is about the Attribute but this parameter is about the "length" of the span. It is not index, really.

So I would suggest max_span instead.

On the contrary, index_attr = Attribute() in the code below makes total sense --because that thing is about the index

Another thought is to use max_span instead of rolling_window_size in Pat2Feat? Since these two parameters are related, although they don't have to be the same. They can be configured separately but they are both restricting the length.

Copy link
Copy Markdown
Contributor

@skadio skadio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the usage example notebook, you might want to add a comment about this new parameter.

Btw, shall we consider changing that notebook name to "sequential_pattern_mining" to complement the other notebook "dichomotic_pattern_mining"?

And once this change is merged notebooks and pypi updated etc. there are a few threads we can circle back:

  • The current github issue might be solved in the next version
  • XAI notebook you shared can be updated, Shovon et al should install new version
  • Danielle should install new version

@takojunior
Copy link
Copy Markdown
Contributor Author

In the usage example notebook, you might want to add a comment about this new parameter.

Btw, shall we consider changing that notebook name to "sequential_pattern_mining" to complement the other notebook "dichomotic_pattern_mining"?

And once this change is merged notebooks and pypi updated etc. there are a few threads we can circle back:

  • The current github issue might be solved in the next version
  • XAI notebook you shared can be updated, Shovon et al should install new version
  • Danielle should install new version

With this update, I will circle back with these threads.

For the dichotomic pattern mining note, it can be simplified now by dropping the span constraint and having the default constraint being in use.

Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
@takojunior
Copy link
Copy Markdown
Contributor Author

The latest commit has the following updates to this PR:

  • Refactored the parameter for built-in span constraint to be max_span in Seq2Pat
  • Refactored the rolling_window_size parameter to be max_span for consistency in Pat2Feat
  • Updated usage_example notebook to sequential_pattern_mining notebook. Added a comment to introduce max_span and the built-in constraint
  • Updated dichotomic pattern mining notebook by dropping the manually created attribute and then utilized the built-in constraint.
  • Updated docs in the API pages.

@skadio @dorukkilitcioglu Let me know if you have further comments. Thanks.

Signed-off-by: Xin Wang <xin.wang@fmr.com>
"""

def __init__(self, rolling_window_size: Union[int, None] = 10):
def __init__(self, max_span: Union[int, None] = 10):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point on making this consistent!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this makes perfect sense!

Copy link
Copy Markdown
Contributor

@skadio skadio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM --thank you @takojunior! This will very much simplify and speed-up everything else (notebooks, examples, etc.)

Copy link
Copy Markdown
Contributor

@dorukkilitcioglu dorukkilitcioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good! Left a couple of comments, but good to go as it is.

"""

def __init__(self, rolling_window_size: Union[int, None] = 10):
def __init__(self, max_span: Union[int, None] = 10):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this makes perfect sense!

sequences : List[list]
A list of sequences each with a list of events.
The event values can be all strings or all integers.
max_span: Union[int, None]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why Union[int, None] instead of Optional[int]? It accomplishes the same thing either way, but I'm wondering if you had a preference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preferences. Technically they are doing the same thing, but Optional[...] seems better, since it documents the reason why None is introduced, and it is easier to move Union[...] to be a separate type alias when there are multiple types to specify. I refer to a discussion here. So I think I will update it to Optional.

"## Seq2Pat for Positive Labels\n",
"- There are two attributes: `event_time` and `event_order`\n",
"- Constraint 1: to enforce the average event time greater than 20 sec\n",
"- Constraint 2: to enforce the span of event order less than 9. This is to restrict the length of sequence to be <= 10."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before you'd have to have additional data input for this, now it's just internally handled? This is very nice!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the span constraint is now a built-in constraint and optional to use.


for (int att = 0; att < (*lgap).size(); att++){
if ((*lgap)[att] == 0)
// It enables the case when lower bound of gap constraint is 0. If lower bound is not a number, continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the discussion with Serdar on the original repo, if there are changes here, can we point out the differences?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a note to point out the differences.

Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
@takojunior takojunior merged commit 0a0fc60 into master Jul 5, 2022
@takojunior takojunior deleted the default-span-constraint-on-index branch July 5, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants