WIP: Default span constraint on index#25
Conversation
Signed-off-by: Xin Wang <xin.wang@fmr.com>
sequential/backend/build_mdd.cpp
Outdated
| if ((*lgap)[att] == 0) | ||
| continue; | ||
| if (abs((*attrs)[(*lgapi)[att]].at(i).at(endp - 1) - (*attrs)[(*lgapi)[att]].at(i).at(strp - 1)) < (*lgap)[att]) | ||
| // Allow lower bound of gap constraint to be 0 if not work with abs values |
There was a problem hiding this comment.
This is the fix that Amin did in the original/earlier repository, correct?
My suggestions is; go that repo and look at the latest commits. It should be easy to find the commit for this fix at the backend. Then, you can compare & contrast the solutions.
I vaguely remember looking into it at the time (~2 years ago) and it was a simple/one-liner change like this.
It should be easy to verify.
There was a problem hiding this comment.
I see his original repo uses abs value at this line though (see the line 81), which is the latest commit.
In an earlier commit, there is a fix at another line (line 46), which was to apply abs instead of removing it. So I tried to do the same fix as Amin did, which didn't give me expected results as I have negative upper/lower bounds and run it on paper though.
I am thinking that maybe I can leave an inline comment (shown as an issue) to the original repo and Amin can provide his thoughts there?
skadio
left a comment
There was a problem hiding this comment.
LGTM! This is a small change but will make the behavior much better for new comers / first experience with the library. Thank you!
sequential/seq2pat.py
Outdated
|
|
||
| def __init__(self, sequences: List[list]): | ||
| # Validate input sequences | ||
| def __init__(self, sequences: List[list], max_span_index: Union[int, None] = 10): |
There was a problem hiding this comment.
Not sure about the naming.. The "index" is about the Attribute but this parameter is about the "length" of the span. It is not index, really.
So I would suggest max_span instead.
On the contrary, index_attr = Attribute() in the code below makes total sense --because that thing is about the index
There was a problem hiding this comment.
max_span makes sense. I will change the naming.
There was a problem hiding this comment.
Not sure about the naming.. The "index" is about the Attribute but this parameter is about the "length" of the span. It is not index, really.
So I would suggest
max_spaninstead.On the contrary,
index_attr = Attribute()in the code below makes total sense --because that thing is about the index
Another thought is to use max_span instead of rolling_window_size in Pat2Feat? Since these two parameters are related, although they don't have to be the same. They can be configured separately but they are both restricting the length.
skadio
left a comment
There was a problem hiding this comment.
In the usage example notebook, you might want to add a comment about this new parameter.
Btw, shall we consider changing that notebook name to "sequential_pattern_mining" to complement the other notebook "dichomotic_pattern_mining"?
And once this change is merged notebooks and pypi updated etc. there are a few threads we can circle back:
- The current github issue might be solved in the next version
- XAI notebook you shared can be updated, Shovon et al should install new version
- Danielle should install new version
With this update, I will circle back with these threads. For the dichotomic pattern mining note, it can be simplified now by dropping the span constraint and having the default constraint being in use. |
Signed-off-by: Xin Wang <xin.wang@fmr.com>
Signed-off-by: Xin Wang <xin.wang@fmr.com>
|
The latest commit has the following updates to this PR:
@skadio @dorukkilitcioglu Let me know if you have further comments. Thanks. |
Signed-off-by: Xin Wang <xin.wang@fmr.com>
sequential/pat2feat.py
Outdated
| """ | ||
|
|
||
| def __init__(self, rolling_window_size: Union[int, None] = 10): | ||
| def __init__(self, max_span: Union[int, None] = 10): |
There was a problem hiding this comment.
ah good point on making this consistent!
There was a problem hiding this comment.
Agreed, this makes perfect sense!
skadio
left a comment
There was a problem hiding this comment.
LGTM --thank you @takojunior! This will very much simplify and speed-up everything else (notebooks, examples, etc.)
dorukkilitcioglu
left a comment
There was a problem hiding this comment.
These changes look good! Left a couple of comments, but good to go as it is.
sequential/pat2feat.py
Outdated
| """ | ||
|
|
||
| def __init__(self, rolling_window_size: Union[int, None] = 10): | ||
| def __init__(self, max_span: Union[int, None] = 10): |
There was a problem hiding this comment.
Agreed, this makes perfect sense!
sequential/seq2pat.py
Outdated
| sequences : List[list] | ||
| A list of sequences each with a list of events. | ||
| The event values can be all strings or all integers. | ||
| max_span: Union[int, None] |
There was a problem hiding this comment.
Any reason why Union[int, None] instead of Optional[int]? It accomplishes the same thing either way, but I'm wondering if you had a preference.
There was a problem hiding this comment.
No strong preferences. Technically they are doing the same thing, but Optional[...] seems better, since it documents the reason why None is introduced, and it is easier to move Union[...] to be a separate type alias when there are multiple types to specify. I refer to a discussion here. So I think I will update it to Optional.
| "## Seq2Pat for Positive Labels\n", | ||
| "- There are two attributes: `event_time` and `event_order`\n", | ||
| "- Constraint 1: to enforce the average event time greater than 20 sec\n", | ||
| "- Constraint 2: to enforce the span of event order less than 9. This is to restrict the length of sequence to be <= 10." |
There was a problem hiding this comment.
So before you'd have to have additional data input for this, now it's just internally handled? This is very nice!
There was a problem hiding this comment.
Yes, the span constraint is now a built-in constraint and optional to use.
sequential/backend/build_mdd.cpp
Outdated
|
|
||
| for (int att = 0; att < (*lgap).size(); att++){ | ||
| if ((*lgap)[att] == 0) | ||
| // It enables the case when lower bound of gap constraint is 0. If lower bound is not a number, continue |
There was a problem hiding this comment.
Re: the discussion with Serdar on the original repo, if there are changes here, can we point out the differences?
There was a problem hiding this comment.
Will add a note to point out the differences.
The PR is for adding a default index attribute and a default maximum span constraint on this attribute. This is going to avoid regular users to run into scaling issues when they have long sequences in data and usually apply no constraints at the beginning.
The PR has the following changes:
max_span_indexto initialize Seq2Pat, add the default constraint.max_span_index=10by default.I like to have the above changes reviewed first, then notebooks and docs can be updated with the default constraint.