-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seaborn Plots #62
Seaborn Plots #62
Conversation
Codecov Report
@@ Coverage Diff @@
## master #62 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 16 17 +1
Lines 532 629 +97
=====================================
+ Hits 532 629 +97
|
does count by time make sense for continuous? I feel like it only makes sense if you were to bucket it in discrete labels or if you take some sort of rolling average by time |
I agree. When
Also, do we want to plot: continuous labels vs. cutoff times? |
maybe just use overall total number of label by time. don't try to break it up at all. |
I see. So, we'd just count event per cutoff time without grouping by label. |
Updated plot for count by time here. May need a hard refresh in browser to remove cached files. |
composeml/label_times.py
Outdated
labels = labels.groupby(self.name) | ||
distribution = labels['count'].count() | ||
return distribution | ||
def _is_categorical(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a user should be able to provide a label_type
parameter to override this. Label types could be "categorical" or "continuous". we can also pass this parameter through when using the label maker. for example, bin would know to pass categorical through.
In terms of infer categorical, I'd also check for any non numeric dtyes like objects, strs, or the pandas categorical dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, we only want to be calculating this once when the labeltime are initialized, not every time _is_categorical
is accessed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about overriding it by using type hints?
>>> def labeling_function(df) -> 'categorical':
...
>>> labeling_function.__annotations__
{'return': 'categorical'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, casting label times to categorical dtype if categorical. Related to #60.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the type hints for annotations is very intuitive. might just be better to make it a parameter to the label maker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, will make parameter for label_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new plotting stuff looks good. just some comments on how we do the label type inference
@@ -1,9 +1,10 @@ | |||
import pandas as pd | |||
|
|||
from composeml.label_plots import LabelPlots | |||
|
|||
|
|||
class LabelTimes(pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the label times class should store the type of label it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added label_type
attribute to label times
value = self.groupby('cutoff_time') | ||
value = value[self.name].count() | ||
value = value.cumsum() | ||
return value | ||
|
||
def describe(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the describe method should say the label type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should label type be under settings? I guess it is a parameter of search.
composeml/label_maker.py
Outdated
@@ -327,14 +329,23 @@ def search(self, | |||
|
|||
labels = LabelTimes(data=labels, name=name, target_entity=self.target_entity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should pass the label type to the LabelTime
so it can track it as well. perhaps even move the logic to handle and check is into the init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved check to label times init
composeml/label_maker.py
Outdated
|
||
else: | ||
labels = labels.infer_type() | ||
if labels.label_type == 'discrete': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this casting before we init the label_times? like above line 330
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labels are records (list of dictionaries) above line 330. Should I pass records to a pandas data frame to make categorical before initializing label times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. it's fine to leave it here then
composeml/label_times.py
Outdated
error = 'label type must be "continuous" or "discrete"' | ||
assert label_type in ['continuous', 'discrete'], error | ||
|
||
if label_type is None and name in self.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would name
not be in self.columns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came across behavior where pandas would initialize label times without passing value for name
. I refactored logic to infer inside is_discrete
only if label type is none. link
composeml/label_times.py
Outdated
if is_discrete: | ||
return True | ||
|
||
labels = self[self.name].iloc[:100] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, let's only look at the dtype to infer type. we can always add this functionality in later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated function
composeml/label_times.py
Outdated
@@ -320,16 +355,9 @@ def infer_type(self): | |||
"""Infer label type. | |||
|
|||
Returns: | |||
LabelTimes : Label Times as inferred type. | |||
str : Inferred label type. Can be "continuous" or "discrete". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think intuitively I'd expect the logic in is_discrete
to be here in infer_type
and then I'd expect is_discrete
to just check if label_type == "discrete
.
then every we currently check if self.label_type == 'discrete'
we'd just replace with is_discrete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR uses Seaborn for plotting Label Times. Theses are plot examples for categorical and continuous Label Times. Closes #60. Closes #16 . Closes #56 .