-
Notifications
You must be signed in to change notification settings - Fork 16
Feature alignment #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nto feature-alignment
| return len(self.metadata) | ||
| elif self.feature_type == TabularType.NUMERIC: | ||
| return 1 | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than returning zero here, does it make sense to just raise an exception, as getting the dimension for other feature types isn't "supported"? Maybe there are downstream implications though, so correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true we are assuming that the target feature cannot be of type string, but since this method is for the class TabularFeature I think it's kind of odd that it would raise an error only for certain instances of the class but not others. Originally I thought to return the length of the vocabulary for string features here, but because string pipelines produce sparse matrices, such a length would not reflect the actual "dimension" of the output. So I just put zero here instead... I don't think there is any downstream implication though, so let me know what you think is the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that it's okay for us to raise an exception for only some feature_types here if using get_metadata_dimension on them is "improper" and should be avoided. Basically the exception would represent that something weird has happened that we're calling get_metadata_dimension on a feature type that doesn't really have one. However, I may be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
emersodb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking really good and I think you've improved the code readability a lot! Most of the comments that I left are pretty minor. One thing that we don't have any of is tests. I think it would be good for us to at least mock up some very small misaligned dataframes to make sure all of the transforms etc. work as expected.
fatemetkl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of the code and its readability have improved considerably!
Just had two minor comments, testing would also be a good addition, other than that the code looks great to me.
| # Construct TabularFeature objects. | ||
| for feature_name in features_to_types: | ||
| tabular_feature = TabularFeaturesInfoEncoder._construct_tab_feature( | ||
| df, feature_name, features_to_types, fill_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor: is there a reason for not sending features_to_types[feature_name] to _construct_tab_feature instead of fature_to_types which stores the types of all the features? This seems to me to be less memory efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
| else: | ||
| fill_value = fill_values[feature_name] | ||
|
|
||
| if feature_type == TabularType.ORDINAL or feature_type == TabularType.BINARY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we extracting categories for binary type as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is inherited from the previous iteration of the code because I thought it might be useful if the user wishes to impute missing value with a special unknown value, in which case the binary type would essentially become a categorical type (since there would be more than two categories). Do you think it would make sense to keep this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that is something useful to have!
I've now added some tests to test the basic functionality on some small dataframes. |
| """ | ||
| User defined method that returns a pandas dataframe. | ||
| Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment has empty args?
|
|
||
| # Dropping columns to create misalignment. | ||
| df2 = df2.drop(columns=["ExpiredHospital", "admit_type", "NumRx", "ethnicity"]) | ||
| log(INFO, "Hospital2 missing columns: ExpiredHospital, admit_type, NumRx, ethnicity") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, but you could "automate" the log here as
columns_to_drop = ["ExpiredHospital", "admit_type", "NumRx", "ethnicity"]
df2 = df2.drop(columns=columns_to_drop)
log(INFO, "Hospital2 missing columns: {', '.join(columns_to_drop)}")
|
|
||
| def get_metadata_dimension(self) -> int: | ||
| if self.feature_type == TabularType.BINARY or TabularType.ORDINAL: | ||
| if self.feature_type == TabularType.BINARY or self.feature_type == TabularType.ORDINAL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
| self.initialize_parameters = initialize_parameters | ||
| self.format_info_gathered = False | ||
| self.dimension_info: Dict[str, int] = {} | ||
| # casting self.strategy to BasicFedAvg so its on_fit_config_fn can be specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this comment, since it's no longer a cast?
emersodb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some very minor comments that are certainly not "deal-breaking." Changes look great to me. Tests are awesome. I'm good to go. I'd just make sure @fatemetkl is good with the way you addressed her comments and we can merge.
fatemetkl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great to me, thanks for adding the test.
PR Type
[Feature | Fix | Documentation | Other() ]
Feature
Data preprocessing pipelines for tabular features alignment.
ClickUp: https://app.clickup.com/t/860q2qc3c
There is also an example under the examples folder which demonstrates the usage of this pipeline.
The ClickUp also includes two new subtasks that I think are the natural next steps following this PR.
Tests Added
test_alignment_pipeline.py