Feature alignment #63

yuchongzhang · 2023-10-04T19:36:10Z

PR Type

[Feature | Fix | Documentation | Other() ]
Feature

Data preprocessing pipelines for tabular features alignment.
ClickUp: https://app.clickup.com/t/860q2qc3c

There is also an example under the examples folder which demonstrates the usage of this pipeline.

The ClickUp also includes two new subtasks that I think are the natural next steps following this PR.

Tests Added

test_alignment_pipeline.py

…nto feature-alignment

fl4health/feature_alignment/tab_features_preprocessor.py

emersodb · 2023-11-09T19:05:18Z

fl4health/feature_alignment/tabular_feature.py

+            return len(self.metadata)
+        elif self.feature_type == TabularType.NUMERIC:
+            return 1
+        else:


Rather than returning zero here, does it make sense to just raise an exception, as getting the dimension for other feature types isn't "supported"? Maybe there are downstream implications though, so correct me if I'm wrong.

It's true we are assuming that the target feature cannot be of type string, but since this method is for the class TabularFeature I think it's kind of odd that it would raise an error only for certain instances of the class but not others. Originally I thought to return the length of the vocabulary for string features here, but because string pipelines produce sparse matrices, such a length would not reflect the actual "dimension" of the output. So I just put zero here instead... I don't think there is any downstream implication though, so let me know what you think is the best.

I would say that it's okay for us to raise an exception for only some feature_types here if using get_metadata_dimension on them is "improper" and should be avoided. Basically the exception would represent that something weird has happened that we're calling get_metadata_dimension on a feature type that doesn't really have one. However, I may be missing something.

Sounds good!

fl4health/server/tabular_feature_alignment_server.py

emersodb

I think this is looking really good and I think you've improved the code readability a lot! Most of the comments that I left are pretty minor. One thing that we don't have any of is tests. I think it would be good for us to at least mock up some very small misaligned dataframes to make sure all of the transforms etc. work as expected.

fatemetkl

The structure of the code and its readability have improved considerably!
Just had two minor comments, testing would also be a good addition, other than that the code looks great to me.

fatemetkl · 2023-11-10T17:38:37Z

fl4health/feature_alignment/tab_features_info_encoder.py

+        # Construct TabularFeature objects.
+        for feature_name in features_to_types:
+            tabular_feature = TabularFeaturesInfoEncoder._construct_tab_feature(
+                df, feature_name, features_to_types, fill_values


Super minor: is there a reason for not sending features_to_types[feature_name] to _construct_tab_feature instead of fature_to_types which stores the types of all the features? This seems to me to be less memory efficient.

Good point!

fatemetkl · 2023-11-10T22:47:04Z

fl4health/feature_alignment/tab_features_info_encoder.py

+        else:
+            fill_value = fill_values[feature_name]
+
+        if feature_type == TabularType.ORDINAL or feature_type == TabularType.BINARY:


Are we extracting categories for binary type as well?

Yes. This is inherited from the previous iteration of the code because I thought it might be useful if the user wishes to impute missing value with a special unknown value, in which case the binary type would essentially become a categorical type (since there would be more than two categories). Do you think it would make sense to keep this?

Right, that is something useful to have!

yuchongzhang · 2023-11-14T03:52:47Z

I think this is looking really good and I think you've improved the code readability a lot! Most of the comments that I left are pretty minor. One thing that we don't have any of is tests. I think it would be good for us to at least mock up some very small misaligned dataframes to make sure all of the transforms etc. work as expected.

I've now added some tests to test the basic functionality on some small dataframes.

emersodb · 2023-11-14T19:03:47Z

fl4health/clients/tabular_data_client.py

        """
        User defined method that returns a pandas dataframe.
+
+        Args:


Looks like this comment has empty args?

emersodb · 2023-11-14T19:07:35Z

examples/feature_alignment_example/misalign_data.py


    # Dropping columns to create misalignment.
    df2 = df2.drop(columns=["ExpiredHospital", "admit_type", "NumRx", "ethnicity"])
+    log(INFO, "Hospital2 missing columns: ExpiredHospital, admit_type, NumRx, ethnicity")


Very minor, but you could "automate" the log here as

columns_to_drop = ["ExpiredHospital", "admit_type", "NumRx", "ethnicity"] df2 = df2.drop(columns=columns_to_drop) log(INFO, "Hospital2 missing columns: {', '.join(columns_to_drop)}")

emersodb · 2023-11-14T19:11:22Z

fl4health/feature_alignment/tabular_feature.py


    def get_metadata_dimension(self) -> int:
-        if self.feature_type == TabularType.BINARY or TabularType.ORDINAL:
+        if self.feature_type == TabularType.BINARY or self.feature_type == TabularType.ORDINAL:


Nice catch!

emersodb · 2023-11-14T19:12:22Z

fl4health/server/tabular_feature_alignment_server.py

        self.initialize_parameters = initialize_parameters
        self.format_info_gathered = False
+        self.dimension_info: Dict[str, int] = {}
        # casting self.strategy to BasicFedAvg so its on_fit_config_fn can be specified.


I think we can remove this comment, since it's no longer a cast?

emersodb

I left some very minor comments that are certainly not "deal-breaking." Changes look great to me. Tests are awesome. I'm good to go. I'd just make sure @fatemetkl is good with the way you addressed her comments and we can merge.

fatemetkl

The changes look great to me, thanks for adding the test.

Yuchong Zhang added 30 commits October 3, 2023 15:56

basic functionality

4a79cfe

small fix

a226141

new encoder design

db513b1

new encoder design

cefc705

encoder and preprocessor updates

69346b4

added server

e03f6d5

added example client and server files

ce6f3fd

update

0e2c45c

fixed bug for encoder

038de25

fixed client and server

604f6c5

first complete running example

b8d17aa

handle string data

05e5eb2

new design to handle string columns

00507ed

bug fix

cc13dc2

added functionality for unifying vocabulary

f846a8e

new example running

5ad00fa

Merge branch 'main' of https://github.com/VectorInstitute/FL4Health i…

02077da

…nto feature-alignment

refactoring of client class done

9d40848

refactoring of client class done

69bd916

removed old datasets

2f864ce

deleted unnecessary file

48dad7a

clean up

3dd56aa

oracle source of truth working

21bdea8

clean up misaligning data code

d4d5e5a

fixed bug for binary encoding

711a6bf

update

864e065

added mimic3 dataset

1ba813c

added README

6499140

some documentation

a344fdf

small bug fix and more documentations

d632f5d