Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor feature serialization to avoid storing duplicate primitive information #2144

Merged
merged 12 commits into from
Jun 29, 2022

Conversation

thehomebrewnerd
Copy link
Contributor

Refactor feature serialization to avoid storing duplicate primitive information

Separates storage of primitive information from feature information to avoid storing duplicate primitive information when serializing features.

@thehomebrewnerd thehomebrewnerd marked this pull request as draft June 24, 2022 19:19
@thehomebrewnerd thehomebrewnerd marked this pull request as ready for review June 28, 2022 20:52
@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

Merging #2144 (4f8dd84) into serialization-updates (fb2292d) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 4f8dd84 differs from pull request most recent head 9e6830c. Consider uploading reports for the commit 9e6830c to get more accurate results

@@                  Coverage Diff                   @@
##           serialization-updates    #2144   +/-   ##
======================================================
  Coverage                  99.23%   99.23%           
======================================================
  Files                        143      143           
  Lines                      17035    17100   +65     
======================================================
+ Hits                       16904    16969   +65     
  Misses                       131      131           
Impacted Files Coverage Δ
featuretools/feature_base/feature_base.py 97.85% <ø> (-0.02%) ⬇️
...aturetools/tests/primitive_tests/test_agg_feats.py 99.51% <ø> (-0.01%) ⬇️
...imitive_tests/test_groupby_transform_primitives.py 100.00% <ø> (ø)
...s/tests/primitive_tests/test_transform_features.py 99.86% <ø> (-0.01%) ⬇️
featuretools/feature_base/features_deserializer.py 100.00% <100.00%> (ø)
featuretools/feature_base/features_serializer.py 100.00% <100.00%> (ø)
featuretools/primitives/utils.py 99.60% <100.00%> (-0.02%) ⬇️
...retools/tests/primitive_tests/test_feature_base.py 100.00% <100.00%> (ø)
...ests/primitive_tests/test_feature_serialization.py 99.60% <100.00%> (ø)
...ests/primitive_tests/test_features_deserializer.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb2292d...9e6830c. Read the comment docs.

primitives_dict_key = str(primitive_number)
primitive_id_to_key[primitive_id] = primitives_dict_key
self._primitives_dict[
str(primitives_dict_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primitives_dict_key is already cast to string on line 118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sure is. Fixed 9e6830c

@thehomebrewnerd
Copy link
Contributor Author

Looks like Windows tests are failing due to some unrelated graphviz issue and one failed due to codecov (possibly due to github api problems), so going ahead and merging this into the feature branch.

@thehomebrewnerd thehomebrewnerd merged commit 25c71fb into serialization-updates Jun 29, 2022
@thehomebrewnerd thehomebrewnerd deleted the refactor-feat-serialization branch June 29, 2022 16:20
thehomebrewnerd added a commit that referenced this pull request Jun 30, 2022
…2136)

* serialization and deserialization improvements

* add pr number

* Improve feature deserialization to use common primitive instances (#2127)

* update feature deserialization

* update release notes

* lint fix

* fix comment

* fix for list inputs and test cleanup

* remove files

* remove file

* use tmp_path

* Allow users to directly set feature output column names and save during serialization (#2142)

* add set_feature_names method

* initial serialization updates

* update release notes

* fix tests

* code cleanup - only store names when set

* only use on multi-output features

* fix test

* lint fix

* remove unused functions

* add more test cases

* update serialization test

* Update featuretools/feature_base/feature_base.py

Co-authored-by: Roy Wedge <roy.wedge@alteryx.com>

Co-authored-by: Roy Wedge <roy.wedge@alteryx.com>

* Refactor feature serialization to avoid storing duplicate primitive information (#2144)

* initial refactor of serialization

* update release notes

* lint and remove breakpoint

* new approach without hash

* refactor and update tests

* remove comment

* update s3 file

* update feature base args

* lint fix

* misc cleanup

* remove extra str casting

* fix spelling error

* remove instance cache

* update save and load docstring examples

* lint fix

* more docstring cleanup

* update release notes

* tweak serialization

* update json

Co-authored-by: Roy Wedge <roy.wedge@alteryx.com>
@ozzieD ozzieD mentioned this pull request Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants