Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize features as JSON #532

Merged
merged 14 commits into from May 8, 2019

Conversation

Projects
None yet
3 participants
@CJStadler
Copy link
Contributor

commented May 6, 2019

This changes the implementation of save_features and load_features to
use JSON instead of pickling. This will give us greater control over
version changes, and also makes it easier to inspect the saved data.

At the top level the JSON object has the following keys:

  • schema_version: the version of the features schema. During
    deserialization if the saved version is greater than the current
    version an error will be raised.
  • ft_version: the version of featuretools.
  • entityset: the entityset metadata (all features must be associated
    with the same entityset).
  • feature_list: an array of the names of the features.
  • feature_definitions: an object where the keys are feature names and
    the values are objects with the data for the corresponding feature.
    This may include features which were not included in the given list
    but are dependencies of the features in the list.

Each feature object has the following keys:

  • type: the name of the class of this feature.
  • dependencies: a list of the names of features which this feature
    depends on.
  • arguments: an object storing the data necessary to construct this
    feature. This is generated by feature.get_arguments during
    serialization, and passed to feature_class.from_dictionary during
    deserialization to construct the new feature.

Primitives may be included as arguments of features, and have the
following keys:

  • type: the name of the primitive class.
  • module: the name of the module the primitive class is defined in.
  • arguments: an object storing the data necessary to construct this
    primitive. This is built by reflecting on the signature of the
    primitive's constructor and for every parameter getting the attribute
    with the same name.

Since primitives classes may come from different modules or be defined
dynamically they are found by searching the descendants of
PrimitiveBase.

Other changes:

  • Move featuretools.__version__ to its own module. This allows it to be
    imported into other featuretools modules without creating a circular
    dependency.
  • Add EntitySet.to_dictionary() to get the entity set metadata as a
    dictionary.
  • Add feature.unique_name(), which includes the entity id. get_name()
    may not be unique in a list of features because features on different
    entities could have the same name.
  • Add Timedelta.get_arguments() and from_dictionary.

Resolves #471

Serialize features as JSON
This changes the implementation of save_features and load_features to
use JSON instead of pickling. This will give us greater control over
version changes, and also makes it easier to inspect the saved data.

At the top level the JSON object has the following keys:
- schema_version: the version of the features schema. During
  deserialization if the saved version is greater than the current
  version an error will be raised.
- ft_version: the version of featuretools.
- entityset: the entityset metadata (all features must be associated
  with the same entityset).
- feature_list: an array of the names of the features.
- feature_definitions: an object where the keys are feature names and
  the values are objects with the data for the corresponding feature.
  This may include features which were not included in the given list
  but are dependencies of the features in the list.

Each feature object has the following keys:
- type: the name of the class of this feature.
- dependencies: a list of the names of features which this feature
  depends on.
- arguments: an object storing the data necessary to construct this
  feature. This is generated by feature.get_arguments during
  serialization, and passed to feature_class.from_dictionary during
  deserialization to construct the new feature.

Primitives may be included as arguments of features, and have the
following keys:
- type: the name of the primitive class.
- module: the name of the module the primitive class is defined in.
- arguments: an object storing the data necessary to construct this
  primitive. This is built by reflecting on the signature of the
  primitive's constructor and for every parameter getting the attribute
  with the same name.

Since primitives classes may come from different modules or be defined
dynamically they are found by searching the descendents of
PrimitiveBase.

Other changes:
- Move featuretools.__version__ to its own module. This allows it to be
  imported into other featuretools modules without creating a circular
  dependency.
- Add EntitySet.to_dictionary() to get the entity set metadata as a
  dictionary.
- Add feature.unique_name(), which includes the entity id. get_name()
  may not be unique in a list of features because features on different
  entities could have the same name.
- Add Timedelta.get_arguments() and from_dictionary.
@codecov

This comment has been minimized.

Copy link

commented May 6, 2019

Codecov Report

Merging #532 into master will increase coverage by 0.12%.
The diff coverage is 99.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #532      +/-   ##
==========================================
+ Coverage    96.1%   96.23%   +0.12%     
==========================================
  Files         108      114       +6     
  Lines        8915     9245     +330     
==========================================
+ Hits         8568     8897     +329     
- Misses        347      348       +1
Impacted Files Coverage Δ
featuretools/utils/api.py 100% <ø> (ø) ⬆️
featuretools/feature_base/api.py 100% <100%> (ø) ⬆️
featuretools/entityset/entityset.py 95.07% <100%> (+0.02%) ⬆️
.../tests/primitive_tests/test_features_serializer.py 100% <100%> (ø)
featuretools/feature_base/feature_base.py 96.95% <100%> (+0.48%) ⬆️
featuretools/primitives/base/primitive_base.py 100% <100%> (ø) ⬆️
...s/tests/primitive_tests/test_transform_features.py 98.12% <100%> (+0.03%) ⬆️
...ools/tests/primitive_tests/test_direct_features.py 100% <100%> (ø) ⬆️
featuretools/primitives/utils.py 97.36% <100%> (+0.93%) ⬆️
...ests/primitive_tests/test_feature_serialization.py 100% <100%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0739b88...107213a. Read the comment docs.

Update test to not compare entityset dictionaries
In older python versions some lists in the dictionaries do not always
have the same order. Instead, convert them back to entitysets and
compare those.
@CJStadler

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

@rwedge this is ready for review. Thanks!

@rwedge rwedge self-requested a review May 6, 2019

Show resolved Hide resolved featuretools/feature_base/feature_base.py Outdated
Show resolved Hide resolved featuretools/feature_base/feature_base.py Outdated
Show resolved Hide resolved featuretools/feature_base/features_deserializer.py Outdated
Show resolved Hide resolved featuretools/feature_base/features_deserializer.py Outdated
Show resolved Hide resolved featuretools/feature_base/features_deserializer.py Outdated
Show resolved Hide resolved featuretools/feature_base/features_serializer.py
Show resolved Hide resolved featuretools/tests/primitive_tests/test_features_deserializer.py Outdated
Show resolved Hide resolved featuretools/tests/primitive_tests/test_groupby_transform_primitives.py Outdated
Show resolved Hide resolved featuretools/tests/primitive_tests/test_groupby_transform_primitives.py Outdated
Show resolved Hide resolved featuretools/__init__.py

@CJStadler CJStadler requested a review from rwedge May 7, 2019

Show resolved Hide resolved featuretools/primitives/utils.py Outdated
Show resolved Hide resolved featuretools/primitives/utils.py Outdated

CJStadler added some commits May 8, 2019

raise RuntimeError('Primitive "%s" in module "%s" not found' %
(class_name, module))
if class_cache:
class_cache[cache_key] = cls

This comment has been minimized.

Copy link
@rwedge

rwedge May 8, 2019

Contributor

This only caches the primitive class found by _find_class_in_descendants but I think we can cache any primitive class examined by _find_class_in_descendants once it has looked at all of it's subclasses.

@rwedge

This comment has been minimized.

Copy link
Contributor

commented May 8, 2019

Current untested scenarios:

  • Trying to deserialize an unknown feature class
  • Trying to deserialize an unknown primitive class

CJStadler added some commits May 8, 2019

Refactor primitive class lookup
Added PrimitivesDeserializer, wrapping a cache and a generator which
iterates over all primitive classes. When deserializing a primitive if
it is not in the cache then we iterate until it is found, adding every
seen class to the cache. When deseriazing the next primitive the
iteration resumes where it left off. This means that we never visit a
class more than once.

A PrimitivesDeserializer is initialized in FeaturesDeserializer and then
passed to every `Feature.from_dictionary` call.
@rwedge

rwedge approved these changes May 8, 2019

@CJStadler CJStadler merged commit 7c1c1a9 into master May 8, 2019

4 checks passed

codecov/patch 99.73% of diff hit (target 96.1%)
Details
codecov/project 96.23% (+0.12%) compared to 0739b88
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details

@CJStadler CJStadler deleted the features-json branch May 8, 2019

@rwedge rwedge referenced this pull request May 17, 2019

Merged

v0.8.0 #548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.