-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize features as JSON #532
Conversation
This changes the implementation of save_features and load_features to use JSON instead of pickling. This will give us greater control over version changes, and also makes it easier to inspect the saved data. At the top level the JSON object has the following keys: - schema_version: the version of the features schema. During deserialization if the saved version is greater than the current version an error will be raised. - ft_version: the version of featuretools. - entityset: the entityset metadata (all features must be associated with the same entityset). - feature_list: an array of the names of the features. - feature_definitions: an object where the keys are feature names and the values are objects with the data for the corresponding feature. This may include features which were not included in the given list but are dependencies of the features in the list. Each feature object has the following keys: - type: the name of the class of this feature. - dependencies: a list of the names of features which this feature depends on. - arguments: an object storing the data necessary to construct this feature. This is generated by feature.get_arguments during serialization, and passed to feature_class.from_dictionary during deserialization to construct the new feature. Primitives may be included as arguments of features, and have the following keys: - type: the name of the primitive class. - module: the name of the module the primitive class is defined in. - arguments: an object storing the data necessary to construct this primitive. This is built by reflecting on the signature of the primitive's constructor and for every parameter getting the attribute with the same name. Since primitives classes may come from different modules or be defined dynamically they are found by searching the descendents of PrimitiveBase. Other changes: - Move featuretools.__version__ to its own module. This allows it to be imported into other featuretools modules without creating a circular dependency. - Add EntitySet.to_dictionary() to get the entity set metadata as a dictionary. - Add feature.unique_name(), which includes the entity id. get_name() may not be unique in a list of features because features on different entities could have the same name. - Add Timedelta.get_arguments() and from_dictionary.
Codecov Report
@@ Coverage Diff @@
## master #532 +/- ##
==========================================
+ Coverage 96.1% 96.23% +0.12%
==========================================
Files 108 114 +6
Lines 8915 9245 +330
==========================================
+ Hits 8568 8897 +329
- Misses 347 348 +1
Continue to review full report at Codecov.
|
In older python versions some lists in the dictionaries do not always have the same order. Instead, convert them back to entitysets and compare those.
@rwedge this is ready for review. Thanks! |
featuretools/tests/primitive_tests/test_features_deserializer.py
Outdated
Show resolved
Hide resolved
featuretools/tests/primitive_tests/test_groupby_transform_primitives.py
Outdated
Show resolved
Hide resolved
featuretools/tests/primitive_tests/test_groupby_transform_primitives.py
Outdated
Show resolved
Hide resolved
featuretools/primitives/utils.py
Outdated
raise RuntimeError('Primitive "%s" in module "%s" not found' % | ||
(class_name, module)) | ||
if class_cache: | ||
class_cache[cache_key] = cls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only caches the primitive class found by _find_class_in_descendants
but I think we can cache any primitive class examined by _find_class_in_descendants
once it has looked at all of it's subclasses.
Current untested scenarios:
|
Added PrimitivesDeserializer, wrapping a cache and a generator which iterates over all primitive classes. When deserializing a primitive if it is not in the cache then we iterate until it is found, adding every seen class to the cache. When deseriazing the next primitive the iteration resumes where it left off. This means that we never visit a class more than once. A PrimitivesDeserializer is initialized in FeaturesDeserializer and then passed to every `Feature.from_dictionary` call.
This changes the implementation of
save_features
andload_features
touse JSON instead of pickling. This will give us greater control over
version changes, and also makes it easier to inspect the saved data.
At the top level the JSON object has the following keys:
schema_version
: the version of the features schema. Duringdeserialization if the saved version is greater than the current
version an error will be raised.
ft_version
: the version of featuretools.entityset
: the entityset metadata (all features must be associatedwith the same entityset).
feature_list
: an array of the names of the features.feature_definitions
: an object where the keys are feature names andthe values are objects with the data for the corresponding feature.
This may include features which were not included in the given list
but are dependencies of the features in the list.
Each feature object has the following keys:
type
: the name of the class of this feature.dependencies
: a list of the names of features which this featuredepends on.
arguments
: an object storing the data necessary to construct thisfeature. This is generated by
feature.get_arguments
duringserialization, and passed to
feature_class.from_dictionary
duringdeserialization to construct the new feature.
Primitives may be included as arguments of features, and have the
following keys:
type
: the name of the primitive class.module
: the name of the module the primitive class is defined in.arguments
: an object storing the data necessary to construct thisprimitive. This is built by reflecting on the signature of the
primitive's constructor and for every parameter getting the attribute
with the same name.
Since primitives classes may come from different modules or be defined
dynamically they are found by searching the descendants of
PrimitiveBase
.Other changes:
featuretools.__version__
to its own module. This allows it to beimported into other featuretools modules without creating a circular
dependency.
EntitySet.to_dictionary()
to get the entity set metadata as adictionary.
feature.unique_name()
, which includes the entity id.get_name()
may not be unique in a list of features because features on different
entities could have the same name.
Timedelta.get_arguments()
andfrom_dictionary
.Resolves #471