New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet serialization #152
Conversation
"""Save the entityset at the given path. | ||
def read_parquet(path): | ||
from featuretools.entityset.entityset import EntitySet | ||
return EntitySet.read_parquet(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to have this method on EntitySet? I feel like we should do it more like pandas.read_csv
. So, ft.read_parquet
.
featuretools/entityset/entityset.py
Outdated
@@ -165,22 +165,22 @@ def entities(self): | |||
|
|||
@property | |||
def metadata(self): | |||
'''Defined as a property because an EntitySet's metadata | |||
is used in many places, for instance, for each feature in a feature list. | |||
'''An EntitySet's metadata is used in many places, for instance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first line of the doc string should be a one liner defining metadata
.
featuretools/entityset/entityset.py
Outdated
else: | ||
new_metadata = self._gen_metadata() | ||
# Don't want to keep making new copies of metadata | ||
# Only make a new one if something was changed | ||
if not self._metadata.__eq__(new_metadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not self._metadata != new_metdata
?
featuretools/entityset/entityset.py
Outdated
return self | ||
|
||
@classmethod | ||
def read_pickle(cls, path): | ||
return read_pickle(path) | ||
|
||
@classmethod | ||
def read_parquet(cls, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be like pandas and make the read_parquet
method defined on the featuretools module not the EntitySet
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update api reference too
featuretools/entityset/entity.py
Outdated
} | ||
|
||
@classmethod | ||
def from_metadata(cls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function signature should be 80 chars
featuretools/entityset/entity.py
Outdated
@@ -120,6 +119,11 @@ def __init__(self, id, df, entityset, variable_types=None, name=None, | |||
if self.index is not None and self.index not in inferred_variable_types: | |||
self.add_variable(self.index, vtypes.Index) | |||
|
|||
# make sure index is at the beginning | |||
index_variable = [v for v in self.variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we already merged this into master?
featuretools/entityset/entity.py
Outdated
@@ -563,10 +567,23 @@ def infer_variable_types(self, ignore=None, link_vars=None): | |||
|
|||
def update_data(self, df=None, data=None, already_sorted=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the data argument is never actually used
featuretools/entityset/entityset.py
Outdated
return self | ||
|
||
@classmethod | ||
def read_pickle(cls, path): | ||
return read_pickle(path) | ||
|
||
@classmethod | ||
def read_parquet(cls, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update api reference too
featuretools/entityset/entityset.py
Outdated
write_entityset(self, path, to_parquet=False) | ||
return self | ||
|
||
def to_parquet(self, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add an engine parameter like pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_parquet.html
featuretools/entityset/entityset.py
Outdated
dummy=not load_data) | ||
if any(v['interesting_values'] is not None and len(v['interesting_values']) | ||
for v in entity['variables'].values()): | ||
add_interesting_values = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add exact interesting values that were serialized
with open(os.path.join(entityset_path, 'metadata.json')) as f: | ||
metadata = json.load(f) | ||
return EntitySet.from_metadata(metadata, root=entityset_path, | ||
load_data=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename root --> dataroot and then get rid of load_data
|
||
_datetime_types = vtypes.PandasTypes._pandas_datetimes | ||
def read_parquet(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add load_data to these methods
load_data=True) | ||
|
||
|
||
def load_entity_data(metadata, dummy=True, root=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call dummy
just load_data
load_data=True) | ||
|
||
|
||
def load_entity_data(metadata, dummy=True, root=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's split this into two functions. one that does variable types and one that does data
return df, variable_types | ||
|
||
|
||
def write_entityset(entityset, path, to_parquet=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change to_parquet serialization_method
and make it string
entity, | ||
metadata) | ||
else: | ||
rel_filename = os.path.join(e_id, 'data.p') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this it's own function
Serialization is now has the option of saving to pickle files or Parquet (users need to import a parquet library that Pandas recognizes: either fastparquet or pyarrow). EntitySets have a
create_metadata_dict()
method to create a serializable dictionary with all the metadata that can be dumped to a json file. The serialization routines save this JSON to a single file, as well as each entity's data objects to separate files (either pickle or parquet files). Serialization API is:EntitySet.to_pickle(filepath)
EntitySet.to_parquet(filepath
EntitySet.read_pickle(filepath)
EntitySet.read_parquet(filepath)
ft.read_pickle(filepath)
ft.read_parquet(filepath)