New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet serialization #152

Merged
merged 86 commits into from Jun 22, 2018

Conversation

Projects
None yet
3 participants
@bschreck
Contributor

bschreck commented May 17, 2018

Serialization is now has the option of saving to pickle files or Parquet (users need to import a parquet library that Pandas recognizes: either fastparquet or pyarrow). EntitySets have a create_metadata_dict() method to create a serializable dictionary with all the metadata that can be dumped to a json file. The serialization routines save this JSON to a single file, as well as each entity's data objects to separate files (either pickle or parquet files). Serialization API is:

  • EntitySet.to_pickle(filepath)
  • EntitySet.to_parquet(filepath
  • EntitySet.read_pickle(filepath)
  • EntitySet.read_parquet(filepath)
  • ft.read_pickle(filepath)
  • ft.read_parquet(filepath)

bschreck added some commits Jun 18, 2018

@@ -165,22 +165,22 @@ def entities(self):
@property
def metadata(self):
'''Defined as a property because an EntitySet's metadata
is used in many places, for instance, for each feature in a feature list.
'''An EntitySet's metadata is used in many places, for instance,

This comment has been minimized.

@kmax12

kmax12 Jun 19, 2018

Member

the first line of the doc string should be a one liner defining metadata.

else:
new_metadata = self._gen_metadata()
# Don't want to keep making new copies of metadata
# Only make a new one if something was changed
if not self._metadata.__eq__(new_metadata):

This comment has been minimized.

@kmax12

kmax12 Jun 19, 2018

Member

why not self._metadata != new_metdata?

return self
@classmethod
def read_pickle(cls, path):
return read_pickle(path)
@classmethod
def read_parquet(cls, path):

This comment has been minimized.

@kmax12

kmax12 Jun 19, 2018

Member

we should be like pandas and make the read_parquet method defined on the featuretools module not the EntitySet class

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

let's update api reference too

}
@classmethod
def from_metadata(cls,

This comment has been minimized.

@kmax12

kmax12 Jun 19, 2018

Member

function signature should be 80 chars

bschreck and others added some commits Jun 19, 2018

@@ -120,6 +119,11 @@ def __init__(self, id, df, entityset, variable_types=None, name=None,
if self.index is not None and self.index not in inferred_variable_types:
self.add_variable(self.index, vtypes.Index)
# make sure index is at the beginning
index_variable = [v for v in self.variables

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

I thought we already merged this into master?

@@ -563,10 +567,23 @@ def infer_variable_types(self, ignore=None, link_vars=None):
def update_data(self, df=None, data=None, already_sorted=False,

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

looks like the data argument is never actually used

return self
@classmethod
def read_pickle(cls, path):
return read_pickle(path)
@classmethod
def read_parquet(cls, path):

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

let's update api reference too

write_entityset(self, path, to_parquet=False)
return self
def to_parquet(self, path):

This comment has been minimized.

@kmax12
dummy=not load_data)
if any(v['interesting_values'] is not None and len(v['interesting_values'])
for v in entity['variables'].values()):
add_interesting_values = True

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

add exact interesting values that were serialized

with open(os.path.join(entityset_path, 'metadata.json')) as f:
metadata = json.load(f)
return EntitySet.from_metadata(metadata, root=entityset_path,
load_data=True)

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

rename root --> dataroot and then get rid of load_data

_datetime_types = vtypes.PandasTypes._pandas_datetimes
def read_parquet(path):

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

add load_data to these methods

load_data=True)
def load_entity_data(metadata, dummy=True, root=None):

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

let's call dummy just load_data

load_data=True)
def load_entity_data(metadata, dummy=True, root=None):

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

let's split this into two functions. one that does variable types and one that does data

return df, variable_types
def write_entityset(entityset, path, to_parquet=False):

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

let's change to_parquet serialization_method and make it string

entity,
metadata)
else:
rel_filename = os.path.join(e_id, 'data.p')

This comment has been minimized.

@kmax12

kmax12 Jun 20, 2018

Member

make this it's own function

@bschreck bschreck changed the base branch from master to remove-variable-stats Jun 20, 2018

bschreck added some commits Jun 20, 2018

@kmax12 kmax12 changed the base branch from remove-variable-stats to master Jun 22, 2018

@bschreck bschreck merged commit e3deb21 into master Jun 22, 2018

2 checks passed

ci/circleci Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details

@bschreck bschreck deleted the parquet-serialization branch Jun 22, 2018

@rwedge rwedge referenced this pull request Jun 22, 2018

Merged

v0.2.0 #173

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment