Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet serialization #152

Merged
merged 86 commits into from Jun 22, 2018
Merged

Parquet serialization #152

merged 86 commits into from Jun 22, 2018

Conversation

bschreck
Copy link
Contributor

@bschreck bschreck commented May 17, 2018

Serialization is now has the option of saving to pickle files or Parquet (users need to import a parquet library that Pandas recognizes: either fastparquet or pyarrow). EntitySets have a create_metadata_dict() method to create a serializable dictionary with all the metadata that can be dumped to a json file. The serialization routines save this JSON to a single file, as well as each entity's data objects to separate files (either pickle or parquet files). Serialization API is:

  • EntitySet.to_pickle(filepath)
  • EntitySet.to_parquet(filepath
  • EntitySet.read_pickle(filepath)
  • EntitySet.read_parquet(filepath)
  • ft.read_pickle(filepath)
  • ft.read_parquet(filepath)

"""Save the entityset at the given path.
def read_parquet(path):
from featuretools.entityset.entityset import EntitySet
return EntitySet.read_parquet(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to have this method on EntitySet? I feel like we should do it more like pandas.read_csv. So, ft.read_parquet.

@@ -165,22 +165,22 @@ def entities(self):

@property
def metadata(self):
'''Defined as a property because an EntitySet's metadata
is used in many places, for instance, for each feature in a feature list.
'''An EntitySet's metadata is used in many places, for instance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first line of the doc string should be a one liner defining metadata.

else:
new_metadata = self._gen_metadata()
# Don't want to keep making new copies of metadata
# Only make a new one if something was changed
if not self._metadata.__eq__(new_metadata):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not self._metadata != new_metdata?

return self

@classmethod
def read_pickle(cls, path):
return read_pickle(path)

@classmethod
def read_parquet(cls, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be like pandas and make the read_parquet method defined on the featuretools module not the EntitySet class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update api reference too

}

@classmethod
def from_metadata(cls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function signature should be 80 chars

@@ -120,6 +119,11 @@ def __init__(self, id, df, entityset, variable_types=None, name=None,
if self.index is not None and self.index not in inferred_variable_types:
self.add_variable(self.index, vtypes.Index)

# make sure index is at the beginning
index_variable = [v for v in self.variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already merged this into master?

@@ -563,10 +567,23 @@ def infer_variable_types(self, ignore=None, link_vars=None):

def update_data(self, df=None, data=None, already_sorted=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the data argument is never actually used

return self

@classmethod
def read_pickle(cls, path):
return read_pickle(path)

@classmethod
def read_parquet(cls, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update api reference too

write_entityset(self, path, to_parquet=False)
return self

def to_parquet(self, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dummy=not load_data)
if any(v['interesting_values'] is not None and len(v['interesting_values'])
for v in entity['variables'].values()):
add_interesting_values = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add exact interesting values that were serialized

with open(os.path.join(entityset_path, 'metadata.json')) as f:
metadata = json.load(f)
return EntitySet.from_metadata(metadata, root=entityset_path,
load_data=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename root --> dataroot and then get rid of load_data


_datetime_types = vtypes.PandasTypes._pandas_datetimes
def read_parquet(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add load_data to these methods

load_data=True)


def load_entity_data(metadata, dummy=True, root=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call dummy just load_data

load_data=True)


def load_entity_data(metadata, dummy=True, root=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's split this into two functions. one that does variable types and one that does data

return df, variable_types


def write_entityset(entityset, path, to_parquet=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change to_parquet serialization_method and make it string

entity,
metadata)
else:
rel_filename = os.path.join(e_id, 'data.p')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this it's own function

@bschreck bschreck changed the base branch from master to remove-variable-stats June 20, 2018 19:07
@kmax12 kmax12 changed the base branch from remove-variable-stats to master June 22, 2018 02:13
@bschreck bschreck merged commit e3deb21 into master Jun 22, 2018
@bschreck bschreck deleted the parquet-serialization branch June 22, 2018 03:23
@rwedge rwedge mentioned this pull request Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants