Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet serialization #152

Merged
merged 86 commits into from
Jun 22, 2018
Merged
Show file tree
Hide file tree
Changes from 70 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
77c0200
starting
bschreck Mar 27, 2018
0b86de0
works
bschreck Apr 3, 2018
aa09c85
fixed conflicts
bschreck Apr 16, 2018
48e2794
moved data to properties
bschreck Apr 24, 2018
f3b9c9c
trying
bschreck Apr 24, 2018
958e999
tests pass
bschreck Apr 24, 2018
3cb63bb
mostly working
bschreck Apr 24, 2018
9883f8a
tests pass
bschreck Apr 25, 2018
e95ab5b
modified deployment
bschreck Apr 25, 2018
c866ace
entityset optional arg
bschreck Apr 25, 2018
a17bb11
rm extraneous files
bschreck Apr 25, 2018
d683359
added comments
bschreck Apr 25, 2018
b8cf2a5
updates
bschreck Apr 25, 2018
95b3900
remove unecessary import
bschreck Apr 25, 2018
9279055
update per roy suggestion
bschreck May 2, 2018
6bb2245
Merge branch 'fl-no-entityset' into no-base-entityset-and-other-clean…
bschreck May 3, 2018
e406bc7
added comment about entityset metadata memory saving
bschreck May 3, 2018
2ea566a
Merge branch 'fl-no-entityset' into no-base-entityset-and-other-clean…
bschreck May 3, 2018
12fded1
working
bschreck May 7, 2018
a42026d
got rid of lots of stuff
bschreck May 7, 2018
f23b1a5
1 entityset test fails
bschreck May 8, 2018
062dd7e
tests pass
bschreck May 8, 2018
d0eb395
merged
bschreck May 8, 2018
7b3e14f
lint passed
bschreck May 8, 2018
2d9e7bf
added __ne__ method to entityset
bschreck May 9, 2018
5a80afc
lint:
bschreck May 9, 2018
e48211d
merged
bschreck May 9, 2018
0015d84
merged master
bschreck May 9, 2018
e3646c6
fixed old merge conflict
bschreck May 9, 2018
3c0a8a6
tests pass
bschreck May 9, 2018
6b3788a
removed class_from_dtype method
bschreck May 9, 2018
af9f846
added docstrings
bschreck May 9, 2018
fcb9f41
merged
bschreck May 9, 2018
df5a761
remove import
bschreck May 10, 2018
14657a8
remove add variable to variable_types property
bschreck May 14, 2018
aa531de
entity_stores to entity_dict
bschreck May 14, 2018
824d492
update_data reclaculates last_time_indexes and better serialization o…
bschreck May 16, 2018
79c3675
serialization using pickle, parquet, and saving metadata in a json
bschreck May 17, 2018
ec5e203
tests pass, lots of warnings though
bschreck May 17, 2018
6ca0e60
type warning removed
bschreck May 17, 2018
cad1ff9
added ft.read_pickle and ft.read_parquet
bschreck May 17, 2018
72f18f0
linteD
bschreck May 17, 2018
0d1793d
reverted
bschreck May 17, 2018
430766f
fixed data type not understood error
bschreck May 18, 2018
cb6d1a3
updated test
bschreck May 18, 2018
a278b82
Merge branch 'master' into no-base-entityset-and-other-clean-ups
bschreck May 18, 2018
e3fba88
merged
bschreck May 18, 2018
764bbc8
commit
bschreck May 19, 2018
608ffcc
fixed everything except multiindex multiple types not raising typeerror
bschreck May 19, 2018
a259ee3
merged with pandas 23 fixes
bschreck May 19, 2018
d2cc944
merged with pandas 23 fixes
bschreck May 19, 2018
9cb9cd0
merged with master (pandas 23 fix)
bschreck May 23, 2018
da8393e
merged
bschreck May 23, 2018
1c9be57
updated reqs to run tests on ci
bschreck May 23, 2018
8f1b298
linted
bschreck May 23, 2018
cfba565
linted
bschreck May 23, 2018
8d8d10f
merged
bschreck Jun 12, 2018
0b07945
almost
bschreck Jun 12, 2018
b76ce4a
tests pass
bschreck Jun 12, 2018
fed43a7
raise useful error if attempting to use parquet with unicode column n…
bschreck Jun 13, 2018
2ece52a
Removed old cfm lines from merge conflict
bschreck Jun 13, 2018
94feaf4
switched to pyarrow
bschreck Jun 13, 2018
d222d4e
tests pass in py2 and py3
bschreck Jun 13, 2018
bb4df2a
tests pass
bschreck Jun 15, 2018
7e05b84
fixed lint
bschreck Jun 15, 2018
5d82132
explicit metadata dict
bschreck Jun 18, 2018
42acfee
linted
bschreck Jun 18, 2018
b404415
serialize to write namechange
bschreck Jun 18, 2018
94a6268
use public api
bschreck Jun 19, 2018
cf7048d
to_parquet test working
bschreck Jun 19, 2018
6e74bbe
initial pass through
kmax12 Jun 19, 2018
8b58943
all tests passing
kmax12 Jun 19, 2018
783bb6a
fix linting
kmax12 Jun 19, 2018
63ab80b
change test name
kmax12 Jun 20, 2018
b6037ba
update update data
bschreck Jun 20, 2018
c70b64c
merged
bschreck Jun 20, 2018
7a28213
update per code review
bschreck Jun 20, 2018
5123311
tests pass except stats
bschreck Jun 20, 2018
3a2cbc5
merged
bschreck Jun 20, 2018
5333480
tests pass
bschreck Jun 20, 2018
7cf4ab1
linted
bschreck Jun 20, 2018
7dbecb7
remove bad file
bschreck Jun 20, 2018
ea36a3d
added docstrings
bschreck Jun 20, 2018
e7a1ce0
remove metadata_filename arg
bschreck Jun 21, 2018
e28f28a
merged
bschreck Jun 21, 2018
ec64175
linted
bschreck Jun 21, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions featuretools/entityset/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
from .entity import Entity
from .entityset import EntitySet
from .relationship import Relationship
from .serialization import read_parquet, read_pickle
from .timedelta import Timedelta
25 changes: 21 additions & 4 deletions featuretools/entityset/entity.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ class Entity(object):
index = None
indexed_by = None

def __init__(self, id, df, entityset, variable_types=None, name=None,
def __init__(self, id, df, entityset, variable_types=None,
index=None, time_index=None, secondary_time_index=None,
last_time_index=None, encoding=None, relationships=None,
already_sorted=False, created_index=None, verbose=False):
Expand All @@ -56,7 +56,6 @@ def __init__(self, id, df, entityset, variable_types=None, name=None,
entity_id to variable_types dict with which to initialize an
entity's store.
An entity's variable_types dict maps string variable ids to types (:class:`.Variable`).
name (str): Name of entity.
index (str): Name of id column in the dataframe.
time_index (str): Name of time column in the dataframe.
secondary_time_index (dict[str -> str]): Dictionary mapping columns
Expand All @@ -80,7 +79,6 @@ def __init__(self, id, df, entityset, variable_types=None, name=None,
self.created_index = created_index
self.convert_all_variable_data(variable_types)
self.id = id
self.name = name
self.entityset = entityset
self.indexed_by = {}
variable_types = variable_types or {}
Expand All @@ -92,6 +90,7 @@ def __init__(self, id, df, entityset, variable_types=None, name=None,
if ti not in cols:
cols.append(ti)

relationships = relationships or []
link_vars = [v.id for rel in relationships for v in [rel.parent_variable, rel.child_variable]
if v.entity.id == self.id]

Expand Down Expand Up @@ -120,6 +119,11 @@ def __init__(self, id, df, entityset, variable_types=None, name=None,
if self.index is not None and self.index not in inferred_variable_types:
self.add_variable(self.index, vtypes.Index)

# make sure index is at the beginning
index_variable = [v for v in self.variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already merged this into master?

if v.id == self.index][0]
self.variables = [index_variable] + [v for v in self.variables
if v.id != self.index]
self.update_data(df=self.df,
already_sorted=already_sorted,
recalculate_last_time_indexes=False,
Expand Down Expand Up @@ -563,10 +567,23 @@ def infer_variable_types(self, ignore=None, link_vars=None):

def update_data(self, df=None, data=None, already_sorted=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the data argument is never actually used

reindex=True, recalculate_last_time_indexes=True):
to_check = None
if df is not None:
to_check = df
elif data is not None:
to_check = data['df']

if to_check is not None and len(to_check.columns) != len(self.variables):
raise ValueError("Updated dataframe contains {} columns, expecting {}".format(len(to_check.columns),
len(self.variables)))
for v in self.variables:
if v.id not in to_check.columns:
raise ValueError("Updated dataframe is missing new {} column".format(v.id))
if data is not None:
self.data = data
elif df is not None:
self.df = df
self.df = self.df[[v.id for v in self.variables]]
self.set_index(self.index)
self.set_time_index(self.time_index, already_sorted=already_sorted)
self.set_secondary_time_index(self.secondary_time_index)
Expand Down Expand Up @@ -684,7 +701,7 @@ def set_time_index(self, variable_id, already_sorted=False):
# sort by time variable, then by index
self.df.sort_values([variable_id, self.index], inplace=True)

t = vtypes.TimeIndex
t = vtypes.NumericTimeIndex
if col_is_datetime(self.df[variable_id]):
t = vtypes.DatetimeTimeIndex
self.convert_variable_type(variable_id, t, convert_data=False)
Expand Down
141 changes: 71 additions & 70 deletions featuretools/entityset/entityset.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@

from .entity import Entity
from .relationship import Relationship
from .serialization import read_pickle, to_pickle
from .serialization import (load_entity_data,
read_parquet,
read_pickle,
write_entityset)

import featuretools.variable_types.variable as vtypes
from featuretools.utils.gen_utils import make_tqdm_iterator
Expand Down Expand Up @@ -165,22 +168,22 @@ def entities(self):

@property
def metadata(self):
'''Defined as a property because an EntitySet's metadata
is used in many places, for instance, for each feature in a feature list.
'''An EntitySet's metadata is used in many places, for instance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first line of the doc string should be a one liner defining metadata.

for each feature in a feature list.
To prevent using copying the full metadata object to each feature,
we generate a new metadata object and check if it's the same as the existing one,
and if it is return the existing one. Thus, all features in the feature list
would reference the same object, rather than copies. This saves a lot of memory
'''
new_metadata = self.from_metadata(self.create_metadata_dict(),
load_data=False)
if self._metadata is None:
self._metadata = self._gen_metadata()
self._metadata = new_metadata
else:
new_metadata = self._gen_metadata()
# Don't want to keep making new copies of metadata
# Only make a new one if something was changed
if not self._metadata.__eq__(new_metadata):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not self._metadata != new_metdata?

self._metadata = new_metadata

return self._metadata

@property
Expand All @@ -192,13 +195,74 @@ def is_metadata(self):
return all(e.df.empty for e in self.entity_dict.values())

def to_pickle(self, path):
to_pickle(self, path)
write_entityset(self, path, to_parquet=False)
return self

def to_parquet(self, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_entityset(self, path, to_parquet=True)
return self

@classmethod
def read_pickle(cls, path):
return read_pickle(path)

@classmethod
def read_parquet(cls, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be like pandas and make the read_parquet method defined on the featuretools module not the EntitySet class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update api reference too

return read_parquet(path)

def create_metadata_dict(self):
return {
'id': self.id,
'relationships': [{
'parent_entity': r.parent_entity.id,
'parent_variable': r.parent_variable.id,
'child_entity': r.child_entity.id,
'child_variable': r.child_variable.id,
} for r in self.relationships],
'entity_dict': {eid: {
'index': e.index,
'time_index': e.time_index,
'secondary_time_index': e.secondary_time_index,
'encoding': e.encoding,
'variables': {
v.id: v.create_metadata_dict()
for v in e.variables
},
'has_last_time_index': e.last_time_index is not None
} for eid, e in self.entity_dict.items()},
}

@classmethod
def from_metadata(cls, metadata, root=None, load_data=False):
es = EntitySet(metadata['id'])
set_last_time_indexes = False
add_interesting_values = False
for eid, entity in metadata['entity_dict'].items():
df, variable_types = load_entity_data(entity, root=root,
dummy=not load_data)
if any(v['interesting_values'] is not None and len(v['interesting_values'])
for v in entity['variables'].values()):
add_interesting_values = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add exact interesting values that were serialized

es.entity_from_dataframe(eid,
df,
index=entity['index'],
time_index=entity['time_index'],
secondary_time_index=entity['secondary_time_index'],
encoding=entity['encoding'],
variable_types=variable_types)
if entity['has_last_time_index']:
set_last_time_indexes = True
for rel in metadata['relationships']:
es.add_relationship(Relationship(
es[rel['parent_entity']][rel['parent_variable']],
es[rel['child_entity']][rel['child_variable']],
))
if set_last_time_indexes:
es.add_last_time_indexes()
if add_interesting_values:
es.add_interesting_values()
return es

###########################################################################
# Public getter/setter methods #########################################
###########################################################################
Expand Down Expand Up @@ -1102,69 +1166,6 @@ def gen_relationship_var(self, child_eid, parent_eid):
# Private methods ######################################################
###########################################################################

def _gen_metadata(self):
new_entityset = object.__new__(EntitySet)
new_entityset_dict = {}
for k, v in self.__dict__.items():
if k not in ["entity_dict", "relationships"]:
new_entityset_dict[k] = v
new_entityset_dict["entity_dict"] = {}
for eid, e in self.entity_dict.items():
metadata_e = self._entity_metadata(e)
new_entityset_dict['entity_dict'][eid] = metadata_e
new_entityset_dict["relationships"] = []
for r in self.relationships:
metadata_r = self._relationship_metadata(r)
new_entityset_dict['relationships'].append(metadata_r)
new_entityset.__dict__ = copy.deepcopy(new_entityset_dict)
for e in new_entityset.entity_dict.values():
e.entityset = new_entityset
for v in e.variables:
v.entity = new_entityset[v.entity_id]
for r in new_entityset.relationships:
r.entityset = new_entityset
return new_entityset

@classmethod
def _entity_metadata(cls, e):
new_dict = {}
for k, v in e.__dict__.items():
if k not in ["data", "entityset", "variables"]:
new_dict[k] = v
new_dict["data"] = {
"df": e.df.head(0),
"last_time_index": None,
"indexed_by": {}
}
new_dict["variables"] = [cls._variable_metadata(v)
for v in e.variables]
new_dict = copy.deepcopy(new_dict)
new_entity = object.__new__(Entity)
new_entity.__dict__ = new_dict
return new_entity

@classmethod
def _relationship_metadata(cls, r):
new_dict = {}
for k, v in r.__dict__.items():
if k != "entityset":
new_dict[k] = v
new_dict = copy.deepcopy(new_dict)
new_r = object.__new__(Relationship)
new_r.__dict__ = new_dict
return new_r

@classmethod
def _variable_metadata(cls, var):
new_dict = {}
for k, v in var.__dict__.items():
if k != "entity":
new_dict[k] = v
new_dict = copy.deepcopy(new_dict)
new_v = object.__new__(type(var))
new_v.__dict__ = new_dict
return new_v

def _import_from_dataframe(self,
entity_id,
dataframe,
Expand Down
Loading