New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Entity Set Serialization #361
Conversation
b730fcf
to
61badaa
Compare
61badaa
to
0e96b5b
Compare
Codecov Report
@@ Coverage Diff @@
## master #361 +/- ##
==========================================
+ Coverage 96.31% 96.45% +0.14%
==========================================
Files 96 98 +2
Lines 8543 8611 +68
==========================================
+ Hits 8228 8306 +78
+ Misses 315 305 -10
Continue to review full report at Codecov.
|
I completed the adjustments. Let me know if anything needs modification. Thanks. |
@jeff-hernandez there are a lot of places in this PR where you don't change the code, but instead are just changing the formatting. it makes the diff a bit harder to review. can you go through and revert the formatting only changes back? |
featuretools/entityset/entityset.py
Outdated
@@ -147,8 +149,8 @@ def metadata(self): | |||
would reference the same object, rather than copies. This saves a lot of memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the doc string for this method?
we no longer do the check it says. we simply check that self._metadata is None
. it gets recomputed when self.reset_metadata()
is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Would the following description work:
'''Returns the EntitySet's metadata and recomputes if the metadata does not exist.'''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
@@ -133,6 +133,9 @@ def _dataframes_equal(df1, df2): | |||
elif not df1.empty and df2.empty: | |||
return False | |||
elif not df1.empty and not df2.empty: | |||
for df in [df1, df2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this addition? isn't it possible for something to be dtype object but not able to cast as unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pandas dataframe could have a column type object
where each element in the column is type tuple
. After writing and reading this dataframe from disk, each element in the column has been converted to type string
while the column remains as type object
. As a result, _dataframes_equal
does not see the columns as being equal, because one element is type tuple
while the other element is type string
, although the string representations of both elements are identical. Casting elements of column type object
to unicode
in both data frames will better equate whether the columns are equal by ensuring elements inside the columns are of the same type. The column remains as an object
. I don't know of a case where elements of a column type object
are not able to cast as type unicode
.
'name': self.name, | ||
'interesting_values': self._interesting_values | ||
'type': { | ||
'value': self._dtype_repr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about renaming _dtype_repr
to just type_string
? seems like it is more clear what it does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it would be more clear to use Variable.type_string
or even Variable.type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go with type_string
for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…into serialization
…into serialization
path (str) : Root directory to serialized entityset. | ||
''' | ||
from_disk = path is not None | ||
dataframe = read_entity_data(description, path=path) if from_disk else empty_dataframe(description) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for oneliners, just make this
if path:
dataframe = read_entity_data(description, path=path)
else:
dataframe = empty_dataframe(description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change applied in this commit.
featuretools/entityset/serialize.py
Outdated
location = os.path.join('data', basename) | ||
file = os.path.join(path, location) | ||
if format == 'csv': | ||
params = ['compression', 'encoding', 'index'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sep
can be added as a valid parm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change applied in this commit.
featuretools/entityset/entityset.py
Outdated
path (str): Directory on disk to read `data_description.json`. | ||
kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the underlying deserialization method. | ||
''' | ||
return EntitySet.from_data_description(deserialize.read_data_description(path), **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this two lines
data_description = deserialize.read_data_description(path)
return EntitySet.from_data_description(data_description, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change applied in this commit.
* bump version number * update changelog
…into serialization
…s into serialization
This is the implementation for serializing entity sets. I will be adjusting the code according to the integration tests.