Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Entity Set Serialization #361

Merged
merged 96 commits into from Feb 25, 2019
Merged

Conversation

jeff-hernandez
Copy link
Contributor

@jeff-hernandez jeff-hernandez commented Dec 30, 2018

This is the implementation for serializing entity sets. I will be adjusting the code according to the integration tests.

@jeff-hernandez jeff-hernandez force-pushed the serialization branch 2 times, most recently from b730fcf to 61badaa Compare December 30, 2018 20:18
@codecov
Copy link

codecov bot commented Dec 30, 2018

Codecov Report

Merging #361 into master will increase coverage by 0.14%.
The diff coverage is 99.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #361      +/-   ##
==========================================
+ Coverage   96.31%   96.45%   +0.14%     
==========================================
  Files          96       98       +2     
  Lines        8543     8611      +68     
==========================================
+ Hits         8228     8306      +78     
+ Misses        315      305      -10
Impacted Files Coverage Δ
featuretools/tests/testing_utils/mock_ds.py 87.76% <100%> (ø) ⬆️
featuretools/entityset/entity.py 96.1% <100%> (+0.35%) ⬆️
featuretools/entityset/entityset.py 95.04% <100%> (-0.2%) ⬇️
featuretools/tests/entityset_tests/test_es.py 100% <100%> (+0.74%) ⬆️
featuretools/entityset/api.py 100% <100%> (ø) ⬆️
featuretools/entityset/deserialize.py 100% <100%> (ø)
featuretools/entityset/serialize.py 100% <100%> (ø)
...etools/tests/entityset_tests/test_serialization.py 100% <100%> (ø)
featuretools/utils/wrangle.py 73.91% <100%> (-5.87%) ⬇️
featuretools/variable_types/variable.py 98.02% <98.43%> (+0.72%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce1a335...39ab514. Read the comment docs.

@jeff-hernandez
Copy link
Contributor Author

I completed the adjustments. Let me know if anything needs modification. Thanks.

@kmax12
Copy link
Contributor

kmax12 commented Jan 8, 2019

@jeff-hernandez there are a lot of places in this PR where you don't change the code, but instead are just changing the formatting.

it makes the diff a bit harder to review. can you go through and revert the formatting only changes back?

@@ -147,8 +149,8 @@ def metadata(self):
would reference the same object, rather than copies. This saves a lot of memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the doc string for this method?

we no longer do the check it says. we simply check that self._metadata is None. it gets recomputed when self.reset_metadata() is called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Would the following description work:

'''Returns the EntitySet's metadata and recomputes if the metadata does not exist.'''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

featuretools/entityset/entityset.py Outdated Show resolved Hide resolved
featuretools/entityset/entityset.py Outdated Show resolved Hide resolved
@@ -133,6 +133,9 @@ def _dataframes_equal(df1, df2):
elif not df1.empty and df2.empty:
return False
elif not df1.empty and not df2.empty:
for df in [df1, df2]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this addition? isn't it possible for something to be dtype object but not able to cast as unicode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pandas dataframe could have a column type object where each element in the column is type tuple. After writing and reading this dataframe from disk, each element in the column has been converted to type string while the column remains as type object. As a result, _dataframes_equal does not see the columns as being equal, because one element is type tuple while the other element is type string, although the string representations of both elements are identical. Casting elements of column type object to unicode in both data frames will better equate whether the columns are equal by ensuring elements inside the columns are of the same type. The column remains as an object. I don't know of a case where elements of a column type object are not able to cast as type unicode.

featuretools/tests/testing_utils/mock_ds.py Show resolved Hide resolved
featuretools/entityset/entityset.py Outdated Show resolved Hide resolved
'name': self.name,
'interesting_values': self._interesting_values
'type': {
'value': self._dtype_repr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about renaming _dtype_repr to just type_string? seems like it is more clear what it does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it would be more clear to use Variable.type_string or even Variable.type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go with type_string for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

featuretools/variable_types/variable.py Outdated Show resolved Hide resolved
featuretools/entityset/serialization.py Outdated Show resolved Hide resolved
featuretools/entityset/serialization.py Outdated Show resolved Hide resolved
path (str) : Root directory to serialized entityset.
'''
from_disk = path is not None
dataframe = read_entity_data(description, path=path) if from_disk else empty_dataframe(description)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for oneliners, just make this

if path:
   dataframe = read_entity_data(description, path=path)
else:
   dataframe = empty_dataframe(description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change applied in this commit.

location = os.path.join('data', basename)
file = os.path.join(path, location)
if format == 'csv':
params = ['compression', 'encoding', 'index']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sep can be added as a valid parm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change applied in this commit.

path (str): Directory on disk to read `data_description.json`.
kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the underlying deserialization method.
'''
return EntitySet.from_data_description(deserialize.read_data_description(path), **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this two lines

data_description = deserialize.read_data_description(path)
return EntitySet.from_data_description(data_description, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change applied in this commit.

@kmax12 kmax12 changed the title Serialization Improve Entity Set Serialization Feb 25, 2019
@kmax12 kmax12 merged commit 4bf6709 into alteryx:master Feb 25, 2019
@rwedge rwedge mentioned this pull request Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants