Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Entity Set Serialization #361

Merged
merged 96 commits into from Feb 25, 2019

Conversation

Projects
None yet
3 participants
@jeff-hernandez
Copy link
Collaborator

commented Dec 30, 2018

This is the implementation for serializing entity sets. I will be adjusting the code according to the integration tests.

@jeff-hernandez jeff-hernandez force-pushed the jeff-hernandez:serialization branch 2 times, most recently from b730fcf to 61badaa Dec 30, 2018

@jeff-hernandez jeff-hernandez force-pushed the jeff-hernandez:serialization branch from 61badaa to 0e96b5b Dec 30, 2018

@codecov

This comment has been minimized.

Copy link

commented Dec 30, 2018

Codecov Report

Merging #361 into master will increase coverage by 0.14%.
The diff coverage is 99.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #361      +/-   ##
==========================================
+ Coverage   96.31%   96.45%   +0.14%     
==========================================
  Files          96       98       +2     
  Lines        8543     8611      +68     
==========================================
+ Hits         8228     8306      +78     
+ Misses        315      305      -10
Impacted Files Coverage Δ
featuretools/tests/testing_utils/mock_ds.py 87.76% <100%> (ø) ⬆️
featuretools/entityset/entity.py 96.1% <100%> (+0.35%) ⬆️
featuretools/entityset/entityset.py 95.04% <100%> (-0.2%) ⬇️
featuretools/tests/entityset_tests/test_es.py 100% <100%> (+0.74%) ⬆️
featuretools/entityset/api.py 100% <100%> (ø) ⬆️
featuretools/entityset/deserialize.py 100% <100%> (ø)
featuretools/entityset/serialize.py 100% <100%> (ø)
...etools/tests/entityset_tests/test_serialization.py 100% <100%> (ø)
featuretools/utils/wrangle.py 73.91% <100%> (-5.87%) ⬇️
featuretools/variable_types/variable.py 98.02% <98.43%> (+0.72%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce1a335...39ab514. Read the comment docs.

@jeff-hernandez

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 30, 2018

I completed the adjustments. Let me know if anything needs modification. Thanks.

@kmax12

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

@jeff-hernandez there are a lot of places in this PR where you don't change the code, but instead are just changing the formatting.

it makes the diff a bit harder to review. can you go through and revert the formatting only changes back?

@@ -147,8 +149,8 @@ def metadata(self):
would reference the same object, rather than copies. This saves a lot of memory

This comment has been minimized.

Copy link
@kmax12

kmax12 Jan 10, 2019

Member

can we update the doc string for this method?

we no longer do the check it says. we simply check that self._metadata is None. it gets recomputed when self.reset_metadata() is called

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Jan 11, 2019

Author Collaborator

Yes. Would the following description work:

'''Returns the EntitySet's metadata and recomputes if the metadata does not exist.'''

This comment has been minimized.

Copy link
@kmax12

kmax12 Jan 21, 2019

Member

looks good

Show resolved Hide resolved featuretools/entityset/entityset.py Outdated
Show resolved Hide resolved featuretools/entityset/entityset.py Outdated
@@ -133,6 +133,9 @@ def _dataframes_equal(df1, df2):
elif not df1.empty and df2.empty:
return False
elif not df1.empty and not df2.empty:
for df in [df1, df2]:

This comment has been minimized.

Copy link
@kmax12

kmax12 Jan 10, 2019

Member

can you explain this addition? isn't it possible for something to be dtype object but not able to cast as unicode

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Jan 14, 2019

Author Collaborator

A pandas dataframe could have a column type object where each element in the column is type tuple. After writing and reading this dataframe from disk, each element in the column has been converted to type string while the column remains as type object. As a result, _dataframes_equal does not see the columns as being equal, because one element is type tuple while the other element is type string, although the string representations of both elements are identical. Casting elements of column type object to unicode in both data frames will better equate whether the columns are equal by ensuring elements inside the columns are of the same type. The column remains as an object. I don't know of a case where elements of a column type object are not able to cast as type unicode.

Show resolved Hide resolved featuretools/tests/testing_utils/mock_ds.py
Show resolved Hide resolved featuretools/entityset/entityset.py Outdated
'name': self.name,
'interesting_values': self._interesting_values
'type': {
'value': self._dtype_repr,

This comment has been minimized.

Copy link
@kmax12

kmax12 Jan 10, 2019

Member

what do you think about renaming _dtype_repr to just type_string? seems like it is more clear what it does

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Jan 14, 2019

Author Collaborator

I agree. I think it would be more clear to use Variable.type_string or even Variable.type

This comment has been minimized.

Copy link
@kmax12

kmax12 Jan 22, 2019

Member

let's go with type_string for now

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Jan 25, 2019

Author Collaborator

Done

Show resolved Hide resolved featuretools/variable_types/variable.py Outdated
Show resolved Hide resolved featuretools/entityset/serialization.py Outdated
Show resolved Hide resolved featuretools/entityset/serialization.py Outdated
path (str) : Root directory to serialized entityset.
'''
from_disk = path is not None
dataframe = read_entity_data(description, path=path) if from_disk else empty_dataframe(description)

This comment has been minimized.

Copy link
@kmax12

kmax12 Feb 17, 2019

Member

no need for oneliners, just make this

if path:
   dataframe = read_entity_data(description, path=path)
else:
   dataframe = empty_dataframe(description)

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Feb 19, 2019

Author Collaborator

Change applied in this commit.

location = os.path.join('data', basename)
file = os.path.join(path, location)
if format == 'csv':
params = ['compression', 'encoding', 'index']

This comment has been minimized.

Copy link
@kmax12

kmax12 Feb 17, 2019

Member

sep can be added as a valid parm

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Feb 19, 2019

Author Collaborator

Change applied in this commit.

path (str): Directory on disk to read `data_description.json`.
kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the underlying deserialization method.
'''
return EntitySet.from_data_description(deserialize.read_data_description(path), **kwargs)

This comment has been minimized.

Copy link
@kmax12

kmax12 Feb 17, 2019

Member

let's make this two lines

data_description = deserialize.read_data_description(path)
return EntitySet.from_data_description(data_description, **kwargs)

This comment has been minimized.

Copy link
@jeff-hernandez

jeff-hernandez Feb 19, 2019

Author Collaborator

Change applied in this commit.

@kmax12 kmax12 changed the title Serialization Improve Entity Set Serialization Feb 25, 2019

@kmax12

kmax12 approved these changes Feb 25, 2019

@kmax12 kmax12 merged commit 4bf6709 into Featuretools:master Feb 25, 2019

3 checks passed

codecov/patch 99.68% of diff hit (target 96.31%)
Details
codecov/project 96.45% (+0.14%) compared to ce1a335
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge referenced this pull request Mar 29, 2019

Merged

v0.7.0 #477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.