New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up entity creation logic #336

Merged
merged 32 commits into from Dec 7, 2018

Conversation

Projects
None yet
2 participants
@rwedge
Copy link
Contributor

rwedge commented Dec 3, 2018

The idea behind this PR is to make it easier to find where in the code we handle data type and variable type conversion when creating a new entity. Currently it's handled in two places: EntitySet._import_from_dataframe and Entity.__init__.

This PR moves the relevant logic in EntitySet._import_from_dataframe into Entity.__init__ and removes a couple statements that try to infer variable types that can be handled by Entity.infer_variable_types instead.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #336 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
+ Coverage   95.28%   95.34%   +0.05%     
==========================================
  Files          74       74              
  Lines        7783     7792       +9     
==========================================
+ Hits         7416     7429      +13     
+ Misses        367      363       -4
Impacted Files Coverage Δ
featuretools/tests/testing_utils/mock_ds.py 87.4% <ø> (ø) ⬆️
featuretools/entityset/entity.py 95.65% <100%> (+1.71%) ⬆️
featuretools/entityset/entityset.py 95.73% <100%> (-0.23%) ⬇️
featuretools/tests/entityset_tests/test_es.py 99.22% <100%> (+0.01%) ⬆️
featuretools/tests/entityset_tests/test_entity.py 100% <100%> (ø) ⬆️
featuretools/utils/gen_utils.py 85.71% <0%> (-2.39%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fae88b...e356a72. Read the comment docs.

@@ -93,9 +116,8 @@ def __init__(self, id, df, entityset, variable_types=None,

inferred_variable_types = self.infer_variable_types(ignore=list(variable_types.keys()),
link_vars=link_vars)

for var_id, desired_type in variable_types.items():

This comment has been minimized.

@kmax12

kmax12 Dec 4, 2018

Member

can simplify loop to inferred_variable_types.update(variable_types)

make_index (bool, optional) : If True, assume index does not exist as a column in
dataframe, and create a new column of that name using integers the (0, len(dataframe)).
Otherwise, assume index exists in dataframe.
"""
assert is_string(id), "Entity id must be a string"

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

maybe add a function from line 69 - 76 called _validate_entity_params(id, time_index, columns)

@@ -669,3 +681,21 @@ def col_is_datetime(col):
else:
return True
return False


def create_index(index, make_index, df):

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

add short comment describing function

something like "handles index creation logic"

This comment has been minimized.

@rwedge

rwedge Dec 5, 2018

Contributor

Maybe this should also be private

This comment has been minimized.

@kmax12
index = df.columns[0]
elif make_index and index in df.columns:
raise RuntimeError("Cannot make index: index variable already present")
elif make_index or index not in df.columns:

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

i think this last elif block can be simplified to

elif index not in df.columns:
    if not make_index:
        logger.warning("index %s not found in dataframe, creating new "
                       "integer column", index)
    df.insert(0, index, range(0, len(df)))
    created_index = index

at this point if index is in df.columns then make_index must be false and we should skip this block. this is because we handle the case where make_index and index in df.columns is True above

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

if my logic is correct might be helpful to leave a comment about this entire if/elif block so people can quickly understand

Show resolved Hide resolved featuretools/entityset/entityset.py
created_index, index, df = create_index(index, make_index, df)
if index not in variable_types:
variable_types[index] = vtypes.Index

self.data = {"df": df,

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

can we avoid setting the data like this? perhaps by just making df an argument to some of the other functions?

it feels wrong to set the data here and then call self.update_data(...) a second time at the end

This comment has been minimized.

@rwedge

rwedge Dec 5, 2018

Contributor

should we just call the relevant parts of self.update_data separately and skip calling update_data?. It already expects the indexes to be set and it's a little strange to be "updating" the dataframe on initialization

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

ya, i agree with that approach

self.convert_all_variable_data(inferred_variable_types)
variable_types = variable_types or {}
secondary_time_index = secondary_time_index or {}
self.create_variables(variable_types, index, time_index, secondary_time_index)

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

i think this should be self._create_variables since a user should never need call it, right?

self.convert_all_variable_data(inferred_variable_types)
variable_types = variable_types or {}
secondary_time_index = secondary_time_index or {}
self.create_variables(variable_types, index, time_index, secondary_time_index)

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

i think secondary_time_index is getting modified by reference within this function (actually when it gets passed to infer_variable_types). if that's true, i'd say let's avoid that or at least leave a comment.

This comment has been minimized.

@rwedge

rwedge Dec 5, 2018

Contributor

This code here, right?

secondary_time_index = secondary_time_index or {}
for ti, cols in secondary_time_index.items():
    if ti not in cols:
        cols.append(ti)

The first line should be thrown out and mabye the rest should go in _validate_entity_params with either a warning or an error if the secondary time index column isn't included.

self.variables = [index_variable] + [v for v in variables
if v.id != index]

def infer_variable_types(self, variable_types, time_index, secondary_time_index):
"""Extracts the variables from a dataframe

This comment has been minimized.

@kmax12

kmax12 Dec 5, 2018

Member

update doc string

@kmax12 kmax12 changed the title [WIP] Clean up entity creation logic Clean up entity creation logic Dec 7, 2018

@kmax12

kmax12 approved these changes Dec 7, 2018

@kmax12

This comment has been minimized.

Copy link
Member

kmax12 commented Dec 7, 2018

Looks good to merge

@rwedge rwedge merged commit d367818 into master Dec 7, 2018

3 checks passed

codecov/patch 100% of diff hit (target 95.28%)
Details
codecov/project 95.34% (+0.05%) compared to 9fae88b
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge deleted the entity-from-df-cleanup-1 branch Dec 7, 2018

@rwedge rwedge referenced this pull request Dec 17, 2018

Merged

v0.5.0 #351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment