Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up entity creation logic #336

Merged
merged 32 commits into from
Dec 7, 2018
Merged

Clean up entity creation logic #336

merged 32 commits into from
Dec 7, 2018

Conversation

rwedge
Copy link
Contributor

@rwedge rwedge commented Dec 3, 2018

The idea behind this PR is to make it easier to find where in the code we handle data type and variable type conversion when creating a new entity. Currently it's handled in two places: EntitySet._import_from_dataframe and Entity.__init__.

This PR moves the relevant logic in EntitySet._import_from_dataframe into Entity.__init__ and removes a couple statements that try to infer variable types that can be handled by Entity.infer_variable_types instead.

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #336 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
+ Coverage   95.28%   95.34%   +0.05%     
==========================================
  Files          74       74              
  Lines        7783     7792       +9     
==========================================
+ Hits         7416     7429      +13     
+ Misses        367      363       -4
Impacted Files Coverage Δ
featuretools/tests/testing_utils/mock_ds.py 87.4% <ø> (ø) ⬆️
featuretools/entityset/entity.py 95.65% <100%> (+1.71%) ⬆️
featuretools/entityset/entityset.py 95.73% <100%> (-0.23%) ⬇️
featuretools/tests/entityset_tests/test_es.py 99.22% <100%> (+0.01%) ⬆️
featuretools/tests/entityset_tests/test_entity.py 100% <100%> (ø) ⬆️
featuretools/utils/gen_utils.py 85.71% <0%> (-2.39%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fae88b...e356a72. Read the comment docs.

@@ -93,9 +116,8 @@ def __init__(self, id, df, entityset, variable_types=None,

inferred_variable_types = self.infer_variable_types(ignore=list(variable_types.keys()),
link_vars=link_vars)

for var_id, desired_type in variable_types.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can simplify loop to inferred_variable_types.update(variable_types)


make_index (bool, optional) : If True, assume index does not exist as a column in
dataframe, and create a new column of that name using integers the (0, len(dataframe)).
Otherwise, assume index exists in dataframe.
"""
assert is_string(id), "Entity id must be a string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a function from line 69 - 76 called _validate_entity_params(id, time_index, columns)

@@ -669,3 +681,21 @@ def col_is_datetime(col):
else:
return True
return False


def create_index(index, make_index, df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add short comment describing function

something like "handles index creation logic"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should also be private

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

index = df.columns[0]
elif make_index and index in df.columns:
raise RuntimeError("Cannot make index: index variable already present")
elif make_index or index not in df.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this last elif block can be simplified to

elif index not in df.columns:
    if not make_index:
        logger.warning("index %s not found in dataframe, creating new "
                       "integer column", index)
    df.insert(0, index, range(0, len(df)))
    created_index = index

at this point if index is in df.columns then make_index must be false and we should skip this block. this is because we handle the case where make_index and index in df.columns is True above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if my logic is correct might be helpful to leave a comment about this entire if/elif block so people can quickly understand

featuretools/entityset/entityset.py Show resolved Hide resolved
created_index, index, df = create_index(index, make_index, df)
if index not in variable_types:
variable_types[index] = vtypes.Index

self.data = {"df": df,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid setting the data like this? perhaps by just making df an argument to some of the other functions?

it feels wrong to set the data here and then call self.update_data(...) a second time at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just call the relevant parts of self.update_data separately and skip calling update_data?. It already expects the indexes to be set and it's a little strange to be "updating" the dataframe on initialization

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, i agree with that approach

self.convert_all_variable_data(inferred_variable_types)
variable_types = variable_types or {}
secondary_time_index = secondary_time_index or {}
self.create_variables(variable_types, index, time_index, secondary_time_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be self._create_variables since a user should never need call it, right?

self.convert_all_variable_data(inferred_variable_types)
variable_types = variable_types or {}
secondary_time_index = secondary_time_index or {}
self.create_variables(variable_types, index, time_index, secondary_time_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think secondary_time_index is getting modified by reference within this function (actually when it gets passed to infer_variable_types). if that's true, i'd say let's avoid that or at least leave a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code here, right?

secondary_time_index = secondary_time_index or {}
for ti, cols in secondary_time_index.items():
    if ti not in cols:
        cols.append(ti)

The first line should be thrown out and mabye the rest should go in _validate_entity_params with either a warning or an error if the secondary time index column isn't included.

self.variables = [index_variable] + [v for v in variables
if v.id != index]

def infer_variable_types(self, variable_types, time_index, secondary_time_index):
"""Extracts the variables from a dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc string

@kmax12 kmax12 changed the title [WIP] Clean up entity creation logic Clean up entity creation logic Dec 7, 2018
@kmax12
Copy link
Contributor

kmax12 commented Dec 7, 2018

Looks good to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants