Skip to content

Clean up EntitySet class #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jun 8, 2018
Merged

Conversation

bschreck
Copy link
Contributor

@bschreck bschreck commented May 9, 2018

This PR removes unnecessary code in EntitySet, and removes BaseEntity/BaseEntitySet.

A lot of this is cleaning up duplicated or convoluted functionality. There is now only a single file for both Entity and EntitySet, and I dropped all functionality that is just a thin wrapper around pandas.

One outstanding issue with this PR: creating EntitySets now produces lots of "data type not understood" warnings. This is related to variable type checking within numpy, but I haven't pinpointed the problem.

Entity

convert_variable_types

Old:
Previously, there were several functions to convert variables types, and the underlying pandas dtypes:

  • entity.py::Entity.convert_variable_types would take in a desired FT variable types dictionary, and then call self.entityset_convert_variable_type with each variable.
  • entity.py::Entity.entityset_convert_variable_type would convert underlying variable data
  • base_entity.py::BaseEntity.convert_variable_type converts both the FT type and the underlying data, calling Entity.entityset_convert_variable_type to do the data conversion

New: Similar to old but in the same file, with better names

  • Entity.convert_all_variable_data takes in desired FT variable types and calls Entity.convert_variable_data
  • Entity.convert_variable_data takes in a variable and desired type, and does underlying data conversion
  • Entity.convert_variable_type converts a variable to a new type, and optionally calls convert_variable_data to change the underlying type.

Indexes & sorting

Old:

  • Entity.set_index sets the index on the underlying data, asserts uniqueness, calls convert_variable_type to convert the variable (but not the data) to an Index, and then calls BaseEntity.set_index
  • BaseEntity.set_index sets the index attribute
  • Entity.set_time_index does type checking, sorts the data, and then calls BaseEntity.set_time_index
  • BaseEntity.set_time_index sets the time_index attribute
  • Entity.add_all_variable_statistics needs to be called explicitly

New: Entity.update_data makes sure the data is sorted properly. This method is used whenever the underlying data is replaced by a new dataframe

  • index and time_index attributes are first set in the __init__ method
  • Entity.update_data is called from __init__, which sets the underlying dataframe as an attribute, and calls set_index(self.index), set_time_index(self.time_index), and add_all_variable_statistics
  • Entity.set_index sets the index on the underlying data, asserts uniqueness, calls convert_variable_type to convert the variable (but not the data) to an Index, and sets the` index attribute
  • Entity.set_time_index does type checking, sorts the data, and then sets the time_index attribute
  • Entity.update_data can accept either a dataframe, or the whole "data" dictionary which includes the dataframe, indexed_by, and last_time_index. It includes optional parameters to resort, reindex, and redo last_time_index. It reindexes through a new method Entity.index_data

Adding and removing variables

Old:

  • BaseEntity.add_variable adds an FT variable to the variable list, and calls add_variable_statistics
  • BaseEntity.delete_variable removes an FT variable from the variable list
  • Entity.add_column adds column data to the dataframe, optionally infers the variable type, adds an FT variable to the variable_types dictionary but not to the variable list. I don't actually know if this works, because variable_types is a property that just converts the variable list into a dictionary. It does not call BaseEntity.add_variable
  • Entity.delete_column deletes a column from the dataframe and from the variable_types property. Again, not sure how this works.

New:

  • Entity.delete_variable deletes a variable both from the variables list and from the dataframe
  • Entity.add_variable adds an FT variable to the variable list (optionally inferring the type), adds data to the underlying dataframe, and calls add_variable_statistics

Miscellaneous methods/properties changed/deleted

  • name is removed
  • show_instance is removed
  • get_shape is removed (this was a duplicate of shape)
  • get_column_stat is removed
  • has_time_index is removed
  • num_instances is removed
  • is_index_column is removed
  • get_column_type is removed
  • get_column_max/min/etc are all removed
  • get_all_instances is removed
  • get_top_n_instances is removed
  • sample_instances is removed
  • get_sliced_instance_ids is removed
  • get_column_data is removed
  • get_sample is renamed to sample

EntitySet

Methods that were just wrappers around the Entity, or around pandas, were removed. Other methods that didn't seem to be useful were removed as well:

  • get_sample is removed
  • get_instance_data is removed
  • num_instances is removed
  • get_all_instances is removed
  • get_top_n_instances is removed
  • sample_instances is removed
  • get_sliced_instance_ids is removed
  • get_dataframe is removed
  • get_column_names is removed
  • get_index is removed
  • get_time_index is removed
  • get_secondary_time_index is removed
  • get_column_X are all removed
  • get_variable_types is removed
  • add/delete_column are both removed
  • store_convert_variable_type is removed
  • _related_instances is now public as related_instances
  • get_name is removed (and correspondingly removed from PrimitiveBase.get_name, and other primitives)
  • delete_entity_variables is removed
  • make_index_variable_name is a top level function outside of the EntitySet class
  • add_entity is removed, and import_from_dataframe is now in change of adding the entity to self.entity_stores, whose name is changed to self.entity_dict
  • index_by_variable is now private _index_by_variable

@bschreck bschreck requested a review from kmax12 May 17, 2018 19:39
@bschreck bschreck force-pushed the no-base-entityset-and-other-clean-ups branch from 72f18f0 to 0d1793d Compare May 17, 2018 22:05
@bschreck bschreck changed the title Clean up EntitySet class [WIP] Clean up EntitySet class May 18, 2018
@bschreck bschreck changed the title [WIP] Clean up EntitySet class Clean up EntitySet class May 31, 2018
@property
def entities(self):
return list(self.entity_dict.values())

@property
def metadata(self):
Copy link
Contributor

@kmax12 kmax12 Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the logic here can just be

@property
    def metadata(self):
        '''Defined as a property because an EntitySet's metadata
        is used in many places, for instance, for each feature in a feature list.
        To prevent using copying the full metadata object to each feature,
        we generate a new metadata object and check if it's the same as the existing one,
        and if it is return the existing one. Thus, all features in the feature list
        would reference the same object, rather than copies. This saves a lot of memory
        '''
        if self._metadata is None:
            self._metadata = self._gen_metadata()
        else:
            new_metadata = self._gen_metadata()
            # Don't want to keep making new copies of metadata
            # Only make a new one if something was changed
            if not self._metadata.__eq__(new_metadata):
                self._metadata = new_metadata

        return self._metadata

also, any reason to use not self._metadata.__eq__(new_metadata) rather than self._metadata != new_metadata

@kmax12
Copy link
Contributor

kmax12 commented Jun 8, 2018

Looks good to me. Merging

@kmax12 kmax12 merged commit ea47bb0 into master Jun 8, 2018
@rwedge rwedge mentioned this pull request Jun 22, 2018
@kmax12 kmax12 deleted the no-base-entityset-and-other-clean-ups branch August 15, 2018 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants