Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataTable.update_dataframe method #407

Merged
merged 14 commits into from
Nov 20, 2020
Merged

Add DataTable.update_dataframe method #407

merged 14 commits into from
Nov 20, 2020

Conversation

thehomebrewnerd
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd commented Nov 18, 2020

Adds a new update_dataframe method to DataTable that replaces the existing dataframe with the supplied dataframe. The new dataframe must have the same columns as the original dataframe. All DataTable information (index, time_index, logical_types, semantic_tags) are retained. This modification is done in-place.

Note, this PR does not implement the already_sorted parameter that is present in Featuretools. A separate issue is being created to implement this behavior here and on DataTable.set_time_index.

@codecov
Copy link

codecov bot commented Nov 18, 2020

Codecov Report

Merging #407 (dd377fe) into main (efc9d65) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #407   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         3900      3964   +64     
=========================================
+ Hits          3900      3964   +64     
Impacted Files Coverage Δ
woodwork/datatable.py 100.00% <100.00%> (ø)
woodwork/tests/datatable/test_datatable.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efc9d65...dd377fe. Read the comment docs.

@thehomebrewnerd thehomebrewnerd self-assigned this Nov 18, 2020
woodwork/datatable.py Outdated Show resolved Hide resolved
new_df (DataFrame): DataFrame containing the new data
'''
if self.make_index:
new_df = _make_index(new_df, self.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we want to handle a situation where someone might be trying to update with data from the datatable iteself. Something like:

dt = DataTable(sample_df,
               index='created_index',
               make_index=True)
new_from_dt = dt.to_dataframe().tail(2) 
dt.update_dataframe(new_from_dt)

In this case, we don't need to create the index, so we get ValueError: cannot insert created_index, already exists.

It might even arise if they're changing the original dataframe and then passing it to update_dataframe if they didn't create the DataTable with a copy of the data since we'll have added the index column to the data.

Copy link
Contributor Author

@thehomebrewnerd thehomebrewnerd Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...didn't think about that case. I'm debating whether we need to handle this edge case or not. Do we just require that any updated data have the same columns originally used to create the table (same number of columns and same column names)? In this case, the user could just drop the index column from new_from_dt and it should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it take a long time to create the index on a large table? That's really the only reason I could see to handle this case instead of having users drop the column as you said

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we could only run make index if make_index was True and the index was not in the new_df. Let me check that out and make sure it works ok.

Copy link
Contributor Author

@thehomebrewnerd thehomebrewnerd Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated if self.make_index: -> if self.make_index and self.index not in new_df.columns and it seems to work fine.

This also got me thinking that we probably need to rerun our validation checks for the index and time_index to make sure the values present in the new dataframe are valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a fair point.

Also, with noting that not dropping and then recreating the index means it'll keep the index values the same instead of re-indexing from zero. Which seems like it might be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the update to handle the situation you identified and also added in the validation checks for the index and time index.

Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment about the docstring, but other than that, looking good!

@@ -498,6 +493,40 @@ def select(self, include):
cols_to_include = self._filter_cols(include)
return self._new_dt_from_cols(cols_to_include)

def update_dataframe(self, new_df, already_sorted=False):
'''Update DataTable's dataframe with new data, making sure the new DataFrame dtypes are updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning something about the behavior with make_index here to make it clear that a new index will be created if necessary, but otherwise, the index column will remain unchanged

Copy link
Contributor

@gsheni gsheni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

self._dataframe = self._dataframe.koalas.attach_id_column('distributed-sequence', index)
else:
self._dataframe.insert(0, index, range(len(self._dataframe)))
self.make_index = make_index or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@thehomebrewnerd thehomebrewnerd merged commit 3e1f627 into main Nov 20, 2020
@thehomebrewnerd thehomebrewnerd deleted the update-data branch November 20, 2020 13:36
@gsheni gsheni mentioned this pull request Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add update_dataframe function to DataTable
3 participants