-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataTable.update_dataframe method #407
Conversation
Codecov Report
@@ Coverage Diff @@
## main #407 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 31 31
Lines 3900 3964 +64
=========================================
+ Hits 3900 3964 +64
Continue to review full report at Codecov.
|
new_df (DataFrame): DataFrame containing the new data | ||
''' | ||
if self.make_index: | ||
new_df = _make_index(new_df, self.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to handle a situation where someone might be trying to update with data from the datatable iteself. Something like:
dt = DataTable(sample_df,
index='created_index',
make_index=True)
new_from_dt = dt.to_dataframe().tail(2)
dt.update_dataframe(new_from_dt)
In this case, we don't need to create the index, so we get ValueError: cannot insert created_index, already exists
.
It might even arise if they're changing the original dataframe and then passing it to update_dataframe
if they didn't create the DataTable with a copy of the data since we'll have added the index column to the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...didn't think about that case. I'm debating whether we need to handle this edge case or not. Do we just require that any updated data have the same columns originally used to create the table (same number of columns and same column names)? In this case, the user could just drop the index column from new_from_dt
and it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it take a long time to create the index on a large table? That's really the only reason I could see to handle this case instead of having users drop the column as you said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we could only run make index if make_index was True and the index was not in the new_df. Let me check that out and make sure it works ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated if self.make_index:
-> if self.make_index and self.index not in new_df.columns
and it seems to work fine.
This also got me thinking that we probably need to rerun our validation checks for the index and time_index to make sure the values present in the new dataframe are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a fair point.
Also, with noting that not dropping and then recreating the index means it'll keep the index values the same instead of re-indexing from zero. Which seems like it might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the update to handle the situation you identified and also added in the validation checks for the index and time index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment about the docstring, but other than that, looking good!
woodwork/datatable.py
Outdated
@@ -498,6 +493,40 @@ def select(self, include): | |||
cols_to_include = self._filter_cols(include) | |||
return self._new_dt_from_cols(cols_to_include) | |||
|
|||
def update_dataframe(self, new_df, already_sorted=False): | |||
'''Update DataTable's dataframe with new data, making sure the new DataFrame dtypes are updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth mentioning something about the behavior with make_index here to make it clear that a new index will be created if necessary, but otherwise, the index column will remain unchanged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
self._dataframe = self._dataframe.koalas.attach_id_column('distributed-sequence', index) | ||
else: | ||
self._dataframe.insert(0, index, range(len(self._dataframe))) | ||
self.make_index = make_index or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Adds a new
update_dataframe
method to DataTable that replaces the existing dataframe with the supplied dataframe. The new dataframe must have the same columns as the original dataframe. All DataTable information (index, time_index, logical_types, semantic_tags) are retained. This modification is done in-place.Note, this PR does not implement the
already_sorted
parameter that is present in Featuretools. A separate issue is being created to implement this behavior here and onDataTable.set_time_index
.