Add DataTable.update_dataframe method #407

thehomebrewnerd · 2020-11-18T21:10:33Z

Add DataTable.update_dataframe method
Closes Add update_dataframe function to DataTable #391

Adds a new update_dataframe method to DataTable that replaces the existing dataframe with the supplied dataframe. The new dataframe must have the same columns as the original dataframe. All DataTable information (index, time_index, logical_types, semantic_tags) are retained. This modification is done in-place.

Note, this PR does not implement the already_sorted parameter that is present in Featuretools. A separate issue is being created to implement this behavior here and on DataTable.set_time_index.

codecov · 2020-11-18T21:18:12Z

Codecov Report

Merging #407 (dd377fe) into main (efc9d65) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #407   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         3900      3964   +64     
=========================================
+ Hits          3900      3964   +64

Impacted Files	Coverage Δ
woodwork/datatable.py	`100.00% <100.00%> (ø)`
woodwork/tests/datatable/test_datatable.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efc9d65...dd377fe. Read the comment docs.

woodwork/datatable.py

…pdate-data

tamargrey · 2020-11-19T17:02:05Z

woodwork/datatable.py

+            new_df (DataFrame): DataFrame containing the new data
+        '''
+        if self.make_index:
+            new_df = _make_index(new_df, self.index)


How do we want to handle a situation where someone might be trying to update with data from the datatable iteself. Something like:

dt = DataTable(sample_df, index='created_index', make_index=True) new_from_dt = dt.to_dataframe().tail(2) dt.update_dataframe(new_from_dt)

In this case, we don't need to create the index, so we get ValueError: cannot insert created_index, already exists.

It might even arise if they're changing the original dataframe and then passing it to update_dataframe if they didn't create the DataTable with a copy of the data since we'll have added the index column to the data.

Hmmm...didn't think about that case. I'm debating whether we need to handle this edge case or not. Do we just require that any updated data have the same columns originally used to create the table (same number of columns and same column names)? In this case, the user could just drop the index column from new_from_dt and it should work.

Could it take a long time to create the index on a large table? That's really the only reason I could see to handle this case instead of having users drop the column as you said

Yeah, I guess we could only run make index if make_index was True and the index was not in the new_df. Let me check that out and make sure it works ok.

Updated if self.make_index: -> if self.make_index and self.index not in new_df.columns and it seems to work fine.

This also got me thinking that we probably need to rerun our validation checks for the index and time_index to make sure the values present in the new dataframe are valid.

yeah, that's a fair point.

Also, with noting that not dropping and then recreating the index means it'll keep the index values the same instead of re-indexing from zero. Which seems like it might be useful

I made the update to handle the situation you identified and also added in the validation checks for the index and time index.

tamargrey

Small comment about the docstring, but other than that, looking good!

tamargrey · 2020-11-19T18:17:21Z

woodwork/datatable.py

@@ -498,6 +493,40 @@ def select(self, include):
        cols_to_include = self._filter_cols(include)
        return self._new_dt_from_cols(cols_to_include)

+    def update_dataframe(self, new_df, already_sorted=False):
+        '''Update DataTable's dataframe with new data, making sure the new DataFrame dtypes are updated.


Might be worth mentioning something about the behavior with make_index here to make it clear that a new index will be created if necessary, but otherwise, the index column will remain unchanged

gsheni

lgtm

gsheni · 2020-11-20T00:08:31Z

woodwork/datatable.py

-                self._dataframe = self._dataframe.koalas.attach_id_column('distributed-sequence', index)
-            else:
-                self._dataframe.insert(0, index, range(len(self._dataframe)))
+        self.make_index = make_index or None


Nate Parsons added 3 commits November 18, 2020 15:08

add DataTable.update_dataframe method

04afe46

update release_notes

289aaa5

update api reference

e33fe44

remove duplicate calls

19a89fb

thehomebrewnerd requested review from tamargrey and gsheni November 18, 2020 21:35

Nate Parsons added 2 commits November 18, 2020 15:55

simplify implementation

51a09ef

lint fix

e40fbf6

thehomebrewnerd self-assigned this Nov 18, 2020

tamargrey reviewed Nov 18, 2020

View reviewed changes

woodwork/datatable.py Outdated Show resolved Hide resolved

Nate Parsons and others added 4 commits November 18, 2020 16:25

update for column ordering

abe0582

Merge branch 'main' of https://github.com/FeatureLabs/woodwork into u…

173bed3

…pdate-data

Merge branch 'main' into update-data

d31b992

update to work with make_index

7c32b3b

tamargrey reviewed Nov 19, 2020

View reviewed changes

Nate Parsons added 3 commits November 19, 2020 11:31

improve validation checks

3a5c8a3

update test

c0536b5

lint fix

a383bf5

tamargrey approved these changes Nov 19, 2020

View reviewed changes

update docstring

dd377fe

gsheni approved these changes Nov 20, 2020

View reviewed changes

thehomebrewnerd merged commit 3e1f627 into main Nov 20, 2020

thehomebrewnerd deleted the update-data branch November 20, 2020 13:36

gsheni mentioned this pull request Nov 30, 2020

v0.0.6 #418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataTable.update_dataframe method #407

Add DataTable.update_dataframe method #407

thehomebrewnerd commented Nov 18, 2020 •

edited

Loading

codecov bot commented Nov 18, 2020 •

edited

Loading

tamargrey Nov 19, 2020

thehomebrewnerd Nov 19, 2020 •

edited

Loading

tamargrey Nov 19, 2020

thehomebrewnerd Nov 19, 2020

thehomebrewnerd Nov 19, 2020 •

edited

Loading

tamargrey Nov 19, 2020

thehomebrewnerd Nov 19, 2020

tamargrey left a comment

tamargrey Nov 19, 2020

gsheni left a comment

gsheni Nov 20, 2020

Add DataTable.update_dataframe method #407

Add DataTable.update_dataframe method #407

Conversation

thehomebrewnerd commented Nov 18, 2020 • edited Loading

codecov bot commented Nov 18, 2020 • edited Loading

Codecov Report

tamargrey Nov 19, 2020

Choose a reason for hiding this comment

thehomebrewnerd Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

tamargrey Nov 19, 2020

Choose a reason for hiding this comment

thehomebrewnerd Nov 19, 2020

Choose a reason for hiding this comment

thehomebrewnerd Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

tamargrey Nov 19, 2020

Choose a reason for hiding this comment

thehomebrewnerd Nov 19, 2020

Choose a reason for hiding this comment

tamargrey left a comment

Choose a reason for hiding this comment

tamargrey Nov 19, 2020

Choose a reason for hiding this comment

gsheni left a comment

Choose a reason for hiding this comment

gsheni Nov 20, 2020

Choose a reason for hiding this comment

thehomebrewnerd commented Nov 18, 2020 •

edited

Loading

codecov bot commented Nov 18, 2020 •

edited

Loading

thehomebrewnerd Nov 19, 2020 •

edited

Loading

thehomebrewnerd Nov 19, 2020 •

edited

Loading