Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement .at property for Koalas DataFrames and Series #384

Merged
merged 14 commits into from May 31, 2019

Conversation

floscha
Copy link
Collaborator

@floscha floscha commented May 25, 2019

As requested in #382, this PR implements pandas' .at property for Koalas DataFrames and Series.

@codecov-io
Copy link

codecov-io commented May 25, 2019

Codecov Report

Merging #384 into master will decrease coverage by 0.04%.
The diff coverage is 98.03%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #384      +/-   ##
=========================================
- Coverage   94.74%   94.7%   -0.05%     
=========================================
  Files          41      42       +1     
  Lines        4513    4666     +153     
=========================================
+ Hits         4276    4419     +143     
- Misses        237     247      +10
Impacted Files Coverage Δ
databricks/koalas/missing/frame.py 100% <ø> (ø) ⬆️
databricks/koalas/missing/series.py 100% <ø> (ø) ⬆️
databricks/koalas/generic.py 93.89% <100%> (-0.28%) ⬇️
databricks/koalas/tests/test_indexing.py 88.53% <100%> (+1.03%) ⬆️
databricks/koalas/indexing.py 93.2% <96.29%> (+0.46%) ⬆️
databricks/koalas/series.py 92.85% <0%> (-0.74%) ⬇️
databricks/koalas/missing/indexes.py 100% <0%> (ø) ⬆️
databricks/koalas/tests/test_series.py 100% <0%> (ø) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dabecef...ebea010. Read the comment docs.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@floscha Thanks! I left some comments.

Also, we might need to mention in the doc that this might cause OOM if the row_index matches a lot of rows since unlike .loc or .iloc, this immediately collects the corresponding data into local.

if self._ks is None and len(key) != 2:
raise TypeError("Use DataFrame.at like .at[row_index, column_name]")
if self._ks is not None and len(key) != 1:
raise TypeError("Use Series.at like .at[column_name]")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_index instead of column_name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Fixed with 8137630.

def __getitem__(self, key):
from databricks.koalas.frame import DataFrame

if self._ks is None and len(key) != 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to check the isinstance(key, tuple) first. If key is str, i.e., kdf.at['AB'] , len(key) returns the string length. Could you add a test for such a case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added with 09df062.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for such a case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done with fb3587d.


row = key[0]
sdf = (series._kdf._sdf
.where(F.col(self._kdf._metadata.index_columns[0]) == row)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the kdf have multi-index or no index?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check now that raises an exception if the index level doesn't equal 1 with a61f163. I'll have to look closer into how .at behaves for multilevel indices which generally seems to be rather uncommon. Also, is a Koalas DataFrame without an index even valid? Like, wouldn't it for example crash when calling to_pandas() on it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for multi- or no index, currently .loc raises an exception for multi- or no index. Maybe we can follow the behavior for now, but please make sure to check it and raise the exception.
And yes, no index in Koalas DataFrame is valid, but I'm not sure whether we should crash in that case. E.g., sdf.to_koalas() doesn't have index, but should we explicitly set_index prior to to_pandas() in that case? This is the future work anyway. Let's file an issue to discuss it.
Thanks!

sdf = (series._kdf._sdf
.where(F.col(self._kdf._metadata.index_columns[0]) == row)
.select(column))
if sdf.count() < 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs an extra Spark job.
How about checking the length after to_pandas() below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've implemented your suggestion in 2dc8fa5.


column = key[1] if len(key) > 1 else self._ks.name
if column is not None and column not in self._kdf._metadata.data_columns:
raise KeyError("'%s" % column)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity, what's ' for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to be consistent with the way pandas uses Python's KeyError, putting the missing key between 's. However, I've just realized I've missed the closing ' and the 's are not even necessary to start with 🙈 So I've fixed this with 2dc8fa5.

.where(F.col(self._kdf._metadata.index_columns[0]) == row)
.select(column))
if sdf.count() < 1:
raise KeyError("'%s" % row)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

if sdf.count() < 1:
raise KeyError("'%s" % row)

values = DataFrame(sdf).to_pandas().iloc[:, 0].values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we don't need to create Koalas DataFrame here. Spark DataFrame has toPandas().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Fixed this with 2dc8fa5.

@floscha
Copy link
Collaborator Author

floscha commented May 27, 2019

Thanks for your thorough review @ueshin!

I've addressed all of your comments and also added a warning note for matching a lot of rows with f14d136.


if self._ks is None and (not isinstance(key, tuple) or len(key) != 2):
raise TypeError("Use DataFrame.at like .at[row_index, column_name]")
if self._ks is not None and len(key) != 1:
Copy link
Collaborator

@ueshin ueshin May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check key is not str before len(key) != 1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean what if kdf.a.at['abc'], for example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've added the check and an additional unit test (that used to fail previously) with
415e78a.

"""
Access a single value for a row/column label pair.
Similar to ``loc``, in that both provide label-based lookups. Use ``at`` if you only need to
get a single value in a DataFrame or Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also mention the behavior when multiple values are matched (returning an array)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it to the docs with 0dc9e08.


Get value at specified row/column pair

>>> kdf.at[4, 'B']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also put an example with multiple values matching the same index, to show that it returns an array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done with 83a77d0.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two nits, otherwise LGTM.


Get array if an index occurs multiple times

>>> kdf.to_pandas().at[5, 'B']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kdf.at[5, 'B'] ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Fixed with ebea010.

self.assertEqual(test_series.at['b'], 6)
self.assertEqual(test_series.at['b'], pdf.loc[3].at['b'])

#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to add a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Guess I was kind of tired yesterday when committing this... Now fixed with ebea010 though.

@softagram-bot
Copy link

Softagram Impact Report for pull/384 (head commit: ebea010)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to support@softagram.com

@ueshin
Copy link
Collaborator

ueshin commented May 31, 2019

Thanks! merging.

@ueshin ueshin merged commit c6310ff into databricks:master May 31, 2019
@floscha floscha deleted the at branch May 31, 2019 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants