New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement .at property for Koalas DataFrames and Series #384
Conversation
Codecov Report
@@ Coverage Diff @@
## master #384 +/- ##
=========================================
- Coverage 94.74% 94.7% -0.05%
=========================================
Files 41 42 +1
Lines 4513 4666 +153
=========================================
+ Hits 4276 4419 +143
- Misses 237 247 +10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@floscha Thanks! I left some comments.
Also, we might need to mention in the doc that this might cause OOM if the row_index matches a lot of rows since unlike .loc
or .iloc
, this immediately collects the corresponding data into local.
databricks/koalas/indexing.py
Outdated
if self._ks is None and len(key) != 2: | ||
raise TypeError("Use DataFrame.at like .at[row_index, column_name]") | ||
if self._ks is not None and len(key) != 1: | ||
raise TypeError("Use Series.at like .at[column_name]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_index
instead of column_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Fixed with 8137630.
databricks/koalas/indexing.py
Outdated
def __getitem__(self, key): | ||
from databricks.koalas.frame import DataFrame | ||
|
||
if self._ks is None and len(key) != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to check the isinstance(key, tuple)
first. If key
is str
, i.e., kdf.at['AB']
, len(key)
returns the string length. Could you add a test for such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added with 09df062.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done with fb3587d.
|
||
row = key[0] | ||
sdf = (series._kdf._sdf | ||
.where(F.col(self._kdf._metadata.index_columns[0]) == row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the kdf
have multi-index or no index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check now that raises an exception if the index level doesn't equal 1 with a61f163. I'll have to look closer into how .at
behaves for multilevel indices which generally seems to be rather uncommon. Also, is a Koalas DataFrame without an index even valid? Like, wouldn't it for example crash when calling to_pandas()
on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for multi- or no index, currently .loc
raises an exception for multi- or no index. Maybe we can follow the behavior for now, but please make sure to check it and raise the exception.
And yes, no index in Koalas DataFrame is valid, but I'm not sure whether we should crash in that case. E.g., sdf.to_koalas()
doesn't have index, but should we explicitly set_index
prior to to_pandas()
in that case? This is the future work anyway. Let's file an issue to discuss it.
Thanks!
databricks/koalas/indexing.py
Outdated
sdf = (series._kdf._sdf | ||
.where(F.col(self._kdf._metadata.index_columns[0]) == row) | ||
.select(column)) | ||
if sdf.count() < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs an extra Spark job.
How about checking the length after to_pandas()
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've implemented your suggestion in 2dc8fa5.
databricks/koalas/indexing.py
Outdated
|
||
column = key[1] if len(key) > 1 else self._ks.name | ||
if column is not None and column not in self._kdf._metadata.data_columns: | ||
raise KeyError("'%s" % column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity, what's '
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to be consistent with the way pandas uses Python's KeyError
, putting the missing key between '
s. However, I've just realized I've missed the closing '
and the '
s are not even necessary to start with 🙈 So I've fixed this with 2dc8fa5.
databricks/koalas/indexing.py
Outdated
.where(F.col(self._kdf._metadata.index_columns[0]) == row) | ||
.select(column)) | ||
if sdf.count() < 1: | ||
raise KeyError("'%s" % row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
databricks/koalas/indexing.py
Outdated
if sdf.count() < 1: | ||
raise KeyError("'%s" % row) | ||
|
||
values = DataFrame(sdf).to_pandas().iloc[:, 0].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't need to create Koalas DataFrame
here. Spark DataFrame has toPandas()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Fixed this with 2dc8fa5.
databricks/koalas/indexing.py
Outdated
|
||
if self._ks is None and (not isinstance(key, tuple) or len(key) != 2): | ||
raise TypeError("Use DataFrame.at like .at[row_index, column_name]") | ||
if self._ks is not None and len(key) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check key is not str before len(key) != 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean what if kdf.a.at['abc']
, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've added the check and an additional unit test (that used to fail previously) with
415e78a.
""" | ||
Access a single value for a row/column label pair. | ||
Similar to ``loc``, in that both provide label-based lookups. Use ``at`` if you only need to | ||
get a single value in a DataFrame or Series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also mention the behavior when multiple values are matched (returning an array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it to the docs with 0dc9e08.
|
||
Get value at specified row/column pair | ||
|
||
>>> kdf.at[4, 'B'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also put an example with multiple values matching the same index, to show that it returns an array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done with 83a77d0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two nits, otherwise LGTM.
databricks/koalas/indexing.py
Outdated
|
||
Get array if an index occurs multiple times | ||
|
||
>>> kdf.to_pandas().at[5, 'B'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kdf.at[5, 'B']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Fixed with ebea010.
self.assertEqual(test_series.at['b'], 6) | ||
self.assertEqual(test_series.at['b'], pdf.loc[3].at['b']) | ||
|
||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Guess I was kind of tired yesterday when committing this... Now fixed with ebea010 though.
Softagram Impact Report for pull/384 (head commit: ebea010)⭐ Change Overview
📄 Full report
Give feedback on this report to support@softagram.com |
Thanks! merging. |
As requested in #382, this PR implements pandas'
.at
property for Koalas DataFrames and Series.