Implement .at property for Koalas DataFrames and Series #384

floscha · 2019-05-25T12:17:18Z

As requested in #382, this PR implements pandas' .at property for Koalas DataFrames and Series.

codecov-io · 2019-05-25T12:58:54Z

Codecov Report

Merging #384 into master will decrease coverage by 0.04%.
The diff coverage is 98.03%.

@@            Coverage Diff            @@
##           master    #384      +/-   ##
=========================================
- Coverage   94.74%   94.7%   -0.05%     
=========================================
  Files          41      42       +1     
  Lines        4513    4666     +153     
=========================================
+ Hits         4276    4419     +143     
- Misses        237     247      +10

Impacted Files	Coverage Δ
databricks/koalas/missing/frame.py	`100% <ø> (ø)`	⬆️
databricks/koalas/missing/series.py	`100% <ø> (ø)`	⬆️
databricks/koalas/generic.py	`93.89% <100%> (-0.28%)`	⬇️
databricks/koalas/tests/test_indexing.py	`88.53% <100%> (+1.03%)`	⬆️
databricks/koalas/indexing.py	`93.2% <96.29%> (+0.46%)`	⬆️
databricks/koalas/series.py	`92.85% <0%> (-0.74%)`	⬇️
databricks/koalas/missing/indexes.py	`100% <0%> (ø)`	⬆️
databricks/koalas/tests/test_series.py	`100% <0%> (ø)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dabecef...ebea010. Read the comment docs.

ueshin

@floscha Thanks! I left some comments.

Also, we might need to mention in the doc that this might cause OOM if the row_index matches a lot of rows since unlike .loc or .iloc, this immediately collects the corresponding data into local.

ueshin · 2019-05-27T06:45:06Z

databricks/koalas/indexing.py

+        if self._ks is None and len(key) != 2:
+            raise TypeError("Use DataFrame.at like .at[row_index, column_name]")
+        if self._ks is not None and len(key) != 1:
+            raise TypeError("Use Series.at like .at[column_name]")


row_index instead of column_name?

True. Fixed with 8137630.

ueshin · 2019-05-27T06:53:52Z

databricks/koalas/indexing.py

+    def __getitem__(self, key):
+        from databricks.koalas.frame import DataFrame
+
+        if self._ks is None and len(key) != 2:


I guess we need to check the isinstance(key, tuple) first. If key is str, i.e., kdf.at['AB'] , len(key) returns the string length. Could you add a test for such a case?

Sure, added with 09df062.

Could you add tests for such a case?

Sure, done with fb3587d.

ueshin · 2019-05-27T06:54:59Z

databricks/koalas/indexing.py

+
+        row = key[0]
+        sdf = (series._kdf._sdf
+               .where(F.col(self._kdf._metadata.index_columns[0]) == row)


What if the kdf have multi-index or no index?

I've added a check now that raises an exception if the index level doesn't equal 1 with a61f163. I'll have to look closer into how .at behaves for multilevel indices which generally seems to be rather uncommon. Also, is a Koalas DataFrame without an index even valid? Like, wouldn't it for example crash when calling to_pandas() on it?

As for multi- or no index, currently .loc raises an exception for multi- or no index. Maybe we can follow the behavior for now, but please make sure to check it and raise the exception.
And yes, no index in Koalas DataFrame is valid, but I'm not sure whether we should crash in that case. E.g., sdf.to_koalas() doesn't have index, but should we explicitly set_index prior to to_pandas() in that case? This is the future work anyway. Let's file an issue to discuss it.
Thanks!

ueshin · 2019-05-27T06:56:18Z

databricks/koalas/indexing.py

+        sdf = (series._kdf._sdf
+               .where(F.col(self._kdf._metadata.index_columns[0]) == row)
+               .select(column))
+        if sdf.count() < 1:


This runs an extra Spark job.
How about checking the length after to_pandas() below?

Good point! I've implemented your suggestion in 2dc8fa5.

ueshin · 2019-05-27T07:00:05Z

databricks/koalas/indexing.py

+
+        column = key[1] if len(key) > 1 else self._ks.name
+        if column is not None and column not in self._kdf._metadata.data_columns:
+            raise KeyError("'%s" % column)


just out of curiosity, what's ' for?

This is just to be consistent with the way pandas uses Python's KeyError, putting the missing key between 's. However, I've just realized I've missed the closing ' and the 's are not even necessary to start with 🙈 So I've fixed this with 2dc8fa5.

ueshin · 2019-05-27T07:00:12Z

databricks/koalas/indexing.py

+               .where(F.col(self._kdf._metadata.index_columns[0]) == row)
+               .select(column))
+        if sdf.count() < 1:
+            raise KeyError("'%s" % row)


See comment above.

ueshin · 2019-05-27T07:01:17Z

databricks/koalas/indexing.py

+        if sdf.count() < 1:
+            raise KeyError("'%s" % row)
+
+        values = DataFrame(sdf).to_pandas().iloc[:, 0].values


Actually we don't need to create Koalas DataFrame here. Spark DataFrame has toPandas().

You're right. Fixed this with 2dc8fa5.

floscha · 2019-05-27T08:15:11Z

Thanks for your thorough review @ueshin!

I've addressed all of your comments and also added a warning note for matching a lot of rows with f14d136.

ueshin · 2019-05-27T08:20:20Z

databricks/koalas/indexing.py

+
+        if self._ks is None and (not isinstance(key, tuple) or len(key) != 2):
+            raise TypeError("Use DataFrame.at like .at[row_index, column_name]")
+        if self._ks is not None and len(key) != 1:


Need to check key is not str before len(key) != 1?

I mean what if kdf.a.at['abc'], for example?

Good point! I've added the check and an additional unit test (that used to fail previously) with
415e78a.

thunterdb · 2019-05-27T16:44:11Z

databricks/koalas/indexing.py

+    """
+    Access a single value for a row/column label pair.
+    Similar to ``loc``, in that both provide label-based lookups. Use ``at`` if you only need to
+    get a single value in a DataFrame or Series.


you should also mention the behavior when multiple values are matched (returning an array)

I've added it to the docs with 0dc9e08.

thunterdb · 2019-05-27T16:44:14Z

databricks/koalas/indexing.py

+
+    Get value at specified row/column pair
+
+    >>> kdf.at[4, 'B']


can you also put an example with multiple values matching the same index, to show that it returns an array?

Sure, done with 83a77d0.

ueshin

I left two nits, otherwise LGTM.

ueshin · 2019-05-31T07:43:50Z

databricks/koalas/indexing.py

+
+    Get array if an index occurs multiple times
+
+    >>> kdf.to_pandas().at[5, 'B']


kdf.at[5, 'B'] ?

You're right. Fixed with ebea010.

ueshin · 2019-05-31T07:45:09Z

databricks/koalas/tests/test_indexing.py

+        self.assertEqual(test_series.at['b'], 6)
+        self.assertEqual(test_series.at['b'], pdf.loc[3].at['b'])
+
+        #


forgot to add a comment?

Correct. Guess I was kind of tired yesterday when committing this... Now fixed with ebea010 though.

softagram-bot · 2019-05-31T08:24:01Z

Softagram Impact Report for pull/384 (head commit: `ebea010`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/384

Give feedback on this report to support@softagram.com

ueshin · 2019-05-31T08:40:41Z

Thanks! merging.

floscha added 2 commits May 25, 2019 14:15

Implement .at property for Koalas DataFrames and Series

3160ae2

Fix return value and tests

13a0acc

ueshin reviewed May 27, 2019

View reviewed changes

floscha added 6 commits May 27, 2019 09:29

Fix TypeError message for Series.at

8137630

Add check for tuple type

09df062

Add check for index level to be 1

a61f163

Avoid unnecessary Spark job and DataFrame creation

2dc8fa5

Fix KeyError message

c1c981a

Add warning note for matching a lot of rows

f14d136

ueshin reviewed May 27, 2019

View reviewed changes

thunterdb reviewed May 27, 2019

View reviewed changes

floscha added 5 commits May 30, 2019 17:33

Add test case where index is str of len 2

fb3587d

Remove unused import

85cd230

Add support for str indices with len > 1

415e78a

Add doctest for index occuring multiple times

83a77d0

Extend docs for index occuring multiple times

0dc9e08

ueshin reviewed May 31, 2019

View reviewed changes

Fix doctest and missing comment

ebea010

ueshin merged commit c6310ff into databricks:master May 31, 2019

floscha deleted the at branch May 31, 2019 09:21

garawalid mentioned this pull request May 31, 2019

Implement missing functions for 10-minute tutorial with Koalas #382

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement .at property for Koalas DataFrames and Series #384

Implement .at property for Koalas DataFrames and Series #384

floscha commented May 25, 2019

codecov-io commented May 25, 2019 •

edited

ueshin left a comment •

edited

ueshin May 27, 2019

floscha May 27, 2019

ueshin May 27, 2019

floscha May 27, 2019

ueshin May 27, 2019

floscha May 30, 2019

ueshin May 27, 2019

floscha May 27, 2019

ueshin May 27, 2019

ueshin May 27, 2019

floscha May 27, 2019

ueshin May 27, 2019

floscha May 27, 2019

ueshin May 27, 2019

floscha May 27, 2019

ueshin May 27, 2019

floscha May 27, 2019

floscha commented May 27, 2019

ueshin May 27, 2019 •

edited

ueshin May 27, 2019

floscha May 30, 2019

thunterdb May 27, 2019

floscha May 30, 2019

thunterdb May 27, 2019

floscha May 30, 2019

ueshin left a comment

ueshin May 31, 2019

floscha May 31, 2019

ueshin May 31, 2019

floscha May 31, 2019

softagram-bot commented May 31, 2019

ueshin commented May 31, 2019


		Get array if an index occurs multiple times

		>>> kdf.to_pandas().at[5, 'B']

Implement .at property for Koalas DataFrames and Series #384

Implement .at property for Koalas DataFrames and Series #384

Conversation

floscha commented May 25, 2019

codecov-io commented May 25, 2019 • edited

Codecov Report

ueshin left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floscha commented May 27, 2019

ueshin May 27, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softagram-bot commented May 31, 2019

Softagram Impact Report for pull/384 (head commit: ebea010)

⭐ Change Overview

📄 Full report

ueshin commented May 31, 2019

codecov-io commented May 25, 2019 •

edited

ueshin left a comment •

edited

ueshin May 27, 2019 •

edited

Softagram Impact Report for pull/384 (head commit: `ebea010`)