Skip to content

Commit

Permalink
[SPARK-10417] [SQL] Iterating through Column results in infinite loop
Browse files Browse the repository at this point in the history
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)

Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8574 from 0x0FFF/SPARK-10417.
  • Loading branch information
0x0FFF authored and davies committed Sep 2, 2015
1 parent 2da3a9e commit 6cd98c1
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 0 deletions.
3 changes: 3 additions & 0 deletions python/pyspark/sql/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,9 @@ def __getattr__(self, item):
raise AttributeError(item)
return self.getField(item)

def __iter__(self):
raise TypeError("Column is not iterable")

# string methods
rlike = _bin_op("rlike")
like = _bin_op("like")
Expand Down
9 changes: 9 additions & 0 deletions python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -1066,6 +1066,15 @@ def test_with_column_with_existing_name(self):
keys = self.df.withColumn("key", self.df.key).select("key").collect()
self.assertEqual([r.key for r in keys], list(range(100)))

# regression test for SPARK-10417
def test_column_iterator(self):

def foo():
for x in self.df.key:
break

self.assertRaises(TypeError, foo)


class HiveContextSQLTests(ReusedPySparkTestCase):

Expand Down

0 comments on commit 6cd98c1

Please sign in to comment.