Skip to content

Commit

Permalink
[SPARK-28003][PYTHON] Allow NaT values when creating Spark dataframe …
Browse files Browse the repository at this point in the history
…from pandas with Arrow

## What changes were proposed in this pull request?

This patch removes `fillna(0)` when creating ArrowBatch from a pandas Series.

With `fillna(0)`, the original code would turn a timestamp type into object type, which pyarrow will complain later:
```
>>> s = pd.Series([pd.NaT, pd.Timestamp('2015-01-01')])
>>> s.dtypes
dtype('<M8[ns]')
>>> s.fillna(0)
0                      0
1    2015-01-01 00:00:00
dtype: object
```

## How was this patch tested?

Added `test_timestamp_nat`

Closes #24844 from icexelloss/SPARK-28003-arrow-nat.

Authored-by: Li Jin <ice.xelloss@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
  • Loading branch information
icexelloss authored and BryanCutler committed Jun 24, 2019
1 parent 9df7587 commit d0fbc4d
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 2 deletions.
3 changes: 1 addition & 2 deletions python/pyspark/serializers.py
Expand Up @@ -296,8 +296,7 @@ def create_array(s, t):
mask = s.isnull()
# Ensure timestamp series are in expected form for Spark internal representation
if t is not None and pa.types.is_timestamp(t):
s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone)

s = _check_series_convert_timestamps_internal(s, self._timezone)
try:
array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
except pa.ArrowException as e:
Expand Down
9 changes: 9 additions & 0 deletions python/pyspark/sql/tests/test_arrow.py
Expand Up @@ -383,6 +383,15 @@ def test_timestamp_dst(self):
assert_frame_equal(pdf, df_from_python.toPandas())
assert_frame_equal(pdf, df_from_pandas.toPandas())

# Regression test for SPARK-28003
def test_timestamp_nat(self):
dt = [pd.NaT, pd.Timestamp('2019-06-11'), None] * 100
pdf = pd.DataFrame({'time': dt})
df_no_arrow, df_arrow = self._createDataFrame_toggle(pdf)

assert_frame_equal(pdf, df_no_arrow.toPandas())
assert_frame_equal(pdf, df_arrow.toPandas())

def test_toPandas_batch_order(self):

def delay_first_part(partition_index, iterator):
Expand Down

0 comments on commit d0fbc4d

Please sign in to comment.