SPARK-25881 #22888

351zyf · 2018-10-30T07:46:34Z

add parametere coerce_float
https://issues.apache.org/jira/browse/SPARK-25881

What changes were proposed in this pull request?

when using pyspark dataframe.toPandas()
the type decimal in spark df turn to object in pandas dataframe

for i in df_spark.dtypes:
... print(i)
...
('dt', 'string')
('cost_sum', 'decimal(38,3)')
('req_sum', 'bigint')
('pv_sum', 'bigint')
('click_sum', 'bigint')

df_pd = df_spark.toPandas()

df_pd.dtypes
dt object
cost_sum object
req_sum int64
pv_sum int64
click_sum int64
dtype: object

the paramater coerce_float in pd.DataFrame.from_records will handle type decimal.Decimal to floating point.

arr = df_spark.collect()
df2_pd = pd.DataFrame.from_records(df_spark.collect(), columns=df_spark.columns, coerce_float=True)
df2_pd.dtypes
dt object
cost_sum float64
req_sum int64
pv_sum int64
click_sum int64
dtype: object

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

AmplabJenkins · 2018-10-30T07:51:59Z

Can one of the admins verify this patch?

HyukjinKwon · 2018-10-30T08:01:17Z

I think you can just manually convert from Pandas DataFrame, no?

351zyf · 2018-10-30T08:10:08Z

I think you can just manually convert from Pandas DataFrame, no?

If I'm using function toPandas, I dont think decimal to object is right.
Isn't decimal values usually a value to calculate? I mean, numbers.

351zyf · 2018-10-30T08:13:13Z

and this also have no effect on timestamp values.
tested.

HyukjinKwon · 2018-10-30T08:24:08Z

Then, you can convert the type into double or floats in Spark DataFrame. This is super easily able to work around at Pandas DataFrame or Spark's DataFrame. I don't think we should add this flag.

BTW, the same feature should be added to when Arrow optimization is enabled as well.

351zyf · 2018-10-30T08:36:22Z

Then, you can convert the type into double or floats in Spark DataFrame. This is super easily able to work around at Pandas DataFrame or Spark's DataFrame. I don't think we should add this flag.

BTW, the same feature should be added to when Arrow optimization is enabled as well.

Or can we correct this conversion in function dataframe._to_corrected_pandas_type ?
Converting decimal type manually everytime sounds not good..

HyukjinKwon · 2018-10-30T08:38:45Z

You're introducing a flag to convert. I think it's virtually same enabling the flag vs calling a function to convert.

HyukjinKwon · 2018-10-30T08:51:59Z

I would close this, @351zyf.

351zyf · 2018-10-30T08:53:48Z

OK

add parametere coerce_float

edc2a61

comment

d9a3325

351zyf closed this Oct 30, 2018

351zyf mentioned this pull request Oct 30, 2018

[SPARK-25881][pyspark] df.toPandas() convert decimal to object(string) type #22891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-25881 #22888

SPARK-25881 #22888

351zyf commented Oct 30, 2018

AmplabJenkins commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

351zyf commented Oct 30, 2018

351zyf commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

351zyf commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

351zyf commented Oct 30, 2018

SPARK-25881 #22888

SPARK-25881 #22888

Conversation

351zyf commented Oct 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

351zyf commented Oct 30, 2018

351zyf commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

351zyf commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

351zyf commented Oct 30, 2018