Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-25881 #22888

Closed
wants to merge 2 commits into from
Closed

SPARK-25881 #22888

wants to merge 2 commits into from

Conversation

351zyf
Copy link

@351zyf 351zyf commented Oct 30, 2018

add parametere coerce_float
https://issues.apache.org/jira/browse/SPARK-25881

What changes were proposed in this pull request?

when using pyspark dataframe.toPandas()
the type decimal in spark df turn to object in pandas dataframe

for i in df_spark.dtypes:
... print(i)
...
('dt', 'string')
('cost_sum', 'decimal(38,3)')
('req_sum', 'bigint')
('pv_sum', 'bigint')
('click_sum', 'bigint')

df_pd = df_spark.toPandas()

df_pd.dtypes
dt object
cost_sum object
req_sum int64
pv_sum int64
click_sum int64
dtype: object

the paramater coerce_float in pd.DataFrame.from_records will handle type decimal.Decimal to floating point.

arr = df_spark.collect()
df2_pd = pd.DataFrame.from_records(df_spark.collect(), columns=df_spark.columns, coerce_float=True)
df2_pd.dtypes
dt object
cost_sum float64
req_sum int64
pv_sum int64
click_sum int64
dtype: object

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

I think you can just manually convert from Pandas DataFrame, no?

@351zyf
Copy link
Author

351zyf commented Oct 30, 2018

I think you can just manually convert from Pandas DataFrame, no?

If I'm using function toPandas, I dont think decimal to object is right.
Isn't decimal values usually a value to calculate? I mean, numbers.

@351zyf
Copy link
Author

351zyf commented Oct 30, 2018

and this also have no effect on timestamp values.
tested.

@HyukjinKwon
Copy link
Member

Then, you can convert the type into double or floats in Spark DataFrame. This is super easily able to work around at Pandas DataFrame or Spark's DataFrame. I don't think we should add this flag.

BTW, the same feature should be added to when Arrow optimization is enabled as well.

@351zyf
Copy link
Author

351zyf commented Oct 30, 2018

Then, you can convert the type into double or floats in Spark DataFrame. This is super easily able to work around at Pandas DataFrame or Spark's DataFrame. I don't think we should add this flag.

BTW, the same feature should be added to when Arrow optimization is enabled as well.

Or can we correct this conversion in function dataframe._to_corrected_pandas_type ?
Converting decimal type manually everytime sounds not good..

@HyukjinKwon
Copy link
Member

You're introducing a flag to convert. I think it's virtually same enabling the flag vs calling a function to convert.

@HyukjinKwon
Copy link
Member

I would close this, @351zyf.

@351zyf
Copy link
Author

351zyf commented Oct 30, 2018

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants