New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames #29719
[SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames #29719
Conversation
Can one of the admins verify this patch? |
I think you're already able to do the same thing with |
Thank you @HyukjinKwon, issue is that this only applies to dataframes, this means that only spark supported types can be input to I believe this can enable more seamless integration with python packages that do not natively support spark. |
Friendly ping @HyukjinKwon, is anything more needed for a review? |
@BryanCutler @ueshin @viirya Who would be the right person to review this? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Sync with master
…andas-rdd-to-spark-df-SPARK-3284 � Conflicts: � python/pyspark/sql/pandas/types.py � python/pyspark/sql/session.py
Hi @HyukjinKwon @BryanCutler, I've synced with master, hoping this could get reviewed and get the "stale" tag removed |
Any chance of getting it through? I'd love to have the feature. My use case is many to many pandas dataframes/arrow tables. applyInPandas with cogroup is just many to one... |
This would be a really useful feature, if it could be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a potentially useful feature.
pandasRDD=True creates a DataFrame from an RDD of pandas dataframes | ||
(currently only supported using arrow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do type checking here instead of having a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we somehow define/get the type of the RDD[py-object] without evaluating the first element of it?
If not, then the RDD might contain any type of object, so the pandasRDD option is used as a way to differentiate between initialization from an RDD and an RDD of pd.DataFrames.
Thank you for reviewing! please let me know if there's anything else i can do to get this merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. If we look in session.py
we can see _createFromRDD
does it magic there. Personally I would put this logic instead inside of _inferSchema
and toInternal
respectively but I'm coming at this from more a core-spark dev perspective maybe @HyukjinKwon has a different view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we don't refactor this back into session.py, I'd encourage you to look at session.py and consider structuring this in a similar way so that we don't have to have this flag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this seems to fit well into _inferSchema
& _createFromRDD
, although we still would need some way to discern between an rdd of DataFrames and other types when the user provides a schema (and we don't want to peek into the first item).
Do you think it would be better to move the pandas flag into _createFromRDD
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So _inferSchema does effectively peek into the first element. I think we could just put the logic down inside of the map and then the user doesn't have to specify this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in case the user wants to infer the schema (so we have to peek into the rdd), but in case the use does specify the schema, there's no need to peek, and we're left with no other option to tell which code path we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say the user specifies a schema, in that case inside of _createFromRDD we can just look at the type of each element that were processing and see if it's a DataFrame or a Row or a Dictionary and dispatch the logic there. What do you think? Or is there a reason I'm missing why we couldn't do the dispatch inside of _createFromRDD based on type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in case the user specifies a schema, the entire process is lazy, so there's no need to evaluate any of the rdd elements...
if we keep everything lazy and map each element to either a row or RecordBatch, we would still need to know which path to take, e.g. for RecordBatches we need to call:
from pyspark.sql.dataframe import DataFrame
jrdd = rb_rdd._to_java_object_rdd()
jdf = self._jvm.PythonSQLUtils.toDataFrame(jrdd, schema.json(), self._wrapped._jsqlContext)
df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
and for Rows we need to call:
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
Hey @linar-jether just pinging to see if this still something your working on. |
@linar-jether, just to clarify what's your usecase? you can leverage a binary DataFrame and call RDD -> DataFrame[binary] -> DataFrame.mapInPandas(func_binary_to_pandas, schema) |
my biggest concern is that the current API here encourages to more use RDD whereas Spark more encourages to use DataFrame APIs in general to leverage Spark SQL optimizer, etc. |
The exact reason I need this API is because Spark DataFrame optimization does not work for complex transformations I have. Simple switch to my own custom wrapper on RDDs (lame analogue of this API) gave me 4-20 times better performance and much lower cost of pandas to pyspark migration. So for me it is not a concern. |
@HyukjinKwon @holdenk My use case is efficiently creating a spark DataFrame from a distributed dataset, spark currently supports doing this either with remote storage (e.g. write to parquete files) or using the The suggestion to use I must say that we use this internally quite a bit (since Spark 2.X) and it greatly improves productivity, some example use cases: Reading large climatological datasets using xarray and treating them as a single Spark DataFrame. I believe this feature can improve interoperability with other python libraries, similar to what can be done with Dask's |
@linar-jether, would you mind sharing your pseudo codes? I am trying to figure out the general approach to address this problem (e.g., SPARK-32846, SPARK-30153, SPARK-26413). |
@HyukjinKwon What do you mean by pseudo codes? My initial snippet for using pandas<->arrow<->spark conversions was done using this: And this comment for converting directly from arrow RecordBatches without using pandas: https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5#gistcomment-3452086 (works with spark 3.x) Basically, all of the logic for creating a dataframe from Arrow |
Regarding the issues you've mentioned, i think a simple If you feel this is a good approach, i can add an option for |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Have there been any updates for adding this kind of functionality since this pull request? Being able to take an RDD of pyarrow RecordBatches or pandas DataFrames and turn it into a Spark DataFrame would be very useful turning a dataset distributed at the workers outside of Spark into a Spark DataFrame for analysis. Even if an API like this hasn't been added, is there any guidance on achieving this (building a Spark DataFrom from an RDD of pandas RecordBatches or pandas DataFrames) in Spark 3.4/3.5? As far as I can tell, the code in this pull request no longer works on the latest versions of Spark because |
I think currently RDD[python arrow table] -> DataFrame would be even better as it would allow not only pandas but polars and pyarrow as well. |
@samkumar @JacekPliszka Hopefully this can be incorporated into a proper api... from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.pandas.types import from_arrow_schema
import pyarrow as pa
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
def map_to_record_batch(i):
df = pd.DataFrame(np.random.randn(10, 4) * i, columns=list('ABCD'))
return pa.RecordBatch.from_pandas(df)
if __name__ == '__main__':
spark = SparkSession.builder.appName("Python Arrow-in-Spark example").getOrCreate()
# Create an RDD of Arrow RecordBatch objects
ardd = spark.sparkContext.parallelize([1, 2, 3, 4], 4)
ardd = ardd.map(map_to_record_batch).cache() # cache to avoid recomputing after inferring schema
# Peek at the first record batch to infer schema
arrow_schema = ardd.first().schema
spark_schema = from_arrow_schema(arrow_schema)
# Convert RDD[RecordBatch] to RDD[bytearray] for serialization
ardd = ardd.map(lambda x: bytearray(x.serialize()))
# Create a spark DataFrame from RDD[bytearray] and schema
jrdd = ardd._to_java_object_rdd()
jdf = spark._jvm.PythonSQLUtils.toDataFrame(jrdd, spark_schema.json(),
spark._jsparkSession)
df = DataFrame(jdf, spark)
df._schema = spark_schema
df.show() |
Fails for me here with py4j.Py4JException: Method toDataFrame([class org.apache.spark.api.java.JavaRDD, class java.lang.String, class org.apache.spark.sql.SparkSession]) does not exist. This is single node cluster and the method does exist.. strange.. Acording to Databricks this is 3.5.0 - on 3.3.0 it failed earlier. Same error on my PC |
You could even do sth like this: import pyarrow
spark_schema = from_arrow_schema(ardd.first().schema)
ser_ardd = ardd.map(lambda x: bytearray(x.serialize()))
df = spark.createDataFrame(ser_ardd, "binary")
df.mapInArrow(pyarrow.deserialize, schema=spark_schema) |
my pyarrow.deserialize has no text_signature But this worked as workaround: def f(obj): df1=df.mapInArrow(f, schema=spark_schema) Still df1.collect() failed - does it work for you? |
What changes were proposed in this pull request?
Added support to
createDataFrame
to receive an RDD ofpd.DataFrame
objects, and convert them using arrow into an RDD of record batches which is then directly converted to a spark DF.Added a
pandasRDD
flag tocreateDataFrame
to distinguish betweenRDD[pd.DataFrame]
and other RDDs without peeking into their content.How was this patch tested?
Added a new test using for creating a spark DF from an RDD of pandas dataframes.