-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs #24965
Conversation
import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, ColumnVector} | ||
|
||
|
||
abstract class BaseArrowPythonRunner[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just some common stuff that I needed for both the new Data passing mechanism and the existing (Arrow Streaming mechanism). I've broken it out her mainly because made it easier for me to track what new functionality I'd actually added. I don't think a proper solution would really have this class hierarchy.
import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, MessageSerializer} | ||
|
||
|
||
class InterleavedArrowWriter( leftRoot: VectorSchemaRoot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is analagous to org.apache.arrow.vector.ipc.ArrowWriter but allows for interleaved dataframes to be sent. I suspect it could all be more memory efficient if we had a different interface which allowed for left batch to be sent before right batch is loaded.
|
||
def __init__(self, stream): | ||
import pyarrow as pa | ||
self._schema1 = pa.read_schema(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to read these also using the message reader but for some reason pa.read_schema(self_reader.read_next_message()) didn't work.
def merge_pandas(left, right): | ||
return pd.merge(left, right, how='outer', on=['k', 'id']) | ||
|
||
# TODO: Grouping by a string fails to resolve here as analyzer cannot determine side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the comment says- l.groupBy('id').cogroup(r.groupBy('id'))
will fail as the resolver can't work out whether each 'id' col should be resolved from l or r. I'm a bit unsure as to the best way of solving this.
…7463-poc # Conflicts: # core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala # python/pyspark/rdd.py # python/pyspark/worker.py
ok to test |
Test build #106892 has finished for PR 24965 at commit
|
|
||
def wrapped(left, right): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is left
a list of pd.Series here? Probably name them left_series
and value_series
to be more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes they are- they are value series for left and right sides of the cogroup respectively. Agreed that the names aren't the best. I'll improve them when I do a tidy up.
@@ -47,8 +47,8 @@ import org.apache.spark.sql.types.{NumericType, StructType} | |||
*/ | |||
@Stable | |||
class RelationalGroupedDataset protected[sql]( | |||
df: DataFrame, | |||
groupingExprs: Seq[Expression], | |||
private val df: DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are- I need the accessors in flatMapCoGroupsInPandas.
@d80tb7 Thanks for working on this! I think at the high level this makes sense. I am slight +1 for introducing a new way of serializing data other than using the arrow steam format. The code doesn't seem too complicated to me. I think doing this in other ways will be more complicated. @BryanCutler @HyukjinKwon WDYT? |
Thanks for working on this @d80tb7 , I have been busy this week but will try to take a look soon. I would really prefer to stick with the Arrow stream format if at all possible. Could the Scala side send 2 complete Arrow streams to Python sequentially for each group? Then the worker would convert each stream into a Pandas DataFrame to evaluate the cogroup UDF. It would add some overhead since it will be sending more streams, but I think it will be minimal. WDYT? |
Hi @BryanCutler Yes, that seems like a valid approach. Let me see if I can produce another prototype based on that approach and see if we can compare them. I think this solution is probably more flexible in the long run, but there would obviously be a cost to defining and maintaining our own custom streaming format (even if it is largely the same as the arrow format). If we have code examples for both it'll be easier to see. |
@BryanCutler I think the main issue with the approach that you suggested is that the python worker needs to hold much more data. For example, assuming each Arrow Stream has 10 batches, in order to process the first cogroup, the worker will need to read all 10 batches from the left table and the first batch from the right table. So in total of 11 batches instead of 2. I think that could be significant more memory usage in the python worker. If we want to send two Arrow stream, I think we would need to do it with two separate connections between Python and Java so the Python worker can alternate between the two streams. I think could be more complicated but not entirely sure. Is this what you prefer? |
@icexelloss - my understanding of @BryanCutler's idea is that he wants a completely separate arrow stream for every group. In this case we would only have to hold 2 batches in memory at any one time, albeit at the cost of paying the stream overhead (schema etc) for every group. Assuming that the stream overhead isn't significant (and I think it's reasonable to assume it won't be be)- then this should work. I think the implementation might be a bit more tricky (you'd have to send some sort of marker to indicate that all the arrow streams have finished), but hopefully a poc could help with this. |
Yeah I am happy as long as we don't hold more than 2 batches in memory at the same time :) |
Yes, this is what I was getting at. Each complete stream is one group, so there wouldn't be more than the 2 groups required in memory at a time. Btw, the reason I prefer to stick to the stream protocol is that Arrow already tests this thoroughly and while sending messages piecemeal will probably work fine, it's just not actively being tested between Java and C++/Python. |
Ok I've put another PR (#24981) up to show an implementation that uses the Arrow Stream format. For what it's worth I marginally prefer it to this implementation, but would be happy with either. Let me know what you think. |
.merge(l.toPandas(), r.toPandas(), how='outer', on=['k', 'id']) | ||
|
||
assert_frame_equal(expected, result, check_column_type=_check_column_type) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @d80tb7, I work with Li and am also interested in cogroup.
Can I ask how you were able to get your test to run? I wasn't able to run it without the following snippet:
if __name__ == "__main__":
from pyspark.sql.tests.test_pandas_udf_grouped_map import *
try:
import xmlrunner
testRunner = xmlrunner.XMLTestRunner(output='target/test-reports', verbosity=2)
except ImportError:
testRunner = None
unittest.main(testRunner=testRunner, verbosity=2)
taken from the other similar tests like test_pandas_udf_grouped_map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hjoo
So far I've just been running it via PyCharm's unit test runner under python 3. I suspect the problem you had was that the iterator I added wasn't compatible with python 2 (sorry!). I've fixed the iterator and added a similar snippet to the one you provided above. Now I can run using python/run-tests --testnames pyspark.sql.tests.test_pandas_udf_cogrouped_map
If you still have problems let me know the error you get and I'll take a look.
Test build #107109 has finished for PR 24965 at commit
|
Test build #107111 has finished for PR 24965 at commit
|
functionExpr: Expression, | ||
output: Seq[Attribute], | ||
left: LogicalPlan, | ||
right: LogicalPlan) extends BinaryNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW, checkout this https://github.com/databricks/scala-style-guide)
Yea .. I prefer to stick to Arrow format as well .. it will make it easier to understand as well. |
Shall we leave this closed and go ahead with #24981? |
Can one of the admins verify this patch? |
yes- I'll close this. Essentially this has been superseded by #24981. |
I opened a PR to address the refactoring stuff at #25989 |
…oup arrow runner and its plan ### What changes were proposed in this pull request? This PR proposes to avoid abstract classes introduced at #24965 but instead uses trait and object. - `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in **Before:** ``` BasePythonRunner ├── BaseArrowPythonRunner │ ├── ArrowPythonRunner │ └── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` **After:** ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` - `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple **Before:** ``` └── BasePandasGroupExec ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` **After:** ``` ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` ### Why are the changes needed? The problem is that R code path is being matched with Python side: **Python:** ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` **R:** ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally. `BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs. `FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec` `MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec` In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Locally tested existing tests. Jenkins tests should verify this too. Closes #25989 from HyukjinKwon/SPARK-29317. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This is a rough first cut of a Pandas Udf cogroup implementation. Currently implemented is:
The code is still pretty rough with the main caveats being:
At this point I think I'd like to focus on: