Skip to content

[SPARK-27971[SQL][R] MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch#24818

Closed
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-27971
Closed

[SPARK-27971[SQL][R] MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch#24818
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-27971

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR is the same fix as #24816 but in vectorized dapply in SparkR.

How was this patch tested?

Manually tested.

@SparkQA
Copy link

SparkQA commented Jun 7, 2019

Test build #106264 has finished for PR 24818 at commit d848354.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

cc @BryanCutler and @mengxr

val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
assert(outputTypes == actualDataTypes, "Invalid schema from dapply(): " +
s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
batch.rowIterator.asScala
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that it was wanting to check the data type at the first batch only. I think this fix is right.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 8, 2019

Test build #106309 has finished for PR 24818 at commit d848354.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @HyukjinKwon .
#24816 said Issued fixed in #24734 but that PR might takes longer to merge. and #24734 is merged yesterday. Do we still need this?

@HyukjinKwon
Copy link
Member Author

Yea here's a different code path for R.

@HyukjinKwon
Copy link
Member Author

Let me merge this one. R vectorization and Python vectorization code paths are being matched for the possibility to deduplicate both code path.

@felixcheung
Copy link
Member

just curious, is it possible to share more infra between python and R so we don't constantly trying to match up with python afterwards?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 9, 2019

Yea .. I was thinking about that. I just wanted to avoid .. like .. ugly deduplication ... let me probably try it in the near future with some neat deduplication. Let me cc you guys there.

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
…t eagerly read the first batch

## What changes were proposed in this pull request?

This PR is the same fix as apache#24816 but in vectorized `dapply` in SparkR.

## How was this patch tested?

Manually tested.

Closes apache#24818 from HyukjinKwon/SPARK-27971.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-27971 branch March 3, 2020 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants