Skip to content

[SPARK-18960][SQL][SS] Avoid double reading file which is being copied.#16370

Closed
uncleGen wants to merge 1 commit intoapache:masterfrom
uncleGen:SPARK-18960
Closed

[SPARK-18960][SQL][SS] Avoid double reading file which is being copied.#16370
uncleGen wants to merge 1 commit intoapache:masterfrom
uncleGen:SPARK-18960

Conversation

@uncleGen
Copy link
Contributor

What changes were proposed in this pull request?

In HDFS, when we copy a file into target directory, there will a temporary ._COPY_ file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times.

How was this patch tested?

update unit test

@uncleGen
Copy link
Contributor Author

uncleGen commented Dec 21, 2016

cc @zsxwing @liancheng

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70456 has started for PR 16370 at commit 1d248c3.

@uncleGen
Copy link
Contributor Author

@AmplabJenkins retest it please

@viirya
Copy link
Member

viirya commented Dec 21, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70473 has finished for PR 16370 at commit 1d248c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

unrelated errors, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70476 has finished for PR 16370 at commit 1d248c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me.

@viirya
Copy link
Member

viirya commented Dec 22, 2016

LGTM.

@zsxwing
Copy link
Member

zsxwing commented Dec 22, 2016

A similar PR got refused as copy is not the correct way to stream files: #3203

You should always move files instead of copying them.

@uncleGen
Copy link
Contributor Author

uncleGen commented Dec 23, 2016

@zsxwing Thanks for your reminder!!
In some ways, we really can evade this issue, just like not use -cp. But this is an user-side behaviour, we can not ensure everyone knows and uses correct ways to move data. It may confuse users if they do not know this issue. Besides, current change is so tiny that does not have any harm to codes. So, IMHO, it is OK to add the check for protection. What do you think?

@srowen
Copy link
Member

srowen commented Dec 23, 2016

I don't see a harm in ignoring these files, other comments notwithstanding. In fact, I think it's mandatory. The HDFS copy mechanism basically is copy (to a .COPYING file), then move (rename). This is exactly what anyone copying from a different FS would have to do anyway manually as you can't rename a file that isn't already on the FS, and don't necessarily have ability to write anywhere else.

@uncleGen
Copy link
Contributor Author

@zsxwing Is there any farther feedback?

@uncleGen
Copy link
Contributor Author

@srowen It looks like @zsxwing has no time to feed back.

@srowen
Copy link
Member

srowen commented Dec 28, 2016

Merged to master

@asfgit asfgit closed this in 76e9bd7 Dec 28, 2016
@zsxwing
Copy link
Member

zsxwing commented Dec 28, 2016

Sorry for the delay. LGTM.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 29, 2016
## What changes were proposed in this pull request?

In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times.

## How was this patch tested?
update unit test

Author: uncleGen <hustyugm@gmail.com>

Closes apache#16370 from uncleGen/SPARK-18960.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times.

## How was this patch tested?
update unit test

Author: uncleGen <hustyugm@gmail.com>

Closes apache#16370 from uncleGen/SPARK-18960.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants