[SPARK-18960][SQL][SS] Avoid double reading file which is being copied.#16370
[SPARK-18960][SQL][SS] Avoid double reading file which is being copied.#16370uncleGen wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #70456 has started for PR 16370 at commit |
|
@AmplabJenkins retest it please |
|
retest this please. |
|
Test build #70473 has finished for PR 16370 at commit
|
|
unrelated errors, retest this please. |
|
Test build #70476 has finished for PR 16370 at commit
|
|
LGTM. |
|
A similar PR got refused as You should always move files instead of copying them. |
|
@zsxwing Thanks for your reminder!! |
|
I don't see a harm in ignoring these files, other comments notwithstanding. In fact, I think it's mandatory. The HDFS copy mechanism basically is copy (to a .COPYING file), then move (rename). This is exactly what anyone copying from a different FS would have to do anyway manually as you can't rename a file that isn't already on the FS, and don't necessarily have ability to write anywhere else. |
|
@zsxwing Is there any farther feedback? |
|
Merged to master |
|
Sorry for the delay. LGTM. |
## What changes were proposed in this pull request? In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times. ## How was this patch tested? update unit test Author: uncleGen <hustyugm@gmail.com> Closes apache#16370 from uncleGen/SPARK-18960.
## What changes were proposed in this pull request? In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times. ## How was this patch tested? update unit test Author: uncleGen <hustyugm@gmail.com> Closes apache#16370 from uncleGen/SPARK-18960.
What changes were proposed in this pull request?
In HDFS, when we copy a file into target directory, there will a temporary
._COPY_file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times.How was this patch tested?
update unit test