Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20448][DOCS] Document how FileInputDStream works with object storage #17743

Conversation

steveloughran
Copy link
Contributor

Change-Id: I88c272444ca734dc2cbc2592607c11287b90a383

What changes were proposed in this pull request?

The documentation on File DStreams is enhanced to

  1. Detail the exact timestamp logic for examining directories and files.
  2. Detail how object stores different from filesystems, and so how using them as a source of data should be treated with caution, possibly publishing data to the store differently (direct PUTs as opposed to stage + rename)

How was this patch tested?

n/a

Change-Id: I88c272444ca734dc2cbc2592607c11287b90a383
Change-Id: Icef71513c228fd8d61e23a03f16b8effc89fe8eb
@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76103 has finished for PR 17743 at commit c83af37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76104 has finished for PR 17743 at commit 1e620ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Just reread this; still looks correct. Review comments welcome

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks OK to me. I believe the content is accurate though I'm not as sure about the behavior of object stores. I take your word for it. Earlier I had wondered whether this level of detail is useful to the user but I think it is here.

@steveloughran
Copy link
Contributor Author

People don't realise how much object stores aren't file systems until they discover all their assumptions are broken.

Once you know how they work, you can set up a workflow which is more efficient and reliable.

@SparkQA
Copy link

SparkQA commented Sep 23, 2017

Test build #3934 has finished for PR 17743 at commit 1e620ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 23, 2017

Merged to master

@asfgit asfgit closed this in c792aff Sep 23, 2017
@steveloughran steveloughran deleted the cloud/SPARK-20448-document-dstream-blobstore branch March 9, 2023 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants