Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps of versions change when files are copied #192

Closed
lmcoy opened this issue Sep 25, 2019 · 6 comments
Closed

Timestamps of versions change when files are copied #192

lmcoy opened this issue Sep 25, 2019 · 6 comments
Assignees
Labels
acknowledged This issue has been read and acknowledged by Delta admins documentation Improvements or additions to documentation

Comments

@lmcoy
Copy link

lmcoy commented Sep 25, 2019

I have a (tiny) delta table containing multiple versions. When I copy the delta table files to another location, the timestamps in the history are wrong. As far as I understand now, the timestamp in the delta log json files is overwritten by the modification timestamp of those files.
I want to use the copy of the table as input in my tests that test the data loading code. Is there a way to preserve the timestamps (i.e. using the timestamps within the json) or what would be the best approach for such tests?

@zsxwing
Copy link
Member

zsxwing commented Sep 25, 2019

We use timestamps of JSON files in _delta_log directory. But it should not be an issue unless you are using timestamps to do the time travel.

You can check your file system to see if it provides some command to copy files without changing the timestamps, or write codes to update modification timestamps of the JSON files in _delta_log directory.

@lmcoy
Copy link
Author

lmcoy commented Sep 25, 2019

We use timestamps of JSON files in _delta_log directory. But it should not be an issue unless you are using timestamps to do the time travel.

I noticed this when I tried to use timestamps for time travelling. I got an error message about the timestamps and was really confused because I expected intuitively that the timestamps written to the json files would be used. I had to look at the actual code to find out that the modification timestamp is used instead.

You can check your file system to see if it provides some command to copy files without changing the timestamps, or write codes to update modification timestamps of the JSON files in _delta_log directory.

I think I have to look into modifying the modification timestamp of the json files programmatically because I have to copy the files beyond system boundaries and I think it is not possible to preserve the timestamps.

@XRayA4T
Copy link

XRayA4T commented Jan 27, 2020

I was investigating the time travel functionality and picked up this issue.

We have a need to query our delta lake to see how the data looked at a point in time.

The JSON file contains a timestamp of the commit: {"commitInfo":{"timestamp":1579786725976,

Why not use this rather than the modified time of the file?

If a person makes a mess up replicating the data and the modified timestamps change this won't affect the delta lake.

@rdettai
Copy link

rdettai commented Oct 5, 2021

Why not use this rather than the modified time of the file?

I guess this is for performance reasons to avoid having to open the file itself, but I agree that it is risky to rely on this metadata.

Is this behavior mentioned in the documentation or the protocol? I couldn't find any clear statement about it.

@dennyglee
Copy link
Contributor

Great call out @rdettai - we will update the Delta Documentation FAQ to include a call out for this scenario and as you surmised, using the file timestamp allows for faster retrieval. Will keep this issue open until the documentation PR has been merged and published. Thanks!

@dennyglee dennyglee self-assigned this Oct 7, 2021
@dennyglee dennyglee added acknowledged This issue has been read and acknowledged by Delta admins documentation Improvements or additions to documentation labels Oct 7, 2021
@zsxwing
Copy link
Member

zsxwing commented Jan 25, 2022

The JSON file contains a timestamp of the commit: {"commitInfo":{"timestamp":1579786725976,

Why not use this rather than the modified time of the file?

timestamp in the commitInfo is created before we create the json file. Using commitInfo.timestamp will make timestamps not in the same order as versions easily. In addition, it's the timestamp in the client side and which clock skew/incorrect clock time is easier to happen. Moreover, if we need to read the content of json files when trying to look for which version by the timestamp, we would need to open tons of json files. Currently, we just need to use the file listing result which is much faster.

Since we have updated the doc for this issue: https://docs.delta.io/latest/delta-faq.html#can-i-copy-my-delta-lake-table-to-another-location , I'm going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants