Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to detect if a file was actually saved? #1

Closed
cannatag opened this issue Jul 1, 2015 · 8 comments
Closed

How to detect if a file was actually saved? #1

cannatag opened this issue Jul 1, 2015 · 8 comments

Comments

@cannatag
Copy link

cannatag commented Jul 1, 2015

Hi,
how can I detect if a file was actually saved with the copy() method? I mean, if two files with different names have the same content the hash is the same, so the first file is actually saved, while the second gets the same path of the first, but the save operation is not performed. In either cases the copy() method returns the address id of the file. How can I detect this? This information is helpful for to know the ratio of duplicate files.

Thanks,
Giovanni

@dgilland
Copy link
Owner

dgilland commented Jul 1, 2015

Currently, you'd have to pre-check whether the file exists before calling put:

obj = # path or readable object
stream = hashfs.hashfs.Stream(obj)  # if you need an iterable wrapper for object
file_id = fs.computehash(stream)

if fs.exists(file_id):
    # Handle duplicate file
    filepath = fs.idpath(file_id)
    address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)
else:
    # Handle unique file
    address = fs.put(stream)

This is somewhat sub-optimal since the file hash is calculated twice if put is used in the else block. You could replace it with fs.copy but you'd only get the filepath returned so if you wanted the HashAddress in both cases, then this may work better:

file_id = fs.computehash(stream)
filepath = fs.idpath(file_id)
address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)

if fs.exists(file_id):
    # Handle duplicate file
else:
    # Handle unique file
    fs.copy(stream, file_id)

However, neither of these is ideal since you have to utilize more of the internal implementation of HashFS. I'm open to suggestions to make this less painful.

@cannatag
Copy link
Author

cannatag commented Jul 1, 2015

Hi Derrick, thanks for your response. Could it be possibile to return a tuple from the copy() method? It could return the id and a boolean indicating if the file was actually saved or not. Or if you don't want to modify the copy() method you could add another method (something like verify_copy() ) that does the check:

    def verify_copy(self, stream, id, extension=None):
        """Copy the contents of `stream` onto disk with an optional file
        extension appended. The copy process uses a temporary file to store the
        initial contents and then moves this file to it's final location.
        """
        filepath = self.idpath(id, extension)

        # Only copy file if it doesn't already exist.
        if not os.path.isfile(filepath):
            with tmpfile(stream, self.fmode) as fname:
                self.makepath(os.path.dirname(filepath))
                shutil.copy(fname, filepath)
            return filepath, True
        else:
            return filepath, False

@dgilland
Copy link
Owner

dgilland commented Jul 1, 2015

Are you using copy directly? Returning whether the file exists would also need to be returned by put if copy is modified since put is meant for the public API while copy is a more lower-level utility.

Changing the return of put would mean a breaking change so I need to consider this carefully even though it's still pre-v1.

Another option would be to modify HashAddress to contain an attribute that indicates whether the file existed previously but that value would only have meaning when calling put. This would be a non-breaking change.

Seems like the options are:

  1. Modify return of put/copy to indicate whether the file existed previously.
  2. Create new methods for put/copy (perhaps verify_put/verify_copy) that return whether the file existed previously.
  3. Modify HashAddress to include an attribute that indicates previous file existence.

@dgilland
Copy link
Owner

dgilland commented Jul 1, 2015

After some more thought, I am leaning towards #3, modifying HashAddress to include whether the file was a duplicate during a put operation.

@cannatag
Copy link
Author

cannatag commented Jul 2, 2015

#3 looks the right solution to me because it should not break the interface contract. I will change my code to use the put() method instead of copy. May I suggest you to change the internal methods (those that are not supposed to be called directly by external code) to _method()? This the the standard way to indicate that a method is not public.

Thanks again for your HashFS project.
Giovanni

dgilland added a commit that referenced this issue Jul 2, 2015
…urn HashAddress with "is_duplicate=True" when file with same hash already exists on disk.

Refs #1.
@dgilland
Copy link
Owner

dgilland commented Jul 2, 2015

This has been released as v0.5.0.

@dgilland dgilland closed this as completed Jul 2, 2015
@dgilland
Copy link
Owner

dgilland commented Jul 2, 2015

BTW, the new attribute for HashAddress is is_duplicate:

address = fs.put(stream)

if address.is_duplicate:
    # Handle duplicate file condition

@cannatag
Copy link
Author

cannatag commented Jul 3, 2015

Thanks. It works.

Giovanni

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants